surveyeval.evaluation_engine module

Core classes for instrument evaluation engine.

class surveyeval.evaluation_engine.EvaluationEngine(evaluation_model: str = '', evaluation_provider: str = '', openai_api_key: str = '', azure_api_key: str = '', azure_api_base: str = '', azure_api_version: str = '', anthropic_api_key: str | None = None, bedrock_region: str = 'us-east-1', bedrock_aws_profile: str | None = None, temperature: float = 0.1, reasoning_effort: str | None = None, max_retries: int = 3, logger: Logger | None = None, extra_evaluation_instructions: str = '', langsmith_api_key: str = '', langsmith_project: str = 'surveyeval', langsmith_endpoint: str = 'https://api.smith.langchain.com', summarize_model: str = '', summarize_provider: str = '', tiktoken_model_name: str = '')

Bases: object

Main class for instrument evaluation engine.

__init__(evaluation_model: str = '', evaluation_provider: str = '', openai_api_key: str = '', azure_api_key: str = '', azure_api_base: str = '', azure_api_version: str = '', anthropic_api_key: str | None = None, bedrock_region: str = 'us-east-1', bedrock_aws_profile: str | None = None, temperature: float = 0.1, reasoning_effort: str | None = None, max_retries: int = 3, logger: Logger | None = None, extra_evaluation_instructions: str = '', langsmith_api_key: str = '', langsmith_project: str = 'surveyeval', langsmith_endpoint: str = 'https://api.smith.langchain.com', summarize_model: str = '', summarize_provider: str = '', tiktoken_model_name: str = '')

Initialize evaluation engine.

Parameters:

evaluation_model (str) – LLM model to use for instrument evaluation (when using Azure, the deployment or engine name must be the same as the model name).
evaluation_provider (str) – Provider name for the evaluation model (“openai”, “azure”, “anthropic”, or “bedrock”).
openai_api_key (str) – API key for OpenAI services (if evaluation_provider is “openai”).
azure_api_key (str) – API key for Azure services (if evaluation_provider is “azure”).
azure_api_base (str) – Base URL for Azure API (if evaluation_provider is “azure”).
azure_api_version (str) – Version of the Azure API (if evaluation_provider is “azure”).
anthropic_api_key (str) – API key for Anthropic (if evaluation_provider is “anthropic”).
bedrock_region (str) – AWS Bedrock region (if evaluation_provider is “bedrock”). Default is “us-east-1”.
bedrock_aws_profile (str) – AWS profile for Bedrock access (if evaluation_provider is “bedrock”). Default is None.
temperature (float) – Temperature setting for AI model responses.
reasoning_effort (str) – Reasoning effort setting for AI model responses (e.g., “low”, “medium”, “high”). Only supported by certain models. Default is None.
max_retries (int) – Maximum number of retries for asking questions.
logger (logging.Logger) – Logger instance for logging messages.
extra_evaluation_instructions (str) – Extra evaluation instructions (optional).
langsmith_api_key (str) – API key for Langsmith services (optional).
langsmith_project (str) – LangSmith project name. Default is ‘surveyeval’.
langsmith_endpoint (str) – LangSmith endpoint URL. Default is ‘https://api.smith.langchain.com’.
summarize_model (str) – LLM model to use for summarizing multistep conversations (not currently used).
summarize_provider (str) – Provider name for the summarization model (“openai” or “azure”) (not currently used).
tiktoken_model_name (str) – Name of the model used with TikToken, if different from evaluation_model (not currently used).

async a_followup_question(condition_func: Callable, condition_key: str, condition_value, prompt_template: str, response_dict: dict, llm_chain: LLMInterface, chat_history: list | None = None, json_validation_schema: str = '') → dict

Ask a follow-up question (asynchronously).

Parameters:

condition_func (callable) – Function to call to evaluate whether the follow-up should be asked (True) or not (False).
condition_key (str) – Key to check in the response dictionary.
condition_value (Any) – Value to check for in the response dictionary, according to the logic of condition_func.
prompt_template (str) – Template for the follow-up question to ask (which can include variables from the response dict).
response_dict (dict) – Full response dictionary.
llm_chain (Runnable) – Runnable conversation chain to use for asking the follow-up question.
chat_history (list) – Chat history to use for the evaluation chain (or None for none).
json_validation_schema (str) – JSON schema to use for validating the response.

Returns:

A dict with result (“success”, “error”, or “skipped”), error (if result is “error”), prompt (a str), response_json (a str), and response (a dict).

Return type:

dict

async a_run_evaluation_chain(task_system_prompt: str, question: str, followups: list[dict], chat_history: list | None = None) → dict

Run an evaluation chain (asynchronously).

Parameters:

task_system_prompt (str) – System prompt to use for the evaluation chain. It should specify a specific JSON format for all responses.
question (str) – Initial question to ask, to begin the evaluation chain.
followups (list[dict]) – List of follow-up questions to ask, based on the JSON response to the initial question. Should be a list of dicts, each with a value for each of the following keys: “condition_func”: function that returns True when the follow-up should be asked and False when it shouldn’t (should take the following parameters: response_dict: dict, condition_key: str, condition_value: value to check against the one in the response dict); “condition_key”: the key in the response dict to check; “condition_value”: the value to check against the one in the response dict; “prompt_template”: the template for the follow-up question to ask (which can include variables from the response dict).
chat_history (list) – Chat history to use for the evaluation chain (or None for none).

Returns:

A dict with result (“success” or “error”), error (if result is “error”), response (a dict), and history (a list with the full history of the evaluation chain, each item of which is a list with two strings, a prompt and a response).

Return type:

dict

static clean_whitespace(s: str) → str

Strip whitespace from prompt string in order to economize on tokens.

Parameters:: s (str) – Prompt string to clean.
Returns:: Cleaned prompt string.
Return type:: str

get_llm_interface(system_prompt: str = '', starting_chat_history: list[tuple] | None = None) → LLMInterface

Get an LLM interface for use in evaluating an instrument.

Parameters:

system_prompt (str) – System prompt to use for all LLM calls.
starting_chat_history (list[tuple]) – Starting chat history to use for the conversation chain (or None for none). Should be tuples, each with a human and an AI message.

Returns:

Runnable conversation chain to use for instrument evaluation.

Return type:

Runnable

run_evaluation_chain(task_system_prompt: str, question: str, followups: list[dict], chat_history: list | None = None) → dict

Run an evaluation chain (synchronously).

Parameters:

task_system_prompt (str) – System prompt to use for the evaluation chain. It should specify a specific JSON format for all responses.
question (str) – Initial question to ask, to begin the evaluation chain.
followups (list[dict]) – List of follow-up questions to ask, based on the JSON response to the initial question. Should be a list of dicts, each with a value for each of the following keys: “condition_func”: function that returns True when the follow-up should be asked and False when it shouldn’t (should take the following parameters: response_dict: dict, condition_key: str, condition_value: value to check against the one in the response dict); “condition_key”: the key in the response dict to check; “condition_value”: the value to check against the one in the response dict; “prompt_template”: the template for the follow-up question to ask (which can include variables from the response dict).
chat_history (list) – Chat history to use for the evaluation chain (or None for none).

Returns:

A dict with result (“success” or “error”), error (if result is “error”), response (a dict), and history (a list with the full history of the evaluation chain, each item of which is a list with two strings, a prompt and a response).

Return type:

dict

static trim_json(json_str: str) → str

Trim common leading and trailing characters from JSON string.

Parameters:: json_str (str) – JSON string to trim
Returns:: Trimmed JSON string
Return type:: str

class surveyeval.evaluation_engine.EvaluationLens(task_system_prompt_template: str, question_template: str, followups: list[dict], evaluation_engine: EvaluationEngine, lens_description: str = 'Evaluation lens with unknown description')

Bases: object

Class for instrument evaluation lens, which is used to conduct a particular type of evaluation.

__init__(task_system_prompt_template: str, question_template: str, followups: list[dict], evaluation_engine: EvaluationEngine, lens_description: str = 'Evaluation lens with unknown description')

Initialize evaluation lens.

Parameters:

task_system_prompt_template (str) – System prompt template to use for the evaluation chain. This can include the {survey_context} and {survey_locations} variables to include information about the survey context. It should specify a specific JSON format for all responses.
question_template (str) – Initial question to ask, to begin the evaluation chain. This should include the {survey_excerpt} variable to include the appropriate excerpt being evaluated.
followups (list[dict]) – List of follow-up questions to ask, based on the JSON response to the initial question. Should be a list of dicts, each with a value for each of the following keys: “condition_func”: function that returns True when the follow-up should be asked and False when it shouldn’t (should take the following parameters: response_dict: dict, condition_key: str, condition_value: value to check against the one in the response dict); “condition_key”: the key in the response dict to check; “condition_value”: the value to check against the one in the response dict; “prompt_template”: the template for the follow-up question to ask (which can include variables from the response dict).
evaluation_engine (EvaluationEngine) – Evaluation engine instance to use for conducting evaluation.
lens_description (str) – High-level description of the evaluation lens, sufficient for LLM-as-judge evaluation of results.

async a_evaluate(chat_history: list | None = None, **kwargs) → dict

Run an evaluation chain (asynchronously).

Parameters:

chat_history (list) – Chat history to use for the evaluation chain (or None for none).
kwargs (Any) – Keyword arguments to use for formatting the task system prompt and question.

Returns:

A dict with result (“success” or “error”), error (if result is “error”), response (a dict), and history (a list with the full history of the evaluation chain, each item of which is a list with two strings, a prompt and a response).

Return type:

dict

static condition_is_in_list(response_dict: dict, condition_key: str, condition_value: str) → bool

Check if a condition is met in a given response dictionary: “is in list” (string in list of strings).

Parameters:

response_dict (dict) – Response dictionary to check.
condition_key (str) – Key to check in response dictionary.
condition_value (str) – Value to check for in response dictionary.

Returns:

True if condition is met, False otherwise.

Return type:

bool

static condition_is_not_in_list(response_dict: dict, condition_key: str, condition_value: str) → bool

Check if a condition is met in a given response dictionary: “is not in list” (string not in list of strings).

Parameters:

response_dict (dict) – Response dictionary to check.
condition_key (str) – Key to check in response dictionary.
condition_value (str) – Value to check for in response dictionary.

Returns:

True if condition is met, False otherwise.

Return type:

bool

static condition_is_not_value(response_dict: dict, condition_key: str, condition_value: str) → bool

Check if a condition is met in a given response dictionary: “is not value” (string doesn’t match).

Parameters:

response_dict (dict) – Response dictionary to check.
condition_key (str) – Key to check in response dictionary.
condition_value (str) – Value to check for in response dictionary.

Returns:

True if condition is met, False otherwise.

Return type:

bool

static condition_is_value(response_dict: dict, condition_key: str, condition_value: str) → bool

Check if a condition is met in a given response dictionary: “is value” (string match).

Parameters:

response_dict (dict) – Response dictionary to check.
condition_key (str) – Key to check in response dictionary.
condition_value (str) – Value to check for in response dictionary.

Returns:

True if condition is met, False otherwise.

Return type:

bool

static condition_list_has_greater_or_equal_elements(response_dict: dict, condition_key: str, condition_value: int) → bool

Check if a condition is met in a given response dictionary: “list has greater than or equal to n elements”.

Parameters:

response_dict (dict) – Response dictionary to check.
condition_key (str) – Key to check in response dictionary.
condition_value (int) – Value to check for in response dictionary.

Returns:

True if condition is met, False otherwise.

Return type:

bool

static condition_list_has_greater_than_elements(response_dict: dict, condition_key: str, condition_value: int) → bool

Check if a condition is met in a given response dictionary: “list has greater than n elements”.

Parameters:

response_dict (dict) – Response dictionary to check.
condition_key (str) – Key to check in response dictionary.
condition_value (int) – Value to check for in response dictionary.

Returns:

True if condition is met, False otherwise.

Return type:

bool

static condition_list_has_less_or_equal_elements(response_dict: dict, condition_key: str, condition_value: int) → bool

Check if a condition is met in a given response dictionary: “list has less than or equal to n elements”.

Parameters:

response_dict (dict) – Response dictionary to check.
condition_key (str) – Key to check in response dictionary.
condition_value (int) – Value to check for in response dictionary.

Returns:

True if condition is met, False otherwise.

Return type:

bool

static condition_list_has_less_than_elements(response_dict: dict, condition_key: str, condition_value: int) → bool

Check if a condition is met in a given response dictionary: “list has less than n elements”.

Parameters:

response_dict (dict) – Response dictionary to check.
condition_key (str) – Key to check in response dictionary.
condition_value (int) – Value to check for in response dictionary.

Returns:

True if condition is met, False otherwise.

Return type:

bool

evaluate(chat_history: list | None = None, **kwargs) → dict

Run an evaluation chain (synchronously).

Parameters:

chat_history (list) – Chat history to use for the evaluation chain (or None for none).
kwargs (Any) – Keyword arguments to use for formatting the task system prompt and question.

Returns:

A dict with result (“success” or “error”), error (if result is “error”), response (a dict), and history (a list with the full history of the evaluation chain, each item of which is a list with two strings, a prompt and a response).

Return type:

dict

format_result(result: dict | None = None, minimum_importance: int = 0) → str

Format the evaluation result as a human-readable string.

Parameters:

result (dict | None) – Evaluation result to format (or None to use the evaluation_result attribute).
minimum_importance (int) – Minimum importance score for filtering results (defaults to 0, which doesn’t filter).

Returns:

Formatted evaluation result.

Return type:

str

standardize_result(result: dict | None = None) → list[dict]

Reorganize the evaluation result into a list of recommendations in a standardized format.

Parameters:: result (dict | None) – Evaluation result to format (or None to use the evaluation_result attribute).
Returns:: List of recommendations, each of which is a dict with the following keys: importance (int 1-5), replacement_original (str), replacement_suggested (str), explanation (str).
Return type:: list[dict]