openfactcheck.evaluator.LLMEvaluator#

class openfactcheck.evaluator.LLMEvaluator(ofc)[source][source]#

This class is used to evaluate the performance of a Language Model.

Parameters:
  • model_name (str) – The name of the Language Model.

  • input_path (Union[str, pd.DataFrame]) – The path to the CSV file or the DataFrame containing the LLM responses. The CSV file should have the following two columns: - index: The index of the response. - response: The response generated by the LLM.

  • output_path (str) – The path to store the output files.

  • dataset_path (str) – The path to the dataset file containing the questions.

  • datasets (list) – The list of datasets to evaluate the LLM on.

  • analyze (bool) – Whether to analyze the results.

  • save_plots (bool) – Whether to save the plots.

  • save_report (bool) – Whether to save the report.

  • ofc (OpenFactCheck)

model_name#

The name of the Language Model.

Type:

str

run_id#

The unique identifier for the run.

Type:

str

input_path#

The path to the CSV file or the DataFrame containing the LLM responses.

Type:

Union[str, pd.DataFrame]

output_path#

The path to store the output files.

Type:

str

dataset_path#

The path to the dataset file containing the questions.

Type:

str

datasets#

The list of datasets to evaluate the LLM on.

Type:

list

combined_result#

The combined evaluation results for all datasets.

Type:

dict

evaluate(model_name: str, input_path: Union[str, pd.DataFrame], output_path: str = "", dataset_path: str = "", datasets: list = ["snowballing"], analyze: bool = True, save_plots: bool = True, save_report: bool = True):

This function evaluates the performance of the Language Model.

read_input():

This function reads the input file and dataset file and returns a DataFrame containing the combined data.

filter_responses(df: pd.DataFrame, dataset: str):

Filter the responses based on the dataset.

generate_plots(fig_path, save_plots=True):

Generate plots for the evaluation

__init__(ofc)[source][source]#

Initialize the FreeTextEvaluator object.

Parameters:

ofc (OpenFactCheck)

Methods

__init__(ofc)

Initialize the FreeTextEvaluator object.

assess_freetext(output_path)

Assess the free-text experiment, i.e., the number and type of claims, this is, Exact Matching (EM).

calculate_price(num_claims[, cost_openai, ...])

Calculate the cost (in USD) of the API calls for the free-text experiment.

call_fresheval(prefix, question, response, ...)

Call the FreshEval API to evaluate responses.

call_openai_api(prompt, temperature, max_tokens)

Call the OpenAI API to generate responses.

cut_sentences(content)

Cut the content into sentences.

cut_sub_string(input_string[, window_size, ...])

Cut the input string into sub-strings of a fixed window size.

evaluate(model_name, input_path[, ...])

evaluate_freetext(llm_responses, model_name, ...)

Evaluate the LLM responses on free-text datasets.

evaluate_freshqa(llm_responses)

Evaluate the responses generated by the LLM on FreshQA questions.

evaluate_selfaware(llm_responses)

evaluate_snowballing(llm_responses)

Evaluate the LLM responses on the Snowballing dataset.

extract_ratings(response)

Extract the rating from the evaluation response.

filter_responses(df, dataset)

freetext_barplot(results[, fig_path, save])

Create a barplot for the free-text evaluation results, ensuring full row utilization.

freshqa_piechart(result[, fig_path, save])

Plot a pie chart of the true and false answers on FreshQA.

generate_plots([fig_path, save_plots])

generate_report(report_path)

get_boolean(response[, strict])

Get a boolean value from the response.

get_unanswerable(response, model, tokenizer)

Predict whether the response is unanswerable or not.

group_cosine_similarity(model, tokenizer, ...)

Calculate the cosine similarity between two groups of sentences.

read_evaluations()

Read the evaluations from the output directory.

read_input()

This function reads the input file and dataset file and returns a DataFrame containing the combined data.

read_results(evaluations)

Read the results from the evaluations.

remove_punctuation(input_string)

Remove the punctuation from the input string.

selfaware_barplot(result[, fig_path, save])

Create a bar plot of the performance on the SelfAware dataset.

selfaware_cm(labels, preds[, fig_path, save])

Create a confusion matrix for the SelfAware dataset.

snowballing_barplot(result[, fig_path, save])

Create a bar plot of the accuracy of the LLM responses on the Snowballing dataset for each topic and the overall accuracy.

snowballing_cm(labels, preds[, fig_path, save])

Create a confusion matrix for the Snowballing dataset.

sum_all_elements(obj)

Sum all elements of an object.