Skip to main content

Create An Evaluation Dataset

An evaluation dataset is a way to load in:

  • CSV files
  • JSON files
  • Custom data sources (to be supported soon)

Defining A Dataset

An evaluation dataset is a list of test cases designed to make testing a large number of test cases very easily. Testing a large number of test cases is important for enterprise production use cases. We support a number of ways to quickly get started.

Example

from deepeval.dataset import EvaluationDataset

# from a csv
# sample.csv
# input,expected_output,id
# sample_input,sample_output,312
ds = EvaluationDataset.from_csv(
csv_filename="sample.csv",
input_column="input",
expected_output_column="expected_output",
id_column="312"
)

Running Tests

Running the tests is easy with the run_evaluation method. When you call run_evaluation , it will output a text file for you to review the results which will contain

ds.run_evaluation(
callable_fn=generate_llm_output,
)
# Returns the evaluation

Once you run these tests, you will then be given a table that looks like this and is saved to a text file.

Test Passed    Metric Name                  Score    Output                                            Expected output    Message
------------- --------------------- ----------- ------------------------------------------------ ----------------- -------------------------------------------
True EntailmentScoreMetric 0.000830871 Our customer success phone line is 1200-231-231. 1800-213-123 EntailmentScoreMetric was unsuccessful for
What is the customer success number
which should have matched
1800-213-123

View a sample of data inside the Evaluation Dataset

To view a sample of data, simply run:

ds.sample(5)

From CSV

You can set up an evaluation dataset from the CSV the from_csv method

dataset = EvaluationDataset.from_csv(
csv_filename="input.csv",
input_column="input",
expected_output_column="expected_output",
)
Parameters
  • csv_filename - the name of the CSV file
  • input_column - the input column name
  • expected_output_column- the expected output column
  • id_column - the ID column
  • metrics - The list of metrics you want to supply to run this test.