Part 4 - Maximize your fine-tuned model performance with the new Azure AI Evaluation SDK
By Cedric Vidal, Principal AI Advocate, Microsoft
Part of the Future of AI 🚀 series initiated by Marco Casalaina with his Exploring Multi-Agent AI Systems blog post.
Fictitious retro representation of judges at the Paris Olympics evaluating LLama athletes. Generated using Azure OpenAI DALL-E 3
In earlier posts of this distillation series, we detailed the process of distilling a Llama 3.1 405B model into a more compact Llama 3.1 8B model. This journey included generating a synthetic dataset using RAFT, as well as fine-tuning and deploying our student model on Azure AI Serverless.
But how can we confirm that our distilled model performs optimally? The crucial final step is evaluating the model.
Effective model evaluation is key to ensuring that our AI systems function as expected and meet the desired standards. With the introduction of the Azure AI Evaluation Python SDK, we now have a powerful toolkit for assessing AI models through advanced metrics. In this blog post, we’ll look at evaluating a distilled student model, which was trained with data generated by RAFT, and compare it against a baseline model.
In our setup, Llama 3.1 405B functions as the teacher, Llama 3.1 8B serves as the student model and GPT-4 serves as the judge.
Why evaluate?
Evaluating distilled student models is crucial because it allows us to assess how effectively knowledge has been transferred from the teacher model to the student model. Distillation aims to compress a larger, more complex model into a smaller, more efficient one without significantly sacrificing performance. By thoroughly evaluating the distilled models, we ensure they not only mimic the teacher model’s outputs but also maintain high levels of accuracy, coherence, and relevance. This evaluation process helps identify areas where the student model may need further fine-tuning and ensures that the distilled models are ready for deployment in resource-constrained environments where computational efficiency is paramount.
Process Overview
Evaluating the performance of our models involves several key steps, which can be broadly categorized under Testing and Scoring.
- Testing
- Run the Baseline Model on the Evaluation Split: Our first step is to run the teacher model (Llama 3.1 405B) on the evaluation split to generate its predictions.
- Run the Student Model on the Evaluation Split: Next, we run the student model on the same evaluation dataset to generate its predictions.
- Scoring
- Calculate Metrics for the Baseline Model: Using the predictions from the baseline model, we calculate various performance metrics.
- Calculate Metrics for the Student Model: Similarly, we calculate the performance metrics for the student model’s predictions.
- Compare Metrics: Finally, we compare the performance of both models, highlighting the results through visuals and diagrams.
Testing the baseline and student models
Installing the SDK
First, you need to install the Azure AI Evaluation SDK:
pip install openai azure-ai-evaluation azure-identity promptflow-azure
Note on SDK Availability: It’s important to highlight that the Azure AI Evaluation SDK is currently in beta. This means that while the SDK offers a comprehensive suite of tools and features for evaluating AI models, it may still undergo changes and improvements. Users should stay updated with any modifications or enhancements introduced by Azure, and consider providing feedback to help refine and optimize the SDK for wider use in its official release.
Baseline Model Testing
This will generate answers to the questions in the evaluation dataset using the baseline model:
env $(cat .env .env.state) python .gorilla/raft/eval.py \ --question-file $dataset_path_hf_eval \ --answer-file $dataset_path_hf_eval_answer_baseline \ --model $BASELINE_OPENAI_DEPLOYMENT \ --env-prefix BASELINE \ --mode $BASELINE_MODEL_API
Note: JSONL file format needs to be further converted to a format suitable for testing, see eval notebook for details.
Student Model Testing
This will generate answers to the questions in the evaluation dataset using the student model:
env $(cat .env .env.state) python .gorilla/raft/eval.py \ --question-file $dataset_path_hf_eval \ --answer-file $dataset_path_hf_eval_answer \ --model $STUDENT_DEPLOYMENT_NAME \ --env-prefix STUDENT \ --mode $STUDENT_MODEL_API
Note: JSONL file format needs to be further converted to a format suitable for testing, see eval notebook for details.
Let’s look at a sample
This sample is extracted from the evaluation split and shows the baseline and student answers:
question: What types of waves do strong direct offshore winds create?
gold_final_answer: plunging or large barrel waves
context <DOCUMENT>Lefts, Rights, and A-frames could be directed from this pump design providing forWave intensity Artificial reefs Artificial wavesSurfing a stationary, artificialwave in Southern California A surfer going for the tube Catching waves at a surfing conteston the North Shore of Oahu, Hawaiirippable surf and barrel rides. The Ocean Dome cost about $2 billion tobuild and was expensive to maintain.[31] The Ocean Dome was closed in2007.</DOCUMENT> <DOCUMENT>However, thewaves that are produced by reef breaks are some of the best in the world. Famous reef breaks arepresent in Padang Padang (Indonesia), Pipeline (Hawaii), Uluwatu (Bali), and Teahupo’o(Tahiti).[49][52]A ledge break is formed by steep rocks ledges that make intense waves because the waves travelthrough deeper water then abruptly reach shallower water at the ledge. Shark Island, Australia is alocation with a ledge break.</DOCUMENT>
baseline_answer: Strong direct offshore winds create plunging or large barrel waves. These waves are characterized by their increased height and intensity due to the shallow water depth when they break.
student_answer: plunging or large barrel waves
This sample was chosen randomly and in this case, the student model answer is identical to the gold answer. This is not always the case.
Evaluating the baseline and student model responses
Built-in Evaluators
The Azure AI Evaluation SDK offers an extensive suite of built-in metrics, designed to facilitate comprehensive evaluation of AI models. In the following sections, we’ll highlight selected evaluators and provide detailed examples of their application, showcasing how they can enhance your model assessments.
They are categorized into two main groups: (1) metrics that leverage GPT models for scoring, providing advanced qualitative assessments, and (2) metrics that utilize straightforward mathematical calculations for evaluation.
GPT based metrics
Category | Evaluator Class | Notes |
Quality | GroundednessEvaluator | Groundedness measures the extent to which the generated content is based on factual correctness and aligns with the provided data or context. |
RelevanceEvaluator | Relevance assesses how pertinent the generated text is to the given input or prompt. Higher relevance scores indicate that the generated responses are more appropriate and closely aligned with the query or topic. | |
CoherenceEvaluator | Coherence measures how logically consistent and semantically meaningful the generated text is. Higher coherence indicates better understanding and logical consistency. | |
FluencyEvaluator | Fluency evaluates how naturally the generated text reads. Fluent text should be grammatically correct and smooth in its flow. | |
SimilarityEvaluator | Measures the similarity between the predicted answer and the correct answer | |
Content Safety | ViolenceEvaluator | |
SexualEvaluator | ||
SelfHarmEvaluator | ||
HateUnfairnessEvaluator | ||
Composite | QAEvaluator | Built on top of individual quality evaluators. |
ChatEvaluator | Similar to QAEvaluator but designed for evaluating chat messages. | |
ContentSafetyEvaluator | Built on top of individual content safety evaluators. |
Math based metrics
Evaluator Class | Notes |
BleuScoreEvaluator | BLEU (Bilingual Evaluation Understudy) is a widely-used metric for evaluating the quality of text generated by an AI by comparing it to one or more reference texts. It particularly looks at the precision of n-grams in the generated text. |
RougeScoreEvaluator | ROUGE (Recall-Oriented Understudy for Gisting Evaluation) primarily measures recall, comparing n-grams between the generated text and reference texts. It is commonly used for evaluation in summarization tasks. |
F1ScoreEvaluator | A balance between precision and recall, the F1 score provides a single metric that combines both, offering a more comprehensive view of performance in classification problems. |
Running metrics individually
The Azure AI Evaluation SDK enables the utilization of individual metrics. This feature is particularly useful for experimentation, gaining deeper insights, and incorporating metrics into bespoke evaluation workflows.
Tech Tip: This blog post is crafted using the Quarto writing system, a versatile tool for publishing with code. The Azure AI Evaluation metrics are seamlessly executed and displayed inline within this post.
Let’s look first at the F1 Score math metric
For a response that is accurate but includes additional information not found in the ground truth:
from azure.ai.evaluation import F1ScoreEvaluatorf1_score_evaluator = F1ScoreEvaluator()f1_score = f1_score_evaluator( ground_truth="The capital of Japan is Tokyo.", response="Tokyo is Japan's capital, known for its blend of traditional culture")print(f"The F1 Score is {round(f1_score['f1_score'], 2)}")
The F1 Score is 0.5
For a response that is accurate but uses the same words turned differently:
from azure.ai.evaluation import F1ScoreEvaluatorf1_score_evaluator = F1ScoreEvaluator()f1_score = f1_score_evaluator( ground_truth="The capital of Japan is Tokyo.", response="Tokyo is Japan's capital")print(f"The F1 Score is {round(f1_score['f1_score'], 2)}")
The F1 Score is 0.67
Let’s look first at the Similarity GPT metric
We first need to instantiate the Judge model client:
from os import getenvfrom azure.ai.evaluation import AzureOpenAIModelConfigurationmodel_config = AzureOpenAIModelConfiguration( azure_endpoint = getenv("JUDGE_AZURE_OPENAI_ENDPOINT"), azure_deployment = getenv("JUDGE_AZURE_OPENAI_DEPLOYMENT"), api_version = getenv("JUDGE_OPENAI_API_VERSION"),)
Let’s now instantiate the Similarity score metric:
from azure.ai.evaluation import SimilarityEvaluatorsimilarity_evaluator = SimilarityEvaluator(model_config)For a response that is accurate but includes additional information not found in the ground truth:similarity = similarity_evaluator( query="What's the capital of Japan?", ground_truth="The capital of Japan is Tokyo.", response="Tokyo is Japan's capital, known for its blend of traditional culture")print(f"The Similarity is {similarity['gpt_similarity']}")
The Similarity is 4.0
For a response that is accurate but uses the same words turned differently:
similarity = similarity_evaluator( query="What's the capital of Japan?", ground_truth="The capital of Japan is Tokyo.", response="Tokyo is Japan's capital")print(f"The Similarity is {similarity['gpt_similarity']}")
The Similarity is 5.0
GPT-based similarity metrics demonstrate greater robustness in evaluating correct responses that are phrased differently compared to traditional F1 Scores.
Running metrics in bulk
While evaluating metrics individually helps in understanding their functionality, acquiring statistically significant results necessitates running them on a larger scale across an evaluation dataset.
The Azure AI Evaluation SDK provides a convenient bulk evaluation capability via the evaluate function.
To begin, we need to initialize the evaluators that will be used to assess the student and baseline models:
from azure.ai.evaluation import CoherenceEvaluator, F1ScoreEvaluator, FluencyEvaluator, GroundednessEvaluator, RelevanceEvaluator, SimilarityEvaluator, BleuScoreEvaluator, RougeScoreEvaluator, RougeType# Initializing evaluatorsevaluators = { # GPT based metrics "coherence" : CoherenceEvaluator(model_config), "f1_score" : F1ScoreEvaluator(), "fluency" : FluencyEvaluator(model_config), "groundedness" : GroundednessEvaluator(model_config), "relevance" : RelevanceEvaluator(model_config), "similarity" : SimilarityEvaluator(model_config), # Math metrics "bleu" : BleuScoreEvaluator(), "rouge_1" : RougeScoreEvaluator(RougeType.ROUGE_1), "rouge_2" : RougeScoreEvaluator(RougeType.ROUGE_2),}
Note that we have previously executed the baseline and student models on the evaluation dataset, which means the JSONL file provided to the evaluate function already includes their responses. Consequently, further model invocations are unnecessary at this stage.
Recommendation: It’s often beneficial to run the baseline and student models once initially. By doing so, you can execute the evaluate function multiple times with various metrics configurations without re-incurring the inference time and costs associated with model executions. Note that while this avoids repeated inference expenses, using GPT-based metrics will still incur costs and time for each evaluate execution, as the Judge model is utilized.
from azure.ai.evaluation import evaluateresult = evaluate( data="test-results-[baseline|student].jsonl", evaluators=evaluators, evaluator_config={ "default": { "column_mapping": { "query": "${data.question}", "response": "${data.final_answer}", "ground_truth": "${data.gold_final_answer}", "context": "${data.context}", } } },)
This command initiates a background process that hosts a user interface locally. Here is an example of its appearance:
The interface updates in real-time to display the progress of the scoring process on the evaluation dataset.
Additionally, you can click on each completed line to view the detailed trace of the calls. This feature is particularly useful for GPT-based metrics, as it reveals the system prompt used and provides insights into the underlying logic that contributed to the final score:
Comparing Metrics and Visualizing Results
Note: You can find the implementation details for generating the comparison figures of baseline and student metrics in the repository notebook. This resource provides comprehensive insights into how the metric comparisons were conducted, along with the code necessary to reproduce these visualizations.
Going further with continuous model evaluation and GenAIOps
This marks the beginning of a continuous improvement journey. It’s quite common to find that the student model’s initial performance does not meet expectations. Through our evaluation, we may uncover areas needing adjustment—whether it’s refining the synthetically generated dataset, optimizing fine-tuning parameters, or other elements. This initiates a cycle of iterative improvement and reassessment before the model is ready for deployment in production.
To effectively help you navigate this process, we came up with the GenAIOps Maturity Model, which serves as a comprehensive guide for evaluating your progress and maturity in operationalizing AI models.
Conclusion
By leveraging the Azure AI Evaluation Python SDK, we gain a detailed understanding of how our distilled student model compares to the baseline model across a spectrum of performance indicators. This structured evaluation framework not only helps in refining our models but also ensures that we are continuously improving and delivering robust AI solutions.
Explore, fork and clone the comprehensive 🚀🔥 GitHub Recipe Repository for complete code coverage on executing the full distillation process, including in-depth evaluations as detailed in this blog post. Discover step-by-step notebooks and resources to master the entire pipeline efficiently.
Stay tuned for more insights and tutorials on advanced AI topics and the latest tools available in the Azure ecosystem!