Introduction
As the age of Generative AI explodes, proof of concepts, pilots, and demos are everywhere we turn! Yet as we move forward, the pressing realization of an undeniable need for evaluation of these LLM applications is surfacing. This is especially true as we push POC’s into production implementations.
Evaluating LLM Applications: The Problem
The age of Generative AI is witnessing an explosion of proof of concepts, pilots, and demos. However, as we progress, the need for evaluating these LLM applications becomes increasingly evident, especially when transitioning POCs into production implementations.
A quick search reveals several evaluation frameworks like RAGas, DeepEval, and LangSmith. While these frameworks may seem near-perfect, they present challenges for those not using OpenAI flavored models or orchestrators that are not langchain or llamaindex.
Wouldn't it be preferable to have an evaluation method that doesn't abstract prompts for judging? A method that offers precision like a scalpel over a swiss army knife, especially concerning output formatting that is crucial for post-inference calculations?
Evaluating LLM Applications: The Solution
The goal is to use a SLM as a judge for the presence of toxicity within output data. Through my work with GroundedAI, I'm addressing these challenges by leveraging a SLM along with PEFT (Parameter Efficient Fine Tuning) to achieve our goals effectively.
This approach offers several benefits:
- We create a model accessible to everyone, allowing users to create their own prompts and continue tuning the model to their specific tasks.
- By using PEFT, we merge the adapter with the base model, making it aligned with our task of judging toxicity.
- Developers can choose to call our API (alpha testing), passing their tested prompts and input-output data to the evaluator they are familiar with.
Finetuning and Performance
For the base model, I chose the Phi-3-mini-4k-instruct from Microsoft. To get a baseline of toxicity judging performance, I loaded the wiki-toxic dataset and sampled 175 records from each class (non-toxic & toxic).
The initial results showed decent performance, but we aimed for improvement. Thus, I set up a PEFT script for fine-tuning with QLORA.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "microsoft/Phi-3-mini-4k-instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config,
device_map={"":0})
Training Process
First, define the lora config:
from peft import LoraConfig
lora_config = LoraConfig(
r=16,
lora_dropout=0.1,
lora_alpha=32,
target_modules=["q_proj", "o_proj", "k_proj", "v_proj"],
task_type="CAUSAL_LM",
)
Next, train the model:
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=hf_dataset,
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=110,
learning_rate=9e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit",
),
peft_config=lora_config,
formatting_func=formatting_func,
)
trainer.train()
Model Evaluation Results
Upon viewing the eval results from the Arize/Phoenix documentation on running the same data with various models, we can see that our small model outperforms nearly all of them:
Toxicity Eval | GPT-4 | GPT-4 Turbo | Gemini Pro | GPT-3.5 Turbo | Claude V2 | Phi3-SFT-PEFT |
---|---|---|---|---|---|---|
Precision | 0.91 | 0.89 | 0.81 | 0.93 | 0.86 | 0.85 |
Recall | 0.91 | 0.77 | 0.84 | 0.83 | 0.40 | 0.90 |
F1 | 0.91 | 0.83 | 0.83 | 0.87 | 0.54 | 0.87 |
Arize. “Running Pre-Tested Evals: Toxicity.” Arize Documentation, n.d. Web. https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/toxicity
Conclusion
By fine-tuning a more efficient model, we achieved a 13% performance improvement. Notably, our small model outperforms nearly all others, as demonstrated in the evaluation results.
Thank you for reading and following along! For more updates on the work at GroundedAI, subscribe at https://groundedai.tech/.