May 28, 2024

Small but Mighty: Employing SLMs for Superior Evaluation Performance Over Larger Models

Fine tuning SLM's for efficient evaluation

Introduction

As the age of Generative AI explodes, proof of concepts, pilots, and demos are everywhere we turn! Yet as we move forward, the pressing realization of an undeniable need for evaluation of these LLM applications is surfacing. This is especially true as we push POC’s into production implementations.

Evaluating LLM Applications: The Problem

The age of Generative AI is witnessing an explosion of proof of concepts, pilots, and demos. However, as we progress, the need for evaluating these LLM applications becomes increasingly evident, especially when transitioning POCs into production implementations.

A quick search reveals several evaluation frameworks like RAGas, DeepEval, and LangSmith. While these frameworks may seem near-perfect, they present challenges for those not using OpenAI flavored models or orchestrators that are not langchain or llamaindex.

Wouldn't it be preferable to have an evaluation method that doesn't abstract prompts for judging? A method that offers precision like a scalpel over a swiss army knife, especially concerning output formatting that is crucial for post-inference calculations?

Evaluating LLM Applications: The Solution

The goal is to use a SLM as a judge for the presence of toxicity within output data. Through my work with GroundedAI, I'm addressing these challenges by leveraging a SLM along with PEFT (Parameter Efficient Fine Tuning) to achieve our goals effectively.

This approach offers several benefits:

We create a model accessible to everyone, allowing users to create their own prompts and continue tuning the model to their specific tasks.
By using PEFT, we merge the adapter with the base model, making it aligned with our task of judging toxicity.
Developers can choose to call our API (alpha testing), passing their tested prompts and input-output data to the evaluator they are familiar with.

Finetuning and Performance

For the base model, I chose the Phi-3-mini-4k-instruct from Microsoft. To get a baseline of toxicity judging performance, I loaded the wiki-toxic dataset and sampled 175 records from each class (non-toxic & toxic).

The initial results showed decent performance, but we aimed for improvement. Thus, I set up a PEFT script for fine-tuning with QLORA.

                            
                              import torch
                              from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
                              
                              model_id = "microsoft/Phi-3-mini-4k-instruct" 
                              bnb_config = BitsAndBytesConfig(
                                  load_in_4bit=True,
                                  bnb_4bit_quant_type="nf4",
                                  bnb_4bit_compute_dtype=torch.bfloat16 
                              )
                              
                              tokenizer = AutoTokenizer.from_pretrained(model_id)
                              model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config,
                                                                    device_map={"":0})

Training Process

First, define the lora config:

                            
                              from peft import LoraConfig
                              
                              lora_config = LoraConfig(
                                  r=16,
                                  lora_dropout=0.1,
                                  lora_alpha=32,
                                  target_modules=["q_proj", "o_proj", "k_proj", "v_proj"],
                                  task_type="CAUSAL_LM",
                              )

Next, train the model:

                            
                              from trl import SFTTrainer
                               
                              trainer = SFTTrainer(
                                  model=model,
                                  train_dataset=hf_dataset,
                                  args=transformers.TrainingArguments(
                                      per_device_train_batch_size=1,
                                      gradient_accumulation_steps=4,
                                      warmup_steps=10,
                                      max_steps=110,
                                      learning_rate=9e-4,
                                      fp16=True,
                                      logging_steps=1,
                                      output_dir="outputs",
                                      optim="paged_adamw_8bit",
                                  ),
                                  peft_config=lora_config,
                                  formatting_func=formatting_func,
                              )
                              
                              trainer.train()

Model Evaluation Results

Upon viewing the eval results from the Arize/Phoenix documentation on running the same data with various models, we can see that our small model outperforms nearly all of them:

Toxicity Eval	GPT-4	GPT-4 Turbo	Gemini Pro	GPT-3.5 Turbo	Claude V2	Phi3-SFT-PEFT
Precision	0.91	0.89	0.81	0.93	0.86	0.85
Recall	0.91	0.77	0.84	0.83	0.40	0.90
F1	0.91	0.83	0.83	0.87	0.54	0.87

Arize. “Running Pre-Tested Evals: Toxicity.” Arize Documentation, n.d. Web. https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/toxicity

Conclusion

By fine-tuning a more efficient model, we achieved a 13% performance improvement. Notably, our small model outperforms nearly all others, as demonstrated in the evaluation results.

Thank you for reading and following along! For more updates on the work at GroundedAI, subscribe at https://groundedai.tech/.