I’m pleased to share that Amazon SageMaker Make clear now helps basis mannequin (FM) analysis (preview). As a knowledge scientist or machine studying (ML) engineer, now you can use SageMaker Make clear to guage, evaluate, and choose FMs in minutes primarily based on metrics comparable to accuracy, robustness, creativity, factual data, bias, and toxicity. This new functionality provides to SageMaker Make clear’s present capacity to detect bias in ML knowledge and fashions and clarify mannequin predictions.
The brand new functionality offers each automated and human-in-the-loop evaluations for big language fashions (LLMs) anyplace, together with LLMs obtainable in SageMaker JumpStart, in addition to fashions skilled and hosted outdoors of AWS. This removes the heavy lifting of discovering the precise mannequin analysis instruments and integrating them into your growth atmosphere. It additionally simplifies the complexity of attempting to undertake educational benchmarks to your generative synthetic intelligence (AI) use case.
Consider FMs with SageMaker Make clear
With SageMaker Make clear, you now have a single place to guage and evaluate any LLM primarily based on predefined standards throughout mannequin choice and all through the mannequin customization workflow. Along with automated analysis, you too can use the human-in-the-loop capabilities to arrange human evaluations for extra subjective standards, comparable to helpfulness, artistic intent, and magnificence, by utilizing your individual workforce or managed workforce from SageMaker Floor Fact.
To get began with mannequin evaluations, you should utilize curated immediate datasets which can be purpose-built for widespread LLM duties, together with open-ended textual content era, textual content summarization, query answering (Q&A), and classification. You may as well lengthen the mannequin analysis with your individual customized immediate datasets and metrics on your particular use case. Human-in-the-loop evaluations can be utilized for any process and analysis metric. After every analysis job, you obtain an analysis report that summarizes the leads to pure language and contains visualizations and examples. You may obtain all metrics and experiences and in addition combine mannequin evaluations into SageMaker MLOps workflows.
In SageMaker Studio, you could find Mannequin analysis below Jobs within the left menu. You may as well choose Consider immediately from the mannequin particulars web page of any LLM in SageMaker JumpStart.
Choose Consider a mannequin to arrange the analysis job. The UI wizard will information you thru the number of automated or human analysis, mannequin(s), related duties, metrics, immediate datasets, and evaluation groups.
As soon as the mannequin analysis job is full, you possibly can view the leads to the analysis report.
Along with the UI, you too can begin with instance Jupyter notebooks that stroll you thru step-by-step directions on tips on how to programmatically run mannequin analysis in SageMaker.
Consider fashions anyplace with the FMEval open supply library
To run mannequin analysis anyplace, together with fashions skilled and hosted outdoors of AWS, use the FMEval open supply library. The next instance demonstrates tips on how to use the library to guage a customized mannequin by extending the ModelRunner class.
For this demo, I select GPT-2 from the Hugging Face mannequin hub and outline a customized HFModelConfig
and HuggingFaceCausalLLMModelRunner
class that works with causal decoder-only fashions from the Hugging Face mannequin hub comparable to GPT-2. The instance can also be obtainable within the FMEval GitHub repo.
!pip set up fmeval
# ModelRunners invoke FMs
from amazon_fmeval.model_runners.model_runner import ModelRunner
# Extra imports for customized mannequin
import warnings
from dataclasses import dataclass
from typing import Tuple, Non-obligatory
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
@dataclass
class HFModelConfig:
model_name: str
max_new_tokens: int
normalize_probabilities: bool = False
seed: int = 0
remove_prompt_from_generated_text: bool = True
class HuggingFaceCausalLLMModelRunner(ModelRunner):
def __init__(self, model_config: HFModelConfig):
self.config = model_config
self.mannequin = AutoModelForCausalLM.from_pretrained(self.config.model_name)
self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name)
def predict(self, immediate: str) -> Tuple[Optional[str], Non-obligatory[float]]:
input_ids = self.tokenizer(immediate, return_tensors="pt").to(self.mannequin.system)
generations = self.mannequin.generate(
**input_ids,
max_new_tokens=self.config.max_new_tokens,
pad_token_id=self.tokenizer.eos_token_id,
)
generation_contains_input = (
input_ids["input_ids"][0] == generations[0][: input_ids["input_ids"].form[1]]
).all()
if self.config.remove_prompt_from_generated_text and never generation_contains_input:
warnings.warn(
"Your mannequin doesn't return the immediate as a part of its generations. "
"`remove_prompt_from_generated_text` does nothing."
)
if self.config.remove_prompt_from_generated_text and generation_contains_input:
output = self.tokenizer.batch_decode(generations[:, input_ids["input_ids"].form[1] :])[0]
else:
output = self.tokenizer.batch_decode(generations, skip_special_tokens=True)[0]
with torch.inference_mode():
input_ids = self.tokenizer(self.tokenizer.bos_token + immediate, return_tensors="pt")["input_ids"]
model_output = self.mannequin(input_ids, labels=input_ids)
likelihood = -model_output[0].merchandise()
return output, likelihood
Subsequent, create an occasion of HFModelConfig
and HuggingFaceCausalLLMModelRunner
with the mannequin info.
Then, choose and configure the analysis algorithm.
Let’s first take a look at with one pattern. The analysis rating is the proportion of factually right responses.
Though it’s not an ideal response, it contains “UK.”
Subsequent, you possibly can consider the FM utilizing built-in datasets or outline your customized dataset. If you wish to use a customized analysis dataset, create an occasion of DataConfig
:
The analysis outcomes will return a mixed analysis rating throughout the dataset and detailed outcomes for every mannequin enter saved in a neighborhood output path.
Be part of the preview
FM analysis with Amazon SageMaker Make clear is accessible in the present day in public preview in AWS Areas US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Eire). The FMEval open supply library is accessible on GitHub. To study extra, go to Amazon SageMaker Make clear.
Get began
Log in to the AWS Administration Console and begin evaluating your FMs with SageMaker Make clear in the present day!
— Antje