Phi 3 – Small But Highly effective Fashions from Microsoft

May 1, 2024

44

Introduction

The Phi mannequin from Microsoft has been on the forefront of many open-source Giant Language Fashions. Phi structure has led to all the favored small open-source fashions that we see at the moment which embrace TPhixtral, Phi-DPO, and others. Their Phi Household has taken the LLM structure a step ahead with the introduction of Small Language Fashions, saying that these are sufficient to realize completely different duties. Now Microsoft has lastly unveiled the Phi 3, the subsequent technology of Phi fashions, which additional improves than the earlier technology of fashions. We are going to undergo the Phi 3 on this article and check it with completely different prompts.

Studying Goals

Perceive the developments within the Phi 3 mannequin in comparison with earlier iterations.
Study in regards to the completely different variants of the Phi 3 mannequin.
Discover the enhancements in context size and efficiency achieved by Phi 3.
Acknowledge the benchmarks the place Phi 3 surpasses different widespread language fashions.
Perceive learn how to obtain, initialize, and use the Phi 3 mini mannequin.

This text was revealed as part of the Knowledge Science Blogathon.

Phi 3 – The Subsequent Iteration of Phi Household

Not too long ago Microsoft has launched Phi 3, showcasing its dedication to the open-source within the discipline of Synthetic Intelligence. Phi has launched two variants of Phi 3. One is the Phi 3 with a 4k context dimension and the opposite is the Phi 3 with a 128k context dimension. Each of those have the identical structure and a dimension of three.8 Billion Parameters known as the Phi 3 mini. Microsoft has even introduced up two bigger variants of Phi, a 7 Billion model known as the Phi 3 Small and a 14 Billion model known as the Phi 3 Medium, although they’re nonetheless within the coaching phases. All of the Phi 3 fashions include the instruct model and thus are able to be deployed in chat purposes.

Distinctive Options

Prolonged Context Size: Phi 3 will increase the context size of the Giant Language Mannequin from 2k to 128k, facilitated by LongRope expertise, with the default context size doubled to 4k.
Coaching Knowledge Measurement and High quality: Phi 3 is skilled on 3.3 Trillion tokens, that includes bigger and extra superior datasets in comparison with Phi 2.
Mannequin Variants:
- Phi 3 Mini: Educated on 3.3 Trillion tokens, with a 32k vocabulary dimension and leveraging the tiktoken tokenizer.
- Phi 3 Small (7B Model): Default context size of 8k, vocabulary dimension of 100k, and makes use of Grouped Question Consideration with 4 Queries sharing 1 Key to cut back reminiscence footprint.
Mannequin Structure: Incorporates Grouped Question Consideration to optimize reminiscence utilization, beginning with Pretraining and transferring to Supervised fine-tuning, aligned with Direct Choice Optimization for AI-responsible outputs.

Benchmarks – Phi 3

Coming to the benchmarks, the Phi 3 mini, i.e. the three.8 Billion Parameter mannequin has overtaken the Gemma 7B from Google. It has gotten a rating of 68.8 in MMLU and 76.7 in HellaSwag which exceeds Gemma which has a rating of 63.6 in MMLU and 49.8 in HellSwag and even the Mistral 7B mannequin which has a rating of 61.7 in MMLU and 58.5 in HellSwag. Phi-3 has even surpassed the just lately launched Llama 3 8B mannequin in each of those benchmarks.

It even surpasses these and the opposite fashions in different widespread analysis assessments just like the WinoGrande, TruthfulQA, HumanEval, and others. Within the under desk, we are able to evaluate the scores of the Phi 3 household of fashions with different widespread open-source giant language fashions.

Getting Began with Phi 3

To get began with Phi-3. We have to comply with sure steps. Allow us to dive deeper into every step.

Step1: Downloading Libraries

Let’s begin by downloading the next libraries.

!pip set up -q transformers huggingface-cli bitsandbytes speed up

transformers – We want this library to obtain the Giant Language Fashions and work with them
huggingface-cli – We have to log in to huggingface in order that we are able to work with the official HuggingFace mannequin
bitsandbytes – We can not straight run the 8 Billion mannequin within the free GPU occasion of Colab, therefore we want this library to quantize the LLM to 4-bit to work with them
speed up – We want this to hurry up the GPU inference for the Giant Language Fashions

Now, earlier than we begin downloading the mannequin, we have to outline our quantization config. It is because we can not load your complete full precision mannequin inside the free Google Colab GPU and even when we match it, the inference might be gradual. So, we are going to quantize our mannequin to 4-bit precision after which work with the mannequin.

Step2: Defining Quantization Configure

The configuration for this quantization might be seen under:

import torch
from transformers import BitsAndBytesConfig


config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

Right here we begin by importing the torch and the BitsAndBytesConfig from the transformers library.
Then we create an occasion of this BitsAndBytesConfig class and reserve it to the variable known as config
Whereas creating this occasion, we give it the next parameters.
load_in_4bit: This tells that we need to quantize our mannequin into 4bit precision format. This may drastically cut back the scale of the mannequin.
bnb_4bit_quant_type: This tells the kind of 4bit quantization we want to work with. Right here we go along with the conventional float known as the nf4. That is confirmed to provide higher outcomes.
bnb_4bit_use_double_quant: Setting this to True will quantize the quantization constants which might be inside to BitsAndBytes, it will additional cut back the scale of the mannequin.
bnb_4bit_compute_dtype: Right here we inform what datatype we might be working with when computing the ahead go by means of the mannequin. For the colab, we are able to set it to mind float16 known as bfloat16, which tends to offer higher outcomes than the common float16.

Operating this code will create our quantization configuration.

Step3: Obtain the Mannequin

Now, we’re able to obtain the mannequin and quantize it with the next quantization configuration. The code for this might be:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

mannequin = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    quantization_config = config
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Right here we begin by importing the AutoModelForCausalLM and AutoTokenizer from the transformers library
Now we create a variable named model_name and go it the title of the mannequin that we’ll work with and right here we are going to give the Phi-3-mini Instruct model mannequin
Then we create an occasion of the AutoModelForCausualLM.from_pretrained() and go it the mannequin title, and the system map, which can set the system to GPU if GPU is current, after which the quantization config that we’ve simply created
In an analogous means, we create a tokenizer object with the identical mannequin title and the system map set to auto

Operating this code will obtain the Phi-3 mini 4k context instruct LLM after which will quantize it to the 4bit degree based mostly on the configuration that we’ve supplied to it. After which the tokenizer is downloaded as properly.

Step4: Testing Phi-3-mini

Now we are going to check the Phi-3-mini. For this, the code might be:

messages = [
    {"role": "user", "content": "A clock shows 12:00 p.m. now. How many 
    degrees will the minute hand move in 15 minutes?"},
    {"role": "assistant", "content": "The minute hand moves 360 degrees 
    in one hour (60 minutes). Therefore, in 15 minutes, it will move 
    (15/60) * 360 degrees = 90 degrees."},
    {"role": "user", "content": "How many degrees does the hour hand 
    move in 15 minutes?"}
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = mannequin.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output, 
                                       skip_special_tokens=True)
print(decoded_output[0])

First, we create a listing of messages. It is a listing of dictionaries, containing two key-value pairs, the place the keys are function and content material.
The function tells if the message is from the consumer or the assistant and the content material is the precise message
Right here we create a dialog about angles between the palms of the clock. Within the final message from the consumer, we ask a query in regards to the angle made by the hour’s hand.
Then we apply a chat template to this chat dialog. The chat template is important for the mannequin to know, as a result of the instruct knowledge the mannequin is skilled on, accommodates the chat template formatting.
We want the corresponding tensors for this dialog and we are going to transfer it to Cuda for quicker processing.
Now the model_input accommodates our tokens and the corresponding consideration masks.
These model_inputs are handed to the mannequin.generate() perform which takes these tokens with some further parameters just like the variety of tokens to print, which we despatched to 1000, and the do_sample, which can pattern from the excessive likelihood tokens.
Lastly, we decode the output generated by the Giant Language Mannequin to transform the tokens again to English textual content.

Therefore, after we run this code will take within the listing of messages, do the right formatting by making use of the chat template, convert them into tokens, after which go them to generate a perform to generate the response and at last decode them to transform the response generated within the type of tokens to English textual content.

Output

Operating this code produced the next output.

Seeing the output generated, the mannequin has appropriately answered the query. We see a really detailed method much like a series of ideas. Right here the mannequin begins by speaking about how the minute hand strikes and the way the hour hand strikes per hour. Then from there, it calculated the mandatory intermediate consequence, and from there it went on to resolve the precise consumer query.

Implementation with One other Query

Now let’s attempt with one other query.

messages = [
    {"role": "user", "content": "If a plane crashes on the border of the 
    United States and Canada, where do they bury the survivors?"},
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = mannequin.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output,
                                       skip_special_tokens=True)
print(decoded_output[0])

Right here within the above instance, we requested a tough query to the Phi 3 LLM. And it was capable of present a fairly convincing reply. Right here the LLM was capable of get to the complicated half, that’s we can not bury the survivors, as a result of survivors live, therefore there aren’t any survivors in any respect to bury. Let’s attempt giving one other difficult query and checking the generated output.

messages = [
    {"role": "user", "content": "How many smartphones can a human eat?"},
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = mannequin.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output,
                                       skip_special_tokens=True)
print(decoded_output[0])

Right here we requested the Phi-3-mini one other difficult query, about what number of smartphones can a human eat. This assessments the Giant Language Mannequin’s widespread sense capacity. The Phi-3 LLM was capable of catch this by saying that it was a misunderstanding. It even tells that the. This tells that the Phi-3-mini was properly skilled on a high quality dataset containing a great combination of widespread sense, reasoning, and maths.

Conclusion

Phi-3 represents Microsoft’s subsequent technology of Phi fashions, bringing vital developments over Phi-2. It boasts a drastically elevated context size, reaching as much as 128k tokens with minimal efficiency affect. Moreover, Phi-3 is skilled on a a lot bigger and extra complete dataset in comparison with its predecessor. Benchmarks point out that Phi-3 outperforms different widespread fashions in varied duties, demonstrating its effectiveness. With its functionality to deal with advanced questions and incorporate widespread sense reasoning, Phi-3 holds nice promise for varied purposes.

Key Takeaways

Phi 3 performs properly in sensible eventualities, dealing with difficult and ambiguous questions successfully
Mannequin Variants: Totally different variations of Phi 3 embrace Mini (3.8B), Small (7B), and Medium (14B), offering choices for varied use instances.
Phi 3 surpasses different open-source fashions in key benchmarks like MMLU and HellaSwag.
In comparison with the earlier mannequin Phi 2, the context dimension of Phi 3 is doubled that’s 4k, and with the LongRope technique, the context size is additional moved to 128k with little or no degradation in efficiency
Phi 3 is skilled on 3.3 Trillion Tokens involving extremely curated datasets and it was each supervised fine-tuned after which adopted by alignment with Direct Choice Optimization

Regularly Requested Questions

Q1. What sort of prompts can I take advantage of with Phi 3?

A. Phi 3 fashions are skilled on knowledge with a selected chat template format. So, it’s beneficial to make use of the identical format when offering prompts or inquiries to the mannequin. This template might be utilized by calling the apply_chat_template.

Q2. What’s Phi 3 and what fashions are a part of its household?

A. hello 3 is the subsequent technology of Phi fashions from Microsoft, a part of a household together with Phi 3 mini, Small, and Medium. The place the mini model is a 3.8 Billion Parameter mannequin, whereas the Small is a 7 Billion Parameter mannequin and the Medium is a 14 Billion Parameter mannequin.

Q3. Can I take advantage of Phi 3 without spending a dime?

A. Sure, Phi 3 fashions can be found without spending a dime by means of the Hugging Face platform. Proper now solely the Phi 3 mini i.e. the three.8 Billion Parameter mannequin is obtainable on HuggingFace. This mannequin might be labored with for industrial purposes too, based mostly on the given license.

This fall. How properly does Phi 3 deal with difficult questions?

A. Phi 3 reveals promising outcomes with common sense reasoning. The supplied examples reveal that Phi 3 can reply difficult questions that contain humor or logic.

Q5. Are there any modifications for the tokenizers within the new Phi household of fashions?

A. Sure. Whereas the Phi 3 Mini nonetheless works with the common Llama 2 tokenizer, having a vocabulary dimension of 32k, the brand new Phi 3 Small mannequin will get a tokenizer, the place the vocabulary dimension is prolonged to 100k tokens