Right this moment, we’re saying the final availability of Amazon SageMaker HyperPod recipes to assist information scientists and builders of all talent units to get began coaching and fine-tuning basis fashions (FMs) in minutes with state-of-the-art efficiency. They’ll now entry optimized recipes for coaching and fine-tuning in style publicly accessible FMs similar to Llama 3.1 405B, Llama 3.2 90B, or Mixtral 8x22B.
At AWS re:Invent 2023, we launched SageMaker HyperPod to scale back time to coach FMs by as much as 40 % and scale throughout greater than a thousand compute assets in parallel with preconfigured distributed coaching libraries. With SageMaker HyperPod, yow will discover the required accelerated compute assets for coaching, create essentially the most optimum coaching plans, and run coaching workloads throughout totally different blocks of capability based mostly on the provision of compute assets.
SageMaker HyperPod recipes embrace a coaching stack examined by AWS, eradicating tedious work experimenting with totally different mannequin configurations, eliminating weeks of iterative analysis and testing. The recipes automate a number of vital steps, similar to loading coaching datasets, making use of distributed coaching strategies, automating checkpoints for quicker restoration from faults, and managing the end-to-end coaching loop.
With a easy recipe change, you possibly can seamlessly swap between GPU- or Trainium-based situations to additional optimize coaching efficiency and scale back prices. You possibly can simply run workloads in manufacturing on SageMaker HyperPod or SageMaker coaching jobs.
SageMaker HyperPod recipes in motion
To get began, go to the SageMaker HyperPod recipes GitHub repository to browse coaching recipes for in style publicly accessible FMs.
You solely have to edit simple recipe parameters to specify an occasion kind and the placement of your dataset in cluster configuration, then run the recipe with a single line command to attain state-of-art efficiency.
You might want to edit the recipe config.yaml file to specify the mannequin and cluster kind after cloning the repository.
$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 set up -r necessities.txt.
$ cd ./recipes_collections
$ vim config.yaml
The recipes assist SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker coaching jobs. For instance, you possibly can arrange a cluster kind (Slurm orchestrator), a mannequin identify (Meta Llama 3.1 405B language mannequin), an occasion kind (ml.p5.48xlarge
), and your information areas, similar to storing the coaching information, outcomes, logs, and so forth.
defaults:
- cluster: slurm # assist: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # identify of mannequin to be educated
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or different supported cluster situations
base_results_dir: # Location(s) to retailer the outcomes, checkpoints, logs and so forth.
You possibly can optionally regulate model-specific coaching parameters on this YAML file, which outlines the optimum configuration, together with the variety of accelerator units, occasion kind, coaching precision, parallelization and sharding strategies, the optimizer, and logging to observe experiments by TensorBoard.
run:
identify: llama-405b
results_dir: ${base_results_dir}/${.identify}
time_limit: "6-00:00:00"
restore_from_path: null
coach:
units: 8
num_nodes: 2
accelerator: gpu
precision: bf16
max_steps: 50
log_every_n_steps: 10
...
exp_manager:
exp_dir: # location for TensorBoard logging
identify: helloworld
create_tensorboard_logger: True
create_checkpoint_callback: True
checkpoint_callback_params:
...
auto_checkpoint: True # for automated checkpointing
use_smp: True
distributed_backend: smddp # optimized collectives
# Begin coaching from pretrained mannequin
mannequin:
model_type: llama_v3
train_batch_size: 4
tensor_model_parallel_degree: 1
expert_model_parallel_degree: 1
# different model-specific params
To run this recipe in SageMaker HyperPod with Slurm, it’s essential to put together the SageMaker HyperPod cluster following the cluster setup instruction.
Then, hook up with the SageMaker HyperPod head node, entry the Slurm controller, and replica the edited recipe. Subsequent, you run a helper file to generate a Slurm submission script for the job that you need to use for a dry run to examine the content material earlier than beginning the coaching job.
$ python3 important.py --config-path recipes_collection --config-name=config
After coaching completion, the educated mannequin is mechanically saved to your assigned information location.
To run this recipe on SageMaker HyperPod with Amazon EKS, clone the recipe from the GitHub repository, set up the necessities, and edit the recipe (cluster: k8s
) in your laptop computer. Then, create a hyperlink between your laptop computer and operating the EKS cluster and subsequently use the HyperPod Command Line Interface (CLI) to run the recipe.
$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora
--persistent-volume-claims fsx-claim:information
--override-parameters
'{
"recipes.run.identify": "hf-llama3-405b-seq8k-gpu-qlora",
"recipes.exp_manager.exp_dir": "/information/<your_exp_dir>",
"cluster": "k8s",
"cluster_type": "k8s",
"container": "658645717510.dkr.ecr.<area>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.mannequin.information.train_dir": "<your_train_data_dir>",
"recipes.mannequin.information.val_dir": "<your_val_data_dir>",
}'
You too can run recipe on SageMaker coaching jobs utilizing SageMaker Python SDK. The next instance is operating PyTorch coaching scripts on SageMaker coaching jobs with overriding coaching recipes.
...
recipe_overrides = {
"run": {
"results_dir": "/decide/ml/mannequin",
},
"exp_manager": {
"exp_dir": "",
"explicit_log_dir": "/decide/ml/output/tensorboard",
"checkpoint_dir": "/decide/ml/checkpoints",
},
"mannequin": {
"information": {
"train_dir": "/decide/ml/enter/information/prepare",
"val_dir": "/decide/ml/enter/information/val",
},
},
}
pytorch_estimator = PyTorch(
output_path=<output_path>,
base_job_name=f"llama-recipe",
position=<position>,
instance_type="p5.48xlarge",
training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
recipe_overrides=recipe_overrides,
sagemaker_session=sagemaker_session,
tensorboard_output_config=tensorboard_output_config,
)
...
As coaching progresses, the mannequin checkpoints are saved on Amazon Easy Storage Service (Amazon S3) with the absolutely automated checkpointing functionality, enabling quicker restoration from coaching faults and occasion restarts.
Now accessible
Amazon SageMaker HyperPod recipes are actually accessible within the SageMaker HyperPod recipes GitHub repository. To study extra, go to the SageMaker HyperPod product web page and the Amazon SageMaker AI Developer Information.
Give SageMaker HyperPod recipes a attempt to ship suggestions to AWS re:Put up for SageMaker or by your regular AWS Help contacts.
— Channy