Sagemaker Hyperpod で DeepSeek-R1 をFine-Tuning してみた

森田力

2025.02.03

こんにちは、森田です。

今、巷で DeepSeek が話題になっていますが、AWS でも DeepSeek を利用できるようにする取り組みが行われています。

そんな中、SageMaker HyperPod recipes でも DeepSeek r1 Distill Llama をサポートしたようなので、実際にファインチューニングしてみたいと思います。

やってみた

以下のワークショップをベースに進めます。

参考

ワークショップを進めるにあたって以下記事も参考にしました。

前提条件

サービスクォータ

今回実施するFine-Tuningは、ml.p5.48xlargeのインスタンスタイプを使用します。

そのため、以下のように事前にml.g6e.48xlarge for cluster usageの値を増やしておく必要があります。

スクリーンショット 2025-02-03 0.21.45.png

実行環境

コマンドを使って実行する必要があるため、AWS CLIは、インストール済みとします。

また、クラスター作成時には、Pythonを実行しているため、仮想環境と必要なライブラリをインストールしておきます。

python3 -m venv venv
source venv/bin/activate
pip install boto3 jsonschema

インスタンス接続には、Session Managerを利用するため、以下に従ってプラグインを事前にインストールしておきます。

データセットの準備

以下記事を参考にtokenized_dataset.tarを準備済みとします。

クラスターの作成

ワークショップ内の「Easy Cluster Setup」を使ってクラスターを対話形式で作成していきます。

スクリプトをダウンロードして、実行します。

# Clone the repository
mkdir hyperpod && cd hyperpod

curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/refs/heads/main/1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/automate-cluster-creation.sh

# Run the script
bash automate-cluster-creation.sh

いくつか対話形式で入力を求められます。

リージョン

今回は、us-west-2で作成します。

🌎 AWS Region Configuration
Please confirm that your AWS region is ap-northeast-1 (default).
If not, enter the AWS region where you want to set up your cluster (e.g., us-west-2):
> us-west-2
✅ Region updated to: us-west-2

controller instance type

インスタンスタイプは、ml.c5.xlargeを指定します。

Enter the name for the controller instance group [controller-machine]: 
Enter the instance type for the controller [ml.m5.12xlarge]: ml.c5.xlarge

worker instance type

インスタンスタイプは、ml.p5.48xlargeを指定します。

=== Worker Group Configuration ===
Do you want to add a worker instance group? (yes/no): [yes]: 
Configuring Worker Group 1
Enter the instance type for worker group 1 [ml.c5.4xlarge]: ml.p5.48xlarge

その他は、デフォルト値を指定してそのままクラスタ作成を完了させます。

正常実行時の出力

✅ Creating cluster for you!
✅ Cluster creation request submitted successfully. To monitor the progress of cluster creation, you can either check the SageMaker console, or you can run:.
watch -n 1 aws sagemaker list-clusters --output table
Thank you for using the SageMaker HyperPod Cluster Creation Script!
For any issues or questions, please refer to the AWS documentation.
https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started.html

Exiting script. Good luck with your SageMaker HyperPod journey! 👋

コマンド実行完了後、5~15分程度クラスタ作成に時間を要します。

サクッと作成されたい方向け

以下cluster-config.jsonの各リソースをCFnスタックのリソースに置き換えます。

cluster-config.json

    {
        "ClusterName": "ml-cluster",
        "InstanceGroups": [{
        "InstanceGroupName": "controller-machine",
        "InstanceType": "ml.c5.xlarge",
        "InstanceStorageConfigs": [
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 500
                }
            }
        ],
        "InstanceCount": 1,
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://CFnのLCScriptsBucket/src",
            "OnCreate": "on_create.sh"
        },
        "ExecutionRole": "arn:aws:iam::アカウントID:role/CFnのAmazonSagemakerClusterExecutionRole",
        "ThreadsPerCore": 2
    },
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.g6e.48xlarge",
            "InstanceCount": 1,
            "InstanceStorageConfigs": [
                {
                    "EbsVolumeConfig": {
                        "VolumeSizeInGB": 500
                    }
                }
            ],
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://CFnのLCScriptsBucket/src",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::アカウントID:role/CFnのAmazonSagemakerClusterExecutionRole",
            "ThreadsPerCore": 1
        }],
        "VpcConfig": {
        "SecurityGroupIds": ["CFnのSecurityGroupID"],
        "Subnets":["CFnのPrimaryPrivateSubnet"]
        }
    }

そのあとで、以下コマンドを実行することでクラスタの作成も可能です。

aws sagemaker create-cluster \
    --cli-input-json file://cluster-config.json \
    --region us-west-2 --output json

IAM ロールへのポリシー付与

クラスター作成を待っている間に以下コマンドを実行してECRへのアクセス権限を付与します。

source env_vars && aws iam attach-role-policy --role-name $(echo $ROLE | sed 's/.*role\///') --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

S3バケットとFSxの関連付け

クラスタ作成完了後、以下コマンドでデータセットが格納されているバケットとFSxを関連付けを行います。

source env_vars && aws fsx create-data-repository-association \
    --region ${AWS_REGION} \
    --file-system-id ${FSX_ID} \
    --file-system-path /dataset \
    --data-repository-path s3://データセットが格納されているS3バケット \
    --s3 "AutoImportPolicy={Events=[NEW,CHANGED,DELETED]},AutoExportPolicy={Events=[NEW,CHANGED,DELETED]}" \
    --batch-import-meta-data-on-create

接続設定

クラスタ作成完了後、インスタンスへSession Managerを使ってSSH接続を行います。

接続用のスクリプトを使って接続を行います。

なお、このスクリプトでは、id_rsaという名前でSSHキーが作成されているという前提となっているため、鍵名を変更する際には、easy-ssh.sh内のid_rsaを変更しておいてください。

curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh

bash easy-ssh.sh -c controller-machine ml-cluster

正常にスクリプトが実行されると以下のようにインスタンスに接続されます。

実行結果

# /bin/bash -c 'cd ~ && exec bash'

ubuntu@ip-xxxx:~$

接続後は VSCode で操作できるように以下のようにTunnel設定をしておきます。

curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz;
tar -xf vscode_cli.tar.gz;
./code tunnel;

インスタンス設定

VSCodeを使って、controller-machineに接続します。

Python環境

まずは、以下コマンドを実行してPython環境を設定します。

Pythonのインストールと仮想環境作成

sudo apt install python3.9-venv python3.9-dev -y
python3.9 -m venv ${PWD}/venv
source venv/bin/activate

続いて、SageMaker HyperPod recipes をGitHubから取得して、必要なライブラリのインストールを行います。

git clone https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install --upgrade setuptools wheel
pip3 install -r requirements.txt

コンテナイメージの取得

以下コマンドを実行し、コンテナイメージの取得を行います。

export AWS_REGION="us-west-2"
REGION=$AWS_REGION
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}

データセットの準備

/fsx/datasetに配置されているファイルは、圧縮されているので、以下コマンドで展開を行います。

cd /fsx/dataset
sudo tar xvf tokenized_dataset.tar
$ tree

# .
# ├── tokenized_dataset
# │   ├── test
# │   │   ├── data-00000-of-00001.arrow
# │   │   ├── dataset_info.json
# │   │   └── state.json
# │   ├── train
# │   │   ├── data-00000-of-00008.arrow
# │   │   ├── data-00001-of-00008.arrow
# │   │   ├── data-00002-of-00008.arrow
# │   │   ├── data-00003-of-00008.arrow
# │   │   ├── data-00004-of-00008.arrow
# │   │   ├── data-00005-of-00008.arrow
# │   │   ├── data-00006-of-00008.arrow
# │   │   ├── data-00007-of-00008.arrow
# │   │   ├── dataset_info.json
# │   │   └── state.json
# │   └── validation
# │       ├── data-00000-of-00001.arrow
# │       ├── dataset_info.json
# │       └── state.json
# └── tokenized_dataset.tar

launcher_scripts の準備

今回は、R1 Distill Llama 3 8bをLoRAでFine-Tuningするため、sagemaker-hyperpod-recipes/launcher_scripts/run_hf_deepseek_r1_llama_8b_seq8k_gpu_lora.shのスクリプトを使用します。

なお、ベースのスクリプトから若干の手直しが必要となります。

run_hf_deepseek_r1_llama_8b_seq8k_gpu_lora.sh

#!/bin/bash

# Original Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com

#Users should setup their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

> HF_MODEL_NAME_OR_PATH="deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # HuggingFace pretrained model name or path
> HF_ACCESS_TOKEN="hf_xxx" # Optional HuggingFace access token

> TRAIN_DIR="/fsx/dataset/tokenized_dataset/train"
> VAL_DIR="/fsx/dataset/tokenized_dataset/validation" 

> EXP_DIR="/fsx/ubuntu/exp"

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    recipes=fine-tuning/deepseek/hf_deepseek_r1_distilled_llama_8b_seq8k_gpu_lora \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    recipes.run.name="hf-deepseek-r1-distilled-llama-8b-lora" \
    recipes.exp_manager.exp_dir="$EXP_DIR" \
    recipes.trainer.num_nodes=1 \
    recipes.model.train_batch_size=2 \
    recipes.model.data.train_dir="$TRAIN_DIR" \
    recipes.model.data.val_dir="$VAL_DIR" \
    recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
    recipes.model.hf_access_token="$HF_ACCESS_TOKEN" \
>    instance_type=ml.g6e.48xlarge \
>    container=/fsx/ubuntu/smdistributed-modelparallel.sqsh \
>    recipes.model.shard_degree=8 \
>    recipes.trainer.devices=8

データセットのパスやモデルの指定

HF_MODEL_NAME_OR_PATH="deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # HuggingFace上でのモデルパス
HF_ACCESS_TOKEN="hf_xxx" # HuggingFace のアクセストークン

TRAIN_DIR="/fsx/dataset/tokenized_dataset/train" # 学習データ
VAL_DIR="/fsx/dataset/tokenized_dataset/validation" # バリデーションデータ

EXP_DIR="/fsx/ubuntu/exp" # 学習結果保存先

コンピュートの指定

インスタンスタイプに合わせてshard_degree(並列数)やdevices(GPU数)を指定します。
containerには、取得したコンテナイメージを指定します。

    instance_type=ml.g6e.48xlarge \
    container=/fsx/ubuntu/smdistributed-modelparallel.sqsh \
    recipes.model.shard_degree=8 \
    recipes.trainer.devices=8

Slurm の準備

クラスタのオーケストレート設定は、recipes_collection/cluster/slurm.yamlで指定できます。

今回は、データセットと学習結果のディレクトリを追加しているため、container_mountsに以下の項目を追加します。

recipes_collection/cluster/slurm.yaml

# Original Copyright (c), NVIDIA CORPORATION. Modifications © Amazon.com

exclusive: True
mem: 0
job_name_prefix: 'sagemaker-'
slurm_create_submission_file_only: False # Setting to True if just want to create submission file
stderr_to_stdout: True # Setting to False to split the stderr and stdout logs
srun_args:
  # - "--no-container-mount-home"
slurm_docker_cfg:
  docker_args:
    # - "--runtime=nvidia" # this is required if the docker runtime version is low
  post_launch_commands: # commands will run after launching the docker container using bash
container_mounts:
>   - /fsx/dataset:/fsx/dataset
>   - /fsx/ubuntu/exp:/fsx/ubuntu/exp

モデルの学習

スクリプトの準備も完了したら以下コマンドを実行して、モデルの学習を開始します。

bash launcher_scripts/deepseek/run_hf_deepseek_r1_llama_8b_seq8k_gpu_lora.sh

正常に実行が完了すると、/fsx/ubuntu/exp配下に学習結果が格納されます。

├── checkpoints
│   └── peft_full
│       └── steps_50
│           ├── README.md
│           ├── adapter_config.json
│           ├── adapter_model.safetensors
│           └── final-model
│               ├── config.json
│               ├── generation_config.json
│               ├── model-00001-of-00004.safetensors
│               ├── model-00002-of-00004.safetensors
│               ├── model-00003-of-00004.safetensors
│               ├── model-00004-of-00004.safetensors
│               └── model.safetensors.index.json
└── experiment
    ├── 2025-02-02_15-17-16
    │   ├── cmd-args.log
    │   ├── events.out.tfevents.1738509443.ip-10-1-28-160.12180.0
    │   ├── git-info.log
    │   ├── hparams.yaml
    │   ├── lightning_logs.txt
    │   ├── nemo_error_log.txt
    │   └── sagemaker_log_globalrank-0_localrank-0.txt
    ├── sagemaker_log_globalrank-1_localrank-1.txt
    ├── sagemaker_log_globalrank-2_localrank-2.txt
    ├── sagemaker_log_globalrank-3_localrank-3.txt
    ├── sagemaker_log_globalrank-4_localrank-4.txt
    ├── sagemaker_log_globalrank-5_localrank-5.txt
    ├── sagemaker_log_globalrank-6_localrank-6.txt
    └── sagemaker_log_globalrank-7_localrank-7.txt

さいごに

DeepSeek-R1は既に高性能なモデルですが、企業固有の大量データでファインチューニングすることで、より特化した価値の高いモデルを作ることができます。
b
大規模データでの学習や継続的なモデル改善が必要なケースでは、SageMaker HyperPodの活用が効果的な選択肢となります。

また、今回試したようにrecipeとして簡単かつスケーラブルに学習できる仕組みが整っているため、従来では難しかったモデル学習の敷居も大きく下がっています。

ぜひ、皆さんも自社のユースケースに合わせてHyperPodを活用してみてはいかがでしょうか。

Sagemaker Hyperpod で DeepSeek-R1 をFine-Tuning してみた

やってみた

参考

前提条件

関連リソースの作成

サービスクォータ

実行環境

データセットの準備

クラスターの作成

リージョン

controller instance type

worker instance type

サクッと作成されたい方向け

IAM ロールへのポリシー付与

S3バケットとFSxの関連付け

接続設定

インスタンス設定

Python環境

コンテナイメージの取得

データセットの準備

launcher_scripts の準備

データセットのパスやモデルの指定

コンピュートの指定

Slurm の準備

モデルの学習

さいごに

関連記事

主なカテゴリ

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

お問い合わせ

運営会社