大規模言語モデルをシングルGPUで動かせる!? FlexGenを触ってみた

大規模言語モデルに手の届く未来がすぐそこに。
2023.02.24

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

こんちには。

データアナリティクス事業本部 インテグレーション部 機械学習チームの中村です。

今回は大規模言語モデルをシングルGPUで動かせるという噂のFlexGenについて使ってみて紹介したいと思います。

FlexGenとは

FlexGenは、大規模言語モデル(LLM: Large Language Model)をシングルGPU(例えば、16GBのT4や24GBのRTX3090)で実行可能な高スループットな生成エンジンです。

以下がGitHubになります。

FlexGenは、Meta社が開発したOPT(Open Pre-trained Transformer)を動かすことができ、実際にAIアシスタントと会話することができます。

参考までにOPTに関する論文は以下です。

使用環境

Google ColaboratoryのPro環境を使います。モデルのアーキテクチャによって動作させるスペックを変えています。

  • OPT-1.3B
    • ハードウェア アクセラレータ : GPU
    • GPUクラス : 標準 (Tesla T4)
    • ラインタイムの仕様 : ハイメモリ (26GB)
  • OPT-6.7B
    • ハードウェア アクセラレータ : GPU
    • GPUクラス : プレミアム (NVIDIA A100-SXM4-40GB)
    • ラインタイムの仕様 : 標準 (85G)

また、FlexGen自体は以下のコミット時点のものが動作しています。

使ってみた

環境構築

まずはFlexGenをcloneします。

!git clone https://github.com/FMInference/FlexGen.git
Cloning into 'FlexGen'...
remote: Enumerating objects: 4113, done.
remote: Counting objects: 100% (94/94), done.
remote: Compressing objects: 100% (60/60), done.
remote: Total 4113 (delta 51), reused 54 (delta 33), pack-reused 4019
Receiving objects: 100% (4113/4113), 36.90 MiB | 26.08 MiB/s, done.
Resolving deltas: 100% (878/878), done.

ディレクトリに移動し、pip installします。

%cd FlexGen
!pip3 install -e .
/content/FlexGen
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/FlexGen
  Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from flexgen==0.0.0) (1.22.4)
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from flexgen==0.0.0) (4.64.1)
Collecting pulp
  Downloading PuLP-2.7.0-py3-none-any.whl (14.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.3/14.3 MB 73.3 MB/s eta 0:00:00
Requirement already satisfied: attrs in /usr/local/lib/python3.8/dist-packages (from flexgen==0.0.0) (22.2.0)
Collecting transformers>=4.24
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 87.6 MB/s eta 0:00:00
Requirement already satisfied: torch>=1.12 in /usr/local/lib/python3.8/dist-packages (from flexgen==0.0.0) (1.13.1+cu116)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.8/dist-packages (from torch>=1.12->flexgen==0.0.0) (4.5.0)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 93.3 MB/s eta 0:00:00
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (2.25.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (2022.6.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (23.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers>=4.24->flexgen==0.0.0) (3.9.0)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 190.3/190.3 KB 18.4 MB/s eta 0:00:00
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->transformers>=4.24->flexgen==0.0.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->transformers>=4.24->flexgen==0.0.0) (1.24.3)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->transformers>=4.24->flexgen==0.0.0) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->transformers>=4.24->flexgen==0.0.0) (2022.12.7)
Installing collected packages: tokenizers, pulp, huggingface-hub, transformers, flexgen
  Running setup.py develop for flexgen
Successfully installed flexgen-0.0.0 huggingface-hub-0.12.1 pulp-2.7.0 tokenizers-0.13.2 transformers-4.26.1

モデルの取得とベンチマーク

動作確認用に小さいOPT-1.3Bを使って、ベンチマークを取ります。

!python3 -m flexgen.flex_opt --model facebook/opt-1.3b
2023-02-24 02:52:38.559822: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 02:52:39.402260: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 02:52:39.402364: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 02:52:39.402379: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading (…)okenizer_config.json: 100% 685/685 [00:00<00:00, 109kB/s]
Downloading (…)lve/main/config.json: 100% 651/651 [00:00<00:00, 89.5kB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 7.55MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 4.78MB/s]
Downloading (…)cial_tokens_map.json: 100% 221/221 [00:00<00:00, 98.4kB/s]
model size: 2.443 GB, cache size: 0.398 GB, hidden size (prefill): 0.008 GB
warmup - init weights
Load the pre-trained pytorch weights of opt-1.3b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)lve/main/config.json: 100% 653/653 [00:00<00:00, 103kB/s]
Downloading (…)"pytorch_model.bin";: 100% 2.63G/2.63G [01:13<00:00, 35.9MB/s]
Downloading (…)neration_config.json: 100% 137/137 [00:00<00:00, 22.8kB/s]
Convert the weights to numpy format under /root/opt_weights/opt-1.3b-np ...
100% 388/388 [00:09<00:00, 39.76it/s]
warmup - generate
benchmark - generate
benchmark - delete weights
/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:262: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
Outputs:
----------------------------------------------------------------------
0: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------
3: Paris is the capital city of France. It is the most populous city in France, with an estimated population of 6,848,000 in 2016. It is the second most populous city
----------------------------------------------------------------------

TorchDevice: cuda:0
  cur_mem: 0.0000 GB,  peak_mem: 3.2399 GB
TorchDevice: cpu
  cur_mem: 0.0000 GB,  peak_mem: 0.0000 GB
model size: 2.443 GB    cache size: 0.398 GB    hidden size (p): 0.008 GB
peak gpu mem: 3.240 GB  projected: False
prefill latency: 0.339 s    prefill throughput: 6043.428 token/s
decode latency: 0.820 s decode throughput: 151.180 token/s
total latency: 1.159 s  total throughput: 110.431 token/s

正常に動いているようです。同じような表示が出ない場合、メモリ不足等で正常に実行できていない可能性がありますので、ご確認ください。

チャットボット起動(OPT-1.3B)

以下でチャットボットを起動できます。

!python3 apps/chatbot.py --model facebook/opt-1.3b
2023-02-24 02:23:46.886822: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 02:23:47.722337: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 02:23:47.722426: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 02:23:47.722441: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Initialize...
A chat between a curious human and a knowledgeable artificial intelligence assistant.
Human: Hello! What can you do?
Assistant: As an AI assistant, I can answer questions and chat with you.
Human: What is the name of the tallest mountain in the world?
Assistant: Everest.
Human: Good morning.
Assistant: How long does it take to fly from New York to London?
Human: I think it takes about 12 hours.
Assistant: How fast does a bullet go through a bulletproof vest?
Human: Please answer my question.
Assistant: The bullet goes through the vest in less than one second.

Colabの出力をカーソルで選択すれば、こちら側の文章をHumanの部分に入力することができます。

疑問文しか返してこなくなるあたりは、会話ができていなように見えるため、あくまで動作確認ということで別のモデルをつかう必要がありそうです。

チャットボット起動(OPT-6.7B)

次に、Colabのスペックを変更して、アーキテクチャをOPT-6.7Bにしてチャットボットを起動してみます。

!python apps/chatbot.py --model facebook/opt-6.7b
2023-02-24 03:15:52.548444: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-24 03:15:52.696922: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-24 03:15:53.479595: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 03:15:53.479707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-24 03:15:53.479734: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading (…)okenizer_config.json: 100% 685/685 [00:00<00:00, 97.8kB/s]
Downloading (…)lve/main/config.json: 100% 651/651 [00:00<00:00, 103kB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 5.30MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 3.34MB/s]
Downloading (…)cial_tokens_map.json: 100% 221/221 [00:00<00:00, 85.3kB/s]
Initialize...
Load the pre-trained pytorch weights of opt-6.7b from huggingface. The downloading and cpu loading can take dozens of minutes. If it seems to get stuck, you can monitor the progress by checking the memory usage of this process.
Downloading (…)lve/main/config.json: 100% 651/651 [00:00<00:00, 99.7kB/s]
Downloading (…)model.bin.index.json: 100% 41.9k/41.9k [00:00<00:00, 1.27MB/s]
Downloading (…)00001-of-00002.bin";: 100% 9.96G/9.96G [01:13<00:00, 136MB/s]
Downloading (…)00002-of-00002.bin";: 100% 3.36G/3.36G [00:35<00:00, 95.0MB/s]
Loading checkpoint shards: 100% 2/2 [00:09<00:00,  4.77s/it]
Downloading (…)neration_config.json: 100% 137/137 [00:00<00:00, 51.5kB/s]
Convert the weights to numpy format under /root/opt_weights/opt-6.7b-np ...
100% 516/516 [00:32<00:00, 16.07it/s]
A chat between a curious human and a knowledgeable artificial intelligence assistant.
Human: Hello! What can you do?
Assistant: As an AI assistant, I can answer questions and chat with you.
Human: What is the name of the tallest mountain in the world?
Assistant: Everest.
Human: Good morning.
Assistant: Good morning.
Human: Please tell me about popular sports in Japan.
Assistant: Baseball, sumo, soccer, and judo.
Human: Can you tell me more about sumo?
Assistant: Sumo comes from the word sūmu, which means “to pull down.” A wrestler enters the ring and the referee calls for him to pull down his opponent with his legs or body. The wrestler then tries to push his opponent out of the ring with his hands. If he is successful, he is considered to be the winner.

次はきちんと質問に答えてくれました!

その他のモデル

その他の選択可能なアーキテクチャは、以下を参照すればわかりそうです。

READMEによればOPT-30BなどもシングルGPUで動作させられるようですが、オフロードパラメータを調整する必要があります。

メモリ不足の対処は以下に記載にも記載があるので参考にされてください。

まとめ

以下がでしたでしょうか?

手の届かないところにあると感じていた大規模言語モデルですが、FlexGenでかなり活用の道筋が見えたかなと思いました。

今後のロードマップとしては以下のような展開を予定しているようです。

  • Apple silicon M1/M2の展開に対応
  • サポートColabの展開
  • テキスト要約アプリケーションや、よりスループットを重視したアプリケーションを追加す
  • チャットボットアプリケーションのレイテンシーの最適化
  • より多くのモデルをサポート(BLOOM, CodeGen, GLM)
  • コストモデル、ポリシーオプティマイザーをリリース
  • pipでインストール可能なパッケージのリリース

今後の開発にも期待したいと思います!