骨格検知モデルをg4dn系(NVIDIA GPU)インスタンスからInf1系(Amazon Inferentia GPU)インスタンス用に変換して動かしてみた。
機械学習の推論といえばNVIDIA系のGPUチップがスタンダードで、最近Intelのゲーム向けGPU「Xe HPG」やAppleのmac用GPU「M1」シリーズなど、各社の技術を詰め込んだ新しいチップが乱立する戦国模様です。
そんな中AWSには去年東京リージョンにも登場した推論専用チップ「Amazon Inferentia」を実装したInf1インスタンス、というものがあります。
OpenPoseとはカーネギーメロン大学のZhe Caoらが2017年に発表した姿勢推定(Pose Estimation)の論文を実装したもので、画像内に映っている骨格点を推測してくれるモデルです。
AMIはDeep Learning AMIの最新のものを使います。
セキュリティグループはSSHのポート(22)とJupyter Labのポート(8888)を空けておきます。長期間使用する場合はコントローラとなる場所のIPを限定して空けましょう。今回はテストとして全空けしておきます。
ssh -i "deeplearningtest.pem" ubuntu@ec2-XX-XX-XXX-XXX.ap-northeast-1.compute.amazonaws.com
Deep Learning AMIはあらかじめ、いろいろな環境が用意されています。コマンドによってその環境を切り替えて使います。
Please use one of the following commands to start the required environment with the framework of your choice: for TensorFlow 2.8 with Python3.8 (CUDA 11.2 and Intel MKL-DNN) ____________________________ source activate tensorflow2_p38 for PyTorch 1.10 with Python3.8 (CUDA 11.1 and Intel MKL) ______________________________________ source activate pytorch_p38 for AWS MX 1.8 (+Keras2) with Python3.7 (CUDA 11.0 and Intel MKL-DNN) ____________________________ source activate mxnet_p37 for AWS MX(+AWS Neuron) with Python3 __________________________________________________ source activate aws_neuron_mxnet_p36 for Tensorflow(+AWS Neuron) with Python3 _________________________________________ source activate aws_neuron_tensorflow_p36 for PyTorch (+AWS Neuron) with Python3 ______________________________________________ source activate aws_neuron_pytorch_p36 for AWS MX(+Amazon Elastic Inference) with Python3 ______________________________________ source activate amazonei_mxnet_p36 for base Python3 (CUDA 11.0) _______________________________________________________________________ source activate python3
source activate pytorch_p38
mkdir work cd work/ git clone https://github.com/tensorboy/pytorch_Realtime_Multi-Person_Pose_Estimation.git
python -m pip install --upgrade pip pip install -r requirements.txt
Deep Learning AMIにはAWS CLIもあらかじめ入っているので、S3からなら簡単にサーバーにコピーできます。
aws s3 cp s3://inf1test-ap-northeast-1/transformedmodels/pose_model.pth .
cd lib/pafprocess/ sudo apt install swig swig -python -c++ pafprocess.i && python3 setup.py build_ext --inplace cd ../../
Jupyter Labを使って推論を行いたいのでJupyter Labもインストールします。
conda install jupyterlab
これで準備ができました。Jupyter Labを起動します。
jupyter-lab --ip=''
Jupyter Labを起動するとターミナルにトークンが表示されるので、EC2のパブリックアドレスにJupyter Labのポート(8888)をつけて、トークンとともにアクセスします。
jupyter-lab --ip='' [I 2022-03-14 22:35:46.116 ServerApp] ipyparallel | extension was successfully linked. [W 2022-03-14 22:35:46.119 LabApp] Config option `kernel_spec_manager_class` not recognized by `LabApp`. [W 2022-03-14 22:35:46.122 LabApp] Config option `kernel_spec_manager_class` not recognized by `LabApp`. [W 2022-03-14 22:35:46.127 LabApp] Config option `kernel_spec_manager_class` not recognized by `LabApp`. [I 2022-03-14 22:35:46.128 ServerApp] jupyterlab | extension was successfully linked. [W 2022-03-14 22:35:46.130 NotebookApp] Config option `kernel_spec_manager_class` not recognized by `NotebookApp`. [W 2022-03-14 22:35:46.134 NotebookApp] 'kernel_spec_manager_class' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. [W 2022-03-14 22:35:46.134 NotebookApp] 'iopub_data_rate_limit' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. [W 2022-03-14 22:35:46.135 NotebookApp] Config option `kernel_spec_manager_class` not recognized by `NotebookApp`. [W 2022-03-14 22:35:46.139 NotebookApp] Config option `kernel_spec_manager_class` not recognized by `NotebookApp`. [I 2022-03-14 22:35:46.149 ServerApp] Writing Jupyter server cookie secret to /home/ubuntu/.local/share/jupyter/runtime/jupyter_cookie_secret [I 2022-03-14 22:35:46.492 ServerApp] nb_conda | extension was found and enabled by nbclassic. Consider moving the extension to Jupyter Server's extension paths. [I 2022-03-14 22:35:46.492 ServerApp] nb_conda | extension was successfully linked. [I 2022-03-14 22:35:46.492 ServerApp] nbclassic | extension was successfully linked. [I 2022-03-14 22:35:46.494 ServerApp] Using EnvironmentKernelSpecManager... [I 2022-03-14 22:35:46.494 ServerApp] Started periodic updates of the kernel list (every 3 minutes). [I 2022-03-14 22:35:47.895 ServerApp] nbclassic | extension was successfully loaded. [I 2022-03-14 22:35:47.897 ServerApp] Loading IPython parallel extension [I 2022-03-14 22:35:47.897 ServerApp] ipyparallel | extension was successfully loaded. [I 2022-03-14 22:35:47.898 LabApp] JupyterLab extension loaded from /home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/jupyterlab [I 2022-03-14 22:35:47.898 LabApp] JupyterLab application directory is /home/ubuntu/anaconda3/envs/pytorch_p38/share/jupyter/lab [I 2022-03-14 22:35:47.901 ServerApp] jupyterlab | extension was successfully loaded. [I 2022-03-14 22:35:47.902 ServerApp] [nb_conda] enabled [I 2022-03-14 22:35:47.902 ServerApp] nb_conda | extension was successfully loaded. [I 2022-03-14 22:35:47.903 ServerApp] Serving notebooks from local directory: /home/ubuntu [I 2022-03-14 22:35:47.903 ServerApp] Jupyter Server 1.12.0 is running at: [I 2022-03-14 22:35:47.903 ServerApp] http://ip-172-31-32-87:8888/lab?token=171f02250e50db8deec991d3f9da8d72414e5b18ab0841ea [I 2022-03-14 22:35:47.903 ServerApp] or [I 2022-03-14 22:35:47.903 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 2022-03-14 22:35:48.014 ServerApp] No web browser found: could not locate runnable browser. [C 2022-03-14 22:35:48.014 ServerApp] To access the server, open this file in a browser: file:///home/ubuntu/.local/share/jupyter/runtime/jpserver-16377-open.html Or copy and paste one of these URLs: http://ip-172-31-32-87:8888/lab?token=171f02250e50db8deec991d3f9XXXXXXXXXXXXXXXXXXXXX or [I 2022-03-14 22:35:48.015 ServerApp] Starting initial scan of virtual environments... [I 2022-03-14 22:35:56.899 ServerApp] Found new kernels in environments: conda_tensorflow2_p38, conda_aws_neuron_mxnet_p36, conda_aws_neuron_pytorch_p36, conda_anaconda3, conda_amazonei_mxnet_p36, conda_python3, conda_pytorch_p38, conda_mxnet_p37, conda_aws_neuron_tensorflow_p36
ブラウザの別タブで[http://<EC2のパブリックアドレス>:8888/lab?token=<ターミナルにでてきたトークン>]を叩くとJupter Labが表示されます。
import os import re import sys sys.path.append('.') import cv2 import math import time import scipy import argparse import matplotlib import numpy as np import pylab as plt import torch import torch.nn as nn import torch.nn.functional as F from torch.autograd import Variable from collections import OrderedDict from scipy.ndimage.morphology import generate_binary_structure from scipy.ndimage.filters import gaussian_filter, maximum_filter
from lib.network.rtpose_vgg import get_model from lib.network import im_transform
from evaluate.coco_eval import get_outputs, handle_paf_and_heat
ここでpythonのエラーが出ますが、これはpythonスクリプトとして書かれている処理(exitやquitなど)がJupyter notebooksでは使えませんよ、という警告なので一旦そのままにします。
from lib.utils.common import Human, BodyPart, CocoPart, CocoColors, CocoPairsRender, draw_humans from lib.utils.paf_to_pose import paf_to_pose_cpp from lib.config import cfg, update_config
parser = argparse.ArgumentParser() parser.add_argument('--cfg', help='experiment configure file name', default='./experiments/vgg19_368x368_sgd.yaml', type=str) parser.add_argument('--weight', type=str, default='pose_model.pth') parser.add_argument('opts', help="Modify config options using the command-line", default=None, nargs=argparse.REMAINDER) #args = parser.parse_args() args = parser.parse_args(args=[])
# update config file update_config(cfg, args)
from lib.datasets.preprocessing import (inception_preprocess, rtpose_preprocess, ssd_preprocess, vgg_preprocess) def get_outputs2(img, model, preprocess): """Computes the averaged heatmap and paf for the given image :param multiplier: :param origImg: numpy array, the image being processed :param model: pytorch model :returns: numpy arrays, the averaged paf and heatmap """ inp_size = cfg.DATASET.IMAGE_SIZE # padding im_croped, im_scale, real_shape = im_transform.crop_with_factor( img, inp_size, factor=cfg.MODEL.DOWNSAMPLE, is_ceil=True) if preprocess == 'rtpose': im_data = rtpose_preprocess(im_croped) elif preprocess == 'vgg': im_data = vgg_preprocess(im_croped) elif preprocess == 'inception': im_data = inception_preprocess(im_croped) elif preprocess == 'ssd': im_data = ssd_preprocess(im_croped) batch_images= np.expand_dims(im_data, 0) batch_var = torch.from_numpy(batch_images).cuda().float() predicted_outputs, _ = model(batch_var) output1, output2 = predicted_outputs[-2], predicted_outputs[-1] heatmap = output2.cpu().data.numpy().transpose(0, 2, 3, 1)[0] paf = output1.cpu().data.numpy().transpose(0, 2, 3, 1)[0] return paf, heatmap, im_scale
model = get_model('vgg19') model.load_state_dict(torch.load(args.weight)) #model = torch.nn.DataParallel(model) model = torch.nn.DataParallel(model).cuda() model.float() model.eval()
test_image = './readme/ski.jpg' oriImg = cv2.imread(test_image) # B,G,R order shape_dst = np.min(oriImg.shape[0:2])
num_inferences = 100 start = time.time() for _ in range(num_inferences): with torch.no_grad(): paf, heatmap, im_scale = get_outputs2(oriImg, model, 'rtpose') elapsed_time = time.time() - start print('num_inferences:{:>6}[images], elapsed_time:{:6.2f}[sec], Throughput:{:8.2f}[images/sec]'.format(num_inferences, elapsed_time, num_inferences / elapsed_time))
num_inferences: 100[images], elapsed_time: 7.76[sec], Throughput: 12.88[images/sec]
print(im_scale) humans = paf_to_pose_cpp(heatmap, paf, cfg) out = draw_humans(oriImg, humans) cv2.imwrite('result.png',out)
ではいよいよInf1を試してみましょう。NVIDIAのGPUをターゲットにしている今のモデルをInf1で動かすには、Neuron SDKというInferentia用のSDKを使ってNeuron Executable File Format (NEFF) に変換し、neuronランタイム上で動かすことになります。先程の推論コードにこの処理を追加すれば動くはずです。
source activate aws_neuron_pytorch_p36
python -m pip install --upgrade pip pip install -r requirements.txt cd lib/pafprocess/ sudo apt install swig swig -python -c++ pafprocess.i && python3 setup.py build_ext --inplace cd ../../
Jupyter Labを立ち上げて、先ほどの推論コードを開き、Inf1用に必要な改修を行います。
まずneuron SDKをインポートします。
import torch_neuron
from lib.datasets.preprocessing import (inception_preprocess, rtpose_preprocess, ssd_preprocess, vgg_preprocess) def get_preprocess(img, preprocess): """Computes the averaged heatmap and paf for the given image :param multiplier: :param origImg: numpy array, the image being processed :param model: pytorch model :returns: numpy arrays, the averaged paf and heatmap """ inp_size = cfg.DATASET.IMAGE_SIZE # padding im_croped, im_scale, real_shape = im_transform.crop_with_factor( img, inp_size, factor=cfg.MODEL.DOWNSAMPLE, is_ceil=True) if preprocess == 'rtpose': im_data = rtpose_preprocess(im_croped) elif preprocess == 'vgg': im_data = vgg_preprocess(im_croped) elif preprocess == 'inception': im_data = inception_preprocess(im_croped) elif preprocess == 'ssd': im_data = ssd_preprocess(im_croped) batch_images= np.expand_dims(im_data, 0) # several scales as a batch # batch_var = torch.from_numpy(batch_images).cuda().float() batch_var = torch.from_numpy(batch_images).float() return batch_var, im_scale
def get_outputs2(batch_var, model): # for compilation # torch.neuron.analyze_model(model, example_inputs=[batch_var]) # model_neuron = torch.neuron.trace(model, example_inputs=[batch_var]) # model_neuron.save("openpose_neuron.pt") predicted_outputs, _ = model(batch_var) output1, output2 = predicted_outputs[-2], predicted_outputs[-1] heatmap = output2.cpu().data.numpy().transpose(0, 2, 3, 1)[0] paf = output1.cpu().data.numpy().transpose(0, 2, 3, 1)[0] return paf, heatmap
model = get_model('vgg19') model.load_state_dict(torch.load(args.weight)) model = torch.nn.DataParallel(model) #model = torch.nn.DataParallel(model).cuda() model.float() model.eval()
def benchmark(model, oriImg): # The first inference loads the model so exclude it from timing with torch.no_grad(): batch_var, im_scale = get_preprocess(oriImg, 'rtpose') paf, heatmap = get_outputs2(batch_var, model) print('Input image shape is {}'.format(list(batch_var.shape))) # Collect throughput and latency metrics latency = [] throughput = [] # Run inference for 100 iterations and calculate metrics num_infers = 100 for _ in range(num_infers): delta_start = time.time() with torch.no_grad(): batch_var, im_scale = get_preprocess(oriImg, 'rtpose') paf, heatmap = get_outputs2(batch_var, model) delta = time.time() - delta_start latency.append(delta) throughput.append(batch_var.size(0)/delta) # Calculate and print the model throughput and latency print("Avg. Throughput: {:.0f}, Max Throughput: {:.0f}".format(np.mean(throughput), np.max(throughput))) print("Latency P50: {:.0f}".format(np.percentile(latency, 50)*1000.0)) print("Latency P90: {:.0f}".format(np.percentile(latency, 90)*1000.0)) print("Latency P95: {:.0f}".format(np.percentile(latency, 95)*1000.0)) print("Latency P99: {:.0f}\n".format(np.percentile(latency, 99)*1000.0))
test_image = './readme/ski.jpg' oriImg = cv2.imread(test_image) # B,G,R order shape_dst = np.min(oriImg.shape[0:2])
num_inferences = 1 start = time.time() for _ in range(num_inferences): with torch.no_grad(): batch_var, im_scale = get_preprocess(oriImg, 'rtpose') paf, heatmap = get_outputs2(batch_var, model) elapsed_time = time.time() - start print('num_inferences:{:>6}[images], elapsed_time:{:6.2f}[sec], Throughput:{:8.2f}[images/sec]'.format(num_inferences, elapsed_time, num_inferences / elapsed_time))
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/_trace.py:965: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior. argument_names, INFO:Neuron:The following operations are currently supported in torch-neuron for this model: INFO:Neuron:aten::cat INFO:Neuron:prim::ListConstruct INFO:Neuron:prim::TupleConstruct INFO:Neuron:aten::relu INFO:Neuron:prim::TupleUnpack INFO:Neuron:prim::Constant INFO:Neuron:aten::_convolution INFO:Neuron:aten::max_pool2d INFO:Neuron:The following operations are currently not supported in torch-neuron for this model: INFO:Neuron:profiler::_record_function_enter INFO:Neuron:profiler::_record_function_exit INFO:Neuron:99.90% of all operations (including primitives) (2075 of 2077) are supported INFO:Neuron:100.00% of arithmetic operations (180 of 180) are supported INFO:Neuron:There are 1 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: profiler::_record_function_enter, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md) INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 180, fused = 180, percent fused = 100.0% /home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/_trace.py:793: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior. get_callable_argument_names(func) INFO:Neuron:Compiling function _NeuronGraph$1152 with neuron-cc INFO:Neuron:Compiling with command line: '/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/bin/neuron-cc compile /tmp/tmpvkes4fym/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpvkes4fym/graph_def.neff --io-config {"inputs": {"0:0": [[1, 3, 368, 392], "float32"]}, "outputs": ["transpose_260:0", "transpose_281:0", "transpose_56:0", "transpose_71:0", "transpose_92:0", "transpose_113:0", "transpose_134:0", "transpose_155:0", "transpose_176:0", "transpose_197:0", "transpose_218:0", "transpose_239:0"]} --verbose 35' /home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/_trace.py:965: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior. argument_names, INFO:Neuron:Number of arithmetic operators (post-compilation) before = 180, compiled = 180, percent compiled = 100.0% INFO:Neuron:The neuron partitioner created 1 sub-graphs INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0% INFO:Neuron:Compiled these operators (and operator counts) to Neuron: INFO:Neuron: => aten::_convolution: 92 INFO:Neuron: => aten::cat: 5 INFO:Neuron: => aten::max_pool2d: 3 INFO:Neuron: => aten::relu: 80
# Running test on Inf1 Neuron core # for loading compiled model model = torch.jit.load('openpose_neuron.pt') # warmup with torch.no_grad(): batch_var, im_scale = get_preprocess(oriImg, 'rtpose') paf, heatmap = get_outputs2(batch_var, model) num_inferences = 100 start = time.time() for _ in range(num_inferences): with torch.no_grad(): batch_var, im_scale = get_preprocess(oriImg, 'rtpose') paf, heatmap = get_outputs2(batch_var, model) elapsed_time = time.time() - start print('num_inferences:{:>6}[images], elapsed_time:{:6.2f}[sec], Throughput:{:8.2f}[images/sec]'.format(num_inferences, elapsed_time, num_inferences / elapsed_time))
num_inferences: 100[images], elapsed_time: 3.28[sec], Throughput: 30.50[images/sec]
benchmark(model, oriImg)
Input image shape is [1, 3, 368, 392] Avg. Throughput: 30, Max Throughput: 31 Latency P50: 33 Latency P90: 33 Latency P95: 33 Latency P99: 33
def benchmark2(model, batch_var): # The first inference loads the model so exclude it from timing with torch.no_grad(): paf, heatmap = get_outputs2(batch_var, model) print('Input image shape is {}'.format(list(batch_var.shape))) # Collect throughput and latency metrics latency = [] throughput = [] # Run inference for 100 iterations and calculate metrics num_infers = 1000 for _ in range(num_infers): delta_start = time.time() with torch.no_grad(): paf, heatmap = get_outputs2(batch_var, model) delta = time.time() - delta_start latency.append(delta) throughput.append(batch_var.size(0)/delta) # Calculate and print the model throughput and latency print("Avg. Throughput: {:.0f}, Max Throughput: {:.0f}".format(np.mean(throughput), np.max(throughput))) print("Latency P50: {:.0f}".format(np.percentile(latency, 50)*1000.0)) print("Latency P90: {:.0f}".format(np.percentile(latency, 90)*1000.0)) print("Latency P95: {:.0f}".format(np.percentile(latency, 95)*1000.0)) print("Latency P99: {:.0f}\n".format(np.percentile(latency, 99)*1000.0))
batch_size = 1 num_neuron_cores = 4 model_neuron_parallel = torch.neuron.DataParallel(model) batch_image = batch_var for i in range(batch_size * num_neuron_cores - 1): batch_image = torch.cat([batch_image, batch_var], 0) benchmark2(model_neuron_parallel, batch_image)
Input image shape is [4, 3, 368, 392] Avg. Throughput: 94, Max Throughput: 98 Latency P50: 42 Latency P90: 44 Latency P95: 45 Latency P99: 45