AmazonSageMakerのXGBoostでアワビの年齢を予測してみた

SageMakerのノートブックインスタンスを立ち上げて、 SageMaker Examples ↓ Introduction to Amazon algorithms ↓ xgboost_abalone.ipynb ↓ use でサンプルからノートブックをコピーして、開きます。ノートブックインスタンスの作成についてはこちらをご参照ください。

環境変数とロールの確認

学習データ等を保存するS3のバケット名と保存オブジェクト名の接頭辞を決めます。

%%time

import os
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()
region = boto3.Session().region_name

bucket='<your_s3_bucket_name_here>' # put your s3 bucket name here, and create s3 bucket
prefix = 'sagemaker/DEMO-xgboost-regression'
# customize to your bucket where you have stored the data
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket)

データ取得

まずはデータ取得用の関数を定義しておきます。

%%time

import io
import boto3
import random

def data_split(FILE_DATA, FILE_TRAIN, FILE_VALIDATION, FILE_TEST, PERCENT_TRAIN, PERCENT_VALIDATION, PERCENT_TEST):
    data = [l for l in open(FILE_DATA, 'r')]
    train_file = open(FILE_TRAIN, 'w')
    valid_file = open(FILE_VALIDATION, 'w')
    tests_file = open(FILE_TEST, 'w')

    num_of_data = len(data)
    num_train = int((PERCENT_TRAIN/100.0)*num_of_data)
    num_valid = int((PERCENT_VALIDATION/100.0)*num_of_data)
    num_tests = int((PERCENT_TEST/100.0)*num_of_data)

    data_fractions = [num_train, num_valid, num_tests]
    split_data = [[],[],[]]

    rand_data_ind = 0

    for split_ind, fraction in enumerate(data_fractions):
        for i in range(fraction):
            rand_data_ind = random.randint(0, len(data)-1)
            split_data[split_ind].append(data[rand_data_ind])
            data.pop(rand_data_ind)

    for l in split_data[0]:
        train_file.write(l)

    for l in split_data[1]:
        valid_file.write(l)

    for l in split_data[2]:
        tests_file.write(l)

    train_file.close()
    valid_file.close()
    tests_file.close()

def write_to_s3(fobj, bucket, key):
    return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(fobj)

def upload_to_s3(bucket, channel, filename):
    fobj=open(filename, 'rb')
    key = prefix+'/'+channel
    url = 's3://{}/{}/{}'.format(bucket, key, filename)
    print('Writing to {}'.format(url))
    write_to_s3(fobj, bucket, key)

LIBSVMからアワビのデータセットをダウンロードし、ファイルに保存します。その後、学習、検証、テスト用に分けてファイルとS3に保存します。

%%time
import urllib.request

# Load the dataset
FILE_DATA = 'abalone'
urllib.request.urlretrieve("https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone", FILE_DATA)

#split the downloaded data into train/test/validation files
FILE_TRAIN = 'abalone.train'
FILE_VALIDATION = 'abalone.validation'
FILE_TEST = 'abalone.test'
PERCENT_TRAIN = 70
PERCENT_VALIDATION = 15
PERCENT_TEST = 15
data_split(FILE_DATA, FILE_TRAIN, FILE_VALIDATION, FILE_TEST, PERCENT_TRAIN, PERCENT_VALIDATION, PERCENT_TEST)

#upload the files to the S3 bucket
upload_to_s3(bucket, 'train', FILE_TRAIN)
upload_to_s3(bucket, 'validation', FILE_VALIDATION)
upload_to_s3(bucket, 'test', FILE_TEST)

学習

XGBoost用のコンテナの名前を取得します。

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

ハイパーパラメータや出力先、学習用コンテナイメージ等の学習に必要な設定を行い、学習処理を実行します。ハイパーパラメータに関する詳細はドキュメントをご確認ください。

%%time
import boto3
from time import gmtime, strftime

job_name = 'DEMO-xgboost-regression-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)

#Ensure that the training and validation data folders generated above are reflected in the "InputDataConfig" parameter below.

create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": container,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": bucket_path + "/" + prefix + "/single-xgboost"
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.m4.4xlarge",
        "VolumeSizeInGB": 5
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "silent":"0",
        "objective":"reg:linear",
        "num_round":"50"
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 3600
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": bucket_path + "/" + prefix + '/train',
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "libsvm",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": bucket_path + "/" + prefix + '/validation',
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "libsvm",
            "CompressionType": "None"
        }
    ]
}


client = boto3.client('sagemaker')
client.create_training_job(**create_training_params)

import time

status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
while status !='Completed' and status!='Failed':
    time.sleep(60)
    status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    print(status)

validationチャネルを設定しているのでラウンド毎にvalidationデータによる検証が行われます。その時の結果はAmazon CloudWatchのロググループ/aws/sagemaker/TrainingJobs/"job_name"/algo-1-"実行時のunixtime"に出力されます。

下のログ（ラウンドを重ねる）ほどRMSEが小さくなっているのが分かると思います。

モデルの展開

先ほど学習によって得られたモデルアーティファクトからモデルを作成します。

%%time
import boto3
from time import gmtime, strftime

model_name=job_name + '-model'
print(model_name)

info = client.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

primary_container = {
    'Image': container,
    'ModelDataUrl': model_data
}

create_model_response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

エンドポイントの設定を作成します。

from time import gmtime, strftime

endpoint_config_name = 'DEMO-XGBoostEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialVariantWeight':1,
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

先ほどの設定を用いてエンドポイントを作成し、モデルを展開します。エンドポイントが立ち上がっている間は課金が発生するので、注意が必要です。

%%time
import time

endpoint_name = 'DEMO-XGBoostEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

モデルの確認

エンドポイントにデータを投げるためのクライアントを取得します。

runtime_client = boto3.client('runtime.sagemaker')

テスト用にデータを１行だけ取ります。

!head -1 abalone.test > abalone.single.test

エンドポイントにデータを投げて予測結果を受け取り、実際の値と予測値を比較します。

%%time
import json
from itertools import islice
import math
import struct

file_name = 'abalone.single.test' #customize to your test file
with open(file_name, 'r') as f:
    payload = f.read().strip()
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
                                   ContentType='text/x-libsvm',
                                   Body=payload)
result = response['Body'].read()
result = result.decode("utf-8")
result = result.split(',')
result = [math.ceil(float(i)) for i in result]
label = payload.strip(' ').split()[0]
print ('Label: ',label,'\nPrediction: ', result[0])

誤差は1でした。外れてはしまいましたが、予測処理自体はできました。

次はテストデータ全体を予測してみます。まずは関数定義。

import sys
import math
def do_predict(data, endpoint_name, content_type):
    payload = '\n'.join(data)
    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
                                   ContentType=content_type,
                                   Body=payload)
    result = response['Body'].read()
    result = result.decode("utf-8")
    result = result.split(',')
    preds = [float((num)) for num in result]
    preds = [math.ceil(num) for num in preds]
    return preds

def batch_predict(data, batch_size, endpoint_name, content_type):
    items = len(data)
    arrs = []

    for offset in range(0, items, batch_size):
        if offset+batch_size < items:
            results = do_predict(data[offset:(offset+batch_size)], endpoint_name, content_type)
            arrs.extend(results)
        else:
            arrs.extend(do_predict(data[offset:items], endpoint_name, content_type))
        sys.stdout.write('.')
    return(arrs)

テストデータから各データにおけるアワビの年齢を予測し、Median Absolute Percent Error(MdAPE)を計算します。 MdAPEは各データのに対する相対誤差の中央値のことです。

%%time
import json
import numpy as np

with open(FILE_TEST, 'r') as f:
    payload = f.read().strip()

labels = [int(line.split(' ')[0]) for line in payload.split('\n')]
test_data = [line for line in payload.split('\n')]
preds = batch_predict(test_data, 100, endpoint_name, 'text/x-libsvm')

print('\n Median Absolute Percent Error (MdAPE) = ', np.median(np.abs(np.array(labels) - np.array(preds)) / np.array(labels)))

mdAPEが約14%とのことなので、ぼちぼち良さそう...ってな感じですね。パラメータチューニングをしっかりすることで更なる精度向上を目指せるかと思います。

エンドポイントの削除

余分なお金を使わないように、エンドポイントを削除します。

client.delete_endpoint(EndpointName=endpoint_name)

まとめ

Amazon SageMakerの組み込みアルゴリズムの一つであるXGBoostの回帰モデルを用いることで、アワビの年齢を予測するができました。今回はboto3のSageMaker用低レベルAPIを使った例だったので設定項目が多く、大変だったかもしれません。SageMakerのSDKもあるので、そちらを使うことで楽をできるかもしれません。以下のエントリではSDKを用いて、XGBoostの分類について紹介しています。宜しければご覧ください。

AmazonSageMakerのXGBoostでMNISTの手書き文字を分類してみた

以下シリーズではAmazon SageMakerのその他の組み込みアルゴリズムについても解説しています。宜しければ御覧ください。

Amazon SageMaker 組み込みアルゴリズム入門

おまけ

アワビのデータセットについて

当エントリで使用しているアワビのデータセットはLIBSVMで公開されているデータセットです。オリジナルのデータソースはUCI Machine Learning Repositoryですが、LIBSVMのものは事前に質的データが数値へと変換されており、使いやすくなっています。

データの内容についてはUCI Machine Learning Repository: Abalone Data Setをご覧ください。

LIBSVM形式とは？

LIBSVMにおけるデータの形式です。

y 1:x1 2:x2 3:x3 ... i:xi ... n:xn

yが教師データ、xi(i=1,2,...,n)は入力データ、nは入力データの数です。

例

次のようなデータがあったとします。この時性別を予測する対象とすると、性別が教師データとなり、それ以外が入力データとなります。

年齢	身長	体重	性別
20	160	47	1
25	170	73	2
30	180	65	2

この場合のLIBSVM形式は次のようになります。

1 1:20 2:160 3:47
2 1:25 2:170 3:73
2 1:30 2:180 3:65

二乗平均誤差平方根について

二乗平均誤差平方根(Root Mean Squared Error,RMSE)というのは回帰等において予測モデルを構築した際に予測精度を評価する指標の一つで、以下のように定義されます。

[latex] RMSE = \sqrt{\frac{1}{n}\sum^n_{i=1}{(y_i-y'_i)}^2} [/latex]

nをデータ数、i=1,...,nに対してy_iを実測値 y'_iを予測値とする。

各データに対する実測値と予測値の誤差を計算し、その平均を取っているので、RMSEが小さいほど予測精度が高いと言えます。

AmazonSageMakerのXGBoostでアワビの年齢を予測してみた

目次

概要説明：XGBoostとは

組み込みアルゴリズム：XGBoostの実践

ノートブックの作成

環境変数とロールの確認

データ取得

学習

モデルの展開

モデルの確認

エンドポイントの削除

まとめ

おまけ

アワビのデータセットについて

LIBSVM形式とは？

例

二乗平均誤差平方根について

関連記事

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

EVENTS