目指せML Specialist！みんなでわいわいサンプル問題を解いたのでまとめてみたよ！

機械学習 AWS認定（資格）AWS 資格

2019.04.23

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

どうも、福岡オフィスのmeです。

先日難易度高めと評判のAWSの新しい試験、AWS Certified Machine Learning – Specialtyがリリースされ、社内では我先にとパイセン方が続々と合格されております。

一方その頃福岡オフィスでは

（みんなML試験申し込んでる・・）

機運の高まりを感じたので、
以下の記事でおなじみの10冠(すごい・・)の岩田さんを福岡部屋へ召喚し、講師をお願いしてみました。

AWS Certified Machine Learning – Specialty合格までにやったこと

もくもくしてみた

社内SlackにML部屋が作られたりと盛り上がってきたので、他拠点とHOを繋いでもくもくしました。今回は公式ページのサンプル問題を翻訳しつつみんなでわいわい解いてみました。いくつかまとめてみたので、訳と一緒に読んでみてください。原文、訳、回答、の順でまとめています。

Q1

A Machine Learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a Machine Learning Specialist do to address this concern?

A. Use Amazon SageMaker Pipe mode.

B. Use Amazon Machine Learning to train the models.

C. Use Amazon Kinesis to stream the data to Amazon SageMaker.

D. Use AWS Glue to transform the CSV dataset to the JSON format.

和訳

S3に置いてある複数のサイズの大きいCSVファイルをAmazon SageMaker Linear Learnerアルゴリズムを用いて学習処理を行っていましたが、時間がかかりすぎるため、機械学習チームは学習プロセスの効率化を図りたいと考えています。どうするのが良いでしょう？

A. Amazon SageMaker のパイプモードの利用

B. Amazon Machine Learning を利用

C. Amazon Kinesisを利用しAmazon SageMakerへデータをストリーミングする

D. AWS Glueを利用し、 CSVをJSONへ変換する

答え：A

解説：

パイプ入力モードを使うとデータセットが最初にダウンロードされるのではなく、トレーニングインスタンスに直接ストリーミングされます。これは、トレーニングジョブが直ぐに始まり、早く完了し、必要なディスク容量も少なくて済むという意味です。Amazon SageMakerのアルゴリズムは、高速で拡張性が高くなるように設計されています。

参考資料

Amazon SageMaker アルゴリズムのパイプ入力モードを使用する

Q2

A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: 1. Please call the number below. 2. Please do not call us. What are the dimensions of the tf–idf matrix?

A. (2, 16)

B. (2, 8)

C. (2, 10)

D. (8, 10)

和訳

以下の二つの文章を解析するためのunigramsとbigramsを用いたterm frequency–inverse document frequency (tf–idf)メトリクスがあります。tf–idfメトリクスの次元数を答えなさい。

Please call the number below.

Please do not call us.

A. (2, 16)

B. (2, 8)

C. (2, 10)

D. (8, 10)

答え：A

解説

この二つの文章にはそれぞれ８つのユニークキャラクターと８つのユニークなバイアグラム(２単語を一つとする)が含まれるため、答えは(2,16)となります。

The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.”

参考資料

tf-idfについてざっくりまとめ_理論編

TfidfVectorizer

Q3

A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?

A. Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.

B. Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.

C. Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.

D. Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.

和訳

とある会社がS3に置かれている複数のデータセットを管理するシステムを導入しようと考えています。データセットとカタログのメタデータの変換と管理は自動で行われ、かつ導入、メンテナンスを最小限に抑えられる方法は以下のうちどれでしょうか？

A. Amazon EMRクラスタを作成し、Apache HiveをインストールしてHive metastoreとスクリプトによって定期的にデータセットの管理を行う

B. AWS Glue crawlerを作成し、AWS Glue Data Catalogを作成する。AWS Glue ETLジョブを作成し定期的にデータセットの管理を行う

C. Amazon EMRクラスタを作成しApache Sparkをインストールし、Hive metastoreとスクリプトによって定期的にデータセットの管理を行う

D. AWS Data Pipelineを作成し、Apache Hive metastoreとスクリプトによって定期的にデータセットの管理を行う

答え：B

解説

"最小限に抑えられる方法"というのがキーワードになります。 AWS Glueはサーバーレスなので、インフラの構築が必要ありません。A,C,Dの方法でも要件を満たすことは可能ですが設定に手間がかかったり、オペレーションとメンテナンスにコストがかかります。

参考資料

AWS Glue

Q4

A Data Scientist is working on optimizing a model during the training process by varying multiple parameters. The Data Scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the Data Scientist do to improve the training process?

A. Increase the learning rate. Keep the batch size the same.

B. Reduce the batch size. Decrease the learning rate.

C. Keep the batch size the same. Decrease the learning rate.

D. Do not change the learning rate. Increase the batch size.

和訳

データ技術者は様々なパラメーターを用いて機械学習モデルの調整を行っていますが、同じパラメーターを用いた学習プロセスを複数回試して結果をみたところ、損失関数がある一定の値から変化が変化が見られなくなっていることがわかりました。

データ技術者はどのような方法で学習プロセスを向上できるでしょうか？

A. バッチサイズは変えずにLearning Rate(学習率)を上げる

B. バッチサイズを下げて、Learning Rate(学習率)も下げる

C. バッチサイズは変えずにLearning Rate(学習率)を下げる

D. Learning Rate(学習率)は変えずにバッチサイズを上げる

答え：B

解説

ニューラルネットワークモデルの訓練（training）では、過学習（over-fitting）が起きたり、局所的な極小点に陥り、性能が高くならないという問題が発生することがあります。

参照元-ニューラルネットワークの訓練の問題点：過学習と局所的極小点

局所的な極小点(Local-Minima)をいかに減らし、学習性能を向上できるか、といった問題になります。この場合バッチサイズを下げることで確率的にLocal-Minimaを減少させることができ、同時にLearning Rate(学習率)を下げることでglobal minimum (大域的最適解)の値が大きくなりすぎるのを防ぐことができます。

ニューラルネットワークの訓練の問題点：過学習と局所的極小点

ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA

損失関数について、ざっくりと考える

Loss Functions in Neural Networks

目的関数、コスト関数、誤差関数、損失関数いろいろあるけど、なにが違うのかを検討

Q6

A Data Scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitively help the model detect more than 10% of fraud cases?

A. Using undersampling to balance the dataset

B. Decreasing the class probability threshold

C. Using regularization to reduce overfitting

D. Using oversampling to balance the dataset

和訳

データサイエンティストは、ロジスティック回帰を使用して詐欺検出モデルを構築します。モデルの精度は99％ですが、詐欺事件の90％はモデルによって検出されません。このモデルが10％以上の詐欺事件を確実に検出するためにどのようなアクションが確実に役立つでしょうか？

A. undersamplingを用いてデータの調整を行う

B. class probabilityの値を下げる

C. 過学習を防ぐために正則化を行う

D. oversamplingを用いてデータの調整を行う

答え：B

解説：

・Class Probability: 検出された値が正かどうかを判定する閾値
上の問題では"モデルの精度は99％"にも関わらず検出率が10%しかないので閾値が高すぎる、という解釈です。

参考資料

不均衡データにおけるsampling

Over-/Under-samplingをして学習した2クラス分類器の予測確率を調整する式

不均衡データのクラス分類

過学習を防ぐ正則化

Q7

A company is interested in building a fraud detection model. Currently, the Data Scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?

A. Oversampling using bootstrapping

B. Undersampling

C. Oversampling using SMOTE

D. Class weight adjustment

和訳

ある会社が詐欺検出モデルの構築に興味を持っています。現在、データサイエンティストは詐欺事件の数が少ないため、十分な量の情報を持っていません。正当な詐欺事件の最大数を検出する可能性が最も高いのはどの方法ですか？

A. bootstrappingを利用したOversampling

B. Undersampling

C. SMOTEを利用したOversampling

D. 重みの調整を行う

答え：C

解説：

・SMOTE アルゴリズム(Synthetic Minority Over-sampling Technique):”SMOTE は不均衡データに対して、少ない方のデータを人工的に生成し、多い方のデータを削除することによって、均衡データに近づけるという手法です。(SMOTE で不均衡データの分類より抜粋)”

参考資料

SMOTE で不均衡データの分類

不均衡データの分類問題

Q8

A Machine Learning Engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML Engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML Engineer do to minimize bias due to missing values?

A. Replace each missing value by the mean or median across non-missing values in same row.

B. Delete observations that contain missing values because these represent less than 5% of the data.

C. Replace each missing value by the mean or median across non-missing values in the same column.

D. For each feature, approximate the missing values using supervised learning based on other features.

和訳

機械学習エンジニアが、Amazon SageMakerのLinear Learnerアルゴリズムを使用して、教師付き学習タスク用のデータフレームを準備しています。機械学習エンジニアは、ターゲットラベルクラスのバランスが非常に悪く、複数のフィーチャ列に欠損値が含まれていることに気付きました。データフレーム全体の欠損値の割合は5％未満です。欠損値による偏りを最小限に抑えるために機械学習エンジニアは何をすべきですか？

A. 欠損値に同じ横列の平均的な値を割り当てる

B. 欠損値を含むデータは全体の５％程なので、そのデータは削除する

C. 欠損値に同じ縦列の平均的な値を割り当てる

D. 欠損値に他の欠損していない値の教師あり学習データを割り当てる

答え：D

解説：

教師あり学習データは学習ごとに異なった結果を返しますが、中央値と同等、もしくはより精度の高い結果を返します。

参考資料

A Comparison of Six Methods for Missing Data Imputation

参考資料

感想

普段なら、”あ、これ知らないからあとでググろう。それでもわからなかったらSlackで質問しよう・・” となるところも、範囲が広くググっていては追いつけないので、せっかくなので全部聞いちゃいました。

”AWS Glueって何でしたっけ？”とか ”SageMakerのパイプモードって何ですか？”とか小さな質問もあえて投げてみることで、わいわいしながら勉強できてとても楽しかったです。

ハードルが高いと感じていたMachine Learningにも興味が湧きました。

こっそり申し込む

Professional資格を持っていなくてもSpeciality試験を受けていいかボスに聞いてみたところ、あっさりOKをもらったので、頑張ってみようかと思います。やるぞ・・やるぞ・・！

目指せML Specialist！みんなでわいわいサンプル問題を解いたのでまとめてみたよ！

一方その頃福岡オフィスでは

もくもくしてみた

Q1

和訳

答え：A

解説：

参考資料

Q2

和訳

答え：A

解説

Q3

和訳

答え：B

解説

Q4

和訳

答え：B

Q6

和訳

答え：B

解説：

Q7

和訳

答え：C

解説：

Q8

和訳

答え：D

解説：

参考資料

感想

こっそり申し込む

関連記事

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

EVENTS