TD & AWS連携：Treasure Dataの結果をtd-agent経由でAmazon Redshiftに直接取り込む

2014.11.05

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Treasure DataとAWSの連携は非常に注目度も高く、弊社で展開している"顧客理解のためのビッグデータ分析基盤"、『カスタマーストーリー』に於いても、構成要素の中でもTreasure Dataは重要な位置を占めています。

カスタマーストーリー｜クラスメソッド株式会社

そして、Treasure Data社から提供されているTreasure Agent(td-agent)では実に様々な処理を行う事が出来ます。各所で収集したログをTreasure Dataに集約する際にこのコマンドを用いているというケースが恐らく一番良く利用されている・知られているケースになると思います。

Overview of Streaming Import by Treasure Agent (td-agent) | Treasure Data

このtd-agentを使って、Treasure Dataから様々な環境へデータの転送を行う事が可能です。当エントリでは、Treasure Dataに蓄積されたデータをAmazon Redshiftへ直接転送、インポートする際の手順について、環境準備の手順を踏まえながらご紹介して行きたいと思います。

超ざっくりな感じですが、処理のイメージはこんな感じです。何らか蓄積されたTreasure Dataのデータに対して、EC2で整えたTreasure Dataのツールからデータを取得、そのままその内容をRedshiftにコピーする...というイメージです。当エントリではこのEC2に於ける環境構築、受け手側のAmazon Redshiftの環境構築、そしてTreasure Dataのコマンドでデータを取得、Redshiftに連携...となります。

Treasure Dataのアカウント取得

まず何より必要なのはTreasure Dataのアカウントです。アカウント登録を済ませ、サービスが利用可能な状態にしておいてください。

Treasure Data - 企業のデータ基盤となり、ビジネスバリューの発見を促す | Treasure Data

Treasure Data実行環境の構築

Amazon Linux環境にgemを使って環境を用意する場合は、以下の大瀧さん、横田さんのエントリを参考に整える事が出来ます。

まずは手頃なAmazon Linux EC2を用意し、サーバにログイン後以下コマンド群を実行。

## 一括更新&Rubyの現行バージョン確認
$ sudo yum -y update
$ ruby -v
ruby 2.0.0p481 (2014-05-08 revision 45883) [x86_64-linux]

## Ruby入れ替え及び環境整備
$ sudo yum remove -y ruby ruby20
$ sudo yum install -y ruby21 ruby21-devel
$ sudo yum groupinstall -y "Development Tools"
$ sudo yum install -y fakeroot
$ gem install bundler

## Rubyのバージョンを改めて確認
$ ruby -v
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux-gnu]

次いで、gemを使ってtdをインストールします。tdのバージョン確認まで出来ればOKです。

$ gem install td
Fetching: msgpack-0.5.9.gem (100%)
Building native extensions.  This could take a while...
Successfully installed msgpack-0.5.9
:
(中略)
:
Installing ri documentation for yajl-ruby-1.2.1
Done installing documentation for fluent-logger, hirb, httpclient, msgpack, parallel, rubyzip, td, td-client, td-logger, yajl-ruby after 6 seconds
10 gems installed
$ 
$ td --version
0.11.5

Treasure Data サンプルデータベース環境を用意する

tdコマンドを利用するには、アカウント情報を設定しておく必要があります。td accountでメールアドレス、パスワードを設定しておきます。

$ td account
Enter your Treasure Data credentials.
Email: **********@***************
Password (typing will be hidden): XXXXXXXXXXX
Authenticated successfully.
Use 'td db:create <db_name>' to create a database.
$

データベース及びテーブルの作成。この辺りは公式ドキュメントが用意されていますので参考にしつつ、sample_dbデータベースに、www_accessテーブル(こちらはサンプルで用意されているテーブルとなります)を構築します。

Treasure Data Toolbelt: Command-line Interface | Treasure Data

## データベース作成
$ td db:create sample_db
Database 'sample_db' is created.
Use 'td table:create sample_db <table_name>' to create a table.

## テーブルの作成及びデータ投入
$ td sample:apache sample_apache.json
Created sample_apache.json with 5000 records whose time is in the [2014-10-30 06:47:53 +0000, 2014-10-31 00:40:23 +0000] range.
Use 'td table:import <db> <table> --json sample_apache.json' to import this file.
$ td table:import sample_db www_access --auto-create-table -f json sample_apache.json
Table 'sample_db.www_access' is created.
importing sample_apache.json...
  uploading 117364 bytes...
  imported 5000 entries from sample_apache.json.
done.
$

試しにクエリを実行してみましょう。ちゃんと結果が返って来ました！

$ td query -w -d sample_db \
>   "SELECT code, COUNT(1) AS cnt FROM www_access GROUP BY code"
Job 16562958 is queued.
Use 'td job:show 16562958' to show the status.
queued...
  started at 2014-10-31T00:42:14Z
  Hive history file=/mnt/hive/tmp/5315/hive_job_log__1070356623.txt
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks not specified. Estimated from input data size: 1
  In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
  In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
  In order to set a constant number of reducers:
    set mapred.reduce.tasks=<number>
  **
  ** WARNING: time index filtering is not set!
  ** This query could be very slow as a result.
  ** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
  **
  Starting Job = job_201410210918_296428, Tracking URL = http://ip-10-28-65-146.ec2.internal:50030/jobdetails.jsp?jobid=job_201410210918_296428
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201410210918_296428
  Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  2014-10-31 00:42:43,153 Stage-1 map = 0%,  reduce = 0%
  2014-10-31 00:43:01,967 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.41 sec
  2014-10-31 00:43:02,990 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.41 sec
  :
  (中略)
  :
  2014-10-31 00:43:10,158 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.35 sec
  2014-10-31 00:43:11,172 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.35 sec
  2014-10-31 00:43:12,190 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.35 sec
  finished at 2014-10-31T00:43:16Z
  2014-10-31 00:43:13,334 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.35 sec
  MapReduce Total cumulative CPU time: 7 seconds 350 msec
  Ended Job = job_201410210918_296428
  MapReduce Jobs Launched: 
  Job 0: Map: 1  Reduce: 1   Cumulative CPU: 7.35 sec   HDFS Read: 412 HDFS Write: 145 SUCCESS
  Total MapReduce CPU Time Spent: 7 seconds 350 msec
  OK
  MapReduce time taken: 52.458 seconds
  Fetching results...
  Total CPU Time: 7350
  Time taken: 52.555 seconds
Status     : success
Result     :
+------+------+     B / 100.0%                                                                                                    
| code | cnt  |
+------+------+
| 200  | 4981 |
| 404  | 17   |
| 500  | 2    |
+------+------+
3 rows in set
$

Redshiftクラスタ環境の構築

連携元となるTreasure Dataの環境はこれでひとまず用意出来ました。次いで、Amazon Redshiftの環境構築に取り掛かりたいと思います。

と、その前に、連携元テーブルの『www_access』について、その内容を確認しておく必要があります。td:tableコマンドで構成要素を確認します。

$ td table:show sample_db www_access
Name        : sample_db.www_access
Type        : log
Count       : 5000
Schema      : (
    host:string
    path:string
    method:string
    referer:string
    code:long
    agent:string
    user:string
    size:long
)
$

上記コマンドだけでは文字列型項目の項目長が分かりませんので、現行登録されている5000件の中でそれぞれの最大桁数を求めます。

$ td query -w -d sample_db \
> "SELECT \
>     LENGTH(MAX(host)), \
>     LENGTH(MAX(path)), \
>     LENGTH(MAX(method)), \
>     LENGTH(MAX(referer)), \
>     LENGTH(MAX(agent)), \
>     LENGTH(MAX(user)) \
> FROM \
>     www_access"
Job 16564288 is queued.
Use 'td job:show 16564288' to show the status.
queued...
  started at 2014-10-31T01:26:34Z
  Hive history file=/mnt/hive/tmp/5315/hive_job_log__1683480054.txt
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks determined at compile time: 1
  In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
  In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
  In order to set a constant number of reducers:
    set mapred.reduce.tasks=<number>
  **
  ** WARNING: time index filtering is not set!
  ** This query could be very slow as a result.
  ** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
  **
  Starting Job = job_201410210918_297193, Tracking URL = http://ip-10-28-65-146.ec2.internal:50030/jobdetails.jsp?jobid=job_201410210918_297193
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201410210918_297193
  Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  2014-10-31 01:27:05,259 Stage-1 map = 0%,  reduce = 0%
  2014-10-31 01:27:25,172 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.67 sec
  2014-10-31 01:27:26,672 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.67 sec
  :
  (中略)
  :
  2014-10-31 01:30:59,343 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 27.39 sec
  2014-10-31 01:31:00,353 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 27.39 sec
  2014-10-31 01:31:01,412 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 27.39 sec
  finished at 2014-10-31T01:31:06Z
  2014-10-31 01:31:02,762 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 27.39 sec
  MapReduce Total cumulative CPU time: 27 seconds 390 msec
  Ended Job = job_201410210918_297193
  MapReduce Jobs Launched: 
  Job 0: Map: 1  Reduce: 1   Cumulative CPU: 27.39 sec   HDFS Read: 412 HDFS Write: 117 SUCCESS
  Total MapReduce CPU Time Spent: 27 seconds 390 msec
  OK
  MapReduce time taken: 261.67 seconds
  Fetching results...
  Total CPU Time: 27390
  Time taken: 262.34 seconds
Status     : success
Result     :
+-----+-----+-----+-----+-----+-----+                                                                                                                       
| _c0 | _c1 | _c2 | _c3 | _c4 | _c5 |
+-----+-----+-----+-----+-----+-----+
| 11  | 27  | 4   | 176 | 136 | 1   |
+-----+-----+-----+-----+-----+-----+
1 row in set
$

Redshiftクラスタにログインし、上記桁数情報を踏まえたテーブルを作成します。

# CREATE TABLE public.www_access (
#     access_time DATE encode delta NOT NULL,
#     access_host VARCHAR(15) encode lzo NOT NULL,
#     access_path VARCHAR(40) encode lzo NOT NULL,
#     method CHAR(5) encode bytedict NOT NULL,
#     referer VARCHAR(200) encode lzo NOT NULL,
#     status_code SMALLINT encode bytedict NOT NULL,
#     access_agent VARCHAR(150) encode lzo NOT NULL,
#     access_user VARCHAR(3) encode lzo NOT NULL,
#     access_size SMALLINT encode mostly8 NOT NULL,
#     PRIMARY KEY(access_host)
# )
# distkey(access_host)
# sortkey(time);
CREATE TABLE
# 
# SELECT * FROM pg_table_def WHERE schemaname = 'public' AND tablename = 'www_access';
 schemaname | tablename  |    column    |          type          | encoding | distkey | sortkey | notnull 
------------+------------+--------------+------------------------+----------+---------+---------+---------
 public     | www_access | access_time  | date                   | delta    | f       |       1 | t
 public     | www_access | access_host  | character varying(15)  | lzo      | t       |       0 | t
 public     | www_access | access_path  | character varying(40)  | lzo      | f       |       0 | t
 public     | www_access | method       | character(5)           | bytedict | f       |       0 | t
 public     | www_access | referer      | character varying(200) | lzo      | f       |       0 | t
 public     | www_access | status_code  | smallint               | bytedict | f       |       0 | t
 public     | www_access | access_agent | character varying(150) | lzo      | f       |       0 | t
 public     | www_access | access_user  | character varying(3)   | lzo      | f       |       0 | t
 public     | www_access | access_size  | smallint               | mostly8  | f       |       0 | t
(9 行)
#

上記Redshiftへのクエリと同様の内容でTDでも、(Amazon Redshiftに投入するのに使用する)クエリを実行してみます。項目名を変更しているのはRedshift側で予約語として使われていた項目名(user)があったので、その辺りも踏まえています。

$ td query -w -d sample_db \
> "SELECT \
>   TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'JST') AS access_time, \
>   host AS access_host, \
>   path AS access_path, \
>   method, \
>   referer, \
>   code AS status_code, \
>   agent AS access_agent, \
>   user AS access_user, \
>   size AS access_size \
> FROM \
>   www_access \
> LIMIT 10;"
Job 16568656 is queued.
Use 'td job:show 16568656' to show the status.
queued...
  started at 2014-10-31T04:04:09Z
  Hive history file=/mnt/hive/tmp/5315/hive_job_log__595277233.txt
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks is set to 0 since there's no reduce operator
  **
  ** WARNING: time index filtering is not set!
  ** This query could be very slow as a result.
  ** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
  **
  Starting Job = job_201410210918_300198, Tracking URL = http://ip-10-28-65-146.ec2.internal:50030/jobdetails.jsp?jobid=job_201410210918_300198
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201410210918_300198
  Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
  2014-10-31 04:04:38,627 Stage-1 map = 0%,  reduce = 0%
  2014-10-31 04:05:12,535 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.04 sec
  2014-10-31 04:05:13,573 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.04 sec
  :
  (中略)
  :
  2014-10-31 04:05:25,644 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.04 sec
  2014-10-31 04:05:26,672 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.04 sec
  2014-10-31 04:05:27,696 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.04 sec
  MapReduce Total cumulative CPU time: 5 seconds 40 msec
  Ended Job = job_201410210918_300198
  MapReduce Jobs Launched: 
  Job 0: Map: 1   Cumulative CPU: 5.04 sec   HDFS Read: 412 HDFS Write: 2028 SUCCESS
  Total MapReduce CPU Time Spent: 5 seconds 40 msec
  OK
  MapReduce time taken: 70.659 seconds
  Fetching results...
  Total CPU Time: 5040
  Time taken: 70.755 seconds
  finished at 2014-10-31T04:05:29Z
Status     : success
Result     :
+---------------------+-----------------+------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
| access_time         | access_host     | access_path                  | method | referer                                                                                                                                                                                | status_code | access_agent                                                                                                                             | access_user | access_size |
+---------------------+-----------------+------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
| 2014-10-31 09:40:23 | 128.216.140.97  | /item/sports/2511            | GET    | http://www.google.com/search?ie=UTF-8&q=google&sclient=psy-ab&q=Sports+Electronics&oq=Sports+Electronics&aq=f&aqi=g-vL1&aql=&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&biw=3994&bih=421 | 200         | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7                                | -           | 95          |
| 2014-10-31 09:40:22 | 224.225.147.72  | /category/electronics        | GET    | -                                                                                                                                                                                      | 200         | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)                                                                          | -           | 43          |
| 2014-10-31 09:40:20 | 172.75.186.56   | /category/jewelry            | GET    | -                                                                                                                                                                                      | 200         | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                 | -           | 79          |
| 2014-10-31 09:40:18 | 196.183.34.70   | /search/?c=Electronics+Games | POST   | -                                                                                                                                                                                      | 200         | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1                                                             | -           | 86          |
| 2014-10-31 09:40:16 | 116.198.217.173 | /category/electronics        | GET    | /item/games/4012                                                                                                                                                                       | 200         | Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1                                                                    | -           | 94          |
| 2014-10-31 09:40:14 | 100.126.194.214 | /category/games              | GET    | /category/computers                                                                                                                                                                    | 200         | Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3 | -           | 58          |
| 2014-10-31 09:40:12 | 216.51.135.153  | /item/games/4772             | GET    | /item/electronics/4627                                                                                                                                                                 | 200         | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1                                                             | -           | 131         |
| 2014-10-31 09:40:10 | 40.81.151.94    | /item/electronics/3883       | GET    | /item/giftcards/855                                                                                                                                                                    | 200         | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)                                                                   | -           | 127         |
| 2014-10-31 09:40:08 | 76.75.28.148    | /category/cameras            | GET    | -                                                                                                                                                                                      | 200         | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)                                                                          | -           | 133         |
| 2014-10-31 09:40:05 | 108.78.162.171  | /category/computers          | GET    | /category/games                                                                                                                                                                        | 200         | Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11                                     | -           | 57          |
+---------------------+-----------------+------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
10 rows in set
$

Treasure DataからAmazon Redshiftへのデータ投入

さぁ、いよいよ本題です。Redshiftへのアクセスに関する情報は下記ページにまとまっていますので、必要に応じた設定値で実行文を組み立てます。

Writing Job Results into AWS Redshift | Treasure Data

# デフォルトスキーマ指定の場合 redshift://<username>:<password>@<hostname>:<port>/<database>/<table> # 個別スキーマ指定の場合 redshift://<username>:<password>@<hostname>:<port>/<database>/<table>?schema=<schema>

以下が実行ログです。下記情報で処理を行ったという想定です。(※1行目の設定は内容を置き換えていますが、以降の処理は実際の内容となっています)45行目まででデータ取得を行い、46行目以降でRedshiftへのCOPY処理を行っているのがログからも読み取れると思います。

ユーザー名：username
パスワード：PassWord123
ホスト名(エンドポイント)：redshift-sample.xxxxxxxx.us-east-1.redshift.amazonaws.com
ポート番号：5439
データベース名：testdb
テーブル名：www_access
スキーマ名：public

$ td query -w -d sample_db \
> --result 'redshift://username:PassWord123@redshift-sample.xxxxxxxx.us-east-1.redshift.amazonaws.com:5439/testdb/www_access?schema=public' \
> "SELECT \
>   TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'JST') AS access_time, \
>   host AS access_host, \
>   path AS access_path, \
>   method, \
>   referer, \
>   code AS status_code, \
>   agent AS access_agent, \
>   user AS access_user, \
>   size AS access_size \
> FROM \
>   www_access" 
Job 16569297 is queued.
Use 'td job:show 16569297' to show the status.
queued...
  started at 2014-10-31T04:23:43Z
  Hive history file=/mnt/hive/tmp/5315/hive_job_log__1638434606.txt
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks is set to 0 since there's no reduce operator
  **
  ** WARNING: time index filtering is not set!
  ** This query could be very slow as a result.
  ** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
  **
  Starting Job = job_201410210918_300609, Tracking URL = http://ip-10-28-65-146.ec2.internal:50030/jobdetails.jsp?jobid=job_201410210918_300609
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_201410210918_300609
  Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
  2014-10-31 04:24:10,023 Stage-1 map = 0%,  reduce = 0%
  2014-10-31 04:24:22,401 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.19 sec
  2014-10-31 04:24:23,421 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.19 sec
  2014-10-31 04:24:24,431 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.19 sec
  :
  (中略)
  :
  2014-10-31 04:24:40,856 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.19 sec
  2014-10-31 04:24:41,891 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.19 sec
  MapReduce Total cumulative CPU time: 6 seconds 190 msec
  Ended Job = job_201410210918_300609
  MapReduce Jobs Launched: 
  Job 0: Map: 1   Cumulative CPU: 6.19 sec   HDFS Read: 412 HDFS Write: 1041176 SUCCESS
  Total MapReduce CPU Time Spent: 6 seconds 190 msec
  OK
  MapReduce time taken: 52.557 seconds
  Fetching results...
  Total CPU Time: 6190
  Time taken: 52.964 seconds
  finished at 2014-10-31T04:24:44Z
  14/10/31 04:24:49 INFO result.AbstractJDBCResult: Mode: in-place append
  14/10/31 04:24:49 INFO result.RedshiftResult: Method: copy
  14/10/31 04:24:49 INFO result.RedshiftResult: S3 region: us-east-1
  14/10/31 04:24:49 INFO result.ResultWorker: Data size: 109,514 bytes (compressed)
  14/10/31 04:24:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
  14/10/31 04:24:49 INFO compress.CodecPool: Got brand-new decompressor [.gz]
  14/10/31 04:24:49 INFO result.AbstractJDBCResult: Getting list of tables...
  14/10/31 04:24:49 INFO result.AbstractJDBCResult: Table "www_access" exists. Using its schema to store query results.
  14/10/31 04:24:49 INFO result.RedshiftResult: Copy SQL: COPY "www_access" ("access_time", "access_host", "access_path", "method", "referer", "status_code", "access_agent", "access_user", "access_size") FROM ? GZIP DELIMITER '\t' NULL '\N' ESCAPE TRUNCATECOLUMNS ACCEPTINVCHARS STATUPDATE OFF COMPUPDATE OFF
  14/10/31 04:24:50 INFO jdbc.RedshiftCopyBatchInsert: Uploading file id 7ab95991-507e-4d3f-ab95-28197436478f to S3 (90754 bytes)
  14/10/31 04:24:50 INFO jdbc.RedshiftCopyBatchInsert: Running COPY from file id 7ab95991-507e-4d3f-ab95-28197436478f
  14/10/31 04:24:54 INFO jdbc.RedshiftCopyBatchInsert: Loaded 1 files.
Status     : success
Result     :
+---------------------+-----------------+------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
| access_time         | access_host     | access_path                        | method | referer                                                                                                                                                                                          | status_code | access_agent                                                                                                                                                                                               | access_user | access_size |
+---------------------+-----------------+------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
| 2014-10-31 09:40:23 | 128.216.140.97  | /item/sports/2511                  | GET    | http://www.google.com/search?ie=UTF-8&q=google&sclient=psy-ab&q=Sports+Electronics&oq=Sports+Electronics&aq=f&aqi=g-vL1&aql=&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&biw=3994&bih=421           | 200         | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7                                                                                                  | -           | 95          |
| 2014-10-31 09:40:22 | 224.225.147.72  | /category/electronics              | GET    | -                                                                                                                                                                                                | 200         | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)                                                                                                                                            | -           | 43          |
| 2014-10-31 09:40:20 | 172.75.186.56   | /category/jewelry                  | GET    | -                                                                                                                                                                                                | 200         | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                                   | -           | 79          |
| 2014-10-31 09:40:18 | 196.183.34.70   | /search/?c=Electronics+Games       | POST   | -                                                                                                                                                                                                | 200         | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1                                                                                                                               | -           | 86          |
| 2014-10-31 09:40:16 | 116.198.217.173 | /category/electronics              | GET    | /item/games/4012                                                                                                                                                                                 | 200         | Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1                                                                                                                                      | -           | 94          |
:
(中略)
:
| 2014-10-30 15:48:09 | 32.219.129.36   | /item/sports/1511                  | GET    | /item/games/103                                                                                                                                                                                  | 200         | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618; .NET4.0C)                             | -           | 115         |
| 2014-10-30 15:47:53 | 200.129.205.208 | /category/electronics              | GET    | -                                                                                                                                                                                                | 200         | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11                                                                                                | -           | 62          |
+---------------------+-----------------+------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
5000 rows in set
$

Redshiftに接続して中身を確認してみます。内容も問題無さそうですね。

# SELECT COUNT(*) FROM public.www_access;
 count 
-------
  5000
(1 行)

# SELECT
#   access_time,
#   access_host,
#   TRIM(access_path) AS access_path,
#   method,
#   SUBSTRING(TRIM(referer),0,20) AS referer,
#   status_code,
#   access_user,
#   access_size
# FROM
#    www_access
# LIMIT 10;
 
 access_time |   access_host   |      access_path      | method |       referer       | status_code | access_user | access_size 
-------------+-----------------+-----------------------+--------+---------------------+-------------+-------------+-------------
 2014-10-30  | 96.54.24.116    | /category/office      | GET    | /item/software/494  |         200 | -           |         130
 2014-10-30  | 104.192.45.190  | /category/books       | GET    | /item/books/2860    |         200 | -           |          90
 2014-10-30  | 108.78.209.95   | /category/software    | GET    | /category/networkin |         200 | -           |          48
 2014-10-30  | 220.183.213.181 | /item/software/2553   | GET    | /item/books/383     |         200 | -           |          96
 2014-10-30  | 44.135.67.116   | /item/health/1189     | GET    | /search/?c=Electron |         200 | -           |          57
 2014-10-30  | 132.54.226.209  | /item/sports/2198     | GET    | /category/sports    |         200 | -           |         116
 2014-10-30  | 208.18.229.78   | /category/electronics | GET    | /category/electroni |         200 | -           |          63
 2014-10-30  | 92.63.34.188    | /item/office/4429     | GET    | /item/games/2637    |         200 | -           |          72
 2014-10-30  | 80.132.220.20   | /category/books       | GET    | /item/software/2593 |         200 | -           |         133
 2014-10-30  | 116.105.196.64  | /item/software/2872   | GET    | -                   |         200 | -           |          62
(10 行)
#

まとめ

以上、Treasure DataとAmazon Redshiftのデータ連携に関するご紹介でした。Treasure Dataからはこの他にも様々なリソースへの出力に対応しています。その辺りについても、機会を見てご紹介して行ければと思います。こちらからは以上です。

TD & AWS連携：Treasure Dataの結果をtd-agent経由でAmazon Redshiftに直接取り込む

目次

処理イメージ

Treasure Dataのアカウント取得

Treasure Data実行環境の構築

Treasure Data サンプルデータベース環境を用意する

Redshiftクラスタ環境の構築

Treasure DataからAmazon Redshiftへのデータ投入

まとめ

関連記事

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

EVENTS