この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。
Treasure DataとAWSの連携は非常に注目度も高く、弊社で展開している"顧客理解のためのビッグデータ分析基盤"、『カスタマーストーリー』に於いても、構成要素の中でもTreasure Dataは重要な位置を占めています。
そして、Treasure Data社から提供されているTreasure Agent(td-agent)では実に様々な処理を行う事が出来ます。各所で収集したログをTreasure Dataに集約する際にこのコマンドを用いているというケースが恐らく一番良く利用されている・知られているケースになると思います。
このtd-agentを使って、Treasure Dataから様々な環境へデータの転送を行う事が可能です。当エントリでは、Treasure Dataに蓄積されたデータをAmazon Redshiftへ直接転送、インポートする際の手順について、環境準備の手順を踏まえながらご紹介して行きたいと思います。
目次
- 処理イメージ
- Treasure Dataのアカウント取得
- Treasure Data実行環境の構築
- サンプルデータベース環境の構築
- Redshiftクラスタ環境の構築
- Treasure DataからAmazon Redshiftへのデータ投入
- まとめ
処理イメージ
超ざっくりな感じですが、処理のイメージはこんな感じです。何らか蓄積されたTreasure Dataのデータに対して、EC2で整えたTreasure Dataのツールからデータを取得、そのままその内容をRedshiftにコピーする...というイメージです。当エントリではこのEC2に於ける環境構築、受け手側のAmazon Redshiftの環境構築、そしてTreasure Dataのコマンドでデータを取得、Redshiftに連携...となります。
Treasure Dataのアカウント取得
まず何より必要なのはTreasure Dataのアカウントです。アカウント登録を済ませ、サービスが利用可能な状態にしておいてください。
Treasure Data実行環境の構築
Amazon Linux環境にgemを使って環境を用意する場合は、以下の大瀧さん、横田さんのエントリを参考に整える事が出来ます。
まずは手頃なAmazon Linux EC2を用意し、サーバにログイン後以下コマンド群を実行。
## 一括更新&Rubyの現行バージョン確認
$ sudo yum -y update
$ ruby -v
ruby 2.0.0p481 (2014-05-08 revision 45883) [x86_64-linux]
## Ruby入れ替え及び環境整備
$ sudo yum remove -y ruby ruby20
$ sudo yum install -y ruby21 ruby21-devel
$ sudo yum groupinstall -y "Development Tools"
$ sudo yum install -y fakeroot
$ gem install bundler
## Rubyのバージョンを改めて確認
$ ruby -v
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux-gnu]
次いで、gemを使ってtdをインストールします。tdのバージョン確認まで出来ればOKです。
$ gem install td
Fetching: msgpack-0.5.9.gem (100%)
Building native extensions. This could take a while...
Successfully installed msgpack-0.5.9
:
(中略)
:
Installing ri documentation for yajl-ruby-1.2.1
Done installing documentation for fluent-logger, hirb, httpclient, msgpack, parallel, rubyzip, td, td-client, td-logger, yajl-ruby after 6 seconds
10 gems installed
$
$ td --version
0.11.5
Treasure Data サンプルデータベース環境を用意する
tdコマンドを利用するには、アカウント情報を設定しておく必要があります。td accountでメールアドレス、パスワードを設定しておきます。
$ td account
Enter your Treasure Data credentials.
Email: **********@***************
Password (typing will be hidden): XXXXXXXXXXX
Authenticated successfully.
Use 'td db:create <db_name>' to create a database.
$
データベース及びテーブルの作成。この辺りは公式ドキュメントが用意されていますので参考にしつつ、sample_dbデータベースに、www_accessテーブル(こちらはサンプルで用意されているテーブルとなります)を構築します。
## データベース作成
$ td db:create sample_db
Database 'sample_db' is created.
Use 'td table:create sample_db <table_name>' to create a table.
## テーブルの作成及びデータ投入
$ td sample:apache sample_apache.json
Created sample_apache.json with 5000 records whose time is in the [2014-10-30 06:47:53 +0000, 2014-10-31 00:40:23 +0000] range.
Use 'td table:import <db> <table> --json sample_apache.json' to import this file.
$ td table:import sample_db www_access --auto-create-table -f json sample_apache.json
Table 'sample_db.www_access' is created.
importing sample_apache.json...
uploading 117364 bytes...
imported 5000 entries from sample_apache.json.
done.
$
試しにクエリを実行してみましょう。ちゃんと結果が返って来ました!
$ td query -w -d sample_db \
> "SELECT code, COUNT(1) AS cnt FROM www_access GROUP BY code"
Job 16562958 is queued.
Use 'td job:show 16562958' to show the status.
queued...
started at 2014-10-31T00:42:14Z
Hive history file=/mnt/hive/tmp/5315/hive_job_log__1070356623.txt
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
Starting Job = job_201410210918_296428, Tracking URL = http://ip-10-28-65-146.ec2.internal:50030/jobdetails.jsp?jobid=job_201410210918_296428
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201410210918_296428
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-10-31 00:42:43,153 Stage-1 map = 0%, reduce = 0%
2014-10-31 00:43:01,967 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.41 sec
2014-10-31 00:43:02,990 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.41 sec
:
(中略)
:
2014-10-31 00:43:10,158 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.35 sec
2014-10-31 00:43:11,172 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.35 sec
2014-10-31 00:43:12,190 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.35 sec
finished at 2014-10-31T00:43:16Z
2014-10-31 00:43:13,334 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.35 sec
MapReduce Total cumulative CPU time: 7 seconds 350 msec
Ended Job = job_201410210918_296428
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 7.35 sec HDFS Read: 412 HDFS Write: 145 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 350 msec
OK
MapReduce time taken: 52.458 seconds
Fetching results...
Total CPU Time: 7350
Time taken: 52.555 seconds
Status : success
Result :
+------+------+ B / 100.0%
| code | cnt |
+------+------+
| 200 | 4981 |
| 404 | 17 |
| 500 | 2 |
+------+------+
3 rows in set
$
Redshiftクラスタ環境の構築
連携元となるTreasure Dataの環境はこれでひとまず用意出来ました。次いで、Amazon Redshiftの環境構築に取り掛かりたいと思います。
と、その前に、連携元テーブルの『www_access』について、その内容を確認しておく必要があります。td:tableコマンドで構成要素を確認します。
$ td table:show sample_db www_access
Name : sample_db.www_access
Type : log
Count : 5000
Schema : (
host:string
path:string
method:string
referer:string
code:long
agent:string
user:string
size:long
)
$
上記コマンドだけでは文字列型項目の項目長が分かりませんので、現行登録されている5000件の中でそれぞれの最大桁数を求めます。
$ td query -w -d sample_db \
> "SELECT \
> LENGTH(MAX(host)), \
> LENGTH(MAX(path)), \
> LENGTH(MAX(method)), \
> LENGTH(MAX(referer)), \
> LENGTH(MAX(agent)), \
> LENGTH(MAX(user)) \
> FROM \
> www_access"
Job 16564288 is queued.
Use 'td job:show 16564288' to show the status.
queued...
started at 2014-10-31T01:26:34Z
Hive history file=/mnt/hive/tmp/5315/hive_job_log__1683480054.txt
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
Starting Job = job_201410210918_297193, Tracking URL = http://ip-10-28-65-146.ec2.internal:50030/jobdetails.jsp?jobid=job_201410210918_297193
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201410210918_297193
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-10-31 01:27:05,259 Stage-1 map = 0%, reduce = 0%
2014-10-31 01:27:25,172 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.67 sec
2014-10-31 01:27:26,672 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.67 sec
:
(中略)
:
2014-10-31 01:30:59,343 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 27.39 sec
2014-10-31 01:31:00,353 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 27.39 sec
2014-10-31 01:31:01,412 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 27.39 sec
finished at 2014-10-31T01:31:06Z
2014-10-31 01:31:02,762 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 27.39 sec
MapReduce Total cumulative CPU time: 27 seconds 390 msec
Ended Job = job_201410210918_297193
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 27.39 sec HDFS Read: 412 HDFS Write: 117 SUCCESS
Total MapReduce CPU Time Spent: 27 seconds 390 msec
OK
MapReduce time taken: 261.67 seconds
Fetching results...
Total CPU Time: 27390
Time taken: 262.34 seconds
Status : success
Result :
+-----+-----+-----+-----+-----+-----+
| _c0 | _c1 | _c2 | _c3 | _c4 | _c5 |
+-----+-----+-----+-----+-----+-----+
| 11 | 27 | 4 | 176 | 136 | 1 |
+-----+-----+-----+-----+-----+-----+
1 row in set
$
Redshiftクラスタにログインし、上記桁数情報を踏まえたテーブルを作成します。
# CREATE TABLE public.www_access (
# access_time DATE encode delta NOT NULL,
# access_host VARCHAR(15) encode lzo NOT NULL,
# access_path VARCHAR(40) encode lzo NOT NULL,
# method CHAR(5) encode bytedict NOT NULL,
# referer VARCHAR(200) encode lzo NOT NULL,
# status_code SMALLINT encode bytedict NOT NULL,
# access_agent VARCHAR(150) encode lzo NOT NULL,
# access_user VARCHAR(3) encode lzo NOT NULL,
# access_size SMALLINT encode mostly8 NOT NULL,
# PRIMARY KEY(access_host)
# )
# distkey(access_host)
# sortkey(time);
CREATE TABLE
#
# SELECT * FROM pg_table_def WHERE schemaname = 'public' AND tablename = 'www_access';
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
------------+------------+--------------+------------------------+----------+---------+---------+---------
public | www_access | access_time | date | delta | f | 1 | t
public | www_access | access_host | character varying(15) | lzo | t | 0 | t
public | www_access | access_path | character varying(40) | lzo | f | 0 | t
public | www_access | method | character(5) | bytedict | f | 0 | t
public | www_access | referer | character varying(200) | lzo | f | 0 | t
public | www_access | status_code | smallint | bytedict | f | 0 | t
public | www_access | access_agent | character varying(150) | lzo | f | 0 | t
public | www_access | access_user | character varying(3) | lzo | f | 0 | t
public | www_access | access_size | smallint | mostly8 | f | 0 | t
(9 行)
#
上記Redshiftへのクエリと同様の内容でTDでも、(Amazon Redshiftに投入するのに使用する)クエリを実行してみます。項目名を変更しているのはRedshift側で予約語として使われていた項目名(user)があったので、その辺りも踏まえています。
$ td query -w -d sample_db \
> "SELECT \
> TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'JST') AS access_time, \
> host AS access_host, \
> path AS access_path, \
> method, \
> referer, \
> code AS status_code, \
> agent AS access_agent, \
> user AS access_user, \
> size AS access_size \
> FROM \
> www_access \
> LIMIT 10;"
Job 16568656 is queued.
Use 'td job:show 16568656' to show the status.
queued...
started at 2014-10-31T04:04:09Z
Hive history file=/mnt/hive/tmp/5315/hive_job_log__595277233.txt
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
Starting Job = job_201410210918_300198, Tracking URL = http://ip-10-28-65-146.ec2.internal:50030/jobdetails.jsp?jobid=job_201410210918_300198
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201410210918_300198
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-10-31 04:04:38,627 Stage-1 map = 0%, reduce = 0%
2014-10-31 04:05:12,535 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.04 sec
2014-10-31 04:05:13,573 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.04 sec
:
(中略)
:
2014-10-31 04:05:25,644 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.04 sec
2014-10-31 04:05:26,672 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.04 sec
2014-10-31 04:05:27,696 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.04 sec
MapReduce Total cumulative CPU time: 5 seconds 40 msec
Ended Job = job_201410210918_300198
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 5.04 sec HDFS Read: 412 HDFS Write: 2028 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 40 msec
OK
MapReduce time taken: 70.659 seconds
Fetching results...
Total CPU Time: 5040
Time taken: 70.755 seconds
finished at 2014-10-31T04:05:29Z
Status : success
Result :
+---------------------+-----------------+------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
| access_time | access_host | access_path | method | referer | status_code | access_agent | access_user | access_size |
+---------------------+-----------------+------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
| 2014-10-31 09:40:23 | 128.216.140.97 | /item/sports/2511 | GET | http://www.google.com/search?ie=UTF-8&q=google&sclient=psy-ab&q=Sports+Electronics&oq=Sports+Electronics&aq=f&aqi=g-vL1&aql=&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&biw=3994&bih=421 | 200 | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7 | - | 95 |
| 2014-10-31 09:40:22 | 224.225.147.72 | /category/electronics | GET | - | 200 | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0) | - | 43 |
| 2014-10-31 09:40:20 | 172.75.186.56 | /category/jewelry | GET | - | 200 | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | - | 79 |
| 2014-10-31 09:40:18 | 196.183.34.70 | /search/?c=Electronics+Games | POST | - | 200 | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 | - | 86 |
| 2014-10-31 09:40:16 | 116.198.217.173 | /category/electronics | GET | /item/games/4012 | 200 | Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 | - | 94 |
| 2014-10-31 09:40:14 | 100.126.194.214 | /category/games | GET | /category/computers | 200 | Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3 | - | 58 |
| 2014-10-31 09:40:12 | 216.51.135.153 | /item/games/4772 | GET | /item/electronics/4627 | 200 | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 | - | 131 |
| 2014-10-31 09:40:10 | 40.81.151.94 | /item/electronics/3883 | GET | /item/giftcards/855 | 200 | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0) | - | 127 |
| 2014-10-31 09:40:08 | 76.75.28.148 | /category/cameras | GET | - | 200 | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0) | - | 133 |
| 2014-10-31 09:40:05 | 108.78.162.171 | /category/computers | GET | /category/games | 200 | Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11 | - | 57 |
+---------------------+-----------------+------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
10 rows in set
$
Treasure DataからAmazon Redshiftへのデータ投入
さぁ、いよいよ本題です。Redshiftへのアクセスに関する情報は下記ページにまとまっていますので、必要に応じた設定値で実行文を組み立てます。
以下が実行ログです。下記情報で処理を行ったという想定です。(※1行目の設定は内容を置き換えていますが、以降の処理は実際の内容となっています)45行目まででデータ取得を行い、46行目以降でRedshiftへのCOPY処理を行っているのがログからも読み取れると思います。
- ユーザー名:username
- パスワード:PassWord123
- ホスト名(エンドポイント):redshift-sample.xxxxxxxx.us-east-1.redshift.amazonaws.com
- ポート番号:5439
- データベース名:testdb
- テーブル名:www_access
- スキーマ名:public
$ td query -w -d sample_db \
> --result 'redshift://username:PassWord123@redshift-sample.xxxxxxxx.us-east-1.redshift.amazonaws.com:5439/testdb/www_access?schema=public' \
> "SELECT \
> TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss', 'JST') AS access_time, \
> host AS access_host, \
> path AS access_path, \
> method, \
> referer, \
> code AS status_code, \
> agent AS access_agent, \
> user AS access_user, \
> size AS access_size \
> FROM \
> www_access"
Job 16569297 is queued.
Use 'td job:show 16569297' to show the status.
queued...
started at 2014-10-31T04:23:43Z
Hive history file=/mnt/hive/tmp/5315/hive_job_log__1638434606.txt
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
**
** WARNING: time index filtering is not set!
** This query could be very slow as a result.
** Please see http://docs.treasure-data.com/articles/performance-tuning#leveraging-time-based-partitioning
**
Starting Job = job_201410210918_300609, Tracking URL = http://ip-10-28-65-146.ec2.internal:50030/jobdetails.jsp?jobid=job_201410210918_300609
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201410210918_300609
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-10-31 04:24:10,023 Stage-1 map = 0%, reduce = 0%
2014-10-31 04:24:22,401 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.19 sec
2014-10-31 04:24:23,421 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.19 sec
2014-10-31 04:24:24,431 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.19 sec
:
(中略)
:
2014-10-31 04:24:40,856 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.19 sec
2014-10-31 04:24:41,891 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.19 sec
MapReduce Total cumulative CPU time: 6 seconds 190 msec
Ended Job = job_201410210918_300609
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 6.19 sec HDFS Read: 412 HDFS Write: 1041176 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 190 msec
OK
MapReduce time taken: 52.557 seconds
Fetching results...
Total CPU Time: 6190
Time taken: 52.964 seconds
finished at 2014-10-31T04:24:44Z
14/10/31 04:24:49 INFO result.AbstractJDBCResult: Mode: in-place append
14/10/31 04:24:49 INFO result.RedshiftResult: Method: copy
14/10/31 04:24:49 INFO result.RedshiftResult: S3 region: us-east-1
14/10/31 04:24:49 INFO result.ResultWorker: Data size: 109,514 bytes (compressed)
14/10/31 04:24:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/10/31 04:24:49 INFO compress.CodecPool: Got brand-new decompressor [.gz]
14/10/31 04:24:49 INFO result.AbstractJDBCResult: Getting list of tables...
14/10/31 04:24:49 INFO result.AbstractJDBCResult: Table "www_access" exists. Using its schema to store query results.
14/10/31 04:24:49 INFO result.RedshiftResult: Copy SQL: COPY "www_access" ("access_time", "access_host", "access_path", "method", "referer", "status_code", "access_agent", "access_user", "access_size") FROM ? GZIP DELIMITER '\t' NULL '\N' ESCAPE TRUNCATECOLUMNS ACCEPTINVCHARS STATUPDATE OFF COMPUPDATE OFF
14/10/31 04:24:50 INFO jdbc.RedshiftCopyBatchInsert: Uploading file id 7ab95991-507e-4d3f-ab95-28197436478f to S3 (90754 bytes)
14/10/31 04:24:50 INFO jdbc.RedshiftCopyBatchInsert: Running COPY from file id 7ab95991-507e-4d3f-ab95-28197436478f
14/10/31 04:24:54 INFO jdbc.RedshiftCopyBatchInsert: Loaded 1 files.
Status : success
Result :
+---------------------+-----------------+------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
| access_time | access_host | access_path | method | referer | status_code | access_agent | access_user | access_size |
+---------------------+-----------------+------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
| 2014-10-31 09:40:23 | 128.216.140.97 | /item/sports/2511 | GET | http://www.google.com/search?ie=UTF-8&q=google&sclient=psy-ab&q=Sports+Electronics&oq=Sports+Electronics&aq=f&aqi=g-vL1&aql=&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&biw=3994&bih=421 | 200 | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7 | - | 95 |
| 2014-10-31 09:40:22 | 224.225.147.72 | /category/electronics | GET | - | 200 | Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0) | - | 43 |
| 2014-10-31 09:40:20 | 172.75.186.56 | /category/jewelry | GET | - | 200 | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | - | 79 |
| 2014-10-31 09:40:18 | 196.183.34.70 | /search/?c=Electronics+Games | POST | - | 200 | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 | - | 86 |
| 2014-10-31 09:40:16 | 116.198.217.173 | /category/electronics | GET | /item/games/4012 | 200 | Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 | - | 94 |
:
(中略)
:
| 2014-10-30 15:48:09 | 32.219.129.36 | /item/sports/1511 | GET | /item/games/103 | 200 | Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618; .NET4.0C) | - | 115 |
| 2014-10-30 15:47:53 | 200.129.205.208 | /category/electronics | GET | - | 200 | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11 | - | 62 |
+---------------------+-----------------+------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------+
5000 rows in set
$
Redshiftに接続して中身を確認してみます。内容も問題無さそうですね。
# SELECT COUNT(*) FROM public.www_access;
count
-------
5000
(1 行)
# SELECT
# access_time,
# access_host,
# TRIM(access_path) AS access_path,
# method,
# SUBSTRING(TRIM(referer),0,20) AS referer,
# status_code,
# access_user,
# access_size
# FROM
# www_access
# LIMIT 10;
access_time | access_host | access_path | method | referer | status_code | access_user | access_size
-------------+-----------------+-----------------------+--------+---------------------+-------------+-------------+-------------
2014-10-30 | 96.54.24.116 | /category/office | GET | /item/software/494 | 200 | - | 130
2014-10-30 | 104.192.45.190 | /category/books | GET | /item/books/2860 | 200 | - | 90
2014-10-30 | 108.78.209.95 | /category/software | GET | /category/networkin | 200 | - | 48
2014-10-30 | 220.183.213.181 | /item/software/2553 | GET | /item/books/383 | 200 | - | 96
2014-10-30 | 44.135.67.116 | /item/health/1189 | GET | /search/?c=Electron | 200 | - | 57
2014-10-30 | 132.54.226.209 | /item/sports/2198 | GET | /category/sports | 200 | - | 116
2014-10-30 | 208.18.229.78 | /category/electronics | GET | /category/electroni | 200 | - | 63
2014-10-30 | 92.63.34.188 | /item/office/4429 | GET | /item/games/2637 | 200 | - | 72
2014-10-30 | 80.132.220.20 | /category/books | GET | /item/software/2593 | 200 | - | 133
2014-10-30 | 116.105.196.64 | /item/software/2872 | GET | - | 200 | - | 62
(10 行)
#
まとめ
以上、Treasure DataとAmazon Redshiftのデータ連携に関するご紹介でした。Treasure Dataからはこの他にも様々なリソースへの出力に対応しています。その辺りについても、機会を見てご紹介して行ければと思います。こちらからは以上です。