11 points that should be noted by Amazon Redshift beginners

2014.10.09

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

I have documented the points better to be noted that I have experienced in the project where Amazon Redshift has been used. It's nice if you feel they are helpful. If you think "What's Data Warehouse?", please refer to the following introductory articles I wrote before. (Sorry this article is Japanese version only;;) 気軽に始めてみよう！クラウド時代のデータウェアハウス超入門

NOTICE List

Difference from PostgreSQL
Too Slow if you insert data one by one
No error reported even if primary key constraints, unique constraints or foreign key constraints are violated
The COPY command inserts all records
When you upload CSV files
Supported data types
Character code cannot be selected
When you add a user who has a reference privilege only
About time zone
About node type
When the PostgreSQL client timed out

Difference from PostgreSQL

Amazon Redshift is based on PostgreSQL. I have not used PostgreSQL in practice. If you have already got used to PostgreSQL, you should check the difference between Amazon Redshift and PostgreSQL in the following document. In this document you can check the unsupported functions, data types and so on. Amazon Redshift and PostgreSQL - Amazon Redshift

And Amazon Redshift has a key and type such as sort key, distribution key, column compression type etc that PostgeSQL does not have. It's better to see the following articles when you design tables.

Amazon Redshift Best Practices for Designing Tables - Amazon Redshift

The following is a brief summary:

Distribution key: You can define only one distribution key per table. It can't be altered after table is made.
Sort key: You can define multi-column sort key. You can define a maximum of 400 SORTKEY columns per table. They can't be altered after table is made.
Column compression type: It can't be altered after table is made. If it is undefined and no data, you can set it automatically by setting the COMPUPDATE option 'ON' when running the COPY command.

Too Slow if you insert data one by one

You use ETL tools to load data from one database to another database. It's very useful, but Amazon Redshift is bad at INSERT a record one by one. If you want to load data at a faster pace, you should use the COPY command to load data.

INSERT - Amazon Redshift

However it is no problem that you use the INSERT command as INSERT ... SELECT statement.

No error reported even if primary key constraints, unique constraints or foreign key constraints are violated

You can define constraints by the CREATE TABLE statement. But data that violates the constraints can be inserted with no error. However the error will be reported when you violate the NOT NULL constraint.

Define Primary Key and Foreign Key Constraints - Amazon Redshift

The COPY command inserts all records

When you use the COPY command to load data, the data are loaded even if the some record with the same primary key have existed in the target table. As mentioned above, no error will be reported when the primary key constraint violation has made. If the record with the same primary key exists in the target table, there is a way that uses a staging table such as the following documents:

Use a Staging Table to Perform a Merge (Upsert) - Amazon Redshift

When you upload CSV files

It is important to note that you use the COPY command to load CSV files containing the data such as the following:

Data containing some commas.
Data containing some line breaks.
Data containing some empty character strings.
Data containing some NULL characters.

If the data contains some commas or line breaks, you should use the COPY command with the CSV option, and the data between commas in the CSV file must be enclosed in double quotaions. And if the data have some empty character strings, you should consider whether you insert it as NULL or empty character string in the database. If you assume the empty character string as NULL, the EMPTYASNULL option must be presented in the COPY command. Finally, if the text data contains some NULL characters, the load process will fail. So you should delete them in advance.

Supported data types

Amazon Redshift support 11 data types. It's not so much. You can define the TEXT type or other data types in the CREATE TABLE statement then no error will be made. But the data type is converted to the VARCHAR data type internally. The column size is sometimes unwilling size, so you had better check the data type in the pg_table_def table. You would notice that there are no large object data type and binary data type.

Data Types - Amazon Redshift

Character code cannot be selected

In Amazon Redshift you cannot select character encoding except from UTF8. If the character encoding of the file to load is not UTF8, you convert it to UTF8. You would notice when you switch the database from another one. And if there are unsupported multibyte character in the file, the error will be made by default. You can use the COPY command with ACCEPTINVCHARS option to replace them with the specified character.

When you add a user who has a reference privilege only

By defalut, the database has a schema named 'public'. This is a specification of PostgreSQL. The user added by the CREATE USER command has CREATE and USAGE privileges by default against the schema. So you should create your original schema when you create a user who has a reference privilege only.

CREATE USER - Amazon Redshift

About time zone

You can use getdate function to get the system time as follows. But the timezone is UTC fixed, so you cannot get the Japan time.

select getdate();

If you want to get the Japan time, use the convert_timezone function to convert the timezone as follows:

select convert_timezone('JST', getdate());

If you want to get the date, use the trunc function as follows:

select trunc(convert_timezone('JST', getdate()));

About node type

You cannot select the storage size of Amazon Redshift instances like Amazon RDS. In case of Amazon Redshift, the storage size per node is decided by selecting the node type. So you can adjust the storage size by changing the number of nodes. The main difference between the dw1 series such as dw1.xlarge or dw1.8xlarge and the dw2 series such as dw2.large or dw2.8xlarge is that the storage is a HDD-based or a SSD-based. If you want to use a large storage, you should select the HDD-based dw1 series node type. If the node type is 8xlarge, the single node cannot be made. You need to create at least two nodes.

When the PostgreSQL client timed out

There is a case such that the connection to the Redshift instance timed out by the heavy processing when you use the SQL Workbench etc. But the requested process does not stop. You can see the request is still running in the AWS Management Console. In this case you have to cancel the request as follows:

-- Identify the PID
select pid, trim(user_name), starttime, substring(query,1,20) 
from stv_recents
where status='Running';

-- Cancel the process
cancel [PID];

Notice that the load continues to run in spite of the time-out. You had better define the Amazon CloudWatch alarm of the CPU usage rate or the storage usage rate. Step 6: Cancel a Query - Amazon Redshift

Omake(Appendix)

The following are useful SQL. And I often use the SQL introduced above, which identifies the PID and cancels a request.

select * from stl_load_errors; --when you want to see the errors
select * from pg_tables where schemaname = 'schema name' and tablename='table name'; --when you see the table list
select * from pg_table_def where schemaname = 'schema name' and tablename='table name'; --when you see the column list

Finally, I introduce the articles about Amazon Redshift I have seen often. (Sorry they are Japanese only;;)

Amazon Redshiftで良く使いそうな便利系SQLをまとめてみた Amazon Redshift クエリパフォーマンスチューニングベストプラクティスを読んでみた Amazon Redshift データ型と列圧縮タイプのまとめ(データ型＆列圧縮タイプ対応表付)