Input DataツールにいろいろなパターンのCSVファイルを読ませてみる#alteryx #24 | Alteryx Advent Calendar 2016

alteryx

こんにちは。たごまさゆきです。

当エントリは『Alteyx Advent Calendar 2016』の24日目のエントリです。

今回は、Input DataツールにいろいろなパターンのCSVファイルを読み込ませて、その結果を確認します。

背景

データの受け渡しによく使われるデータ形式として代表的なものの一つにCSV形式が挙げられます。
このCSV形式ですが、RFCで仕様が成文化されているものの、値に含まれる区切り文字や囲み文字の扱いなどの解釈がツールごとに違っていたりして、なかなか一筋縄ではいかない場合があります。

そこでAlteryxのInput DataツールでいろいろなパターンのCSV形式ファイルを読み込ませてみて、期待通りの読み込み結果が得られるかを確認してみたいと思います。

確認方法

実行環境

今回の実行環境は以下のとおりです。

  • MacBook Pro
  • VMwareで動作しているWindows 7 Enterprise (ホストOSは、MacOS X El Capitan)
  • Alteryx Designer x64 Ver.10.6.8

ワークフロー

ワークフローは以下のとおりです。データを読んで、中身を確認するだけです。

workflow

Input DataツールのOption設定は、以下の通りです。

InputData-Options

読み込みデータ

サンプルで付属する「PetStoresDATE.csv」を加工して、以下のパターンを作成しました。
なお1行目はヘッダになっています。

# パターン
1 値囲み文字なしのCSV (サンプルデータそのまま)
2 値囲み文字あり(")のCSV
3 値囲み文字あり(')、値囲み文字(')を含む値ありのCSV
4 値囲み文字あり(")、値区切り文字(,)を含む値ありのCSV
5 値囲み文字あり(")、値囲み文字(")と区切り文字(,)が連続した値ありのCSV
6 値囲み文字あり(")、ただし一部の値はなしのCSV
7 値囲み文字あり(")、改行(CRLF)を含む値ありのCSV

読み込んでみる

では用意したCSVファイルを順番に読み込んでみます。

パターン1

まずは値囲み文字なしのCSV、サンプルデータそのままです。

NAME,DATE_OPENED,PHONE,ADDRESS,CITY,STATE,ZIP,STD_ERROR_,GEO_ADDR_F,GEO_ADDR_L,GEO_CENSUS,GEO_CITY,GEO_LATITU,GEO_LONGIT,GEO_MATCH_,GEO_NUMBER,GEO_POSTDI,GEO_POSTD2,GEO_PREDIR,GEO_PREDI2,GEO_RESULT,GEO_RESUL2,GEO_STATE,GEO_STATE_,GEO_STREET,GEO_STREE2,GEO_STREE3,GEO_STREE4,GEO_UNIT_D,GEO_UNIT_N,GEO_ZIP
All Creatures Great and Small,1/1/2005,(307) 687-0354,900 Camel Dr, Gillette,WY,82716,,900 CAMEL DR,GILLETTE WY 82716,5.6005E+14,GILLETTE,44.28167,-105.492272,1000,900,,,,,A,,WY,56,CAMEL,,DR,,,,82716
Animal House,3/27/2002,(307) 358-2022,123 S 2nd St, Douglas,WY,82633,,123 S 2ND ST,DOUGLAS WY 82633,5.601E+14,DOUGLAS,42.759113,-105.385263,1000,123,,,S,,A,,WY,56,2ND,,ST,,,,82633
Aquatics and Critters,10/14/2003,(307) 235-1890,601 SE Wyoming Blvd Unit 252, Casper,WY,82609,,601 SE WYOMING BLVD UNIT 252,CASPER WY 82601,5.6025E+14,CASPER,42.845431,-106.26731,996,601,,,SE,,A,8,WY,56,WYOMING,,BLVD,,UNIT,252,82601
Barn,7/17/1975,(307) 587-9515,2625 Big Horn Ave, Cody,WY,82414,,2625 BIG HORN AVE,CODY WY 82414,5.603E+14,CODY,44.533497,-109.03955,1000,2625,,,,,A,,WY,56,BIG HORN,,AVE,,,,82414
Callia's Pet Shop,5/18/1995,(307) 324-5300,210 5th St, Rawlins,WY,82301,,210 5TH ST,RAWLINS WY 82301,5.6008E+14,RAWLINS,41.787613,-107.240378,1000,210,,,,,A,,WY,56,5TH,,ST,,,,82301
(以下略)

早速読み込んでみます。

Pattern01-result

サンプルとして付属しているデータなので、さすがに何事もなく読み込みできました。

パターン2

値を"で囲んだ場合です。

AME,DATE_OPENED,PHONE,ADDRESS,CITY,STATE,ZIP,STD_ERROR_,GEO_ADDR_F,GEO_ADDR_L,GEO_CENSUS,GEO_CITY,GEO_LATITU,GEO_LONGIT,GEO_MATCH_,GEO_NUMBER,GEO_POSTDI,GEO_POSTD2,GEO_PREDIR,GEO_PREDI2,GEO_RESULT,GEO_RESUL2,GEO_STATE,GEO_STATE_,GEO_STREET,GEO_STREE2,GEO_STREE3,GEO_STREE4,GEO_UNIT_D,GEO_UNIT_N,GEO_ZIP
"All Creatures Great and Small","1/1/2005","(307) 687-0354","900 Camel Dr"," Gillette","WY","82716","","900 CAMEL DR","GILLETTE WY 82716","5.6005E+14","GILLETTE","44.28167","-105.492272","1000","900","","","","","A","","WY","56","CAMEL","","DR","","","","82716"
"Animal House","3/27/2002","(307) 358-2022","123 S 2nd St"," Douglas","WY","82633","","123 S 2ND ST","DOUGLAS WY 82633","5.601E+14","DOUGLAS","42.759113","-105.385263","1000","123","","","S","","A","","WY","56","2ND","","ST","","","","82633"
"Aquatics and Critters","10/14/2003","(307) 235-1890","601 SE Wyoming Blvd Unit 252"," Casper","WY","82609","","601 SE WYOMING BLVD UNIT 252","CASPER WY 82601","5.6025E+14","CASPER","42.845431","-106.26731","996","601","","","SE","","A","8","WY","56","WYOMING","","BLVD","","UNIT","252","82601"
"Barn","7/17/1975","(307) 587-9515","2625 Big Horn Ave"," Cody","WY","82414","","2625 BIG HORN AVE","CODY WY 82414","5.603E+14","CODY","44.533497","-109.03955","1000","2625","","","","","A","","WY","56","BIG HORN","","AVE","","","","82414"
"Callia's Pet Shop","5/18/1995","(307) 324-5300","210 5th St"," Rawlins","WY","82301","","210 5TH ST","RAWLINS WY 82301","5.6008E+14","RAWLINS","41.787613","-107.240378","1000","210","","","","","A","","WY","56","5TH","","ST","","","","82301"
(以下略)

では、読んでみましょう。

Pattern02-result

パターン1と違った点は、STD_ERROR_カラムが[Null]ではなくなっています。空の値を囲み文字なしで読み込むと[Null]として扱われ、囲み文字がある場合、空文字列として扱われるようです。
また、CITYカラムの値は、値の前に半角スペースを1つ付加した状態で値を囲んでいましたが、そちらは半角スペースがついたまま読み込まれました(結果一覧のCITYカラムには、「This cell has leading spaces.」とコメントがついています)。

パターン3

値をで囲んだ場合です。6行目のNAMEカラムの値に、囲み文字を含む値があります。

NAME,DATE_OPENED,PHONE,ADDRESS,CITY,STATE,ZIP,STD_ERROR_,GEO_ADDR_F,GEO_ADDR_L,GEO_CENSUS,GEO_CITY,GEO_LATITU,GEO_LONGIT,GEO_MATCH_,GEO_NUMBER,GEO_POSTDI,GEO_POSTD2,GEO_PREDIR,GEO_PREDI2,GEO_RESULT,GEO_RESUL2,GEO_STATE,GEO_STATE_,GEO_STREET,GEO_STREE2,GEO_STREE3,GEO_STREE4,GEO_UNIT_D,GEO_UNIT_N,GEO_ZIP
'All Creatures Great and Small','1/1/2005','(307) 687-0354','900 Camel Dr',' Gillette','WY','82716','','900 CAMEL DR','GILLETTE WY 82716','5.6005E+14','GILLETTE','44.28167','-105.492272','1000','900','','','','','A','','WY','56','CAMEL','','DR','','','','82716'
'Animal House','3/27/2002','(307) 358-2022','123 S 2nd St',' Douglas','WY','82633','','123 S 2ND ST','DOUGLAS WY 82633','5.601E+14','DOUGLAS','42.759113','-105.385263','1000','123','','','S','','A','','WY','56','2ND','','ST','','','','82633'
'Aquatics and Critters','10/14/2003','(307) 235-1890','601 SE Wyoming Blvd Unit 252',' Casper','WY','82609','','601 SE WYOMING BLVD UNIT 252','CASPER WY 82601','5.6025E+14','CASPER','42.845431','-106.26731','996','601','','','SE','','A','8','WY','56','WYOMING','','BLVD','','UNIT','252','82601'
'Barn','7/17/1975','(307) 587-9515','2625 Big Horn Ave',' Cody','WY','82414','','2625 BIG HORN AVE','CODY WY 82414','5.603E+14','CODY','44.533497','-109.03955','1000','2625','','','','','A','','WY','56','BIG HORN','','AVE','','','','82414'
'Callia's Pet Shop','5/18/1995','(307) 324-5300','210 5th St',' Rawlins','WY','82301','','210 5TH ST','RAWLINS WY 82301','5.6008E+14','RAWLINS','41.787613','-107.240378','1000','210','','','','','A','','WY','56','5TH','','ST','','','','82301'
(以下略)

今回は、読み込み実行前にInput Dataのオプションを変更します。

InputData-Options-03

では、読み込んでみましょう。

Pattern03-result

読み込みは出来たようですが、ワーニングが出ました。

Pattern03-warning

「エスケープされていない引用符がある」ということですので、値に含まれる'を二重にしてエスケープし、もう一度読み込みます。

'Callia''s Pet Shop','5/18/1995','(307) 324-5300','210 5th St',' Rawlins','WY','82301','','210 5TH ST','RAWLINS WY 82301','5.6008E+14','RAWLINS','41.787613','-107.240378','1000','210','','','','','A','','WY','56','5TH','','ST','','','','82301'
(以下略)

Pattern03-no-warning

今度はワーニングなしで読み込みできました。読み込み結果は変わりませんでした。
ある程度推測でエスケープなしでも読んでくれるようですが、エスケープしておく方が安全でしょう。
なお囲み文字が"の場合も同じ結果となりました。

パターン4

値を"で囲んだ場合で、6行目のNAMEの値に区切り文字,が含まれます。

NAME,DATE_OPENED,PHONE,ADDRESS,CITY,STATE,ZIP,STD_ERROR_,GEO_ADDR_F,GEO_ADDR_L,GEO_CENSUS,GEO_CITY,GEO_LATITU,GEO_LONGIT,GEO_MATCH_,GEO_NUMBER,GEO_POSTDI,GEO_POSTD2,GEO_PREDIR,GEO_PREDI2,GEO_RESULT,GEO_RESUL2,GEO_STATE,GEO_STATE_,GEO_STREET,GEO_STREE2,GEO_STREE3,GEO_STREE4,GEO_UNIT_D,GEO_UNIT_N,GEO_ZIP
"All Creatures Great and Small","1/1/2005","(307) 687-0354","900 Camel Dr"," Gillette","WY","82716","","900 CAMEL DR","GILLETTE WY 82716","5.6005E+14","GILLETTE","44.28167","-105.492272","1000","900","","","","","A","","WY","56","CAMEL","","DR","","","","82716"
"Animal House","3/27/2002","(307) 358-2022","123 S 2nd St"," Douglas","WY","82633","","123 S 2ND ST","DOUGLAS WY 82633","5.601E+14","DOUGLAS","42.759113","-105.385263","1000","123","","","S","","A","","WY","56","2ND","","ST","","","","82633"
"Aquatics and Critters","10/14/2003","(307) 235-1890","601 SE Wyoming Blvd Unit 252"," Casper","WY","82609","","601 SE WYOMING BLVD UNIT 252","CASPER WY 82601","5.6025E+14","CASPER","42.845431","-106.26731","996","601","","","SE","","A","8","WY","56","WYOMING","","BLVD","","UNIT","252","82601"
"Barn","7/17/1975","(307) 587-9515","2625 Big Horn Ave"," Cody","WY","82414","","2625 BIG HORN AVE","CODY WY 82414","5.603E+14","CODY","44.533497","-109.03955","1000","2625","","","","","A","","WY","56","BIG HORN","","AVE","","","","82414"
"Callia,s Pet Shop","5/18/1995","(307) 324-5300","210 5th St"," Rawlins","WY","82301","","210 5TH ST","RAWLINS WY 82301","5.6008E+14","RAWLINS","41.787613","-107.240378","1000","210","","","","","A","","WY","56","5TH","","ST","","","","82301"
(以下略)

Pattern04-result

問題なく読み込みができました。

パターン5

少し意地悪をしてみます。囲み文字"と区切り文字,が連続して値の中に現れる場合はどうなるでしょうか。

NAME,DATE_OPENED,PHONE,ADDRESS,CITY,STATE,ZIP,STD_ERROR_,GEO_ADDR_F,GEO_ADDR_L,GEO_CENSUS,GEO_CITY,GEO_LATITU,GEO_LONGIT,GEO_MATCH_,GEO_NUMBER,GEO_POSTDI,GEO_POSTD2,GEO_PREDIR,GEO_PREDI2,GEO_RESULT,GEO_RESUL2,GEO_STATE,GEO_STATE_,GEO_STREET,GEO_STREE2,GEO_STREE3,GEO_STREE4,GEO_UNIT_D,GEO_UNIT_N,GEO_ZIP
"All Creatures Great and Small","1/1/2005","(307) 687-0354","900 Camel Dr"," Gillette","WY","82716","","900 CAMEL DR","GILLETTE WY 82716","5.6005E+14","GILLETTE","44.28167","-105.492272","1000","900","","","","","A","","WY","56","CAMEL","","DR","","","","82716"
"Animal House","3/27/2002","(307) 358-2022","123 S 2nd St"," Douglas","WY","82633","","123 S 2ND ST","DOUGLAS WY 82633","5.601E+14","DOUGLAS","42.759113","-105.385263","1000","123","","","S","","A","","WY","56","2ND","","ST","","","","82633"
"Aquatics and Critters","10/14/2003","(307) 235-1890","601 SE Wyoming Blvd Unit 252"," Casper","WY","82609","","601 SE WYOMING BLVD UNIT 252","CASPER WY 82601","5.6025E+14","CASPER","42.845431","-106.26731","996","601","","","SE","","A","8","WY","56","WYOMING","","BLVD","","UNIT","252","82601"
"Barn","7/17/1975","(307) 587-9515","2625 Big Horn Ave"," Cody","WY","82414","","2625 BIG HORN AVE","CODY WY 82414","5.603E+14","CODY","44.533497","-109.03955","1000","2625","","","","","A","","WY","56","BIG HORN","","AVE","","","","82414"
"Callia",s Pet Shop","5/18/1995","(307) 324-5300","210 5th St"," Rawlins","WY","82301","","210 5TH ST","RAWLINS WY 82301","5.6008E+14","RAWLINS","41.787613","-107.240378","1000","210","","","","","A","","WY","56","5TH","","ST","","","","82301"
(以下略)

読み込みしてみましょう。

Pattern05-error

今回はさすがに読み込みエラーとなりました。
原因は囲み文字"と区切り文字,が連続して値の中に現れる箇所なので、囲み文字を二重化してエスケープします。

"Callia"",s Pet Shop","5/18/1995","(307) 324-5300","210 5th St"," Rawlins","WY","82301","","210 5TH ST","RAWLINS WY 82301","5.6008E+14","RAWLINS","41.787613","-107.240378","1000","210","","","","","A","","WY","56","5TH","","ST","","","","82301"
(以下略)

読み込んでみましょう。

Pattern05-result

今度は読み込みできました。

パターン6

文字列の値は"で囲われているけど数値は囲まれていない、といったケースもよくあると思います。
次はそんなケースを試してみます。
GEO_LATITUGEO_LONGITGEO_MATCH_GEO_NUMBERカラムの値だけ囲み文字なしです。(このパターンは6行目までです)

NAME,DATE_OPENED,PHONE,ADDRESS,CITY,STATE,ZIP,STD_ERROR_,GEO_ADDR_F,GEO_ADDR_L,GEO_CENSUS,GEO_CITY,GEO_LATITU,GEO_LONGIT,GEO_MATCH_,GEO_NUMBER,GEO_POSTDI,GEO_POSTD2,GEO_PREDIR,GEO_PREDI2,GEO_RESULT,GEO_RESUL2,GEO_STATE,GEO_STATE_,GEO_STREET,GEO_STREE2,GEO_STREE3,GEO_STREE4,GEO_UNIT_D,GEO_UNIT_N,GEO_ZIP
"All Creatures Great and Small","1/1/2005","(307) 687-0354","900 Camel Dr"," Gillette","WY","82716","","900 CAMEL DR","GILLETTE WY 82716","5.6005E+14","GILLETTE",44.28167,-105.492272,1000,900,"","","","","A","","WY","56","CAMEL","","DR","","","","82716"
"Animal House","3/27/2002","(307) 358-2022","123 S 2nd St"," Douglas","WY","82633","","123 S 2ND ST","DOUGLAS WY 82633","5.601E+14","DOUGLAS",42.759113,-105.385263,1000,123,"","","S","","A","","WY","56","2ND","","ST","","","","82633"
"Aquatics and Critters","10/14/2003","(307) 235-1890","601 SE Wyoming Blvd Unit 252"," Casper","WY","82609","","601 SE WYOMING BLVD UNIT 252","CASPER WY 82601","5.6025E+14","CASPER",42.845431,-106.26731,996,601,"","","SE","","A","8","WY","56","WYOMING","","BLVD","","UNIT","252","82601"
"Barn","7/17/1975","(307) 587-9515","2625 Big Horn Ave"," Cody","WY","82414","","2625 BIG HORN AVE","CODY WY 82414","5.603E+14","CODY",44.533497,-109.03955,1000,2625,"","","","","A","","WY","56","BIG HORN","","AVE","","","","82414"
"Callia's Pet Shop","5/18/1995","(307) 324-5300","210 5th St"," Rawlins","WY","82301","","210 5TH ST","RAWLINS WY 82301","5.6008E+14","RAWLINS",41.787613,-107.240378,1000,210,"","","","","A","","WY","56","5TH","","ST","","","","82301"

読み込みます。

Pattern06-result
Pattern06-result02

特に問題なさそうです。

パターン7

値に改行(CRLF)が含まれるケースです。CSVとしてはこれもアリなんですね。
2行目NAMEの値に改行をつっこんでみました。(All Creatures GreatCRLF and Small)

NAME,DATE_OPENED,PHONE,ADDRESS,CITY,STATE,ZIP,STD_ERROR_,GEO_ADDR_F,GEO_ADDR_L,GEO_CENSUS,GEO_CITY,GEO_LATITU,GEO_LONGIT,GEO_MATCH_,GEO_NUMBER,GEO_POSTDI,GEO_POSTD2,GEO_PREDIR,GEO_PREDI2,GEO_RESULT,GEO_RESUL2,GEO_STATE,GEO_STATE_,GEO_STREET,GEO_STREE2,GEO_STREE3,GEO_STREE4,GEO_UNIT_D,GEO_UNIT_N,GEO_ZIP
"All Creatures Great
 and Small","1/1/2005","(307) 687-0354","900 Camel Dr"," Gillette","WY","82716","","900 CAMEL DR","GILLETTE WY 82716","5.6005E+14","GILLETTE","44.28167","-105.492272","1000","900","","","","","A","","WY","56","CAMEL","","DR","","","","82716"
"Animal House","3/27/2002","(307) 358-2022","123 S 2nd St"," Douglas","WY","82633","","123 S 2ND ST","DOUGLAS WY 82633","5.601E+14","DOUGLAS","42.759113","-105.385263","1000","123","","","S","","A","","WY","56","2ND","","ST","","","","82633"
"Aquatics and Critters","10/14/2003","(307) 235-1890","601 SE Wyoming Blvd Unit 252"," Casper","WY","82609","","601 SE WYOMING BLVD UNIT 252","CASPER WY 82601","5.6025E+14","CASPER","42.845431","-106.26731","996","601","","","SE","","A","8","WY","56","WYOMING","","BLVD","","UNIT","252","82601"
"Barn","7/17/1975","(307) 587-9515","2625 Big Horn Ave"," Cody","WY","82414","","2625 BIG HORN AVE","CODY WY 82414","5.603E+14","CODY","44.533497","-109.03955","1000","2625","","","","","A","","WY","56","BIG HORN","","AVE","","","","82414"
"Callia's Pet Shop","5/18/1995","(307) 324-5300","210 5th St"," Rawlins","WY","82301","","210 5TH ST","RAWLINS WY 82301","5.6008E+14","RAWLINS","41.787613","-107.240378","1000","210","","","","","A","","WY","56","5TH","","ST","","","","82301"
(以下略)

では、読み込んでみましょう。

Pattern07-result

あっさり読み込めました。ちゃんと改行を含む値として読めています。
しかし、後続のデータ処理で面倒なことになりそうな気もします。

まとめ

いろいろ読み込んでみた結果、次のようになりました。

# パターン 読み込み結果
1 値囲み文字なしのCSV (サンプルデータそのまま) OK
2 値囲み文字あり(")のCSV OK (ただし、空の値は空文字になる。[Null]にしたい値は囲み文字なし。)
3 値囲み文字あり(')、値囲み文字(')を含む値ありのCSV OK (値のなかの囲み文字は二重化してエスケープした方が良い)
4 値囲み文字あり(")、値区切り文字(,)を含む値ありのCSV OK
5 値囲み文字あり(")、値囲み文字(")と区切り文字(,)が連続した値ありのCSV OK (ただし、正しく読むためには、値のなかの囲み文字のエスケープが必要)
6 値囲み文字あり(")、ただし一部の値はなしのCSV OK
7 値囲み文字あり(")、改行(CRLF)を含む値ありのCSV OK (ただし後続処理では注意)

おまけ:TSVは?

CSVとならんでTSV(タブ区切り)形式もよく使われますね。
Input DataツールのFile Formatの一覧にTSVは見当たりませんが、どうなんでしょうか。

結論としてはちゃんとサポートしてます。Communityに回答がありました。

最後に

今回は必ずお世話になるInput Dataツールで、いろいろなCSVファイルを読み込んでみました。結果として、ほぼ期待通りに読み込めることはわかりました。
しかし無用の手間を防ぐため、データ連携の相手とは「あなたのCSVと私のCSVは本当に同じ形式かどうか」、事前にしっかり確認しておきましょう。

明日25日目は川崎の『Alteryxでセグメンテーションをしてみる』の予定です。明日もお楽しみに!