BigQuery の正規表現関数の使いどころを自分なりに整理してみた
こんにちは!エノカワです。
BigQueryにはREGEXP_
で始まる正規表現関数が用意されていますが、「こういうことをしたい場合、どの関数を使えば良いんだっけ?」と迷った経験はありませんか?
(私だけ?)
正規表現関数の使いどころ
ということで、正規表現関数の使いどころを自分なりに整理してみました。
関数 | 使いどころ | 戻り値の型 |
---|---|---|
REGEXP_CONTAINS | 文字列が正規表現と一致するかを判定したい | BOOL |
REGEXP_EXTRACT | 正規表現と一致する最初の部分文字列を抽出したい | STRING または BYTES |
REGEXP_EXTRACT_ALL | 正規表現と一致するすべての部分文字列を抽出したい | ARRAY |
REGEXP_INSTR | 正規表現と一致する部分が初めて出現する位置を取得したい | INT64 |
REGEXP_REPLACE | 文字列内の正規表現と一致する部分を置き換えたい | STRING または BYTES |
REGEXP_SUBSTR (REGEXP_EXTRACTと同じ) |
正規表現と一致する最初の部分文字列を抽出したい | STRING または BYTES |
使用例
以下、パブリックデータセットbigquery-public-data.samples.github_timeline
を使った正規表現関数の使用例です。
REGEXP_CONTAINS
リポジトリのURLにgoogle
を含むレコードのみを取得します。
SELECT repository_url FROM `bigquery-public-data.samples.github_timeline` WHERE REGEXP_CONTAINS(repository_url, r'google') LIMIT 10
| repository_url | | ------------------------------------------------------------ | | https://github.com/sunnysideup/silverstripe-google-analytics | | https://github.com/piersharding/moodle-google | | https://github.com/wuffers/getoffgoogle | | https://github.com/Granicus/google-api-ruby-client | | https://github.com/joelcuevas/googlestaticmap | | https://github.com/raphink/puppet-googleauthenticator | | https://github.com/terut/google-gwt-sample | | https://github.com/carswellgould/fuel-google-oauth2 | | https://github.com/oughton/google-code-jam-2012 | | https://github.com/brunonery/google-mock |
REGEXP_EXTRACT
リポジトリのURLからgoogle
に続く文字列を抽出します。
SELECT repository_url, REGEXP_EXTRACT(repository_url, r'google(.*)') AS url_extract FROM `bigquery-public-data.samples.github_timeline` WHERE REGEXP_CONTAINS(repository_url, r'google') LIMIT 10
| repository_url | url_extract | | --------------------------------------------------------- | --------------------- | | https://github.com/gak/pygooglechart | chart | | https://github.com/Yesware/omniauth-google | | | https://github.com/wuffers/getoffgoogle | | | https://github.com/wuffers/getoffgoogle | | | https://github.com/brookinsconsulting/all2egooglesitemaps | sitemaps | | https://github.com/binzume/simple-googlespreadsheet-ruby | spreadsheet-ruby | | https://github.com/pmgodin/google-API-helpers | -API-helpers | | https://github.com/tzafrir/google-plus-extension-jsapi | -plus-extension-jsapi | | https://github.com/gregorynicholas/googlephonefix | phonefix | | https://github.com/google/signet | /signet |
REGEXP_EXTRACT_ALL
リポジトリのURLを/
で分割した文字列を抽出します。
SELECT repository_url, REGEXP_EXTRACT_ALL(repository_url, r'/([^/]+)') AS url_extract_all FROM `bigquery-public-data.samples.github_timeline` WHERE REGEXP_CONTAINS(repository_url, r'google') LIMIT 10
| repository_url | url_extract_all | | ------------------------------------------------------------- | --------------------------------------------------------- | | https://github.com/gimite/google-spreadsheet-ruby | "[github.com,gimite,google-spreadsheet-ruby]" | | https://github.com/vorbeth/android_vendor_google | "[github.com,vorbeth,android_vendor_google]" | | https://github.com/jcline/fuse-google-drive | "[github.com,jcline,fuse-google-drive]" | | https://github.com/arthur-debert/google-code-issues-migrator | "[github.com,arthur-debert,google-code-issues-migrator]" | | https://github.com/ubilabs/google.maps.polyline.edit | "[github.com,ubilabs,google.maps.polyline.edit]" | | https://github.com/1Marc/jquery-ui-google-maps | "[github.com,1Marc,jquery-ui-google-maps]" | | https://github.com/r0man/google-maps-clj | "[github.com,r0man,google-maps-clj]" | | https://github.com/wuffers/getoffgoogle | "[github.com,wuffers,getoffgoogle]" | | https://github.com/mohamedmansour/google-plus-extension-jsapi | "[github.com,mohamedmansour,google-plus-extension-jsapi]" | | https://github.com/jarib/google-closure-library | "[github.com,jarib,google-closure-library]" |
REGEXP_INSTR
リポジトリのURLにgoogle
初めて出現する位置を取得します。
SELECT repository_url, REGEXP_INSTR(repository_url, r'google') AS url_instr FROM `bigquery-public-data.samples.github_timeline` WHERE REGEXP_CONTAINS(repository_url, r'google') LIMIT 10
| repository_url | url_instr | | ------------------------------------------------------------- | --------- | | https://github.com/gimite/google-spreadsheet-ruby | 27 | | https://github.com/vorbeth/android_vendor_google | 43 | | https://github.com/jcline/fuse-google-drive | 32 | | https://github.com/arthur-debert/google-code-issues-migrator | 34 | | https://github.com/ubilabs/google.maps.polyline.edit | 28 | | https://github.com/1Marc/jquery-ui-google-maps | 36 | | https://github.com/r0man/google-maps-clj | 26 | | https://github.com/wuffers/getoffgoogle | 34 | | https://github.com/mohamedmansour/google-plus-extension-jsapi | 35 | | https://github.com/jarib/google-closure-library | 26 |
REGEXP_REPLACE
リポジトリのURLのgoogle
をGOOGLE
に置き換えます。
SELECT repository_url, REGEXP_REPLACE(repository_url, r'google', 'GOOGLE') AS url_replace FROM `bigquery-public-data.samples.github_timeline` WHERE REGEXP_CONTAINS(repository_url, r'google') LIMIT 10
| repository_url | url_replace | | ------------------------------------------------------- | ------------------------------------------------------- | | https://github.com/googleapi/googleapi | https://github.com/GOOGLEapi/GOOGLEapi | | https://github.com/zquestz/omniauth-google-oauth2 | https://github.com/zquestz/omniauth-GOOGLE-oauth2 | | https://github.com/zquestz/omniauth-google-oauth2 | https://github.com/zquestz/omniauth-GOOGLE-oauth2 | | https://github.com/DAddYE/google_tasks | https://github.com/DAddYE/GOOGLE_tasks | | https://github.com/yoavaviram/python-google-spreadsheet | https://github.com/yoavaviram/python-GOOGLE-spreadsheet | | https://github.com/waj/redmine_google_apps | https://github.com/waj/redmine_GOOGLE_apps | | https://github.com/mtoader/google-go-lang-idea-plugin | https://github.com/mtoader/GOOGLE-go-lang-idea-plugin | | https://github.com/brentsowers1/googlestaticmap | https://github.com/brentsowers1/GOOGLEstaticmap | | https://github.com/jarib/google-closure-library | https://github.com/jarib/GOOGLE-closure-library | | https://github.com/jarib/google-closure-library | https://github.com/jarib/GOOGLE-closure-library |
REGEXP_SUBSTR
リポジトリのURLからgoogle
に続く文字列を抽出します。
(REGEXP_EXTRACT と同じ)
SELECT repository_url, REGEXP_SUBSTR(repository_url, r'google(.*)') AS url_substr FROM `bigquery-public-data.samples.github_timeline` WHERE REGEXP_CONTAINS(repository_url, r'google') ORDER BY repository_url LIMIT 10
| repository_url | url_substr | | ----------------------------------------------------- | ---------- | | https://github.com/136629170/google-maps-api | -maps-api | | https://github.com/1Marc/jquery-ui-google-maps | -maps | | https://github.com/1Marc/jquery-ui-google-maps | -maps | | https://github.com/1Marc/jquery-ui-google-maps | -maps | | https://github.com/1Marc/tutorials-jquery-google-maps | -maps | | https://github.com/9flats/google-translate | -translate | | https://github.com/9re/android-support-v4-googlemaps | maps | | https://github.com/9re/android-support-v4-googlemaps | maps | | https://github.com/9re/android-support-v4-googlemaps | maps | | https://github.com/9re/android-support-v4-googlemaps | maps |
上記クエリの REGEXP_SUBSTR を REGEXP_EXTRACT を置き換えても同じ結果となります。
SELECT repository_url, REGEXP_EXTRACT(repository_url, r'google(.*)') AS url_extract FROM `bigquery-public-data.samples.github_timeline` WHERE REGEXP_CONTAINS(repository_url, r'google') ORDER BY repository_url LIMIT 10
| repository_url | url_extract | | ----------------------------------------------------- | ----------- | | https://github.com/136629170/google-maps-api | -maps-api | | https://github.com/1Marc/jquery-ui-google-maps | -maps | | https://github.com/1Marc/jquery-ui-google-maps | -maps | | https://github.com/1Marc/jquery-ui-google-maps | -maps | | https://github.com/1Marc/tutorials-jquery-google-maps | -maps | | https://github.com/9flats/google-translate | -translate | | https://github.com/9re/android-support-v4-googlemaps | maps | | https://github.com/9re/android-support-v4-googlemaps | maps | | https://github.com/9re/android-support-v4-googlemaps | maps | | https://github.com/9re/android-support-v4-googlemaps | maps |
まとめ
以上、BigQuery の正規表現関数の使いどころを自分なりに整理してみました。
自分自身への備忘録的な記事になってしまいましたが、他の方にも参考になれば幸いです!