こんにちは!エノカワです。
BigQueryにはREGEXP_
で始まる正規表現関数が用意されていますが、「こういうことをしたい場合、どの関数を使えば良いんだっけ?」と迷った経験はありませんか?
(私だけ?)
正規表現関数の使いどころ
ということで、正規表現関数の使いどころを自分なりに整理してみました。
関数 | 使いどころ | 戻り値の型 |
---|---|---|
REGEXP_CONTAINS | 文字列が正規表現と一致するかを判定したい | BOOL |
REGEXP_EXTRACT | 正規表現と一致する最初の部分文字列を抽出したい | STRING または BYTES |
REGEXP_EXTRACT_ALL | 正規表現と一致するすべての部分文字列を抽出したい | ARRAY |
REGEXP_INSTR | 正規表現と一致する部分が初めて出現する位置を取得したい | INT64 |
REGEXP_REPLACE | 文字列内の正規表現と一致する部分を置き換えたい | STRING または BYTES |
REGEXP_SUBSTR (REGEXP_EXTRACTと同じ) |
正規表現と一致する最初の部分文字列を抽出したい | STRING または BYTES |
使用例
以下、パブリックデータセットbigquery-public-data.samples.github_timeline
を使った正規表現関数の使用例です。
REGEXP_CONTAINS
リポジトリのURLにgoogle
を含むレコードのみを取得します。
SELECT
repository_url
FROM
`bigquery-public-data.samples.github_timeline`
WHERE
REGEXP_CONTAINS(repository_url, r'google')
LIMIT
10
| repository_url |
| ------------------------------------------------------------ |
| https://github.com/sunnysideup/silverstripe-google-analytics |
| https://github.com/piersharding/moodle-google |
| https://github.com/wuffers/getoffgoogle |
| https://github.com/Granicus/google-api-ruby-client |
| https://github.com/joelcuevas/googlestaticmap |
| https://github.com/raphink/puppet-googleauthenticator |
| https://github.com/terut/google-gwt-sample |
| https://github.com/carswellgould/fuel-google-oauth2 |
| https://github.com/oughton/google-code-jam-2012 |
| https://github.com/brunonery/google-mock |
REGEXP_EXTRACT
リポジトリのURLからgoogle
に続く文字列を抽出します。
SELECT
repository_url,
REGEXP_EXTRACT(repository_url, r'google(.*)') AS url_extract
FROM
`bigquery-public-data.samples.github_timeline`
WHERE
REGEXP_CONTAINS(repository_url, r'google')
LIMIT
10
| repository_url | url_extract |
| --------------------------------------------------------- | --------------------- |
| https://github.com/gak/pygooglechart | chart |
| https://github.com/Yesware/omniauth-google | |
| https://github.com/wuffers/getoffgoogle | |
| https://github.com/wuffers/getoffgoogle | |
| https://github.com/brookinsconsulting/all2egooglesitemaps | sitemaps |
| https://github.com/binzume/simple-googlespreadsheet-ruby | spreadsheet-ruby |
| https://github.com/pmgodin/google-API-helpers | -API-helpers |
| https://github.com/tzafrir/google-plus-extension-jsapi | -plus-extension-jsapi |
| https://github.com/gregorynicholas/googlephonefix | phonefix |
| https://github.com/google/signet | /signet |
REGEXP_EXTRACT_ALL
リポジトリのURLを/
で分割した文字列を抽出します。
SELECT
repository_url,
REGEXP_EXTRACT_ALL(repository_url, r'/([^/]+)') AS url_extract_all
FROM
`bigquery-public-data.samples.github_timeline`
WHERE
REGEXP_CONTAINS(repository_url, r'google')
LIMIT
10
| repository_url | url_extract_all |
| ------------------------------------------------------------- | --------------------------------------------------------- |
| https://github.com/gimite/google-spreadsheet-ruby | "[github.com,gimite,google-spreadsheet-ruby]" |
| https://github.com/vorbeth/android_vendor_google | "[github.com,vorbeth,android_vendor_google]" |
| https://github.com/jcline/fuse-google-drive | "[github.com,jcline,fuse-google-drive]" |
| https://github.com/arthur-debert/google-code-issues-migrator | "[github.com,arthur-debert,google-code-issues-migrator]" |
| https://github.com/ubilabs/google.maps.polyline.edit | "[github.com,ubilabs,google.maps.polyline.edit]" |
| https://github.com/1Marc/jquery-ui-google-maps | "[github.com,1Marc,jquery-ui-google-maps]" |
| https://github.com/r0man/google-maps-clj | "[github.com,r0man,google-maps-clj]" |
| https://github.com/wuffers/getoffgoogle | "[github.com,wuffers,getoffgoogle]" |
| https://github.com/mohamedmansour/google-plus-extension-jsapi | "[github.com,mohamedmansour,google-plus-extension-jsapi]" |
| https://github.com/jarib/google-closure-library | "[github.com,jarib,google-closure-library]" |
REGEXP_INSTR
リポジトリのURLにgoogle
初めて出現する位置を取得します。
SELECT
repository_url,
REGEXP_INSTR(repository_url, r'google') AS url_instr
FROM
`bigquery-public-data.samples.github_timeline`
WHERE
REGEXP_CONTAINS(repository_url, r'google')
LIMIT
10
| repository_url | url_instr |
| ------------------------------------------------------------- | --------- |
| https://github.com/gimite/google-spreadsheet-ruby | 27 |
| https://github.com/vorbeth/android_vendor_google | 43 |
| https://github.com/jcline/fuse-google-drive | 32 |
| https://github.com/arthur-debert/google-code-issues-migrator | 34 |
| https://github.com/ubilabs/google.maps.polyline.edit | 28 |
| https://github.com/1Marc/jquery-ui-google-maps | 36 |
| https://github.com/r0man/google-maps-clj | 26 |
| https://github.com/wuffers/getoffgoogle | 34 |
| https://github.com/mohamedmansour/google-plus-extension-jsapi | 35 |
| https://github.com/jarib/google-closure-library | 26 |
REGEXP_REPLACE
リポジトリのURLのgoogle
をGOOGLE
に置き換えます。
SELECT
repository_url,
REGEXP_REPLACE(repository_url, r'google', 'GOOGLE') AS url_replace
FROM
`bigquery-public-data.samples.github_timeline`
WHERE
REGEXP_CONTAINS(repository_url, r'google')
LIMIT
10
| repository_url | url_replace |
| ------------------------------------------------------- | ------------------------------------------------------- |
| https://github.com/googleapi/googleapi | https://github.com/GOOGLEapi/GOOGLEapi |
| https://github.com/zquestz/omniauth-google-oauth2 | https://github.com/zquestz/omniauth-GOOGLE-oauth2 |
| https://github.com/zquestz/omniauth-google-oauth2 | https://github.com/zquestz/omniauth-GOOGLE-oauth2 |
| https://github.com/DAddYE/google_tasks | https://github.com/DAddYE/GOOGLE_tasks |
| https://github.com/yoavaviram/python-google-spreadsheet | https://github.com/yoavaviram/python-GOOGLE-spreadsheet |
| https://github.com/waj/redmine_google_apps | https://github.com/waj/redmine_GOOGLE_apps |
| https://github.com/mtoader/google-go-lang-idea-plugin | https://github.com/mtoader/GOOGLE-go-lang-idea-plugin |
| https://github.com/brentsowers1/googlestaticmap | https://github.com/brentsowers1/GOOGLEstaticmap |
| https://github.com/jarib/google-closure-library | https://github.com/jarib/GOOGLE-closure-library |
| https://github.com/jarib/google-closure-library | https://github.com/jarib/GOOGLE-closure-library |
REGEXP_SUBSTR
リポジトリのURLからgoogle
に続く文字列を抽出します。
(REGEXP_EXTRACT と同じ)
SELECT
repository_url,
REGEXP_SUBSTR(repository_url, r'google(.*)') AS url_substr
FROM
`bigquery-public-data.samples.github_timeline`
WHERE
REGEXP_CONTAINS(repository_url, r'google')
ORDER BY
repository_url
LIMIT
10
| repository_url | url_substr |
| ----------------------------------------------------- | ---------- |
| https://github.com/136629170/google-maps-api | -maps-api |
| https://github.com/1Marc/jquery-ui-google-maps | -maps |
| https://github.com/1Marc/jquery-ui-google-maps | -maps |
| https://github.com/1Marc/jquery-ui-google-maps | -maps |
| https://github.com/1Marc/tutorials-jquery-google-maps | -maps |
| https://github.com/9flats/google-translate | -translate |
| https://github.com/9re/android-support-v4-googlemaps | maps |
| https://github.com/9re/android-support-v4-googlemaps | maps |
| https://github.com/9re/android-support-v4-googlemaps | maps |
| https://github.com/9re/android-support-v4-googlemaps | maps |
上記クエリの REGEXP_SUBSTR を REGEXP_EXTRACT を置き換えても同じ結果となります。
SELECT
repository_url,
REGEXP_EXTRACT(repository_url, r'google(.*)') AS url_extract
FROM
`bigquery-public-data.samples.github_timeline`
WHERE
REGEXP_CONTAINS(repository_url, r'google')
ORDER BY
repository_url
LIMIT
10
| repository_url | url_extract |
| ----------------------------------------------------- | ----------- |
| https://github.com/136629170/google-maps-api | -maps-api |
| https://github.com/1Marc/jquery-ui-google-maps | -maps |
| https://github.com/1Marc/jquery-ui-google-maps | -maps |
| https://github.com/1Marc/jquery-ui-google-maps | -maps |
| https://github.com/1Marc/tutorials-jquery-google-maps | -maps |
| https://github.com/9flats/google-translate | -translate |
| https://github.com/9re/android-support-v4-googlemaps | maps |
| https://github.com/9re/android-support-v4-googlemaps | maps |
| https://github.com/9re/android-support-v4-googlemaps | maps |
| https://github.com/9re/android-support-v4-googlemaps | maps |
まとめ
以上、BigQuery の正規表現関数の使いどころを自分なりに整理してみました。
自分自身への備忘録的な記事になってしまいましたが、他の方にも参考になれば幸いです!