BigQuery の正規表現関数の使いどころを自分なりに整理してみた

2023.07.07

こんにちは!エノカワです。

BigQueryにはREGEXP_で始まる正規表現関数が用意されていますが、「こういうことをしたい場合、どの関数を使えば良いんだっけ?」と迷った経験はありませんか?
(私だけ?)

正規表現関数の使いどころ

ということで、正規表現関数の使いどころを自分なりに整理してみました。

関数 使いどころ 戻り値の型
REGEXP_CONTAINS 文字列が正規表現と一致するかを判定したい BOOL
REGEXP_EXTRACT 正規表現と一致する最初の部分文字列を抽出したい STRING または BYTES
REGEXP_EXTRACT_ALL 正規表現と一致するすべての部分文字列を抽出したい ARRAY または ARRAY
REGEXP_INSTR 正規表現と一致する部分が初めて出現する位置を取得したい INT64
REGEXP_REPLACE 文字列内の正規表現と一致する部分を置き換えたい STRING または BYTES
REGEXP_SUBSTR
(REGEXP_EXTRACTと同じ)
正規表現と一致する最初の部分文字列を抽出したい STRING または BYTES

使用例

以下、パブリックデータセットbigquery-public-data.samples.github_timelineを使った正規表現関数の使用例です。

REGEXP_CONTAINS

リポジトリのURLにgoogleを含むレコードのみを取得します。

SELECT
  repository_url
FROM
  `bigquery-public-data.samples.github_timeline`
WHERE
  REGEXP_CONTAINS(repository_url, r'google')
LIMIT
  10
| repository_url                                               |
| ------------------------------------------------------------ |
| https://github.com/sunnysideup/silverstripe-google-analytics |
| https://github.com/piersharding/moodle-google                |
| https://github.com/wuffers/getoffgoogle                      |
| https://github.com/Granicus/google-api-ruby-client           |
| https://github.com/joelcuevas/googlestaticmap                |
| https://github.com/raphink/puppet-googleauthenticator        |
| https://github.com/terut/google-gwt-sample                   |
| https://github.com/carswellgould/fuel-google-oauth2          |
| https://github.com/oughton/google-code-jam-2012              |
| https://github.com/brunonery/google-mock                     |

REGEXP_EXTRACT

リポジトリのURLからgoogleに続く文字列を抽出します。

SELECT
  repository_url,
  REGEXP_EXTRACT(repository_url, r'google(.*)') AS url_extract
FROM
  `bigquery-public-data.samples.github_timeline`
WHERE
  REGEXP_CONTAINS(repository_url, r'google')
LIMIT
  10
| repository_url                                            | url_extract           |
| --------------------------------------------------------- | --------------------- |
| https://github.com/gak/pygooglechart                      | chart                 |
| https://github.com/Yesware/omniauth-google                |                       |
| https://github.com/wuffers/getoffgoogle                   |                       |
| https://github.com/wuffers/getoffgoogle                   |                       |
| https://github.com/brookinsconsulting/all2egooglesitemaps | sitemaps              |
| https://github.com/binzume/simple-googlespreadsheet-ruby  | spreadsheet-ruby      |
| https://github.com/pmgodin/google-API-helpers             | -API-helpers          |
| https://github.com/tzafrir/google-plus-extension-jsapi    | -plus-extension-jsapi |
| https://github.com/gregorynicholas/googlephonefix         | phonefix              |
| https://github.com/google/signet                          | /signet               |

REGEXP_EXTRACT_ALL

リポジトリのURLを/で分割した文字列を抽出します。

SELECT
  repository_url,
  REGEXP_EXTRACT_ALL(repository_url, r'/([^/]+)') AS url_extract_all
FROM
  `bigquery-public-data.samples.github_timeline`
WHERE
  REGEXP_CONTAINS(repository_url, r'google')
LIMIT
  10
| repository_url                                                | url_extract_all                                           |
| ------------------------------------------------------------- | --------------------------------------------------------- |
| https://github.com/gimite/google-spreadsheet-ruby             | "[github.com,gimite,google-spreadsheet-ruby]"             |
| https://github.com/vorbeth/android_vendor_google              | "[github.com,vorbeth,android_vendor_google]"              |
| https://github.com/jcline/fuse-google-drive                   | "[github.com,jcline,fuse-google-drive]"                   |
| https://github.com/arthur-debert/google-code-issues-migrator  | "[github.com,arthur-debert,google-code-issues-migrator]"  |
| https://github.com/ubilabs/google.maps.polyline.edit          | "[github.com,ubilabs,google.maps.polyline.edit]"          |
| https://github.com/1Marc/jquery-ui-google-maps                | "[github.com,1Marc,jquery-ui-google-maps]"                |
| https://github.com/r0man/google-maps-clj                      | "[github.com,r0man,google-maps-clj]"                      |
| https://github.com/wuffers/getoffgoogle                       | "[github.com,wuffers,getoffgoogle]"                       |
| https://github.com/mohamedmansour/google-plus-extension-jsapi | "[github.com,mohamedmansour,google-plus-extension-jsapi]" |
| https://github.com/jarib/google-closure-library               | "[github.com,jarib,google-closure-library]"               |

REGEXP_INSTR

リポジトリのURLにgoogle初めて出現する位置を取得します。

SELECT
  repository_url,
  REGEXP_INSTR(repository_url, r'google') AS url_instr
FROM
  `bigquery-public-data.samples.github_timeline`
WHERE
  REGEXP_CONTAINS(repository_url, r'google')
LIMIT
  10
| repository_url                                                | url_instr |
| ------------------------------------------------------------- | --------- |
| https://github.com/gimite/google-spreadsheet-ruby             | 27        |
| https://github.com/vorbeth/android_vendor_google              | 43        |
| https://github.com/jcline/fuse-google-drive                   | 32        |
| https://github.com/arthur-debert/google-code-issues-migrator  | 34        |
| https://github.com/ubilabs/google.maps.polyline.edit          | 28        |
| https://github.com/1Marc/jquery-ui-google-maps                | 36        |
| https://github.com/r0man/google-maps-clj                      | 26        |
| https://github.com/wuffers/getoffgoogle                       | 34        |
| https://github.com/mohamedmansour/google-plus-extension-jsapi | 35        |
| https://github.com/jarib/google-closure-library               | 26        |

REGEXP_REPLACE

リポジトリのURLのgoogleGOOGLEに置き換えます。

SELECT
  repository_url,
  REGEXP_REPLACE(repository_url, r'google', 'GOOGLE') AS url_replace
FROM
  `bigquery-public-data.samples.github_timeline`
WHERE
  REGEXP_CONTAINS(repository_url, r'google')
LIMIT
  10
| repository_url                                          | url_replace                                             |
| ------------------------------------------------------- | ------------------------------------------------------- |
| https://github.com/googleapi/googleapi                  | https://github.com/GOOGLEapi/GOOGLEapi                  |
| https://github.com/zquestz/omniauth-google-oauth2       | https://github.com/zquestz/omniauth-GOOGLE-oauth2       |
| https://github.com/zquestz/omniauth-google-oauth2       | https://github.com/zquestz/omniauth-GOOGLE-oauth2       |
| https://github.com/DAddYE/google_tasks                  | https://github.com/DAddYE/GOOGLE_tasks                  |
| https://github.com/yoavaviram/python-google-spreadsheet | https://github.com/yoavaviram/python-GOOGLE-spreadsheet |
| https://github.com/waj/redmine_google_apps              | https://github.com/waj/redmine_GOOGLE_apps              |
| https://github.com/mtoader/google-go-lang-idea-plugin   | https://github.com/mtoader/GOOGLE-go-lang-idea-plugin   |
| https://github.com/brentsowers1/googlestaticmap         | https://github.com/brentsowers1/GOOGLEstaticmap         |
| https://github.com/jarib/google-closure-library         | https://github.com/jarib/GOOGLE-closure-library         |
| https://github.com/jarib/google-closure-library         | https://github.com/jarib/GOOGLE-closure-library         |

REGEXP_SUBSTR

リポジトリのURLからgoogleに続く文字列を抽出します。
REGEXP_EXTRACT と同じ)

SELECT
  repository_url,
  REGEXP_SUBSTR(repository_url, r'google(.*)') AS url_substr
FROM
  `bigquery-public-data.samples.github_timeline`
WHERE
  REGEXP_CONTAINS(repository_url, r'google')
ORDER BY
  repository_url
LIMIT
  10
| repository_url                                        | url_substr |
| ----------------------------------------------------- | ---------- |
| https://github.com/136629170/google-maps-api          | -maps-api  |
| https://github.com/1Marc/jquery-ui-google-maps        | -maps      |
| https://github.com/1Marc/jquery-ui-google-maps        | -maps      |
| https://github.com/1Marc/jquery-ui-google-maps        | -maps      |
| https://github.com/1Marc/tutorials-jquery-google-maps | -maps      |
| https://github.com/9flats/google-translate            | -translate |
| https://github.com/9re/android-support-v4-googlemaps  | maps       |
| https://github.com/9re/android-support-v4-googlemaps  | maps       |
| https://github.com/9re/android-support-v4-googlemaps  | maps       |
| https://github.com/9re/android-support-v4-googlemaps  | maps       |

上記クエリの REGEXP_SUBSTRREGEXP_EXTRACT を置き換えても同じ結果となります。

SELECT
  repository_url,
  REGEXP_EXTRACT(repository_url, r'google(.*)') AS url_extract
FROM
  `bigquery-public-data.samples.github_timeline`
WHERE
  REGEXP_CONTAINS(repository_url, r'google')
ORDER BY
  repository_url
LIMIT
  10
| repository_url                                        | url_extract |
| ----------------------------------------------------- | ----------- |
| https://github.com/136629170/google-maps-api          | -maps-api   |
| https://github.com/1Marc/jquery-ui-google-maps        | -maps       |
| https://github.com/1Marc/jquery-ui-google-maps        | -maps       |
| https://github.com/1Marc/jquery-ui-google-maps        | -maps       |
| https://github.com/1Marc/tutorials-jquery-google-maps | -maps       |
| https://github.com/9flats/google-translate            | -translate  |
| https://github.com/9re/android-support-v4-googlemaps  | maps        |
| https://github.com/9re/android-support-v4-googlemaps  | maps        |
| https://github.com/9re/android-support-v4-googlemaps  | maps        |
| https://github.com/9re/android-support-v4-googlemaps  | maps        |

まとめ

以上、BigQuery の正規表現関数の使いどころを自分なりに整理してみました。

自分自身への備忘録的な記事になってしまいましたが、他の方にも参考になれば幸いです!

参考