AWS Glue 5.0からPythonのライブラリをrequirements.txtで指定できるようになったので検証してみた

re:Invent2024 で発表されたAWS Glue 5.0の新機能の一つとして、Sparkジョブで利用するPythonのライブラリを`requirements.txt`で指定できるようになりましたので、実際に検証してみました。

AWS re:Invent 2024

#AWS Glue

#PySpark

#Spark

morimorikochan

2025.04.14

みなさんこんにちは、リテールアプリ共創部のmorimorikochanです。
re:Invent2024 で発表されたAWS Glue 5.0の新機能の一つとして、Sparkジョブで利用するPythonのライブラリをrequirements.txtで指定できるようになりました。

従来のAWS Glue 4.0では、--additional-python-modulesオプションでインラインにライブラリを指定する必要がありましたが、今回の新機能で大量のライブラリをより簡単にインストールできるようになりました。
https://aws.amazon.com/jp/about-aws/whats-new/2024/12/aws-glue-5-0/
今回はそれを実際に検証してみようと思います。
 Sparkジョブの用意公式のAWS Glue for Spark のチュートリアルで作成したジョブをベースにして検証してみます。
https://docs.aws.amazon.com/ja_jp/glue/latest/dg/aws-glue-programming-intro-tutorial.html
また、このチュートリアルは割と難しい点がいくつかあり私なりの手順で以下のブログでまとめてますので、併せてご参照ください。
https://dev.classmethod.jp/articles/aws-glue-for-spark-tutorial
ジョブが用意できたら、AWS Glueのバージョンが5.0になっていることを確認してください。
 requirements.txtの作成requirements.txtを作成します。

今回は、artというライブラリを利用して、弊社社名をアスキーアートをログ出力してみ酔うと思います。
https://www.ascii-art.site/
art==6.4
これをS3バケットにアップロードし、S3 URIを控えておきます。
s3://glue-test-0412-requirementtxt/requirements.txt
また、このときジョブのIAM RoleにこのS3バケットへの読み取り権限を付与するのを忘れないようにしてください。
 Sparkジョブでの設定Glue Studioのジョブの設定画面内で Job details > Advanced properties > Job parameters に2組のパラメータを設定します。


キー
値


--python-modules-installer-option
-r

--additional-python-modules
{先ほど控えたS3 URI}

また、Sparkジョブのスクリプトでartをインポートし、アスキーアートを出力するようにします。
+ from art import *
import sys
from awsglue.transforms import *
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="yyz-tickets", table_name="tickets", transformation_ctx="S3bucket_node1"
)

+ tprint("class method")

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1,
スクリプト全文from art import *
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="yyz-tickets", table_name="tickets", transformation_ctx="S3bucket_node1"
)

tprint("class method")

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1,
    mappings=[
        ("tag_number_masked", "string", "tag_number_masked", "string"),
        ("date_of_infraction", "string", "date_of_infraction", "string"),
        ("ticket_date", "string", "ticket_date", "string"),
        ("ticket_number", "decimal", "ticket_number", "float"),
        ("officer", "decimal", "officer_name", "decimal"),
        ("infraction_code", "decimal", "infraction_code", "decimal"),
        ("infraction_description", "string", "infraction_description", "string"),
        ("set_fine_amount", "decimal", "set_fine_amount", "float"),
        ("time_of_infraction", "decimal", "time_of_infraction", "decimal"),
    ],
    transformation_ctx="ApplyMapping_node2",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
    frame=ApplyMapping_node2,
    connection_type="s3",
    format="glueparquet",
    connection_options={"path": "s3://morifuji-aws-glue-test-20250412", "partitionKeys": []},
    format_options={"compression": "gzip"},
    transformation_ctx="S3bucket_node3",
)

job.commit()
この状態でSparkジョブを実行してみます。
 実行結果正常に終了しました。

ドライバーノードのログを確認すると、以下の部分からS3バケットのrequirements.txtが読み込まれていることがわかります。
INFO	2025-04-12T09:09:19,414	17133	com.amazonaws.services.glue.PythonModuleInstaller	[pool-4-thread-1]	176	Command to Execute /usr/bin/bash /tmp/glue-job-14018748905261537258/pythonmodules/pythonmoduleinstaller.sh /tmp/glue_venv -r /tmp/glue-job-14018748905261537258/requirements.txt
INFO	2025-04-12T09:09:19,414	17133	com.amazonaws.services.glue.FileDownloader	[sdk-async-response-2-2]	145	Object downloaded. Details: GetObjectResponse(AcceptRanges=bytes, LastModified=2025-04-12T09:03:50Z, ContentLength=8, ETag="8db0eb3773366754becd61afd4b0ebe2", ContentType=text/plain, ServerSideEncryption=AES256, Metadata={})
INFO	2025-04-12T09:09:18,840	16559	com.amazonaws.services.glue.FileDownloader	[pool-4-thread-1]	95	downloading s3://glue-test-0412-requirementtxt/requirements.txt file to destination location: /tmp/glue-job-14018748905261537258/requirements.txt
INFO	2025-04-12T09:09:18,840	16559	com.amazonaws.services.glue.utils.AWSS3Utils$	[pool-4-thread-1]	21	Encoding S3 URI s3://glue-test-0412-requirementtxt/requirements.txt
INFO	2025-04-12T09:09:18,841	16560	com.amazonaws.services.glue.utils.AWSS3Utils$	[pool-4-thread-1]	26	Encoded S3 URI to s3://glue-test-0412-requirementtxt/requirements.txt
INFO	2025-04-12T09:09:18,841	16560	com.amazonaws.services.glue.FileDownloader	[pool-4-thread-1]	138	Download bucket: glue-test-0412-requirementtxt key: requirements.txt to /tmp/glue-job-14018748905261537258/requirements.txt with usingProxy: false and isProxyDisabled: true
また、ライブラリを使った処理（アスキーアート出力）も無事に成功していました。（AWS Glue自体が出力するロググループとは異なるロググループに出力されていたので見つけるのに苦労しました）
 AWS Glue 4.0で実行すると？念のため、AWS Glue 4.0で実行してみると以下のようにエラーが発生しました。
LAUNCH ERROR | Python Module Installer indicates modules that failed to install, check logs from the PythonModuleInstaller.Please refer logs for details.
 まとめいかがでしたでしょうか。AWS Glue 5.0からはジョブで使う多くのライブラリを簡単に指定できるようになりました。

これにより、ライブラリの管理がより簡単になり、開発効率が向上することが期待されます。

今後もAWS Glueの新機能に注目していきたいと思います。
 参考URLhttps://docs.aws.amazon.com/ja_jp/glue/latest/dg/aws-glue-programming-python-libraries.html#addl-python-modules-requirements-txt
https://docs.aws.amazon.com/ja_jp/glue/latest/dg/aws-glue-programming-intro-tutorial.html

AWS Glue 5.0からPythonのライブラリをrequirements.txtで指定できるようになったので検証してみた

Sparkジョブの用意

requirements.txtの作成

Sparkジョブでの設定

実行結果

AWS Glue 4.0で実行すると？

まとめ

参考URL

関連記事

主なカテゴリ

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

お問い合わせ

運営会社

キー	値
--python-modules-installer-option	-r
--additional-python-modules	{先ほど控えたS3 URI}