ローカルのDockerイメージでAWS Glue for Sparkのジョブを動かした際にDynamoDBのリソースがバージニアリージョン以外で見つからない場合のワークアラウンド
こんにちは、リテールアプリ共創部のmorimorikochanです。
AWS Glue for Spark を触って理解するために、Dockerイメージを使用してローカルでDynamoDBのリソースを操作するジョブスクリプトを動かしてみましたが、バージニアリージョン以外で動作させるためにワークアラウンドが必要だったので、同じような状況の方のために共有したいと思います。
先にまとめ
public.ecr.aws/glue/aws-glue-libs:5
では、DynamoDBのリージョンがバージニアに固定されています。バージニア以外のDynamoDBのリソースにアクセスしたい場合は後述のAWS Glue 4.0のDockerイメージを利用しましょうpublic.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01
では、デフォルトでDynamoDBのリージョンがバージニアに固定されてしまいます。- 以下の処理を最初に実行することでリージョンを強制的に変更することができます。
sc._jsc.hadoopConfiguration().set("dynamodb.region", "ap-northeast-1")
sc._jsc.hadoopConfiguration().set("dynamodb.regionid", "ap-northeast-1")
事象
上記の手順に沿って、AWS GlueのDockerイメージを使用してローカルでREPLシェル (Pyspark)を起動し、REPLシェルの中で東京リージョンのDynamoDBのテーブルにアクセスしようとしたところ、以下のエラーが発生しました。
Caused by: software.amazon.awssdk.services.dynamodb.model.ResourceNotFoundException: Requested resource not found: Table: someTable not found (Service: DynamoDb, Status Code: 400, Request ID: VC9HBP8H0E0MSR242F0VBNKIDFVV4KQNSO5AEMVJF66Q9ASUAAJG)
このエラーは、DynmaoDBのテーブルが存在しないことを示していますが東京リージョンにテーブルは存在していました。
また、DynamoDBへの読み込みおよび書き込みのいずれでも発生していました。
REPLシェル起動からエラー発生までの全文
docker run -it --rm \
-e DISABLE_SSL=true \
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
-e AWS_REGION=$AWS_DEFAULT_REGION \
-e AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION \
-p 4040:4040 \
-p 18080:18080 \
--name glue_pyspark \
amazon/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark
starting org.apache.spark.deploy.history.HistoryServer, logging to /home/glue_user/spark/logs/spark-glue_user-org.apache.spark.deploy.history.HistoryServer-1-7cb28f02c391.out
Python 3.10.2 (main, Oct 8 2024, 04:02:18) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)] on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/glue_user/spark/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/aws-glue-libs/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/aws-glue-libs/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.0-amzn-1
/_/
Using Python version 3.10.2 (main, Oct 8 2024 04:02:18)
Spark context Web UI available at http://7cb28f02c391:4040
Spark context available as 'sc' (master = local[*], app id = local-1745831606152).
SparkSession available as 'spark'.
>>> import sys
>>> from awsglue.transforms import *
>>> from awsglue.utils import getResolvedOptions
>>> from pyspark.context import SparkContext
>>> from awsglue.context import GlueContext
>>> from awsglue.job import Job
>>> from awsglue.dynamicframe import DynamicFrame
>>> from pyspark.sql import functions as F
>>>
>>> glueContext = GlueContext(sc)
/home/glue_user/spark/python/pyspark/sql/context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
warnings.warn(
>>> spark = glueContext.spark_session
>>> import_table = glueContext.create_dynamic_frame.from_options(
... connection_type="dynamodb",
... connection_options={
... "dynamodb.input.tableName": "someTable",
... "dynamodb.throughput.read.percent": "1"
... }
... )
>>> import_table.show(10)
Traceback (most recent call last): (0 + 1) / 1]
File "<stdin>", line 1, in <module>
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 78, in show
File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/home/glue_user/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o69.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (7cb28f02c391 executor driver): java.lang.RuntimeException: Could not lookup table someTable in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:142)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:68)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:59)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:150)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:82)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:31)
at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:135)
at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:653)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found: Table: someTable not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 14JURSQ1TCGKDNTIBH478TLTNBVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:106)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:81)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:131)
... 23 more
Caused by: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found: Table: someTable not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 14JURSQ1TCGKDNTIBH478TLTNBVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6243)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6210)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2256)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2220)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:135)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:132)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:78)
... 24 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1239)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1239)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3051)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2229)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2269)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1470)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
at org.apache.spark.rdd.RDD.take(RDD.scala:1443)
at com.amazonaws.services.glue.DynamicFrame.$anonfun$showString$1(DynamicFrame.scala:338)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245)
at com.amazonaws.services.glue.DynamicFrame.showString(DynamicFrame.scala:338)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: Could not lookup table someTable in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:142)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:68)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:59)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:150)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:82)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:31)
at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:135)
at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:653)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found: Table: someTable not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 14JURSQ1TCGKDNTIBH478TLTNBVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:106)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:81)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:131)
... 23 more
Caused by: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found: Table: someTable not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 14JURSQ1TCGKDNTIBH478TLTNBVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6243)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6210)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2256)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2220)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:135)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:132)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:78)
... 24 more
Dockerイメージに渡す環境変数の設定が原因かと考え、boto3を使って東京リージョンのDynamoDBのデータ取得を行ってみましたが正常に取得できました。また、デフォルトリージョンの設定値を取得してみましたが正しく東京リージョンになっていました。
>>> import boto3
>>> dynamodb = boto3.client('dynamodb')
>>> response = dynamodb.list_tables()
>>> for table_name in response['TableNames']:
... print(f"- {table_name}")
...
- someTable1
- someTable2
- ...
>>> print(f"デフォルトリージョン: {boto3.Session().region_name}")
デフォルトリージョン: ap-northeast-1
となると aws-glue-libのDockerイメージに含まれるAWS Glueのコードの問題かなと思い調べると、AWS Glue 4.0ですが同様のissueが見つかりました。
こちらに従うと設定値spark.hadoop.dynamodb.customAWSCredentialsProvider
を com.amazonaws.auth.DefaultAWSCredentialsProviderChain
に更新すれば解決するとのことでしたが結果は変わらずでした。
解決方法
結局のところ、AWS Glue 5.0のDockerイメージでは解決できませんでした。
代わりに、AWS Glue 4.0のDockerイメージ(public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01
)を利用し、かつスクリプトの先頭で設定値を以下のように上書きすると、バージニア以外でもDynamoDBのリソースにアクセスできるようになりました。
sc._jsc.hadoopConfiguration().set("dynamodb.region", "ap-northeast-1")
sc._jsc.hadoopConfiguration().set("dynamodb.regionid", "ap-northeast-1")
このdynamodb.region
とdynamodb.regionid
の設定値は、awslabs/emr-dynamodb-connectorでのリージョン計算時の挙動を変更しています。
困っている人は多いと思うので今後修正される可能性も高いと思いますが、だれかの参考になれば幸いです。
参考情報
- https://docs.aws.amazon.com/ja_jp/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image
- https://nsakki55.hatenablog.com/entry/2024/10/09/084043
- https://github.com/awslabs/emr-dynamodb-connector/blob/0b69d2d3ce59f2d47b18d6298e50fd6d8e856daa/emr-dynamodb-hadoop/src/main/java/org/apache/hadoop/dynamodb/DynamoDBUtil.java#L252-L282
- https://github.com/awslabs/aws-glue-libs/issues/39