[Update] I tried AWS Glue Interactive Sessions' Spark Connect support with a notebook from VS Code
This page has been translated by machine translation. View original
This is Ishikawa from the Cloud Business Division. AWS Glue Interactive Sessions now support Apache Spark Connect, enabling you to run PySpark directly from SageMaker Unified Studio notebooks or your local IDE, so I tried connecting from a VS Code notebook to test it out.
AWS Glue Interactive Sessions is a serverless mechanism for interactively running and debugging PySpark code before turning it into a job. Previously, you would interact with sessions via a notebook kernel (SparkMagic) or the Livy-based Statement API.
With this update, AWS Glue Interactive Sessions natively support Apache Spark Connect. Spark Connect is a lightweight client-server architecture that separates the client from the Spark execution environment, enabling direct connections to AWS Glue's serverless Spark from SageMaker Unified Studio notebooks, IDEs with a Python interpreter such as VS Code or PyCharm, or any Python script. A key feature is the ability to perform ad-hoc data exploration, step-by-step debugging, and incremental PySpark job development before production deployment, all within your usual tools.
What is Spark Connect
Apache Spark Connect is an architecture introduced in Spark 3.4 that separates the client from the Spark driver process. The client sends a logical execution plan to the server via gRPC, and results are streamed back using Apache Arrow. The client side does not need a full Spark runtime — any environment supporting PySpark's remote() API can connect.
In AWS Glue, Spark Connect is supported for Interactive Sessions with Glue version 5.1 and later. The differences from traditional Livy sessions are as follows.
| Item | Livy | Spark Connect |
|---|---|---|
| Protocol | REST | gRPC (logical execution plan) + Apache Arrow (result streaming) |
| Connection method | Statement API (RunStatement / GetStatement, etc.) | Direct connection to endpoint URL via PySpark remote() |
| Client requirements | aws-glue-sessions package or AWS SDK | Spark Connect-compatible PySpark |
| Supported environments | Jupyter with SparkMagic kernel | SageMaker Unified Studio notebooks, IDEs such as VS Code and PyCharm |
The overall connection picture looks like this.
Note that Spark Connect is not available in AWS Glue Studio. For interactive development, use SageMaker Unified Studio notebooks or an IDE with a Python interpreter.
Trying It Out
This time, without using SageMaker Unified Studio, I connected to an AWS Glue Spark Connect session and ran PySpark from a VS Code notebook (Jupyter extension) in my usual development environment.
Prerequisites
- Verification region: ap-northeast-1 / Tokyo
- IAM role for AWS Glue Interactive Sessions (in this verification,
AWSGlueServiceRoleDefaultwith access to Glue Data Catalog and S3 was used) - Required permissions for the calling principal:
glue:CreateSession/glue:GetSession/glue:GetSessionEndpoint/glue:DeleteSessionandiam:PassRoleto pass the role to Glue - VS Code (with Python extension and Jupyter extension installed)
- Local Python 3.11 environment
The Spark Connect endpoint is a public gRPC endpoint (443/TLS) using token authentication. There is no need to place it inside a VPC, and you can connect from your local PC over the internet.
Preparing the Client Environment and Kernel
Set up a Spark Connect client locally. A full Spark installation or Java is not required — you can connect with just the PySpark Spark Connect client (a pure Python gRPC client). Since the target Glue 5.1 uses Spark 3.5, match the client's PySpark version accordingly. Also register a Jupyter kernel for selection from VS Code.
python3.11 -m venv .venv
source .venv/bin/activate
pip install boto3 "pyspark[connect]==3.5.6" pandas pyarrow grpcio grpcio-status ipykernel
python -m ipykernel install --user --name sparkconnect-glue --display-name "Python 3.11 (Glue Spark Connect)"
In my environment, I open a notebook (.ipynb) in VS Code and select Python 3.11 (Python 3.11.13) from the "Select Kernel" option in the top right (you can also directly select the venv interpreter). There is no longer a need to prepare a dedicated Spark kernel (SparkMagic) as before — you can connect directly to Glue from a regular Python kernel.
0. Loading Libraries and Preparing Session Information
Build the role ARN for the Glue session from the executing account (retrieved via sts to avoid hardcoding). Make the session ID unique each time.
import time, uuid, urllib.parse
import boto3
from pyspark.sql import SparkSession
REGION = "ap-northeast-1"
glue = boto3.client("glue", region_name=REGION)
account_id = boto3.client("sts", region_name=REGION).get_caller_identity()["Account"]
role_arn = f"arn:aws:iam::{account_id}:role/AWSGlueServiceRoleDefault"
session_id = "blog-vscode-sc-" + uuid.uuid4().hex[:8]
print("session_id:", session_id)
print("role_arn :", role_arn)

1. Creating a Spark Connect Session
Specify SessionType="SPARK_CONNECT" and GlueVersion="5.1". To keep costs down, the worker configuration is minimal (G.1X × 2) and the idle timeout is set short.
resp = glue.create_session(
Id=session_id,
Role=role_arn,
Command={"Name": "glueetl"},
GlueVersion="5.1",
SessionType="SPARK_CONNECT",
NumberOfWorkers=2,
WorkerType="G.1X",
IdleTimeout=15,
Timeout=30,
DefaultArguments={"--language": "python", "--enable-glue-datacatalog": "true"},
)
s = resp["Session"]
print("Status :", s["Status"])
print("SessionType :", s["SessionType"])
print("GlueVersion :", s["GlueVersion"])

The SessionType was created as SPARK_CONNECT. The Interactive Session was created, and you can confirm it shows Provisioning in the management console.
2. Waiting Until READY
Poll until the session reaches READY status. In this case, it reached READY within a few tens of seconds.
while True:
status = glue.get_session(Id=session_id)["Session"]["Status"]
print(status)
if status in ("READY", "FAILED", "STOPPED", "TIMEOUT"):
break
time.sleep(10)

You can confirm that the Interactive Session is Ready.

3. Retrieving the Endpoint and Token
Use get_session_endpoint (the new API) to retrieve the Spark Connect connection URL and authentication token, then construct the connection string to pass to remote(). The token is masked in the output as it is sensitive information.
ep = glue.get_session_endpoint(SessionId=session_id)["SparkConnect"]
token = urllib.parse.quote(ep["AuthToken"], safe="")
remote = f"{ep['Url']}:443/;use_ssl=true;x-aws-proxy-auth={token}"
print("Url :", ep["Url"])
print("AuthToken (len) :", len(ep["AuthToken"]), "chars (masked)")
print("AuthTokenExpirationTime:", ep["AuthTokenExpirationTime"])

The SparkConnect in the response contains Url, AuthToken, and AuthTokenExpirationTime (the token expiration time).
4. Connecting via Spark Connect
Pass the constructed connection string to SparkSession.builder.remote(...) to connect. The VS Code kernel (local PC) acts purely as a thin client, and processing is executed on the Spark side in Glue.
spark = SparkSession.builder.remote(remote).getOrCreate()
spark.version

The Spark version on the connection target was 3.5.6-amzn-1.
5. Running PySpark / Spark SQL
Run the DataFrame API and Spark SQL from cells.

The DataFrame API, Spark SQL, and aggregations using Temp Views all ran without issues from VS Code cells. Result collection (show) is performed via Apache Arrow.
Cell 7: Cleanup (Deleting the Session)
Interactive Sessions incur charges the entire time they are running, so delete the session when finished. Rather than relying on idle timeout, it is safer to explicitly delete it in the last cell of the notebook.
spark.stop()
glue.delete_session(Id=session_id)
print("deleted:", session_id)

Discussion
Here is a summary of the insights and points to note gained from actually trying this out from a VS Code notebook.
- Even without a full Spark or Java installation on the local PC, it was possible to interactively connect to Glue's serverless Spark with just
pyspark[connect]. There is no longer a need to prepare a dedicated Spark kernel (SparkMagic) as before — the ease of connecting directly from a regular Python kernel is a practical advantage. - The connection uses a public gRPC endpoint (443/TLS) plus token authentication, with no VPC setup required. Being able to connect directly from your everyday VS Code environment is a genuinely useful point in practice.
- The authentication token has an expiration time (
AuthTokenExpirationTime). For long-running sessions, it is advisable to implement a mechanism — as shown in the official documentation samples — to refresh the token by callingget_session_endpointagain and reconnecting.
The following limitations are worth keeping in mind.
- The session type (Livy / Spark Connect) cannot be changed after creation.
- The Statement API (RunStatement, etc.) cannot be used with Spark Connect sessions. Interaction is done directly from the client.
- Fine-grained access control (FGAC) via Lake Formation is not supported; only full table access is available. In practice, when I separately tried
SHOW DATABASES, access was denied due to insufficient permissions (Required Describe on default) for catalogs managed by Lake Formation. This indicates that when working with tables in Glue Data Catalog, it is necessary to design Lake Formation permissions for the session role. - Spark Connect is not available in AWS Glue Studio. It can only be used from SageMaker Unified Studio notebooks or an IDE with a Python interpreter.
When using SageMaker Unified Studio notebooks, the sagemaker-studio library's Spark utilities allow you to connect without worrying about endpoint retrieval or constructing the connection string.
from sagemaker_studio import sparkutils
spark = sparkutils.init(connection_name="my-glue-spark-connection")
In terms of cost, the charges for a single verification run (session creation → PySpark execution → deletion) measured DPUSeconds=331.892 (approximately 0.092 DPU hours). At the Tokyo region rate ($0.44/DPU hour), this comes to roughly $0.04. Be aware that Interactive Sessions have a per-minute minimum charge and continue to incur charges while running, so it is safe practice to delete the session when done (as pricing rates vary by region and time, please check the official pricing page for the latest information).
Closing Thoughts
With AWS Glue Interactive Sessions now supporting Spark Connect, it is possible to directly connect to serverless Glue Spark from SageMaker Unified Studio notebooks, local IDEs, or Python scripts, enabling interactive PySpark development. In this walkthrough, I confirmed the ability to connect from a VS Code notebook using pyspark[connect] and successfully run both the DataFrame API and Spark SQL.
In recent years, Apache Spark has seen greatly improved support for Apache Iceberg tables, and there are an increasing number of cases where Spark is required when creating tables. The entire workflow — from creating an Interactive Session to running Spark SQL and deleting the session — can be described in a standard notebook file rather than IaC (Infrastructure as Code), which makes it easy to incorporate into Git-based code management.
Being able to incrementally build PySpark jobs on Glue within your everyday development environment, without worrying about cluster management, seems particularly valuable during the trial-and-error phase before productionizing jobs. When working with tables in Glue Data Catalog, be mindful of Lake Formation permission design (and the FGAC limitation), but the recommended starting point is simply connecting from your local VS Code and giving it a try.
