Query Neptune Graph using Gremlin Query Language

2021.09.09

Introduction to Neptune

Amazon Neptune

Neptune is a fully managed graph database service. Neptune is used for retrieving complex relations between highly connected datasets. It can query billions of relationships with millisecond latency. Graph database models relationships between data with the combination of subject, predicate, object and graph. It uses nodes(vertices) and edges(actions) to describe  the data and the relationship between them.

 

 

Query Languages

Neptune mainly supports the following graph query Languages :

  • Gremlin 
    • Gremlin is defined by Apache TinkerPop.
    • It is used for querying property graphs in Neptune.
  • SPARQL 
    • It is used for querying RDF data.
    • SPARQL is great for multiple data sources with a large variety of datasets.

Both Gremlin and SPARQL graph data can be stored on the same Neptune cluster, but as  separate databases. The data inserted using one of the query languages can be queried by that query language only. Here, I am using Gremlin to query the Neptune Database.

 

Implementation

Neptune clusters can be accessed only from an EC2 Instance from the same VPC or using Jupyter Notebooks. Here, I am creating a Neptune cluster with the Jupyter Notebook.

 

Creating Neptune Cluster

Use the following steps for creating a Neptune cluster with the Jupyter Notebook through the console :

  • Open the Neptune console and click 'create database'.
  • Select the Engine version and enter a name for your database. Choose either 'Production' or 'Development and testing' as your template for the database.

 

 

  • Select the instance class as 'burstable' with Instance as 't3.medium'. You can either select a Default VPC with a default subnet or create a new VPC with a new Subnet. This time I am choosing Default VPC with Default subnet and default security group.

 

  • For creating Notebook, check the 'Create notebook' checkbox and add the following configuration details:
    • Instance type : t3.medium
    • Notebook name : enter a name for your notebook
    • IAM role : select 'create an IAM role'
    • IAM role name : enter a name for IAM role
    • Internet access : choose 'Direct access through Amazon Sagemaker'.

 

 

  • It takes a while to create the database. It creates a Neptune cluster with default reader and writer endpoints. You can also create custom endpoints if needed.

 

Loading Data from S3

  • Create two files 'vertices.csv' containing node properties and 'edges.csv' containing edge properties . Create an S3 bucket in the same region as Neptune and upload the two files.

 

 

  • Change the Inbound source of the security group attached to Neptune to an IP address, which allows S3 access.
  • Create an IAM role which allows Neptune to access S3 with the following configuration :
    • AWS Service - S3
    • Policies - AmazonS3ReadOnlyAccess
    • Role name - NeptuneLoadFromS3
    • After creating the role, edit the trust relationship and add the following trust policy

 

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "rds.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

 

  • Attach the IAM role to the Neptune Cluster.
  • Create a VPC Endpoint for Neptune to communicate with the S3 bucket. Select the S3 gateway as the service name and choose the same VPC as the Neptune cluster. Choose the full access policy and click create.

 

 

  • Open the created Jupyter Notebook, which opens through a Sagemaker. Create a new Python3 file and load the data from S3 using the load command. Specify the S3 bucket name and S3 role ARN.

This command loads the data from S3 into Neptune.

 

Querying a Neptune Graph using Gremlin

  • Listing Vertices and Edges

    • Command to list all the Vertices.
%%gremlin
g.V()
    • Command to list all the Edges.
%%gremlin
g.E()

 

  • Adding vertices and Edges

    • Adding vertex with label, Id and property.
%%gremlin
g.addV('person').property(id, 'v6').property('name', 'person6')
    • Updating property of the vertex. If you do not specify single, it instead appends the value to the name property.
%%gremlin
g.V('v6').property(single, 'name', 'person7')
    • Adding edge connecting two vertices.
%%gremlin 
g.V('v6').addE('friend').to(g.V('v2')).property(id, 'e6')

 

  • Deleting vertices and Edges 

    • Deleting  a single vertex.
%%gremlin
g.V().has(id, '1').drop()
    • Deleting all the vertices.
%%gremlin
g.V().drop().iterate()
    • Deleting  a single Edge.
%%gremlin
g.E().has(id, 'e6').drop()
    • Deleting all the Edges.
%%gremlin
g.E().drop().iterate()

 

  • Traversing through the graph

    • Traversing through the graph to get friends of person2(V2)
%%gremlin
g.V('v2').out('friend')

output
v[v3] //represents person3
v[v6] //represents person6
    • To find out what person2 friends like.
%%gremlin
g.V('v2').out('friend').out('likes')

output
v[v4]. //which represents tea.

 

Summary

We have successfully created the Neptune Database, loaded data from S3 and queried the Neptune Database using Gremlin Query Language. Amazon Neptune is very useful for highly connected dataset with complex relationships. Neptune is highly available, with read replicas and provides data security features for encryption at rest and in transit.