Store and Process Graph Data using the Classmethod Tupl Storage Backend for Titan

alexp

2016.03.31 01:05

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Graph databases store and manage the relationships (edges) between different entities (vertices). Here is an example of a simple graph:

What are graph databases?

A simplified social network represented as a graph (original image from the Amazon DynamoDB Storage Backend for Titan documentation).

Graphs can be used to model social networks, gaming networks, master data, recommendations, logistics, genomics, fraud, geodesy, maps, networks and workflows, to name a few. Both vertices and edges can be typed; some vertices could be people as in our example, and others categories of popular items. Similarly, some edges could denote friendship relationships, and others could denote “likes” relationships (as above). Graph databases that support directed property graphs all allow properties to describe vertices and edges, often in the form of name-value pairs. Once a graph has been built, it is processed by traversing the edges between the vertices. In the graph above, we could traverse from Justin to Anna, and from there to Books.

Titan is a scalable graph database that is optimized for storing and querying graphs that contain hundreds of billions of vertices and edges. It is transactional, and can support concurrent access from thousands of users.

Classmethod Tupl Storage Backend for Titan

Titan’s pluggable data storage layer already supports several NoSQL databases and key-value stores. This allows you to choose the backend that provides the performance and features required by your application, while giving you the freedom to switch from one backend to another with minimal changes to your application code.

Today we are making a new Classmethod Tupl Storage Backend for Titan available. Storing your Titan graphs on Amazon's EC2 and Elastic Block Store (EBS) lets you handle large graphs without having to worry about building, running, or maintaining your own database cluster. You can combine multiple EBS block devices in a logical RAID array to achieve the level of availability and performance you require, even for small graphs. You can also run the Classmethod Tupl Storage Backend for Titan on your laptop in an in-memory mode for rapid development and testing. The Classmethod Tupl Storage Backend for Titan uses the Tupl: The Unnamed Persistence Library as its underlying storage mechanism.

The backend works with Titan version 1.0.0. This version supports fast traversals, edges that are both directed and typed, stored relationships, vertex partitioning, vertex labels, user defined transaction logs, and more. The backend did not make any changes to Tupl to support it. You are simply using Tupl as an efficient way to store your graphs.

Version 1.0.0 of Titan is compatible with version 3.0.1-incubating of the TinkerPop stack. TinkerPop is a “graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP)”.

Since I am talking about graphs, I should illustrate all of the items that I have talked about in the form of a graph! Here you go:

This graph shows the relationships between the various software elements of TinkerPop, Titan and this storage backend.

I created the following Gremlin script. It replicates the graph above using the Classmethod Storage Backend for Titan.

[gist id="f83a422fec48c2adf7a6"]

Benchmarks

I re-ran some of the experiments described in Beis, Papadopoulos and Kompatsiaris’ paper to compare the performance of this storage backend with other Titan backends and Neo4J. First, I ran the Massive Insertion Workload (MIW) on Titan-BerkeleyDB, this storage backend and Neo4J. MIW only commits once, after all the vertices and edges are loaded. Other performance optimizations are also employed on the Neo4J and Titan sides. For Titan, this involves using custom vertex ids, turning on the batch-loading option, and turning off transaction support. For all databases, a vertex cache keyed on the vertex id is employed to speed up edge creation.

MIW time for Titan-BerkeleyDB, Titan-Tupl and Neo4J

I ran the MIW on /dev/shm of an m4.10xlarge EC2 instance. I graphed the performance of each graph database for the MIW as a function of columns. Execution time appears to be linear in the number of columns, so I used least squares to graph a linear approximation for each graph database. A column is a piece of information (a vertex, a vertex property, an edge, and Titan composite index entries) that relates to a particular vertex. The “column” parlance is an artifact of the Titan KCV implementation of edgestore and graphindex tables. I multiply the number of vertices for each graph by three because one column corresponds to the vertex itself, another to the ID property and a final for the index entry in the ID index for that vertex. For each edge, there is one column in the intended direction and one column for the opposite direction, absent configuration options to the contrary. For the MIW, the Classmethod Tupl Storage Backend for Titan appears to be around 1.5 times faster than Titan-BerkeleyDB, and Neo4J appears to be approximately 2.5 times faster than the Classmethod Tupl Storage Backend for Titan.

Second, I measured the storage footprint of each of these databases when the graphs are persisted in the MIW.

MIW Storage Footprint of Titan-BerkeleyDB, Titan-Tupl and Neo4J

The persistent storage footprint of all three graph databases appears to be linear in the number of columns. Neo4J had the smallest footprint at 18 MiB per million columns. The Classmethod Tupl Storage Backend for Titan came in second at 27 MiB per million columns, and Titan-BerkeleyDB came in last at 54 MiB per million columns. In short, the performance Classmethod Tupl Storage Backend for Titan represents a step up from Titan-BerkeleyJE's performance on the write side, but it still lags behind Neo4J.

Getting Started

The Classmethod Tupl Storage Backend for Titan is available as a Maven project on GitHub. It runs on Windows, OSX and Linux and requires Maven and Java 1.8 (or later). The Classmethod Tupl Storage Backend for Titan includes installation instructions and an example that makes creative use of the Marvel Universe Social Graph public dataset. We have also created a CloudFormation template that will Launch an EC2 instance that has the Gremlin Server stack with the Classmethod Tupl Storage Backend for Titan, installed and ready to use. Finally, I updated the graphdb-benchmarks project on GitHub to support TinkerPop 3.0 and this storage backend to compare performance with other Titan storage backends and other graph databases.

-Alex;