At the recent PyData conference in New York, the PokitDok data science team behind Pokitdok described how it used the TinkerPop stack, along with the Titan graph database with Python and containers in its healthcare API platform. The stack allowed the group to show a simulated data set of doctors and the connections to consumers who might have scheduled appointments with them.
Expanding the available languages will be a major focus for Tinkerpop, a graph computing framework for tying together relational and graph databases in one unified model, which just became a top-level project with the Apache Software Foundation.
“Tinkerpop was always great when it came to technology, but it was just a handful of people working on something. When it’s under Apache, larger companies will take more notice of that,” said Stephen Mallette, vice president of Apache TinkerPop and a software engineer at DataStax.
Graph databases have grown far beyond the province of social networking sites as companies seek to understand the relationships in their data. DataStax’s DSE Graph and IBM Graph are two recent commercial applications of Tinkerpop, and another focus will be on further expanding the ecosystem, Mallette said.
“Our focus is our framework and our interfaces and some reference implementations that are production ready. It will be growing into other databases. Apache Flink might be one [possible integration]; there’s another project in the [Apache] incubator based on Hbase called S2 Graph,” he said.
The graph computing framework relies on Gremlin, a graph traversal machine and language, which enables users to write complex queries called traversals that can execute either as real-time transactional (OLTP) queries, batch analytic graph (OLAP) queries, or a hybrid of the two.
The reference implementations of a number of different data systems including Neo4j, Apache Giraph, Spark and Hadoop. These implementations include commercial and open source graph databases and processors, Gremlin language variants, visualization applications for graph analysis and other tools and libraries.
“We’re taking Gremlin, which is written in Java, and producing it in a different language like Scala or Closure or Python. We’re developing those things so people both on and off the JVM can learn something. Up until Tinkerpop 3 released general availability, there really weren’t options for people working in Python and other languages outside the JVM,” Mallette said.
Amazon uses TinkerPop and Gremlin to process its order fulfillment graph, which contains approximately 1 trillion edges (relationships), according to ASF. It’s also integrated its DynamoDB offering with Titan, one of the Tinkerpop projects.
“Tinkerpop’s strength is to have this unified approach to graphs of any size or complexity,” Mallette said. “It’s a really neat thing to just learn Gremlin and Tinkerpop and be able to use that with a small graph on a single machine or use that same exact code and distribute it across a large cluster of machines. That’s a really powerful thing.
“Avoiding vendor lock-in as a side effect of doing that is also really cool. In the morning I can work with Neo4j, in the afternoon I can work with Titan and between I can execute a graph analytic job in Spark, and all of that is just knowledge of the Gremlin traversal language. I don’t necessarily need to learn in depth all these underlying technologies; just learning Tinkerpop is enough to get you going.”
TinkerPop originated in 2009 at the Los Alamos National Laboratory and after two releases were donated to the Apache Incubator in January 2015.
Though interest is growing, graph databases remain a small segment of the database market. 451 Research has estimated it represents $200 million of the $286 billion sector; Forrester has predicted 25 percent industry uptake by 2017.
The Panama Papers, which involved the leak of 2.6TB of data from Panamanian law firm Mossack Fonseca about hidden offshore accounts, poses one of the more interesting use cases for graph databases.
The global Institute of Investigative Journalists spent four years poring through more than 11.5 million documents trying to connect names to accounts by matching emails, legal documents and other unstructured data. Late in the process, it employed the Linkurious data visualization tool and Neo4j graph database to help bring those connections to light.