Sunday, June 18, 2017

Trip Report: Graph Day SF 2017


Yesterday I attended the Graph Day SF 2017 conference. Lately, my interest in graphs have been around Knowledge Graphs. Last year, I worked on a project that used an existing knowledge graph and entity-pair versus relations co-occurrences across a large body of text to predict new relations from the text. Although we modeled the co-occurrence as a matrix instead of a graph, I was hoping to learn techniques that I could apply to graphs. One other area of recent interest is learning how to handle large graphs.

So anyway, that was why I went. In this post, I describe the talks I attended. The conference was 1 day only, and there were 4 parallel tracks of very deep, awesome talks, and there were at least 2 that I would have liked to attend but couldn't because I had to make a choice.

Keynote - Marko Rodriguez, Datastax.

I have always thought of graph people as being somewhat more intellectual than mere programmers, starting with the classes I took at college. The keynote kind of confirms this characterization. The object of the talk was to refute the common assertion by graph people that everything is a graph. The speaker does this by showing that a graph can be thought of structurally, as a collection of vertices and edges, and also as process, as a collection of functions and streams. Differentiating a graph repeatedly oscillates between the two representation, leading to the conclusion that a graph is infinitely differentiable. Here is the paper on which the talk is based, and here are the slides.

Time for a new Relation: going from RDBMS to graph - Patrick McFadin, Datastax

This talk was decidedly less highbrow compared to the keynote, focusing on why one might want to move from relational to the graph paradigm. The speaker has lots of experience in RDBMS and Tabular NoSQL databases (Cassandra), and is currently making the shift to graph databases. One key insight is that he classifies the different types of database technology in a continuum - Key Value stores, Tabular NoSQL databases, Document NoSQL databases, RDBMS, graph databases. Also, he differentiates bwtween the strengths of an RDBMS and a that of a graph databases as follows - the RDBMS makes it easy to describe relations, but the graph database makes it easy to find relations. He also looks at Property Graphs as possible drop-in replacements for RDBMS tables. He also pointed out a free learning resource DS330: DataStax Enterprise Graph, which seems likely to be product specific, although the introductory video suggests that there is some product agnostic content around data modeling.

Comparing Giraph and GraphX - Jenny Zhao, Drawbridge

Drawbridge's business is to disambiguate your devices from other people's, using their activity logs. In this particular presentation, they describe how they switched from using map-reduce for their feature selection process to using Apache Giraph and saved about 8 hours of processing time. Instead of writing out the pair data, then doing a pairwise compare followed by a pairwise join, they ingest the paired data as a graph and compute distances on the edges to find the best pairs for their downstream process. They also tried Spark GraphX but they found it doesn't scale as well to large data volumes. Code using GraphX and Giraph are also shown to highlight an important difference between the two.

Graphs in Genomics - Jason Chin, Pacific Biosciences

Interesting presentation about the use of graphs in the field of genomics. The human genome is currently not readable in its entirety, so it cut into many peieces of random length and resequenced. One possibility is to represent it as 23 bipartite graphs, one for each of our 23 chromosomes. Presentation then focuses on how researchers use graph theory to fill in gaps between the peices of the genome. Here is a link to an older presentation by the same presenter which covers much of the same material as this talk, I will update with the current presentation when it becomes available.

Knowledge Graph in Watson Discovery - Anshu Jain and Nidhi Rajshree, IBM

The talk focuses on lessons learned while the presenters were building the knowledge graph for IBM Watson. I thought this was a good mix of practical ideas and theory. Few things that I found particularly noteworthy was including suprise as a parameter - the user can specify a parameter that indicates his willingness to see serendipitous results. Another one is keeping the Knowledge Graph lighter and using it to finetune queries at runtime (local context) rather than baking it in during creating time (global context). Thus you are using the Knowledge graph itself as a context vector. Yet another idea is using Mutual Information as a similarity metric for the element of surprise (useful in intelligence and legal work) since it treats noise equally for both documents. Here is the link to the presentation slides.

A Cognitive Knowledge Base as an Enterprise Database - Haikal Pribadi, GRAKN.AI

The presenter showcases his product GRAKN.AI (sounds like Kraken), which is a distributed knowledge base with a reasoning query language. It was awarded product of the year for 2017 by University of Cambridge Computer Lab. It has a unified syntax that allows you to define and populate a graph and then query it. The query language feels a bit like Prolog, but much more readable. It is open source and free to use. I was quite impressed with this product and hope to try it soon. One other thing I noted in his presentation was the use of the DeepDive project for knowledge acquisition, which is a nice confirmation since I am looking at it's sister project snorkel for a similar use case.

Graph Based Taxonomy Generation - Rob McDaniel, LiveStories

The presenter describes building taxonomy from queries. The resulting taxonomies are focused on a small area of knowledge, and can be useful for building custom taxonomies for applications focused on a specific domain. Examples mentioned in the presentation were “health care costs” and “poisoning deaths”, produced as a result of using his approach. The idea is to take a group of (manually created) seed queries about a given subject and hit some given search engine using an API and collect the top N documents for each query. You then do topic modeling on these documents and generate a document-topic co-ocurrence graph (using only topics that have p(topic|document) above a certain threshold). You then partition the graph into subgraphs using an iterative partitioning strategy of coarsening, bisecting and un-coarsening. The graph partitioning algorithm covered in the presentation was Heavy Edge Matching, but other partitioning algorithms could be used as well. Once the partitions are stable, the node with the highest degree of connectedness in each partition becomes the root level element in the taxonomy. This node is then removed from the subgraph and the subgraph partitioned recursively again into its own subgraphs, until the number of topics in a partition is less than some threshold. The presentation slides and code are available.

Project Konigsburg: A Graph AI - Gunnar Kleemann and Denis Vrdoljak, Berkeley Data Science Group

The presenters describe a similarity metric based on counting triangles and wedges (subgraph motifs) that seems to work better with connected elements in a graph than more traditional metrics. They use this approach to rank features for feature selection. They have used this metric to build a and rank academics from a citation network extracted from Pubmed. They have also used this metric in several applications that focus on recruiting from the applicant side (resume building, finding the job that best suits your profile, etc).

Knowledge Graph Platform: Going beyond the database - Michael Grove, Stardog

This was a slightly high level talk by the CTO of Stardog. He outlined what people generally think about when they say Enterprise Knowledge Graph Platforms and the common fallacies in these definitions.

Two presentations I missed because there were 4 tracks going on at the same time, and I had to choose between two awesome presentations going on at the same time.

  • DGraph: A native, distributed graph database - Manish Jain, Dgraph Labs.
  • Start Flying with Apache and Tinkerpop - Jason Plurad, IBM

Overall, I thought the conference had really good talks, the venue was excellent, and the event was very well organized. There was no breakfast or snacks, but there was coffee and tea, and the lunch was delicious. One thing I noticed was the absence of video recording, so unfortunately there is not going to be any videos of these talks. There were quite a few booths, mostly graph database vendors. I learned quite a few things here, although I might have learned more if the conference was spread over 2 days and had 2 parallel tracks instead of 4.