Since Gephi is my tool of choice for graph visualization, and Hadoop stores all my data, in HDFS or HBase I created a direct connector Hadoop and Gephi. Now all node and edge properties are accessable directly, no matter if stored in Hive or HBase tables, via Impala we can request the data in SQL-like style. Especially time dependent graph data can now efficiently by partitioned with Hive and all the large processing is done in Hadoop. Gephi is not responsable for the large filter procedures and has only to hold in memory what realy has to be shown.
Using GraphX in the cluster allows even large scale analysis on graphs which do not fit into the Workstations memory. After the results are available, Gephi loads the relevant nodes and draws nice plots.
But now I have to load data from Hadoop into Gephi. Creation of graph files and transfer to the workstation was done regularly. I had to repeat this steps manually all the time and important is, to remember the parameters which have been applied during processing or generation of the graph. This became a critical aspect over time and manual handling of all the files was not an option on the long term.
So I created the Gephi-Hadoop-Connector, which uses the JDBC-Interface provided by Impala and Hive, to load edge- and node-lists.
One really important feature in Gephi is: it supports time dependent analysis and visualisation of networks. To build such a timeline, an individual query can be defined for each single time frame of each individual layer. If data is already partitioned by time the whole procedure is really efficient.
Finally we have to think about: How to handle all this metadata of a time dependent multilayer graph? Therefore we use the Etosha-Graph-Metastore, which will be released soon.
Please clone this repository or download the zip file.
The Gephi-Hadoop-Connector was built by @kamir.