Subscribe

Share

Application Developer

Social Network Analysis

Use Oracle Big Data Spatial and Graph to analyze social networks.

By Mark Rittman Oracle ACE Director

September/October 2016

Many organizations are using Oracle Big Data Appliance and tools such as Oracle Data Integrator 12c and Oracle Business Intelligence Enterprise Edition 12c to capture, process, and analyze social media activity involving their brands and products. By aggregating and then analyzing mentions of their brands across different networks and territories, they can correlate those mentions with page hits on their websites and ecommerce sales, and tools such as Oracle Big Data Discovery can analyze the sentiments of those mentions and enrich them with reference data to add more context.

In addition to counting and quantifying mentions of a brand on social media, organizations can analyze the relationships in the networks themselves. For example, in most personal networks of friends or colleagues, there are a few people who seem to know everyone, have the respect of their group, and are often the main source of information and news. Similarly, within social networks, there is usually a smaller number of users central to each network who are quoted, followed, or connected to more than others and whose opinions influence customers and prospects.

Oracle Big Data Spatial and Graph provides a distributed graph database on Apache Hadoop (HBase) and Oracle NoSQL Database and a powerful in-memory analyst with more than 35 built-in, parallelized analytics. Using a property graph model, Oracle Big Data Spatial and Graph enables you to capture and then analyze the relationships between participants in your social networks, including influencers, communities, anomalies, recommendations, and other patterns of interest. In this example, I use Oracle Big Data Spatial and Graph to analyze Twitter activity within the Oracle Business Analytics community over a period of time in 2015.

To follow along with the steps in this article, download four components: a VM that includes Oracle Big Data Spatial and Graph, sample data, an open source tool, and Oracle support for that tool. You will use Oracle Big Data Lite VM VirtualBox 4.4.0; this virtual machine comes with Oracle Big Data Spatial and Graph 1.1.2 preinstalled. You can download it from Oracle Technology Network. You will use data from two sample files; they are available for download in a single zip file.

To visualize the graph data, you will also use an open source tool called Cytoscape that you can download separately from the Cytoscape Foundation website. You will use an Oracle-developed Cytoscape app (plugin) to enable Oracle Big Data Spatial and Graph support for Cytoscape. It is available to Oracle Big Data Spatial and Graph customers from their My Oracle Support account and can be downloaded for evaluation from Oracle Technology Network.

Creating and Analyzing the Twitter Activity Property Graph

In this article’s example, you will create a property graph with Oracle Big Data Spatial and Graph and use it to identify the most-influential Twitter users in the Oracle Business Analytics community, judging from their network centrality, a concept in network analysis that identifies which nodes (or vertices) are the most important, based on the number and strength of connections (or edges) they have to other vertices in the network. You will also explore what clusters of users exist within that group by discovering common connections between them.

The process includes three main steps:

  1. Load into either Apache HBase or Oracle NoSQL Database a list of nodes (vertices), the set of relationships (edges) that connect them, and properties to create the property graph data model.
  2. Run the Oracle Big Data Spatial and Graph in-memory analyst algorithms on that graph data model to calculate network centrality scores, identify clusters of nodes sharing common connections, and calculate the shortest distance between nodes.
  3. Visualize the property graph with Cytoscape to explore top-ranked nodes for potential influencers and clusters for communities within the social network.

Social networks such as Twitter are particularly good sources of relationship data for graph analysis, because they give users a means of replying to one another’s status updates, quoting or “retweeting” each other to imply endorsement of an opinion, and flagging membership in informal communities through hashtags or mentions of other users’ IDs in their own status updates. Figure 1 shows the type of connections in the data set, defined by replies, retweets, and user mentions in the text of the tweets captured for analysis.


o56ba-f1
 

Figure 1: Relationships in Twitter modeled as a graph

You will use the number and quality of these relationships in calculating network centrality scores to ascertain the most influential users in the group. You’ll also identify clusters of users who communicate among themselves predominantly and may constitute unique communities. Finally, you’ll use the weight scores for connection types to calculate the shortest-distance paths between nodes in the network. Lower weighting denotes a stronger connection or relationship—you can think of this weight as less friction associated with a connection. Here are the connection types and their weight scores:

  • Retweets, which imply endorsement of a user, have a weight of 10.
  • Replies, which are less indicative of influence, have a weight of 25.
  • Mentions, which indicate only a common community, have a weight of 50.

The two sample files, oraclemag_sna.opv and oraclemag_sna.ope, contain the list of vertices (Twitter users) and edges (tweets) that make up this article’s sample social network. The oraclemag_sna.opv vertices file contains a superset of all users, either at the start or the end of a relationship. The first line for each user defines the name, and the second line records the number of followers. Here’s an example:

1,name,1,markrittman,,
1,followers,2,,4527,
2,name,1,gwenshap,,
2,followers,2,,5397,
3,name,1,rmoff,,
3,followers,2,,1204,

The oraclemag_sna.ope edges file contains the edges within the property graph, defined as start and end vertices, edge type, and edge weight. Here’s an example:

1165,37,6,mentions,weight,3,,50.0,
1166,37,93,mentions,weight,3,,50.0,
1167,2,1,retweets,weight,3,,10.0,
1168,3,1,retweets,weight,3,,10.0,

To load these files into an Oracle Big Data Spatial and Graph property graph in Apache HBase, first ensure that the correct Oracle Big Data Lite VM services are running, by double-clicking on the Start/Stop Services icon on the desktop and confirming that the following services are running:

  • Zookeeper
  • Hadoop Distributed File System (HDFS)
  • HBase
  • Solr

Download this article’s sample data zip file, and extract and copy the oraclemag_sna.opv and oraclemag_sna.ope files to the virtual machine. Use the Terminal (Applications -> System Tools -> Terminal) command-line prompt to move the files to the correct location and start the Gremlin shell to prepare the property graph. (Note that your download location may differ; alter the first command as appropriate for your download location.)

sudo mv Downloads/oraclemag_sna.* /u01/oracle-spatial-graph/
property_graph/data
cd /u01/oracle-spatial-graph/property_graph/dal/groovy
unset CLASSPATH
./gremlin-opg-hbase.sh

Now enter the following commands in the Gremlin shell to load and prepare the graph model:

cfg = GraphConfigBuilder.forPropertyGraphHbase()            \
.setName("connectionHBase")                               \
.setZkQuorum("bigdatalite").setZkClientPort(2181)          \
.setZkSessionTimeout(120000).setInitialEdgeNumRegions(3)   \
.setInitialVertexNumRegions(3).setSplitsPerRegion(1)       \
.addEdgeProperty("weight", PropertyType.DOUBLE, "1000000") \
.build();
opg = OraclePropertyGraph.getInstance(cfg);
opg.clearRepository();
vfile="../../data/oraclemag_sna.opv"
efile="../../data/oraclemag_sna.ope"
opgdl=OraclePropertyGraphDataLoader.getInstance();
opgdl.loadData(opg, vfile, efile, 2);
// read through the vertices
opg.getVertices();
// read through the edges
opg.getEdges();
So Who Are the Most Influential Users in This Twitter Group?

You can now use this property graph model to calculate the top five most influential users in this network, using the PageRank function. Continuing with the Gremlin command-line shell (which should still be open in Terminal), enter the following shell commands to calculate the top five vertices by influence:

vOutput="/tmp/mygraph.opv"
eOutput="/tmp/mygraph.ope"
OraclePropertyGraphUtils.exportFlatFiles(opg, vOutput, eOutput, 2, false);
session = Pgx.createSession("session-id-1");
analyst = session.createAnalyst();
graph = session.readGraphWithProperties(opg.getConfig());
rank = analyst.pagerank(graph, 0.001, 0.85, 100);
rank.getTopKValues(5);

Note that only the IDs of the vertices are displayed in the output:

==>PgxVertex with ID 1=0.13885623487462861
==>PgxVertex with ID 3=0.08686102641801993
==>PgxVertex with ID 101=0.06757752513733056
==>PgxVertex with ID 6=0.06743774001139484
==>PgxVertex with ID 37=0.0481517609757462

To display the actual names of these Twitter users, enter the following, again in the Gremlin shell, to retrieve the name properties for each of these vertices:

v1=opg.getVertex(1l); v2=opg.getVertex(3l); v3=opg.getVertex(101l);   \
v4=opg.getVertex(6l); v5=opg.getVertex(37l);
System.out.println("Top 5 influencers: \n " + v1.getProperty("name") + \
                     "\n " + v2.getProperty("name") + \
                     "\n " + v3.getProperty("name") + \
                     "\n " + v4.getProperty("name") + \
                     "\n " + v5.getProperty("name"));

You should then see the following output as the top five users by influence:

Top 5 influencers: 
 markrittman
 rmoff
 rittmanmead
 mRainey
 JeromeFr

Although this is all very interesting, it would even be more useful if you displayed the clustered results in a graphical manner. And it would be more useful still if you used graph visualization to further extend the insights resulting from the in-memory graph analysis.

Visualizing and Analyzing This Social Network with Cytoscape

To visualize the graph model, install and configure the Cytoscape graph data visualization tool and the Oracle Big Data Spatial and Graph support for Cytoscape. Then open Cytoscape and connect to a property graph in Oracle Database. To start Cytoscape from a new Terminal session, select File -> Open Tab from the Terminal menu bar and enter

cd /u02/Cytoscape
./cytoscape.sh

where /u02/Cytoscape is the Cytoscape installation location. Adjust your location text as appropriate for your location.

The Cytoscape graphical user interface appears. For this example, you’ll create a connection to a property graph in Apache HBase. (If you are using Oracle NoSQL Database, refer to the Oracle Big Data Spatial and Graph support for Cytoscape documentation.) To create the connection,

  1. Select File -> Load -> Property graph -> Connect to Apache HBase Database from the Cytoscape application menu.
  2. In the Load from Apache HBase Database dialog box, enter the following to define the connection:
    Quorum : bigdatalite.localdomain
    Port : 2181 

    and click the magnifying glass icon to retrieve the list of HBase connections from Zookeeper, the registry you configured when setting up the graph model earlier.
  3. Select ConnectionHBase from the list of property graphs, but do not click any other buttons yet.

Oracle Big Data Spatial and Graph support for Cytoscape connects to an existing property graph stored in HBase or Oracle NoSQL Database and enables you to execute Oracle’s in-memory analyst’s built-in analytics, including the PageRank function. It also includes community detection algorithms.

For example, the following exercise demonstrates how to use Oracle Big Data Spatial and Graph within Cytoscape to calculate the score of the top five Twitter users by influence:

  1. In the Load from Apache HBase Database dialog box, confirm that connectionsHBase is displayed as a menu item in the Graphs area.
  2. As shown in Figure 2, in the In-memory Analytics Starting Point section of the dialog box:

    o56ba-f2
    Figure 2: Calculating the page rank
    • Choose page rank.
    • Enter 5 in the Top-ranked: text box.
    • Click the Load graph button next to the Top-ranked: text box. Cytoscape then displays the five top-ranked nodes, together with a list of details.
  3. To show the names associated with these top-ranked nodes, right-click anywhere in the Property Graph panel and select Apps -> Show label value -> name, where name is the attribute key from the graph database.
  4. To show the inbound connections for one of these influential users, such as mRainey for Michael Rainey, Oracle ACE, right-click that node and select Apps -> Expand Node so that all inbound connections (tweets mentioning, retweeting, and replying to that user) are displayed in the graph visualization, as shown in Figure 3


o56ba-f3

Figure 3:
Visualizing the connections around a node

Detecting and thematically coloring the different communities in this property graph can be helpful to, for example, identify distinct groups with which you might communicate, using different strategies. To do this with Cytoscape, do the following:

 
  1. In the Property Graph pane, click in the background panel and type Ctrl-A to select all nodes. Then right-click again on one of he selected nodes and select Apps -> Expand 2-hop selected nodes. After the node is expanded, click the Apply Preferred Layout button to display all the nodes in the pane.
  2. To detect the clusters or communities within this superset of all nodes and edges, select Apps -> Property graph -> Detect communities -> Sparsify/WCC and click OK. Figure 4 displays the graphical output.


o56ba-f4

Figure 4:
Finding interaction clusters that may indicate one or more communities

The clusters that Oracle Big Data Spatial and Graph detected were based on actual connections among the nodes, not on attributes such as hashtags or other self-declared community labels. Therefore, this analysis can be particularly useful for detecting groupings of users who may not even be aware that a community exists. Cluster analysis can be used when organizing and communicating with emergent communities and finding patterns where patterns may become apparent only when viewed as part of an overall relationship graph.


Summary

Graph analysis is a useful complement to more-traditional types of tabular big data analysis. In this article, you performed social network analysis on Twitter data, using Oracle Big Data Spatial and Graph to identify the influencers and communities in the Oracle Business Analytics community as well as where their interests lie. You also analyzed the data to determine with whom to communicate about future community events. Finally, you used Cytoscape with the Oracle Big Data Spatial and Graph in-memory analyst for community detection. This final example divided the community into subgroups of interest, based on actual relationships and communication patterns.

Next Steps

 READ more about Oracle Big Data Spatial and Graph.

 READ more by Mark Rittman.

 DOWNLOAD the sample data for this article.

 

Photography by Alvaro Pinot, Unsplash