September/October 2016
Many organizations are using Oracle Big Data Appliance and tools such as Oracle Data Integrator 12c and Oracle Business Intelligence Enterprise Edition 12c to capture, process, and analyze social media activity involving their brands and products. By aggregating and then analyzing mentions of their brands across different networks and territories, they can correlate those mentions with page hits on their websites and ecommerce sales, and tools such as Oracle Big Data Discovery can analyze the sentiments of those mentions and enrich them with reference data to add more context.
In addition to counting and quantifying mentions of a brand on social media, organizations can analyze the relationships in the networks themselves. For example, in most personal networks of friends or colleagues, there are a few people who seem to know everyone, have the respect of their group, and are often the main source of information and news. Similarly, within social networks, there is usually a smaller number of users central to each network who are quoted, followed, or connected to more than others and whose opinions influence customers and prospects.
Oracle Big Data Spatial and Graph provides a distributed graph database on Apache Hadoop (HBase) and Oracle NoSQL Database and a powerful in-memory analyst with more than 35 built-in, parallelized analytics. Using a property graph model, Oracle Big Data Spatial and Graph enables you to capture and then analyze the relationships between participants in your social networks, including influencers, communities, anomalies, recommendations, and other patterns of interest. In this example, I use Oracle Big Data Spatial and Graph to analyze Twitter activity within the Oracle Business Analytics community over a period of time in 2015.
To follow along with the steps in this article, download four components: a VM that includes Oracle Big Data Spatial and Graph, sample data, an open source tool, and Oracle support for that tool. You will use Oracle Big Data Lite VM VirtualBox 4.4.0; this virtual machine comes with Oracle Big Data Spatial and Graph 1.1.2 preinstalled. You can download it from Oracle Technology Network. You will use data from two sample files; they are available for download in a single zip file.
To visualize the graph data, you will also use an open source tool called Cytoscape that you can download separately from the Cytoscape Foundation website. You will use an Oracle-developed Cytoscape app (plugin) to enable Oracle Big Data Spatial and Graph support for Cytoscape. It is available to Oracle Big Data Spatial and Graph customers from their My Oracle Support account and can be downloaded for evaluation from Oracle Technology Network.
Creating and Analyzing the Twitter Activity Property GraphIn this article’s example, you will create a property graph with Oracle Big Data Spatial and Graph and use it to identify the most-influential Twitter users in the Oracle Business Analytics community, judging from their network centrality, a concept in network analysis that identifies which nodes (or vertices) are the most important, based on the number and strength of connections (or edges) they have to other vertices in the network. You will also explore what clusters of users exist within that group by discovering common connections between them.
The process includes three main steps:
Social networks such as Twitter are particularly good sources of relationship data for graph analysis, because they give users a means of replying to one another’s status updates, quoting or “retweeting” each other to imply endorsement of an opinion, and flagging membership in informal communities through hashtags or mentions of other users’ IDs in their own status updates. Figure 1 shows the type of connections in the data set, defined by replies, retweets, and user mentions in the text of the tweets captured for analysis.
Figure 1: Relationships in Twitter modeled as a graph
You will use the number and quality of these relationships in calculating network centrality scores to ascertain the most influential users in the group. You’ll also identify clusters of users who communicate among themselves predominantly and may constitute unique communities. Finally, you’ll use the weight scores for connection types to calculate the shortest-distance paths between nodes in the network. Lower weighting denotes a stronger connection or relationship—you can think of this weight as less friction associated with a connection. Here are the connection types and their weight scores:
The two sample files, oraclemag_sna.opv and oraclemag_sna.ope, contain the list of vertices (Twitter users) and edges (tweets) that make up this article’s sample social network. The oraclemag_sna.opv vertices file contains a superset of all users, either at the start or the end of a relationship. The first line for each user defines the name, and the second line records the number of followers. Here’s an example:
1,name,1,markrittman,, 1,followers,2,,4527, 2,name,1,gwenshap,, 2,followers,2,,5397, 3,name,1,rmoff,, 3,followers,2,,1204,
The oraclemag_sna.ope edges file contains the edges within the property graph, defined as start and end vertices, edge type, and edge weight. Here’s an example:
1165,37,6,mentions,weight,3,,50.0, 1166,37,93,mentions,weight,3,,50.0, 1167,2,1,retweets,weight,3,,10.0, 1168,3,1,retweets,weight,3,,10.0,
To load these files into an Oracle Big Data Spatial and Graph property graph in Apache HBase, first ensure that the correct Oracle Big Data Lite VM services are running, by double-clicking on the Start/Stop Services icon on the desktop and confirming that the following services are running:
Download this article’s sample data zip file, and extract and copy the oraclemag_sna.opv and oraclemag_sna.ope files to the virtual machine. Use the Terminal (Applications -> System Tools -> Terminal) command-line prompt to move the files to the correct location and start the Gremlin shell to prepare the property graph. (Note that your download location may differ; alter the first command as appropriate for your download location.)
sudo mv Downloads/oraclemag_sna.* /u01/oracle-spatial-graph/ property_graph/data cd /u01/oracle-spatial-graph/property_graph/dal/groovy unset CLASSPATH ./gremlin-opg-hbase.sh
Now enter the following commands in the Gremlin shell to load and prepare the graph model:
cfg = GraphConfigBuilder.forPropertyGraphHbase() \ .setName("connectionHBase") \ .setZkQuorum("bigdatalite").setZkClientPort(2181) \ .setZkSessionTimeout(120000).setInitialEdgeNumRegions(3) \ .setInitialVertexNumRegions(3).setSplitsPerRegion(1) \ .addEdgeProperty("weight", PropertyType.DOUBLE, "1000000") \ .build(); opg = OraclePropertyGraph.getInstance(cfg); opg.clearRepository(); vfile="../../data/oraclemag_sna.opv" efile="../../data/oraclemag_sna.ope" opgdl=OraclePropertyGraphDataLoader.getInstance(); opgdl.loadData(opg, vfile, efile, 2); // read through the vertices opg.getVertices(); // read through the edges opg.getEdges();So Who Are the Most Influential Users in This Twitter Group?
You can now use this property graph model to calculate the top five most influential users in this network, using the PageRank function. Continuing with the Gremlin command-line shell (which should still be open in Terminal), enter the following shell commands to calculate the top five vertices by influence:
vOutput="/tmp/mygraph.opv" eOutput="/tmp/mygraph.ope" OraclePropertyGraphUtils.exportFlatFiles(opg, vOutput, eOutput, 2, false); session = Pgx.createSession("session-id-1"); analyst = session.createAnalyst(); graph = session.readGraphWithProperties(opg.getConfig()); rank = analyst.pagerank(graph, 0.001, 0.85, 100); rank.getTopKValues(5);
Note that only the IDs of the vertices are displayed in the output:
==>PgxVertex with ID 1=0.13885623487462861 ==>PgxVertex with ID 3=0.08686102641801993 ==>PgxVertex with ID 101=0.06757752513733056 ==>PgxVertex with ID 6=0.06743774001139484 ==>PgxVertex with ID 37=0.0481517609757462
To display the actual names of these Twitter users, enter the following, again in the Gremlin shell, to retrieve the name properties for each of these vertices:
v1=opg.getVertex(1l); v2=opg.getVertex(3l); v3=opg.getVertex(101l); \ v4=opg.getVertex(6l); v5=opg.getVertex(37l); System.out.println("Top 5 influencers: \n " + v1.getProperty("name") + \ "\n " + v2.getProperty("name") + \ "\n " + v3.getProperty("name") + \ "\n " + v4.getProperty("name") + \ "\n " + v5.getProperty("name"));
You should then see the following output as the top five users by influence:
Top 5 influencers: markrittman rmoff rittmanmead mRainey JeromeFr
Although this is all very interesting, it would even be more useful if you displayed the clustered results in a graphical manner. And it would be more useful still if you used graph visualization to further extend the insights resulting from the in-memory graph analysis.
Visualizing and Analyzing This Social Network with CytoscapeTo visualize the graph model, install and configure the Cytoscape graph data visualization tool and the Oracle Big Data Spatial and Graph support for Cytoscape. Then open Cytoscape and connect to a property graph in Oracle Database. To start Cytoscape from a new Terminal session, select File -> Open Tab from the Terminal menu bar and enter
cd /u02/Cytoscape ./cytoscape.sh
where /u02/Cytoscape is the Cytoscape installation location. Adjust your location text as appropriate for your location.
The Cytoscape graphical user interface appears. For this example, you’ll create a connection to a property graph in Apache HBase. (If you are using Oracle NoSQL Database, refer to the Oracle Big Data Spatial and Graph support for Cytoscape documentation.) To create the connection,
Quorum : bigdatalite.localdomain Port : 2181
Oracle Big Data Spatial and Graph support for Cytoscape connects to an existing property graph stored in HBase or Oracle NoSQL Database and enables you to execute Oracle’s in-memory analyst’s built-in analytics, including the PageRank function. It also includes community detection algorithms.
For example, the following exercise demonstrates how to use Oracle Big Data Spatial and Graph within Cytoscape to calculate the score of the top five Twitter users by influence:
Figure 3: Visualizing the connections around a node
Detecting and thematically coloring the different communities in this property graph can be helpful to, for example, identify distinct groups with which you might communicate, using different strategies. To do this with Cytoscape, do the following:
Figure 4: Finding interaction clusters that may indicate one or more communities
The clusters that Oracle Big Data Spatial and Graph detected were based on actual connections among the nodes, not on attributes such as hashtags or other self-declared community labels. Therefore, this analysis can be particularly useful for detecting groupings of users who may not even be aware that a community exists. Cluster analysis can be used when organizing and communicating with emergent communities and finding patterns where patterns may become apparent only when viewed as part of an overall relationship graph.
Graph analysis is a useful complement to more-traditional types of tabular big data analysis. In this article, you performed social network analysis on Twitter data, using Oracle Big Data Spatial and Graph to identify the influencers and communities in the Oracle Business Analytics community as well as where their interests lie. You also analyzed the data to determine with whom to communicate about future community events. Finally, you used Cytoscape with the Oracle Big Data Spatial and Graph in-memory analyst for community detection. This final example divided the community into subgroups of interest, based on actual relationships and communication patterns.
|
Photography by Alvaro Pinot, Unsplash