X
  • Tuesday, October 31, 2017

Getting started with OAAgraph - vignette

By: Mark Hornick | Director, Advanced Analytics and Machine Learning

Following up on the introductory post on OAAgraph, here is a vignette that illustrates using some of the OAAgraph package's capabilities. Recall that OAAgraph enables seamless interaction between R users of Oracle R Enterprise (ORE) of the Oracle Advanced Analytics option, Oracle Database, and the Parallel Graph Engine (PGX) of the Oracle Spatial and Graph option.

In this post, we highlight a few aspects of OAAgraph:

  • Creating a graph from node and edge tables residing in Oracle Database
  • Invoking graph analytics algorithm: countTriangles, degree, pagerank
  • Using the oaa.cursor object
  • Creating a graph from a snaphot in-memory representation as stored in the database
  • Creating tables from the nodes, edges, and their properties in an oaa.graph proxy object
  • Cleaning up in-memory graphs and database objects

In the architecture figure below, Oracle Database with both ORE and PGX reside at the database server machine, while the client R engine loads packages for both ORE and OAAgraph. Note: using OAAgraph requires installation of the Oracle Spatial and Graph PGX engine. See link for details.

Let's begin. First, we load the ORE and OAAgraph packages at the client and connect to ORE using ore.connect, and then to the PGX server using oaa.graphConnect. Provide the same database credentials in oaa.graphConnect to allow the PGX server to access and create database tables in Oracle Database.

library(ORE)
library(OAAgraph)

dbHost     <- "myHost"
dbUser     <- "myUserID"
dbPassword <- "myPassword"
dbSid      <- "myDatabaseSID"
pgxBaseUrl <- "myPGXBaseURL"

ore.connect(host=dbHost, user=dbUser, 
            password=dbPassword, sid=dbSid)
oaa.graphConnect(pgxBaseUrl=pgxBaseUrl, dbHost=dbHost,
                 dbSid=dbSid, dbUser=dbUser, dbPassword=dbPassword)

To keep the example simple, we create a small set of nodes in a data.frame and then create that as a node table in Oracle Database. Note that VID refers to the numeric vertex identifier. The VID must be numeric.
VID <- c(1, 2, 3, 4, 5)
NP1 <- c("node1", "node2", "node3", "node4", "node5")
NP2 <- c(111.11, 222.22, 333.33, 444.44, 555.55)
NP3 <- c(1, 2, 3, 4, 5)

nodes <- data.frame(VID, NP1, NP2, NP3)
ore.drop(table="MY_NODES")
ore.create(nodes, table = "MY_NODES")
Similarly, we create the edge table in Oracle Database, where EID refers to the required numeric edge identifier, SVID to the source vertex identifier, DVID to the destination vertex identifier, EL to the edge label, and any other named edge properties, here we chose EP1, but any name may be used.
EID  <- c(1, 2, 3, 4, 5)
SVID <- c(1, 3, 3, 2, 4)
DVID <- c(2, 1, 4, 3, 2)
EL   <- c("label1", "label2", "label3", "label4", "label5")
EP1  <- c("edge1", "edge2", "edge3", "edge4", "edge5")

edges <- data.frame(EID, SVID, DVID, EP1, EL)
ore.drop(table="MY_EDGES")
ore.create(edges, table = "MY_EDGES")
Using ore.ls, we can then verify that the tables exist as ore.frames, and use the transparency layer of ORE to view their dimensions and summary statistics.
> ore.ls(pattern="MY")   # view newly created tables in schema
[1] "MY_EDGES"  "MY_EDGES1" "MY_NODES"  "MY_NODES1"
> class(MY_NODES)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> colnames(MY_NODES)
[1] "VID" "NP1" "NP2" "NP3"
> dim(MY_NODES)
[1] 5 4
> summary(MY_NODES)
      VID       NP1         NP2             NP3   
 Min.   :1   node1:1   Min.   :111.1   Min.   :1  
 1st Qu.:2   node2:1   1st Qu.:222.2   1st Qu.:2  
 Median :3   node3:1   Median :333.3   Median :3  
 Mean   :3   node4:1   Mean   :333.3   Mean   :3  
 3rd Qu.:4   node5:1   3rd Qu.:444.4   3rd Qu.:4  
 Max.   :5             Max.   :555.5   Max.   :5  
> 
Now we're ready to create a graph in PGX from the database node and edge tables. We can assign a name to this graph as well, here 'MY_PGX_GRAPH'. Note that in OAAgraph 2.4.2, names will be converted to upper case. The graph name is used in particular when creating snapshots, as depicted below.
graph <- oaa.graph(MY_EDGES, MY_NODES, "MY_PGX_GRAPH")
graph
names(graph, "nodes")
names(graph, "edges")
Upon executing these statements, we see that printing the oaa.graph object displays the name, the number of nodes and edges, whether the graph is persisted (i.e., has a snapshot), along with the node and edge property names. We can also get the node and edge property names using the overloaded names function.
> graph <- oaa.graph(MY_EDGES, MY_NODES, "MY_PGX_GRAPH")
> graph  
Graph Name: MY_PGX_GRAPH 
Number of Nodes: 5 
Number of Edges: 5 
Persistent Graph: FALSE 
Node Properties: NP1, NP3, NP2 
Edge Properties: EP1 
> names(graph, "nodes")  
[1] "NP1" "NP3" "NP2"
> names(graph, "edges")
[1] "EP1"
> 
Let's see the result of the countTriangles function, which counts the number of triangles in the graph, giving an overview of the number of connections between nodes in neighborhoods. Here, we see this simple graph has two such triangles. Note that this algorithm is intended for undirected graphs. To make a graph undirected, use the oaa.undirect function.
> countTriangles(graph, sortVerticesByDegree=FALSE) 
[1] 2
> 
Next, we'll look at the degree function and it's variants. As a result of invoking functions like degree, new properties are added to graph nodes with name as specified in the 'name' argument. These properties can be accessed through the cursor object shown below.
degree(graph, name = "OutDegree")
degree(graph, name = "InDegree", variant = "in")
degree(graph, name = "InOutDegree", variant = "all")

After executing these functions, we have three new node properties:

> names(graph, "nodes") 
[1] "OutDegree"  "NP1"  "NP3" "NP2"  "InOutDegree" "InDegree"   
> 

To access these metrics computed by the degree function, we create a cursor including the names of the degree properties provided above:

cursor <- oaa.cursor(graph, 
                     c("OutDegree", "InOutDegree", "InDegree"), "nodes")
Let's view the 5 entries from the cursor:
> oaa.next(cursor, 5) 
  OutDegree InOutDegree InDegree
1         1           2        1
2         1           3        2
3         2           3        1
4         1           2        1
5         0           0        0
> 
We can also use a Parallel Graph Query Language (PGQL) query to retrieve the same values, to compute values, specify ordering, perform a graph pattern search, and more. See this link to PGQL for details.
cursor <- oaa.cursor(graph, 
                     query = "select n.id(),n.OutDegree,n.InOutDegree,n.InDegree 
                     where (n) order by n.OutDegree desc")
Again, view the 5 entries from the cursor. Note that the node identifier can be accessed using n.id in the select portion of the PGQL query.
> oaa.next(cursor, 5) 
  n.id() n.OutDegree n.InOutDegree n.InDegree
1      3           2             3          1
2      2           1             3          2
3      1           1             2          1
4      4           1             2          1
5      5           0             0          0
> 
Similarly, we can compute the pagerank metric on the graph, which produces the default property named 'pagerank'.
pagerankCursor <- pagerank(graph, error=0.085, damping=0.1, itermations=100)
oaa.next(pagerankCursor, 5)
Next, we create a cursor over the pagerank property using PGQL and view the results.
cursor <- oaa.cursor(graph, 
                     query = "select n.pagerank where (n) 
                              order by n.pagerank desc")
> oaa.next(pagerankCursor, 5)  
2     0.22
3     0.20
1     0.19
4     0.19
5     0.18
> 
This could be done using the ordering argument to the oaa.cursor function as well.
cursor <- oaa.cursor(graph, "pagerank", ordering="desc")
OAAgraph provides the capability to persist a graph as a snapshot in the database. Since reconstructing a graph from node and edge tables can take longer than loading a binary representation of a graph, persisting a graph that will be used often can save the user significant load time. To list available graph snapshots, use the oaa.graphSnapshotList function.
oaa.graphSnapshotList()

To export a binary snapshot of the whole graph into Oracle Database, use the oaa.graphSnapshotPersist function, where we can specify that some or all node or edge properties should be maintained.

By setting argument overwrite to TRUE, if the named snapshot already exists, it will be replaced, otherwise an error is returned. After creating the snapshot,  view the listing to see the name of graph appear.

oaa.graphSnapshotPersist(graph, nodeProperties = TRUE, 
                         edgeProperties = TRUE, overwrite=TRUE)
> oaa.graphSnapshotList()  
[1] "MY_PGX_GRAPH"
> 
We can load the snapshot into memory by name, creating a new oaa.graph proxy object, graph2.
graph2 <- oaa.graphSnapshot("MY_PGX_GRAPH")
Viewing this graph, we see it has the same components as our original graph.
> graph2
Graph Name: MY_PGX_GRAPH_2 
Number of Nodes: 5 
Number of Edges: 5 
Persistent Graph: TRUE 
Node Properties: OutDegree, NP1, pagerank, NP3, NP2, InOutDegree, InDegree 
Edge Properties: EP1 
> 
To wrap up our vignette, we export the graph into database tables. First, we'll export all nodes and their properties. Using oaa.create, we can overwrite existing tables and specify the number of connections to open to the database. If the number of connections is greater than one, PGX will write the graph to the database in parallel.
oaa.create(graph2, nodeTableName = "RANKED_NODES", nodeProperties = TRUE,
           overwrite=TRUE, numConnections = 1)
Next, we export both nodes and edges as tables into the database, but only export the pagerank node property:
oaa.create(graph2, nodeTableName = "RANKED_GRAPH_N",
           nodeProperties = c("NP1", "pagerank"),
           edgeTableName = "RANKED_GRAPH_E", 
           overwrite=TRUE, numConnections = 1)
Lastly, we export only graph edges and their properties:
oaa.create(graph2, edgeTableName = "RANKED_EDGES", edgeProperties = TRUE)
Now that we're finished with our oaa.graph proxy objects, we free the graphs at the PGX server, invoking oaa.rm on the two graphs.
oaa.rm(graph)
oaa.rm(graph2)
To clean up the tables created above, we use the ORE function ore.drop:
ore.drop("MY_NODES")
ore.drop("MY_EDGES")
ore.drop("RANKED_NODES")
ore.drop("RANKED_GRAPH_N")
ore.drop("RANKED_GRAPH_E")
ore.drop("RANKED_EDGES")
Finally, we can remove the snapshot using oaa.dropSnapshots:
oaa.dropSnapshots("MY_PGX_GRAPH")
In this vignette, we exercised a variety of the capabilities provided with OAAgraph. See the OAAgraph webpage on OTN for more details.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha