X

Oracle Spatial and Graph – technical tips, best practices, and news from the product team

A simple Customer 360 analytics example with Oracle Property Graph

Jayant Sharma
Sr. Director, Product Mgmt

Customer 360 example

Note: This is based on Ryota Yamanaka's (Senior Solutions Consultant, Big Data & Analytics, APAC) demos. The original content is here in his Github repository. 

 

Introduction

 

This example shows how integrating multiple datasets, using a graph, facilitates additional analytics can lead to new insights. We will use three small datasets for illustrative purposes. The first contains accounts and account owners. The second is purchases by the people who own those accounts. The third is transactions between these accounts. 

 

The combined dataset is then used to perform the following common graph query and analyses: Pattern Matching (e.g. a path meeting specified criteria), Detection of cycles (e.g. circular payments), Finding weakest or strongest links (e.g. influencer identification), Community Detection (i.e. connected components), and Recommendation (aka link prediction).

 

The datasets

The data can come from multiple sources including database tables, web services, or files. We use a simple property graph descriptive format to list each one. The file lists the nodes, edges, and their properties. Each line consists of IDs (one for Nodes and Source and Destination for Edges), the Type (or Label), and key:value pairs of properties.

 

Accounts and account owners

 

# Nodes

101 type:customer name:"John" age:15 location:"Boston"

102 type:customer name:"Mary" gender:"F"

103 type:customer name:"Jill" location:"Boston"

104 type:customer name:"Todd" student:"true"

201 type:account account_no:"xxx-yyy-201" open:"2015-10-04"

202 type:account account_no:"xxx-yyy-202" open:"2012-09-13"

203 type:account account_no:"xxx-yyy-203" open:"2016-02-04"

204 type:account account_no:"xxx-yyy-204" open:”2018-01-05"

 

# Edges

201 -> 101 :owned_by

202 -> 102 :owned_by

203 -> 103 :owned_by

204 -> 104 :owned_by

103 -> 104 :parent_of

 

and the resulting subgraph (for these nodes and edges) is shown in the figure below. The PGQL query for it is 

SELECT * MATCH ()-[e:parent_of|owned_by]->() 

which selects all edges (and their nodes) that have a label of owned_by or parent_of. 

pastedGraphic.png

Purchases:

 

# Nodes

301 type:merchant name:"Apple Store"

302 type:merchant name:"PC Paradise"

303 type:merchant name:"Kindle Store"

304 type:merchant name:"Asia Books"

305 type:merchant name:"ABC Travel"

 

# Edges

201 -> 301 :purchased amount:800

201 -> 302 :purchased amount:15

202 -> 301 :purchased amount:150

202 -> 302 :purchased amount:20

202 -> 304 :purchased amount:10

203 -> 301 :purchased amount:350

203 -> 302 :purchased amount:20

203 -> 303 :purchased amount:15

204 -> 303 :purchased amount:10

204 -> 304 :purchased amount:15

204 -> 305 :purchased amount:45

 

and the resulting subgraph (for these nodes and edges) is shown in the figure below. The PGQL query for it is 

SELECT * MATCH ()-[e:purchased]->() 

which selects all edges (and their nodes) that have a label of purchased.

pastedGraphic_1.png

Transactions:

 

# Nodes

211 type:account account_no:xxx-zzz-001

212 type:account account_no:xxx-zzz-002

 

# Edges

201 -> 202 :transfer amount:200 date:"2018-10-05"

211 -> 202 :transfer amount:900 date:"2018-10-06"

202 -> 212 :transfer amount:850 date:"2018-10-06"

201 -> 203 :transfer amount:500 date:"2018-10-07"

203 -> 204 :transfer amount:450 date:"2018-10-08"

204 -> 201 :transfer amount:400 date:"2018-10-09"

202 -> 203 :transfer amount:100 date:"2018-10-10"

202 -> 201 :transfer amount:300 date:”2018-10-10"

 

and the resulting subgraph (for these nodes and edges) is shown in the figure below. The PGQL query for it is 

SELECT * MATCH ()-[e:transfer]->() 

which selects all edges (and their nodes) that have a label of transfer.

 pastedGraphic_2.png

And the full graph, once created, is as shown below. 

pastedGraphic_3.png

Getting started

 

The sample data, and some config files, are available from Ryota's GitHub pgx-training repository under the c360/2019-05-06 directory.

PGX is part of the supported Big Data Spatial and Graph or Database Spatial and Graph releases and included in those downloads.

A standalone, unsupported, version of the PGX server package is also available from the Oracle Labs PGX site on the Oracle Technology Network.

Download and install PGX server and the required Groovy and set the GROOVY_HOME environment variable to point to its installation. 

Verify that the pgx shell works. e.g. 

 

echo $GROOVY_HOME

/opt/groovy-2.5.6 

cd ~/Examples/pgx-19.1.0

./bin/pgx

[WARNING] PGX shell is using Groovy installed in /opt/groovy-2.5.6. It is the responsibility of the user to use the latest version of Groovy and to protect against known Groovy vulnerabilities.

PGX Shell 19.1.0

PGX server version: 19.1.0 type: SM

PGX server API version: 3.3.0

PGQL version: 1.2

type :help for available commands

variables instance, session and analyst ready to use

 

Load the C360 sample data. Let's assume the pgx-training repository was downloaded to ~/Examples/pgx-training-master. 

 

pgx> c360graph = session.readGraphWithProperties("~/Examples/pgx-training-master/c360/all.pgx.json")

==> PgxGraph[name=all.pgx,N=15,E=19,created=1556742411188] 

 

Performing Graph Analyses 

 

Now let's use PGQL to perform some pattern matching and graph analyses like detecting cycles, or finding communities and influencers.

 

Pattern Matching

 

Let’s say we’re looking for accounts that are mostly, or only, used as a temporary holding area and thus have inbound and outbound transfers within a short time period. For example, we might look for accounts that had an inbound and an outbound transfer, of over 500, on the same day. The PGQL query for this is:

SELECT a.account_no, t1.amount, t2.amount, t1.date

MATCH (a) <-[t1:transfer]-(a1),

      (a)- [t2:transfer]->(a2) 

WHERE t1.date = t2.date AND t1.amount > 500 AND t2.amount > 500  

 

We can execute this in the PGX shell using (note: the triple quotes """ mark the start and end of a multiline statement)

 

c360graph.queryPgql("""

SELECT a.account_no, t1.amount, t2.amount, t1.date

MATCH (a) <-[t1:transfer]-(a1),

      (a)- [t2:transfer]->(a2) 

WHERE t1.date = t2.date AND t1.amount > 500 AND t2.amount > 500 

“"").print()

 

and the result is

 

+---------------------------------------------------+

| a.account_no | t1.amount | t2.amount | t1.date    |

+---------------------------------------------------+

| xxx-yyy-202  | 900       | 850       | 2018-10-06 |

+---------------------------------------------------+

pastedGraphic_4.png

In this particular instance the transfers are from and to an external entity and hence this account may be unprofitable depending on its fee structure.

Detect Cycles

 

Next we use PGQL to find a series of transfers that start and end at the same account such as A to B to A, or A to B to C to A.

The first query could be expressed as:

 

SELECT a1.account_no, t1.date, t1.amount, a2.account_no, t2.date, t2.amount

MATCH (a1)-[t1:transfer]->(a2)-[t2:transfer]->(a1)

WHERE t1.date < t2.date

 

Which when executed in the PGX shell using

 

c360graph.queryPgql("""

SELECT a1.account_no, t1.date, t1.amount, a2.account_no, t2.date, t2.amount

MATCH (a1)-[t1:transfer]->(a2)-[t2:transfer]->(a1)

WHERE t1.date < t2.date

“"").print();

 

gives the following result.

 

+---------------------------------------------------------------------------------+

| a1.account_no | t1.date    | t1.amount | a2.account_no | t2.date    | t2.amount |

+---------------------------------------------------------------------------------+

| xxx-yyy-201   | 2018-10-05 | 200       | xxx-yyy-202   | 2018-10-10 | 300       |

+---------------------------------------------------------------------------------+

pastedGraphic.png

The second query just adds one more transfer to the pattern (list) and could be expressed as:

 

SELECT a1.account_no, t1.date, t1.amount, a2.account_no, t2.date, t2.amount, a3.account_no, t3.amount 

MATCH (a1)-[t1:transfer]->(a2)-[t2:transfer]->(a3)-[t3:transfer]->(a1)

WHERE t1.date < t2.date AND t2.date < t3.date

 

Which when executed in the PGX shell using

 

c360graph.queryPgql("""

SELECT a1.account_no, t1.date, t1.amount, a2.account_no, t2.date, t2.amount, a3.account_no, t3.amount

MATCH (a1)-[t1:transfer]->(a2)-[t2:transfer]->(a3)-[t3:transfer]->(a1)

WHERE t1.date < t2.date AND t2.date < t3.date

“"").print();

 

gives the following result.

 

+-------------------------------------------------------------------------------------------------------------+

| a1.account_no | t1.date    | t1.amount | a2.account_no | t2.date    | t2.amount | a3.account_no | t3.amount |

+-------------------------------------------------------------------------------------------------------------+

| xxx-yyy-201   | 2018-10-07 | 500       | xxx-yyy-203   | 2018-10-08 | 450       | xxx-yyy-204   | 400       |

+-------------------------------------------------------------------------------------------------------------+

pastedGraphic_1.png

Finding Communities (or connected components)

 

Let's find which subsets of accounts form communities. That is, there are more transfers among accounts in the same subset than there are between those and accounts in another subset. We'll use the built-in Kosaraju strongly connected components algorithm.

The first step is to create a subgraph that only has the accounts and the transfers among them. This is done by creating and applying an edge filter (for edges with the table "transfer') to the c360graph.

 

acctsSubgraph = c360graph.filter(new EdgeFilter("edge.label()='transfer' "))

==> PgxGraph[name=sub-graph_13,N=6,E=8,created=1556744669838]

 

Then run sccKosaraju on that subgraph.

 

result = analyst.sccKosaraju(acctsSubgraph)

==> ComponentCollection[name=compproxy_14,graph=sub-graph_13]

==> ComponentCollection[name=compproxy_14,graph=sub-graph_13]

==> ComponentCollection[name=compproxy_14,graph=sub-graph_13]

 

Check the results; i.e. iterate through the result and print the number of vertices in each component (or community).

 

result.eachWithIndex {

  it, index -> println "Partition ${index} has ${it.size()} vertices"

}

Partition 0 has 1 vertices

Partition 1 has 4 vertices

Partition 2 has 1 vertices

 

Setting this result to node properties, it can be retrieved by PGQL queries. 

 

    cs = sg.createChangeSet()

    rs = sg.queryPgql("SELECT DISTINCT a WHERE (a)-[:transfer]-()")

    for (r in rs) {

      v = r.getVertex(1)

      i = result.getPartitionIndexOfVertex(v)

      cs.updateVertex(v.getId()).setProperty("component", i)

    }

    sg = cs.build()

    sg.queryPgql("""

      SELECT a.component, COUNT(a.account_no), MAX(a.account_no)

      MATCH (a)

      GROUP BY a.component

      ORDER BY a.component

    """).print()

+-------------------------------------------------------+

| a.component | COUNT(a.account_no) | MAX(a.account_no) |

+-------------------------------------------------------+

| 0           | 1                   | xxx-zzz-001       |

| 1           | 4                   | xxx-yyy-204       |

| 2           | 1                   | xxx-zzz-002       |

+-------------------------------------------------------+

 

If we look at the initial subgraph of accounts we see that these correspond to the red (4 vertices), pink (1) and orange (1) vertices.

pastedGraphic_2.png

Lastly let's use Personalized PageRank to find stores that John may purchase from given that people he is connected to have made purchases from those stores.

 

Recommendation

 

Create a subgraph with customers and merchants, i.e. filter the original graph based on the edge label “purchased".

 

customersAndMerchants = c360graph.filter(new EdgeFilter("edge.label()='purchased' "))

 

The algorithms require bi-directional edges so we add them programmatically. The first step is to create a change set with the required updates, i.e. added reverse edges for the purchases. Then web build a new graph using that change set.

 

purchases_changeSet = customersAndMerchants.createChangeSet()

// get the vertices and add a reverse edge

purchases_resultSet  = customersAndMerchants.queryPgql("SELECT a, x MATCH (a)-[:purchased]->(x)")

for (r in purchases_resultSet) {

      a = r.getVertex(1).getId()

      x = r.getVertex(2).getId()

      purchases_changeSet.addEdge(x, a).setLabel("purchased_by")

}

// build the graph

purchases_Graph = purchases_changeSet.build()

// query it to test that the build worked 

purchases_Graph.queryPgql("SELECT x.name, label(r), a.account_no MATCH (x)-[r:purchased_by]->(a)")

pastedGraphic_3.png pastedGraphic_4.png

Now let's compute the personalized page rank (PPR) for John's account (account "a01"). A PPR is biased towards a specified set of vertices. So first we create and populate that vertex set. Then we compute the PPR.

 

ppr_vertexSet = purchases_Graph.createVertexSet();

ppr_vertexSet.addAll("101");

ppr = analyst.personalizedPagerank(purchases_Graph, ppr_vertexSet);

 

Now query the graph to get the results (i.e. computed ppr values which are stored in a new property named "pagerank"). The query "find merchants that John has not purchased from and list them ordered by their ppr" is expressed as follows.

 

purchases_Graph.queryPgql("""
SELECT ID(x), x.name, x.pagerank MATCH (x) WHERE x.type = 'merchant'
AND NOT EXISTS ( SELECT * MATCH (x)-[:purchased_by]->(a) WHERE ID(a)='101')
ORDER BY x.pagerank DESC

""").print()

 

which gives the results shown below.

 

+--------------------------------------------+

| ID(x) | x.name       | x.pagerank          |

+--------------------------------------------+

| 303   | Kindle Store | 0.04932640133302745 |

| 304   | Asia Books   | 0.04932640133302745 |

| 305   | ABC Travel   | 0.01565535511504672 |

+--------------------------------------------+

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.