X

Oracle Big Data Spatial and Graph - technical tips, best practices, and news from the product team

Intuitive Explanation of Personalized Page Rank and its Application in Recommendation

Alan Wu
Architect
In this blog post, I am going to talk about
personalized page rank, its definition and application. Let's
start with some basic terms and definitions.


Definition


Random Walk: Given a graph, a random walk is an
iterative process that starts from a random vertex, and at
each step, either follows a random
outgoing edge of the current vertex or jumps to a random vertex. The jump part is
important because some vertices may not have any outgoing
edges so a walk will terminate at those places without jumping
to another vertex.


In the following graph, an example random walk can start from
A, follow an outgoing edge of A to B, follow an outgoing edge
of B to E, follow one of E's outgoing edges to B, jump to D,
follow an outgoing edge of D to F, and jumps to C.

random walk



Page Rank (PR) measures stationary distribution of
one specific kind of random walk that starts from a
random vertex and in each iteration, with a predefined
probability p, jumps to a random vertex, and with probability1-p follows a random outgoing edge of the current vertex. Page
rank is usually conducted on a graph with homogeneous edges,
for example, a graph with edges in the form of "A  linksTo B",
"A references B", or "A likes B",  or "A endorses B", or  "A
readsBlogsWrittenBy B", or "A hasImpactOn B".


  Running the page rank algorithm on a graph will generate
rankings (PR value) for vertices and the numeric PR values can
be viewed as "importance" or "relevance" of vertices. A vertex
with a high PR value is usually considered more "important" or
more "influential" or having higher "relevance" than a vertex
with a low PR value.


Personalized Page Rank (PPR) is the same as PR other
than the fact that jumps are
back to one of a given set of starting vertices. In a way, the
walk in PPR is biased towards (or personalized for) this set
of starting vertices and is more localized compared to the
random walk performed in PR.

PPR for Recommendation



Now that we are clear about the
terms of random walk, PR, and PPR. Let's take a look at how PPR can
be used for recommendations. Assume we have the following Customer-purchase-Product graph (with
reverse edges). Say we are going to start a PPR from user
John. Quite intuitively, the random walk in PPR will very
likely touch products purchased by John, and other users who
purchased those products, and also products purchases by those
users, so on and so forth. In a way, this walk is able to
reach users that are similar to John because they purchased the same (or
similar, or related) products. In addition, the walk in PPR
discovers similar/related products because they were purchased
by the same (or similar, or related) users.


customer purchase product graph

Recommendation Workflow Using Oracle Big Data
Spatial and Graph Property Graph



The following Java code snippets can be executed in the Groovy
environment supported by Oracle Big Data Spatial and Graph.
For some basic information on Groovy, refer to a previous blog
post



// First, execute
gremlin-opg-hbase.sh or gremlin-opg-nosql.sh

// ...

// Get an instance of OraclePropertyGraph which is a key Java

// class to manage property graph data

opg = OraclePropertyGraph.getInstance(cfg);


opg.clearRepository();   // remove all vertices and edges


// Add vertices for the users and products

vJohn=opg.addVertex(1l); vJohn.setProperty("name","John");
vJohn.setProperty("age",10i);

vMary=opg.addVertex(2l); vMary.setProperty("name","Mary");
vMary.setProperty("sex","F");

vJill=opg.addVertex(3l); vJill.setProperty("name","Jill");
vJill.setProperty("city","Boston");

vTodd=opg.addVertex(4l); vTodd.setProperty("name","Todd");
vTodd.setProperty("student",true);



sdf = new java.text.SimpleDateFormat("mm/dd/yyyy");

vPhone=opg.addVertex(10l); vPhone.setProperty("type","Prod");
vPhone.setProperty("desc","iPhone5");

                          
vPhone.setProperty("released",sdf.parse("02/21/2012"));

vKindle=opg.addVertex(11l);
vKindle.setProperty("type","Prod");
vKindle.setProperty("desc","Kindle Fire");

vFitbit=opg.addVertex(12l);
vFitbit.setProperty("type","Prod");
vFitbit.setProperty("desc","Fitbit Flex Wireless");

                           
vFitbit.setProperty("rating","****");

vPotter=opg.addVertex(13l);
vPotter.setProperty("type","Prod");
vPotter.setProperty("desc","Harry Potter");

vHobbit=opg.addVertex(14l);
vHobbit.setProperty("type","Prod");
vHobbit.setProperty("desc","Hobbit");


// List the vertices

opg.getVertices()


==>Vertex ID 14 {type:str:Prod, desc:str:Hobbit}

==>Vertex ID 2 {name:str:Mary, sex:str:F}

==>Vertex ID 4 {name:str:Todd, student:bol:true}

==>Vertex ID 11 {type:str:Prod, desc:str:Kindle Fire}

==>Vertex ID 10 {type:str:Prod, desc:str:iPhone5,
released:dat:Sat Jan 21 00:02:00 EST 2012}

==>Vertex ID 12 {type:str:Prod, desc:str:Fitbit Flex
Wireless, rating:str:****}

==>Vertex ID 13 {type:str:Prod, desc:str:Harry Potter}

==>Vertex ID 1 {name:str:John, age:int:10}

==>Vertex ID 3 {name:str:Jill, city:str:Boston}


// Add edges with Blueprints Java APIs. Note
that if the number of

// edges is much bigger, then use the parallel data
loader & flat file format.



opg.addEdge(1l,  vJohn, vPhone, "purchased");

opg.addEdge(2l,  vPhone, vJohn, "purchased by");

opg.addEdge(3l,  vJohn, vKindle, "purchased");

opg.addEdge(4l,  vKindle, vJohn, "purchased by");

opg.addEdge(5l,  vMary, vPhone, "purchased");

opg.addEdge(6l,  vPhone, vMary, "purchased by");

opg.addEdge(7l,  vMary, vKindle, "purchased");

opg.addEdge(8l,  vKindle, vMary, "purchased by");

opg.addEdge(9l,  vMary, vFitbit, "purchased");

opg.addEdge(10l, vFitbit, vMary, "purchased by");

opg.addEdge(11l,  vJill, vPhone, "purchased");

opg.addEdge(12l,  vPhone, vJill, "purchased by");

opg.addEdge(13l,  vJill, vKindle, "purchased");

opg.addEdge(14l,  vKindle, vJill, "purchased by");

opg.addEdge(15l,  vJill, vFitbit, "purchased");

opg.addEdge(16l, vFitbit, vJill, "purchased by");

opg.addEdge(17l, vTodd, vFitbit, "purchased");

opg.addEdge(18l, vFitbit, vTodd, "purchased by");

opg.addEdge(19l, vTodd, vPotter, "purchased");

opg.addEdge(20l, vPotter, vTodd, "purchased by");

opg.addEdge(21l, vTodd, vHobbit, "purchased");

opg.addEdge(22l, vHobbit, vTodd, "purchased by");

opg.commit();

// Create in-memory analytics session and analyst

session=Pgx.createSession("session_ID_1");

analyst=session.createAnalyst();

// Read the graph from database into memory

pgxGraph = session.readGraphWithProperties(opg.getConfig());


// We are going to get a recommendation for user
John.

// Find this vertex with a simple query
v=opg.getVertices("name", "John").iterator().next();


// Add John to the start vertex set
vertexSet=pgxGraph.createVertexSet();

vertexSet.addAll(v.getId());


// Run personalized page rank using John as the start
vertex


ppr=analyst.personalizedPagerank(pgxGraph, vertexSet, \

            0.0001 /*maxError*/, 0.85 /*dampingFactor*/,
1000);


// Examine the top 9 entries of the PPR output

// The vertices for John, iPhone5, and Kindle Fire have the
highest PR

// values (shown below in the first column) because John is the starting point of PPR

// and iPhone5 and Kindle Fire were purchased by John.

// Mary and Jill also have high PR values because they made
similar purchases as John.

//

it=ppr.getTopKValues(9).iterator(); while
(it.hasNext()) {

     entry=it.next(); vid=entry.getKey().getId();

     System.out.format("ppr=%.4f  vertex=%s\n",
entry.getValue(), opg.getVertex(vid));}

==>

ppr=0.2496  vertex=Vertex ID 1 {name:str:John, age:int:10}

ppr=0.1758  vertex=Vertex ID 11 {type:str:Prod,
desc:str:Kindle Fire}

ppr=0.1758  vertex=Vertex ID 10 {type:str:Prod,
desc:str:iPhone5, released:dat:Sat Jan 21 00:02:00 EST 2012}

ppr=0.1229  vertex=Vertex ID 3 {name:str:Jill,
city:str:Boston}

ppr=0.1229  vertex=Vertex ID 2 {name:str:Mary, sex:str:F}

ppr=0.0824  vertex=Vertex ID 12 {type:str:Prod,
desc:str:Fitbit Flex Wireless, rating:str:****}

ppr=0.0451  vertex=Vertex ID 4 {name:str:Todd,
student:bol:true}

ppr=0.0128  vertex=Vertex ID 13 {type:str:Prod, desc:str:Harry
Potter}

ppr=0.0128  vertex=Vertex ID 14 {type:str:Prod,
desc:str:Hobbit}


// Now, let's filter out users from this list
(assume we only want to recommend products for John, not
other similar buyers)


// If we exclude the top 2 products that John purchased
before, the remaining three, Fitbit, Harry Potter,

// and Hobbit, are our recommendations for John, in that order.
Note that for John, Fitbit has a much higher PR value

// than Harry Potter and Hobbit.

//

it=ppr.getTopKValues(9).iterator(); while (it.hasNext()) {

     entry=it.next(); vid=entry.getKey().getId();
vertex=opg.getVertex(vid);

     if ("Prod".equals(vertex.getProperty("type")))

     System.out.format("ppr=%.4f  vertex=%s\n",
entry.getValue(), opg.getVertex(vid));}

=>

ppr=0.1758  vertex=Vertex ID 11 {type:str:Prod,
desc:str:Kindle Fire}

ppr=0.1758  vertex=Vertex ID 10 {type:str:Prod,
desc:str:iPhone5, released:dat:Sat Jan 21 00:02:00 EST 2012}

ppr=0.0824  vertex=Vertex ID 12 {type:str:Prod,
desc:str:Fitbit Flex Wireless, rating:str:****}

ppr=0.0128  vertex=Vertex ID 13 {type:str:Prod, desc:str:Harry
Potter}

ppr=0.0128  vertex=Vertex ID 14 {type:str:Prod,
desc:str:Hobbit}


// So the final recommended products for John are,
Fitbit, Harry Potter, and Hobbit

ppr=0.0824  vertex=Vertex ID 12 {type:str:Prod,
desc:str:Fitbit Flex Wireless, rating:str:****}

ppr=0.0128  vertex=Vertex ID 13 {type:str:Prod, desc:str:Harry
Potter}

ppr=0.0128  vertex=Vertex ID 14 {type:str:Prod,
desc:str:Hobbit}



In a future blog post, I am going to show an optimization that
can reduce the number of edges needed in this graph.

Acknowledgement: thanks Jay Banerjee for his input on this blog post, thanks to Sungpack Hong, Martin Sevenich, Korbi Schmid, Hassan Chafi for all those discussions, and thanks to Gaby Montiel, Jane Tao, and Hugo Labra for their work on the backend databases.

Join the discussion

Comments ( 2 )
  • guest Monday, March 14, 2016

    Very interesting example.

    I have a couple of questions to ask at this point

    1) does the initial choice to model a twofold relationship between nodes customers/products affect the performance/effectiveness of the PPR Algorithm ?

    2) is there any effective and agile strategy to clean up the already owned products from the final result set ?


  • guest Tuesday, March 15, 2016

    Reply from Zhe Wu of the Oracle BDSG team:

    1) does the initial choice to model a twofold relationship between nodes customers/products affect the performance/effectiveness of the PPR Algorithm ?

    Yes. PPR follows outgoing edges. So it is important to have a graph that links customers to products and also from products to customers.

    2) is there any effective and agile strategy to clean up the already owned products from the final result set ?

    It is quite easy. From the starting vertex: v, use the following data access layer Java API and get back the list of adjacent vertices. Read out the vertices, add the vertex ID into a hash set, and use this hash set to skip already-owned products.

    v.getVertices(direction, "purchased")

    Javadoc of OracleVertex can be found in the following web pages.

    http://docs.oracle.com/cd/E69290_01/doc.44/e62126/oracle/pg/hbase/OracleVertex.html

    http://docs.oracle.com/cd/E69290_01/doc.44/e62125/oracle/pg/nosql/OracleVertex.html


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.