See your data through new eyes: Everything can be a graph

Did you know you can expose and analyze your data like a graph?

Graphs put relationships between entities in the focus of interest. They are ideally suited to model and analyze complex, deeply hierarchical, or circular relationships.

How can you do that?

Very simple. You first create a graph view of your data and then run pattern-matching queries to analyze your graph. Both steps use PGQL, a SQL-like syntax that is intuitive and, therefore, easy to use.

This post wants to give you a head start on using PGQL and Oracle Graph technologies with existing data residing in an Oracle Database 19c (or 12.2). Here is what you need:

Ingredients:
- Data (stored in your Oracle Database)
Tools:
- SQL Developer (link to download the current version) or
- SQLcl plus the PGQL plug-in (link to version 24.2)

Please note that with Oracle Database 23ai, working with graphs becomes even more straightforward. Why? Oracle has implemented SQL/PGQ, a new extension to the SQL:2023 standard supporting Property Graph Queries. You will see that the PGQL syntax is very similar to SQL/PGQ due to Oracle´s contribution to the standard definition. Therefore, the effort now to familiarize yourself with PGQL is not a waste of time.

Graph Query Languages supported by the Oracle Database — Property Graph Query Languages supported by different Oracle Database versions

(Note: The image still refers to 23c from when the article was published first. It is 23ai now.)

Now, are you ready to learn something you didn´t know before? Then, let´s get started.

Here is what we do:

Download and install SQLcl and the PGQL plug-in.
Load sample data into your Oracle Database using SQLcl.
Create a graph model (view) of your data.
Analyze your graph using SQLcl.

Step 1: Download and install SQLcl and the PGQL plugin

Use this link to download the latest version of SQLcl. Extract the zip file into a folder of your choice. The SQLcl utility resides in the ./sqlcl/bin folder.

You can now download the PGQL plug-in from Oracle Software Delivery Cloud (search for “Oracle Graph Server and Client”) or from Oracle Graph Server and Client Downloads.

To install the plug-in, you need to unzip the downloaded file into the lib/ext directory of your local SQLcl installation.
That´s it. Start SQLcl.

sql /nolog

Connect with a privileged user to create a new schema where you later store the sample data.

--------------------------------------------------------
-- Log into your Oracle Database instance using SYSTEM.
-- Adapt the following statements to your needs.
-- The new user is required to have the GRAPH_DEVELOPER role.
--------------------------------------------------------

connect system/Welcome2PGQL_1234#@localhost:1521/pdb1
create user graphuser identified by Welcome2PGQL_9876#;
grant graph_developer, create session, create table, create view to graphuser;
alter user graphuser quota unlimited on users;

Note: The GRAPH_DEVELOPER role is a convenient grouping of common roles like CONNECT and RESOURCE plus several roles with the prefix PGX_ that are required to work with Oracle Graph Server (commonly known as PGX Server).

Verify that you can connect with the newly created user.

connect graphuser/Welcome2PGQL_9876#@localhost:1521/pdb1

Step 2: Load sample data

Download the sample data from this folder available on GitHub. It contains two .csv files, named BANK_ACCOUNTS.csv and BANK_TXNS.csv.

Use your newly created database user to set up the tables.

--------------------------------------------------------
--  Create table BANK_ACCOUNTS
--------------------------------------------------------
create table bank_accounts (
    acct_id number(10),
    name varchar2(100),
      balance number(13,2)
)
/
--------------------------------------------------------
--  Add unique index to table BANK_ACCOUNTS
--------------------------------------------------------
create unique index bank_accounts_pk on bank_accounts (acct_id)
/
--------------------------------------------------------
--  Add primary key constraint to table BANK_ACCOUNTS
--------------------------------------------------------
alter table bank_accounts add constraint bank_accounts_pk primary key (acct_id)
/
--------------------------------------------------------
--  Create staging table BANK_TXNS_TEMP
--    to avoid problems with autogenerated transaction ID
--    during the data import
--------------------------------------------------------
create table bank_txns_temp (
      from_acct_id number(10),
      to_acct_id number(10),
      txn_date timestamp,
      amount number(13,2),
      currency varchar2(3)
)
/
--------------------------------------------------------
--  Create table BANK_TXNS
--------------------------------------------------------
create table bank_txns (
      txn_id number(10) generated always as identity,
      from_acct_id number(10),
      to_acct_id number(10),
      txn_date timestamp,
      amount number(13,2),
      currency varchar2(3)
)
/
--------------------------------------------------------
--  Add unique index to table BANK_TXNS
--------------------------------------------------------
create unique index bank_txns_pk on bank_txns (txn_id)
/
--------------------------------------------------------
--  Add primary key constraints to table BANK_TXNS
--------------------------------------------------------
alter table bank_txns add constraint bank_txns_pk primary key (txn_id)
/
--------------------------------------------------------
--  Add foreign key constraints to table BANK_TXNS
--------------------------------------------------------
alter table bank_txns add constraint bank_txns_from_acct_fk foreign key (from_acct_id) references bank_accounts(acct_id)
/
alter table bank_txns add constraint bank_txns_to_acct_fk foreign key (to_acct_id) references bank_accounts(acct_id)
/

Now, you can fill the tables with the downloaded sample data using SQLcl.

----------------------------------------------------------------------
-- Import data using LOAD
--    Load bank accounts directly
--    Load bank transactions into a staging table first
--    (to avoid errors related to the autogenerated transaction ID´s)
-- Adapt the 
  
    to match the folder where the files are located

  ----------------------------------------------------------------------
  
   
set loadformat csv
-- Make sure to have the decimals in the data correctly interpreted
set load locale American America

-- Import data using LOAD
load table bank_accounts BANK_ACCOUNTS.csv
load table bank_txns_temp BANK_TXNS.csv

/* Expected results:

Load data into table GRAPHUSER.BANK_ACCOUNTS
#INFO Number of rows processed: 10.000
#INFO Number of rows in error: 0
#INFO Last row processed in final committed batch: 10.000
SUCCESS: Processed without errors

Load data into table GRAPHUSER.BANK_TXNS_TEMP
#INFO Number of rows processed: 50.000
#INFO Number of rows in error: 0
#INFO Last row processed in final committed batch: 50.000
SUCCESS: Processed without errors

*/

-- Verify transaction data in the staging table
select * from bank_txns_temp fetch first 10 rows only;

-- Copy data from staging table BANK_TXNS_TEMP into BANK_TXNS
insert into bank_txns (from_acct_id, to_acct_id, txn_date, amount, currency)(
    select from_acct_id, to_acct_id, txn_date, amount, currency from bank_txns_temp);
commit;

-- Verify that your data was loaded correctly
select * from bank_accounts fetch first 10 rows only;
select * from bank_txns fetch first 10 rows only;

-- Drop staging table BANK_TXNS_TEMP
drop table bank_txns_temp purge;

All statements should run without any errors. If this is not the case, check what´s wrong. The GitHub repository contains a script to clean up all created objects. Use it to start from scratch if needed.

Step 3: Create a graph model

PGQL provides a data definition language (DDL) to create graphs. To make sure that SQLcl from now on speaks PGQL instead of SQL, activate the plug-in:

pgql auto on;

A graph consists of vertices (also called nodes) and edges. Both vertices and edges typically have labels and properties assigned. Properties are expressed as key-value-pairs.

The DDL statement to create a graph view on your existing two tables can look like the following:

CREATE PROPERTY GRAPH bank_graph
  VERTEX TABLES (
    bank_accounts
      KEY (acct_id)
      LABEL account
      PROPERTIES ( acct_id, name, balance )
  )
  EDGE TABLES (
    bank_txns
      KEY (txn_id)
      SOURCE KEY ( from_acct_id ) REFERENCES bank_accounts
      DESTINATION KEY ( to_acct_id ) REFERENCES bank_accounts
      LABEL transfers
      PROPERTIES ( txn_id, from_acct_id, to_acct_id, amount, txn_date ) 
  ) 
  OPTIONS (PG_PGQL);

The statement specifies

the name of the graph,
which tables serve as vertices,
which tables serve as edges,
which columns of the BANK_ACCOUNTS and BANK_TXNS tables are used as properties for further analysis,
which labels are assigned to the vertices and edges,
and the source format for the graph (PG_PGQL). Please note, that in previous PGQL versions the format was named PG_VIEW. It was changed to better distinguish with the new SQL-based Property Graph support in Oracle Database 23. The format for those graphs in named PG_SQL.

All details about the graph creation are stored in five metadata tables, starting with the name of the graph. For the newly created graph, those are:

BANK_GRAPH_ELEM_TABLE$
BANK_GRAPH_KEY$
BANK_GRAPH_LABEL$
BANK_GRAPH_PROPERTY$
BANK_GRAPH_SRC_DST_KEY$

Make sure you do not alter the table contents. If you did, you will compromise further analysis of the graph. You find a detailed description of the CREATE PROPERTY GRAPH syntax in the documentation.

You are done with all preparation steps, which you need to do only once. You can now go ahead and analyze your graph.

Step 4: Query your data using the Property Graph Query Language

Let me start with a few words about PGQL:
Alongside familiar SQL constructs like SELECT, FROM, WHERE, GROUP BY, and ORDER BY, PGQL allows for matching fixed-length graph patterns and variable-length graph patterns. Fixed-length graph patterns match a fixed number of vertices and edges per solution. The types of the vertices and edges can be defined through arbitrary label expressions such as account|company, for example to match edges that have either the label account or the label company. This means that edge patterns are higher-level joins that can relate different types of entities at once. Variable-length graph patterns, on the other hand, contain one or more quantifiers like *,+,? or {2,4} for matching vertices and edges in a recursive fashion. This allows for encoding graph reachability (transitive closure) queries as well as shortest and cheapest path finding queries.

PGQL deeply integrates graph pattern matching with subquery capabilities so that vertices and edges that are matched in one query can be passed into another query for continued joining or pattern matching. Since PGQL is built on top of SQL’s foundation, it benefits from all existing SQL features and any new SQL features that will be added to the standard over time.

Coming back to our graph example, since the metadata structure that keeps the information about all created graphs is a graph of its own, we start with querying that metadata graph.

SELECT g.graph_name
FROM MATCH (g IS property_graph) ON property_graph_metadata
ORDER BY g.graph_name;

The result of this query is one row returning the name of the graph you created.

Let me explain the query syntax in more details based on the next simple, fixed-length path pattern matching query:

--------------------------------------------------------
-- Fixed-length path query example:
--   List 10 transactions between two accounts
--   that are directly connected via a transaction.
--------------------------------------------------------
SELECT
  a.acct_id as src_account,
  b.acct_id as dst_account,
  t.amount as txn_amount,
  t.txn_date as txn_date
FROM MATCH (a IS account)-[t IS transfers]->(b IS account) ON bank_graph
ORDER BY src_account, dst_account, txn_date
LIMIT 10;

In the query above

bank_graph is the name of the graph.
(a IS account) and (b IS account) are vertex patterns in which a and b are variable names and IS account a label expression.
[e IS transfers] is an edge pattern in which e is a variable name and IS transfers a label expression
Variable names like a , b or e can be freely chosen by the user. The vertices or edges that match the pattern are said to bind to the variable.
The label expression IS account specifies that we match only vertices that have the label account.
The label expression IS transfers specifies that we match only edges that have the label transfers.
a.acct_id as src_account, b.acct_id as dst_account, t.amount as txn_amount, t.txn_date as txn_date are property references including aliases to return the vertex and edge properties of interest.
The query returns a maximum of 10 results caused by the LIMIT clause.
The results are ordered by the properties src_acct, dst_account, and txn_date referenced in the ORDER BY clause.

Below is the query result in tabular form.

   SRC_ACCOUNT    DST_ACCOUNT    TXN_AMOUNT TXN_DATE
______________ ______________ _____________ ______________________________
         28196     2906884860       8913,77 05.08.22 09:09:02,015793000
         28196     5377683457      10786,16 02.06.23 09:09:01,672170000
         28196     5641650742      10086,27 06.03.23 09:09:02,014685000
         28196     8674100639       2383,45 04.01.23 09:09:02,102609000
        831045      838779907       2586,65 17.03.23 09:09:01,702506000
        831045     4398554048       8700,66 28.12.22 09:09:01,486478000
        831045     6702362841       5251,38 16.04.23 09:09:01,754032000
        831045     7370032576       8465,92 23.09.22 09:09:01,715073000
        831045     8175906549      11495,07 30.11.22 09:09:02,007173000
        831045     8840313657       7863,51 16.11.22 09:09:02,088441000

To better display the advantage of querying a graph versus plain relational tables have a look at the next query. It is an example for variable-length path pattern matching queries.

--------------------------------------------------------
-- Variable-length path query example:
--   List all direct and indirect transactions
--   between two accounts
--   starting from a defined account
--   with a maxiumum hop distance of 6.
--------------------------------------------------------
SELECT
  a.acct_id AS src_account,
  b.acct_id AS dst_account,
  LISTAGG(t.amount, ' + ') || ' = ',
  SUM(t.amount) AS total_amount
FROM MATCH ALL (a IS account) -[t IS transfers]->{,6} (b IS account) ONE ROW PER MATCH ON bank_graph
WHERE
  src_account = 831045 AND
  src_account <> dst_account
ORDER BY total_amount DESC, dst_account
LIMIT 10;

What is different in this query compared with the previous one?

The use of quantifiers such as *, +, {n}, {n,}, {n,m}, or {,m} make it possible to match variable-length paths such as shortest paths. Variable-length path patterns match a variable number of vertices and edges such that different solutions (different rows) potentially have different numbers of vertices and edges. n specifies the lower bound, m the upper bound of the path length.
MATCH ALL specifies the path finding goal, which in case of ALL requires an upper bound on the path length.
ONE ROW PER MATCH defines that one row per pattern match is returned.
SUM returns the sum property values for all pattern matches.
LISTAGG constructs a concatenation of the property values for all pattern matches.
The WHERE clause sets a filter to the account in the graph from which you start searching.

Next you see the result of that query.

   SRC_ACCOUNT    DST_ACCOUNT LISTAGG(t.amount, ' + ') || ' = '                                      TOTAL_AMOUNT
______________ ______________ ___________________________________________________________________ _______________
        831045      887618973 8700,66 + 27443,9 + 85211,94 + 93275,09 + 43920,15 + 70504,6 =            329056,34
        831045     1708280095 8700,66 + 27443,9 + 85211,94 + 93275,09 + 43920,15 + 65431,65 =           323983,39
        831045     5248045533 8700,66 + 27443,9 + 85211,94 + 93275,09 + 54627,81 + 40364,23 =           309623,63
        831045     4786875159 8700,66 + 27443,9 + 85211,94 + 93275,09 + 64038,38 + 28089,68 =           306759,65
        831045     6333971885 8700,66 + 27443,9 + 87542,25 + 17596,89 + 86283 + 78035,15 =              305601,85
        831045     8238066150 8679,35 + 14202,8 + 68621,57 + 38262,49 + 80749,54 + 93518,05 =            304033,8
        831045     4309466287 8700,66 + 27443,9 + 85211,94 + 93275,09 + 43920,15 + 44413,62 =           302965,36
        831045     9239631859 11495,07 + 70723,13 + 7308,67 + 46515,58 + 80905,57 + 85198,1 =           302146,12
        831045     7794150572 8700,66 + 27443,9 + 84010,65 + 80040,96 + 27873,49 + 72333,19 =           300402,85
        831045     5898575341 8700,66 + 27443,9 + 85211,94 + 93275,09 + 54627,81 + 30401,49 =           299660,89

Starting from account with the ID 831045 the query searches all path to other accounts the source account is connected to via transfers. The number of list items (transfer amounts) is equal with the number of edges (hops) passed from the source to the destination account in the graph.

The next example of variable-length paths query looks for circular payment chains, meaning money transfers from one account via one or more intermediate accounts back to the same account.

---------------------------------------------------------------------
-- Another variable-length path query example:
--   Find all circular payment chains for a given account
---------------------------------------------------------------------
SELECT
  COUNT(t) AS num_hops,
  a.acct_id AS account,
  LISTAGG(b.acct_id, ' -> ' ) AS circular_payment_chain
FROM MATCH ALL (a IS account) (-[t IS transfers]-> (b IS account)){1,5} (a) ON bank_graph
WHERE a.acct_id = 4877749745
ORDER BY num_hops, circular_payment_chain;

The result you can see in the following graphic captured from the Oracle Graph Visualization application. This web application is available as one of the clients tools previously mentioned or can be deployed on OCI via the Oracle Cloud MarketPlace.

GraphViz query example: Find circular payment chains

It is up to you now to try out with the various capabilities of PGQL to query your graph.

You are also free to extend the underlying data set with new tables, such as for account holders, or new properties in the existing tables. Then you can create one or more alternative graphs that includes the changes.

Have fun!

See your data through new eyes: Everything can be a graph

Step 1: Download and install SQLcl and the PGQL plugin

Step 2: Load sample data

Step 3: Create a graph model

Step 4: Query your data using the Property Graph Query Language

Further reading

Karin Patenge

Senior Principal Product Manager | Oracle Spatial and Graph

Running GoldenGate 23ai on kubernetes

Oracle Database Monthly News - September 2024 - Quick Links

See your data through new eyes: Everything can be a graph

Step 1: Download and install SQLcl and the PGQL plugin

Step 2: Load sample data

Step 3: Create a graph model

Step 4: Query your data using the Property Graph Query Language

Further reading

Authors

Karin Patenge

Senior Principal Product Manager | Oracle Spatial and Graph

Running GoldenGate 23ai on kubernetes

Oracle Database Monthly News - September 2024 - Quick Links