X

Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

Graph machine learning with PyPGX and OML4Py

Rhicheek Patra
Research Manager

PyPGX and OML4Py enable users to develop fast, scalable, and secure graph machine learning applications. In this post, I'll show you how to build an intrusion detection system using PyPGX and OML4Py for graph machine learning

 

What are PyPGX and OML4Py? 

PyPGX is the newly released Python client for PGX, a toolkit for graph analysis developed by Oracle (available as the Property Graph feature of Oracle Database).

OML4Py is the Python interface to Oracle Machine Learning (OML), which is a set of products that support scalable machine learning algorithms for in-database and big data environments.

The key advantage of OML is that all the machine learning functionality can be run on data in an Oracle Database. This provides several advantages, including: 

  • Scalability (Oracle Database can store huge volumes of data)
  • Performance improvements (you don't need to constantly fetch and push back your data to and from the database)
  • Security (your data does not need to leave the protected environment of the Oracle Database)

 

Subscribe to the Oracle Data Science Newsletter to get the latest machine learning and data science content sent straight to your inbox!

 

Building an intrusion detection system

What is intrusion detection?

Intrusion detection is a broad term for the monitoring of a wide range of system and public/private network traffic. For our use case, we define intrusion detection as the problem of monitoring public network (e.g. the world wide web) traffic and detecting malicious activity.

Network traffic, in our case, is so-called packet captures, which are logs of IP addresses interacting with other IP addresses. The following table is an example of such a packet capture:

Packet capture table

The previous table can be regarded as an edge-list representation of a graph. A resulting graph might look something like this:These packet captures can naturally be expressed as graphs, by treating IP addresses as vertices and an interaction between two IP addresses A and B as an edge between the two vertices A and B.

Packet captures graph

This figure shows an example of a malicious packet capture.

The victim searches for something over a web search engine; he then clicks on one of the results beloinging to IP address which, unfortunately, has been compromised. A then redirects the victim over several links to the IP address B, which lets the victim download an exploit. This exploit tries to get administrator privileges on the victim's host machine and, upon success, reports the successful infiltration to its master C. C then redirects the victim to the server D, which downloads further malicious software (for example, ransomware that will encrypt your computer and demand money for its decryption).

For a more detailed explanation of packet captures and the whole problem of intrusion detection, I would like to point the interested reader to this report on DynaMiner and malware detection

Training the classifier

For our example, we have a lot of different packet captures, and our goal is to train a classifier that can distinguish malicious packet captures from benign packet captures.

Classifier: Malicious packet captures from benign packet captures.

 

Intrusion detection using PyPGX and OML4Py

Overview

Solving the problem using PyPGX and OML4Py

We will solve the problem in three main stages. In the image above, you can see the main stages (the three columns) and the different steps that are included in each stage. The three main stages consist of the following steps:

  1. Data preparation: In the first stage, we will fetch our raw packet capture data and convert it into a form that can later be used by the PGX server to create graphs. These so-called graph files are then stored in an Oracle Database.
  2. Feature generation: In the second stage, the PGX server reads in the graph files from the database and uses them to build graphs. These graphs are then analyzed extensively using the rich graph-algorithm library provided by PGX. The result of these analyses is then stored in a large feature table, which itself is stored in the Oracle Database.
  3. Classification: In the third stage, we use the feature table we created in stage two to train a variety of different classifiers in order to achieve an as high as possible prediction accuracy. For this, we will also OML's AutoML functionality and embedded Python execution.

Creating the graphs

Build graph files

Connect to the PGX and the OML server

In order to use the PGX and OML functionalities, we first need to connect sessions on the respective servers. This is quite straightforward and can be done with the pypgx.get_session() command for PGX and the oml.connect() function for OML.

import pypgx
import oml
 
# Connect to the PGX server
session = pypgx.get_session(base_url="http://localhost:7007",  token="some_token")
analyst = session.analyst
 
data_source_name = '(DESCRIPTION=data_source_of_the_corresponding_oracle_database)'
 
# Connect with existing oracle database
oml.connect("example_user","example_password", dsn = data_source_name, automl = data_source_name)
 
# Check whether connection attempt was successful
print(oml.isconnected())-->


Note: In order to use OML, you need to have a running Oracle Database with OML capabilities. Otherwise, the oml.connect() function will throw an error.

Create the graph files

In this step, we fetch our raw input data from some external source (for example, the file system or a database) and transform it into a form that the PGX server later can use to create the actual graphs.

In the next step, will create two graphs: one graph for the graph data of malicious packet captures, and one graph for the graph data of benign packet captures.

For each of these graphs we prepare the following two tables:

  • Vertex table: A table which stores all vertices with the corresponding vertex properties
  • Edge table: A table which stores all edges and the corresponding edge properties.

As this step depends on how and where the raw input data is stored, and doesn't include any PyPGX or OML4PY functionality, we omit the actual code to create the tables, and just show the end results.

After this step, our tables are stored in a pandas DataFrame and look like this:

Vertex table (for malicious graph data):

Vertex table for malicious graph data

Edge table (for malicious graph data):

Edge table (for malicious graph data)
The corresponding tables for the benign graph data have the exact same schema.

Store graph files in Oracle Database

We now store the created graph files inside an Oracle Database. This can be done by using the oml.create() command which will create a new table in the database. The function takes as an input the name that this new table shall receive, and a pandas DataFrame that stores the data that will be stored in the newly created table. The function returns a so-called proxy object, which is basically a pointer to the table inside the database. This proxy object can be used to access and manipulate the data stored in the database.

# create the benign table
benign_vertices = oml.create(df_first_benign, table = 'BENIGN_VERTICES')
benign_edges = oml.create(df_second_benign, table = 'BENIGN_EDGES')
 
# create the malware table
malware_vertices = oml.create(df_first_malware, table = 'MALWARE_VERTICES')
malware_edges = oml.create(df_second_malware, table = 'MALWARE_EDGES')


After executing the code above, we created four proxy objects: benign_vertices, benign_edges, malware_vertices, and malware_edges.

 

Load graph files in PGX

In this step, we are going to create the graphs inside the PGX server, using the graph files we prepared in the previous step. For this, we first need to create a graph config for each of our two graphs. The graph config tells the PGX server all the important information it needs to load the graph. The config can be expressed as a simple Python dictionary object:

config_malware = {
     
    'jdbc_url': 'jdbc:oracle:thin:@'+data_source_name,
    'format': 'two_tables',
    'datastore': 'rdbms',
     
    'username': 'test_user',
    'keystore_alias': 'database_keystore',
     
    'vertex_id_type': 'string',
     
    'nodes_table_name': 'MALWARE_VERTICES',
    'edges_table_name': 'MALWARE_EDGES',
     
    'nodes_key_column' : 'V_ID',
    'edges_key_column' : 'E_ID',
    'from_nid_column' : 'SRC_ID',
    'to_nid_column' : 'DST_ID',
     
    'vertex_props': [
        {'name': 'V_LABEL', 'type': 'string'},
        {'name': 'TYPE', 'type': 'string'},
        {'name': 'GRAPH_ID', 'type': 'long'}
    ],
    'edge_props' : [
        {'name': 'SRC_TYPE', 'type': 'string'},
        {'name': 'DST_TYPE', 'type': 'string'}
    ]
}


This config contains information on how the graph files are stored, corresponding credentials to access the database, and information about the schema of the table. In order for the PGX server to access the Oracle Database, we first register a keystore that contains the password for the user test_user to access the database.


# Register the keystore
session.registerKeystore("/path/to/keystore/keystore_test.p12", "test_password")

After this has been done, creating the graphs is straightforward. We use for this the read_graph_with_properties() function and provide the corresponding graph configs:


# Read in the graphs
graphlets_b = session.read_graph_with_properties(config_benign)
graphlets_m = session.read_graph_with_properties(config_malware)

The function read_graph_with_properties() returns a proxy object to the corresponding graph inside PGX. This object can now be used to access and manipulate the graph.

 

Next step: Using a featured table

Workflow: Raw packet captures, partitioned graph, feature table

Let's recap for a moment what we have done so far.

First, we fetched a long list of raw packet capture data. We then created two partitioned graphs, graphlets_b and graphlets_m. Both of these graphs are composed of many small subgraphs. Each of these subgraphs represents one of the packet captures we read in previously.

As a next step, we will analyze each of these small graphlets, and store the information about them in a large feature table. For each graphlet, we will store one row in the feature table. We will then attach a label to each of these rows that tells whether this row contains information about a malicious or a benign packet capture.

So without further ado, let's continue our journey!

From feature generation to classification

Create graphlet features

Now that we created the two graphs inside PGX, we are going to analyze them using a wide variety of graph algorithms. We'll later use the resulting information to train the different classifiers. For each graph, we will iterate over each individual graphlet and analyze it using the following functionality:


# A list which will store the generated feature information
feature_strings_malware = []
 
# Iterate over each of the packet captures inside the graph
for val in range(1,amount_of_malware_graphs+1,1):
    # Reduce the graph to the subgraph of this packet capture
    graphlet = graphlets_m.filter(VertexFilter("vertex.GRAPH_ID = "+str(val)))
     
    # Generate the feature information for this packet capture
    features = generate_graphlet_features(graphlet)
     
    # Add the feature information to our list. The '1' represents the label
    # of this packet capture. It denotes that this is a malicious packet capture
    feature_string = [1] + features
    feature_strings_malware.append(feature_string)

After we got the feature list of a single graphlet, we append it to a list of feature lists (feature_strings_malware). Before doing this, we append to each list of features a binary label, which indicates whether this list belongs to a malicious (label = 1) or benign (label = 0) packet capture.

The function generate_graphlet_features() takes as input a single graphlet and uses the rich set of graph algorithms that PGX provides to compute a number of  properties of the given graphlet:


def generate_graphlet_features(graphlet):
    features = []
    features.append(graphlet.num_vertices)
    features.append(graphlet.num_edges)
    features.append(get_average_property(analyst.degree_centrality(graphlet)))
    features.append(get_average_property(analyst.local_clustering_coefficient(graphlet)))
    features.append(get_average_property(analyst.out_degree_centrality(graphlet)))
    features.append(get_average_property(analyst.in_degree_centrality(graphlet)))
    features.append(get_average_property(analyst.vertex_betweenness_centrality(graphlet)))
    features.append(get_average_property(analyst.pagerank(graphlet)))
    features.append(get_average_property(analyst.eigenvector_centrality(graphlet,max_iter=1000,tol=0.00001, l2_norm=True, in_edges=True)))
    features.append(get_sum_property(analyst.degree_centrality(graphlet)))
    return features

The two functions get_average_property() and get_sum_property() are just helper functions, to average or sum a list of properties. 


def get_average_property(property):
    numerator = 0
    count = 0
    for vertex, val in property.get_values():
        numerator += val
        count += 1
    return numerator / count
 
def get_sum_property(property):
    total_sum = 0
    for vertex, val in property.get_values():
        total_sum += val
    return total_sum

Train data with the OML classifier

Prepare the data for classification

Remember that, until now, we did every step twice: once for the graph which stores the benign packet captures, and a second time for the graph which stores the malicious packet captures. Now that we appended a label to each feature list, which denotes whether this feature list belongs to a benign (label = 0) or malicious (label = 1) packet capture, we can finally combine the two feature lists onto one large feature table:

# Combine the two dataframes
feature_strings = feature_strings_malware.append(feature_strings_benign)
 
# Shuffling the dataframe
feature_strings = feature_strings.sample(frac=1, random_state=0)
feature_strings = feature_strings.reset_index(drop=True)
 
# Renaming the label column to 'label'
feature_strings.rename(columns = {'0':'label'}, inplace = True)

We now created the final table, which we will use to train our classifiers. The table looks something like this:

Table of data to train classifiers

We can now push this table back in the Oracle Database and execute the whole machine learning functionality directly on data inside the database:


# Store the features in the database
data_oml = oml.create(feature_strings, table = 'data_complete')

Finally, to train and test classifiers on this data, we should split our data into a training set and a test set. Note that the data does not need to be fetched from the database to do this. OML provides the split() function for this, which returns two proxy objects to the two splits of the data:


# Split the data
train_data_oml, test_data_oml = data_oml.split(ratio=(.8, .2), seed=0)
 
# Create the train data
train_x_oml = train_data_oml.drop('label')
train_y_oml = train_data_oml['label']
 
# Create the test data
test_x_oml = test_data_oml.drop('label')
test_y_oml = test_data_oml['label']

 

First classification using OML

After having prepared our data for classification, it's time to start training our classifiers!

Let's start with a neural network. OML4Py lets you create a neural network via the oml.nn() constructor. You can provide a large set of parameters to this constructor in order to individualise your neural network. In the code below, we created four layers with different numbers of nodes and different activation functions. 


# Create an oml Neural Network model object.
params = {
    # The architecture of the neural network
    'nnet_hidden_layers' : 4,
    'nnet_nodes_per_layer' : '50, 30, 50, 30',
 
    # The differnt activation functions in each layer
    'nnet_activations' : "'NNET_ACTIVATIONS_LINEAR','NNET_ACTIVATIONS_LINEAR','NNET_ACTIVATIONS_LINEAR','NNET_ACTIVATIONS_LINEAR'",
 
    # Info about the number of rounds to use for fitting
    'nnet_iterations' : 500,
    'nnet_heldaside_max_fail' : 100,
 
    # A seed for reproducibility
    'odms_random_seed' : 1
}
nn_mod = oml.nn(**params)

Training and scoring the neural network is also quite easy. OML4Py provides two functions, fit() and score(), that each take OML proxy objects as parameters. These proxy objects tell the functions on which data they should apply the training and testing.


# Fit the NN model according to the training data and parameter settings.
nn_mod = nn_mod.fit(train_x_oml, train_y_oml)
 
# Score the model
accuracy = nn_mod.score(test_x_oml, test_y_oml)
 
print(accuracy)

Training this model on our data and then scoring it, yields an accuracy of 87.5%. But, can we do better?

OML AutoML: Let OML do the work for you

OML provides a large set of different classifiers, from neural networks over linear classifiers to naive bayes classifiers. However, it's cumbersome to try out all of these different classifiers to find out which one delivers the best results, especially since each single classifier has a large number of hyperparameters which could be tuned. But, OML has a solution for this: AutoML. The AutoML model provides a variety of functions that one can use to let OML find the best model/features/hyperparameters for you.

Let's apply the automatic algorithm selection to our data to find out which model yields the best result for our data:


# Create an algorithm selection object
alg_selection = oml.automl.AlgorithmSelection(mining_function='classification', score_metric='accuracy',
                                             parallel=1)
 
# Find the 3 classifiers which provide the best accuracy
algo_ranking = alg_selection.select(train_x_oml, train_y_oml, cv=None,
                                    X_valid=test_x_oml, y_valid=test_y_oml, k=3)
 
# Output these classifiers
print("The best models are: " + str(algo_ranking))

Running the code above gives us the following ranking:

  1. Decision tree (Accuracy: ~ 99.6%)
  2. General linearized model (Accuracy: ~ 99.6%)
  3. Random forest (Accuracy: ~ 99.5%)

Let's check how accurate this estimation is.

Training a decision tree classifier

The syntax for creating a decision tree classifier is straightforward. Just like in the example with the neural network, we first specify a set of parameters in a Python dictionary. This dictionary can then be passed to the oml.dt() function, which is the constructor function for the decision tree classifier:


# Specify settings.
setting = {
    # Maximum tree depth
    'TREE_TERM_MAX_DEPTH':'7',
     
    # Maximum number of bins for each attribute
    'CLAS_MAX_SUP_BINS':'32',
     
    # Minimum number of rows in a node
    'TREE_TERM_MINREC_NODE':'10'
}
 
# Create an oml DT model object.
dt_mod = oml.dt(**setting)

We can use the exact same functions we used for the neural network to train and test our decision tree classifier:


# Fit the DT model according to the training data and parameter settings.
dt_mod.fit(train_x_oml, train_y_oml)
 
# Score the model
accuracy = dt_mod.score(test_x_oml, test_y_oml)
 
print(accuracy)

And indeed, when we run this code, we receive an accuracy of 99.6%

Embedded Python execution: Create a custom classifier specifically targeted to your needs

Despite all the different parameters that the various OML classifiers and regressors offer, the amount of different models that one can create is still restricted. What if you need to build a very specialized classifier for which there don't exist the right parameters? Fear not, because OML provides a solution for that: embedded Python execution.

This is a functionality where a user can write any Python code he/she would like and then send the code to the Oracle Database, where it gets executed on the specified data. Let's look at an example of this by using the function table_apply(). 

First, we write a simple function to execute on our data in the database. In this example, we use a gradient boosting classifier from the Python machine learning library scikit-learn:


def table_apply_func_sklearn_booster(data):
    # Import the necessary modules
    import numpy as np
    from sklearn.ensemble import GradientBoostingClassifier
     
    # Define a simple one-hot encoder
    def oneHotEncoder(data, n_classes):
        y = np.array([int(i[0]) for i in data])
        y_onehot = [0]*len(y)
        for i,j in enumerate(y):
            y_onehot[i] = [0]*n_classes
            y_onehot[i][j]= 1
        return (y,y_onehot)
     
    # Create a train-test split of the data
    msk = np.random.rand(len(data)) < 0.8
    train_data = data[msk]
    test_data = data[~msk]
    x_train = np.array([ i[1::] for i in train_data])
    y_train, y_train_onehot = oneHotEncoder(train_data, 2)
    x_test = np.array([ i[1::] for i in test_data])
    y_test, y_test_onehot = oneHotEncoder(test_data, 2)
     
    # Create the gradient boosting classifier
    gbc = GradientBoostingClassifier(random_state = 0)
     
    # Train the model
    gbc.fit(x_train, y_train)
     
    # Get the accuracy of the model
    score = gbc.score(x_test, y_test)
     
    # return the trained model and the accuracy it achieved
    return (gbc, score)

Now, to execute this code on our data table in the database, we can just call the following function:


# Execute the function 'table_apply_func_sklearn_booster' on the table 'data_oml'
output_proxy = oml.table_apply(data=data_oml, func=table_apply_func_sklearn_booster)
 
model, score = output_proxy.pull()
print("Score = " + str(score))

The objects we return inside the function table_apply_func_sklearn_booster() get returned into our local Python session as proxy objects (pointers to the real objects which still reside inside the Oracle Database). To fetch the real objects into the local session, we can cull the pull() on the proxy objects.

Storing Python objects in Oracle Databases

What if we would like to reuse the classifier we trained previously some time in the future?

One way would be to store it in the local filesystem. However, this might be quite insecure. Another way is to store them as binary large objects (BLOBs) in the Oracle Database. To do this, we can make use of datastores, namely the oml.ds.save() function:

# Save the previously trained model in the datastore 'my_datastore'
oml.ds.save(objs={'gbc_model':model}, name="my_datastore",
            description = "A trained gradient boosting classifier", overwrite=True)

You can now safely delete the current model, as it has been persisted in the database. When you would like to load the model back into the current Python session, you can use the oml.ds.load() function:


# Load the previously stored model from the datastore
oml.ds.load(name="my_datastore", objs=["gbc_model"])

This will put the model variable back into the global namespace of the current Python session. The model can be accessed by the name it has been assigned during storing. In our case, that would be gbc_model.

 

Graph machine learning features

In this blog post, we cover how graph-based features can boost standard machine learning. However, PGX provides more complex graph machine learning functionalities like vertex representations (DeepWalk, Graph Convolutional Networks) as well as sub-graph representations (PG2Vec).

 

Explore machine learning and data science at Oracle. 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.