This is the second installment of the series "Using RDF Knowledge Graphs in the Oracle Public Cloud." In this blog post, we will use SQL Developer to load some publicly available RDF data into our DBCS instance and execute a few SPARQL queries with SQL Developer's SPARQL query editor. Click here for part 1 of this series.
The data we will use comes from the Linked Geo Data project: http://linkedgeodata.org/About
The project provides several downloadable datasets, which are indexed here: https://hobbitdata.informatik.uni-leipzig.de/LinkedGeoData/downloads.linkedgeodata.org/releases/
We will be using a 1.2 million triple dataset of sports facilities: https://hobbitdata.informatik.uni-leipzig.de/LinkedGeoData/downloads.linkedgeodata.org/releases/2015-11-02/2015-11-02-SportThing.node.sorted.nt.bz2
Download SportThing.node.sorted.nt.bz2 to your client computer and use a program such as WinSCP to copy the file to your DBCS instance. Refer to the DBCS user guide for more information on how to copy files to/from your DBCS instance. Detailed instructions for WinSCP can be found in the HOW-TO document available at the end of this blog post. In this example, we have copied the file to /home/oracle on the DBCS instance.
We will use a normal database user with connect, resource, and appropriate tablespace privileges to perform the load operation. Recall that we used the SYSTEM user to create and configure our semantic network in the previous installment. Open a connection for your desired database user in SQL Developer. We are using RDFUSER in this example. Expand the RDFUSER connection and expand the RDF Semantic Graph component.
The first step is to create a semantic model to hold our RDF dataset. Right click on Models and choose New Model.
Enter a Model Name and choose to create a new Application table with TRIPLE column. Choose the tablespace you used when creating the semantic network for Model tablespace. Click Apply.
We have now created a model to hold our RDF data. If you expand Models and Regular Models under RDF Semantic Graph, you should see the LGD_SPORT model we created.
Now, we will bulk load the downloaded RDF file. The bulk load process consists of two major steps:
Loading the file from the file system into a simple staging table in the database.
Loading data from the staging table into our semantic model.
The first step involves loading from an external table, so we need to use the SYSTEM connection to create a DIRECTORY in the database and grant privileges on it to RDFUSER.
Expand the SYSTEM connection and right-click on Directories. Then choose Create Directory.
Enter a Directory Name and the full path for the directory on the database server. Click Apply.
Expand Directories and click the directory name to see details.
Now we need to grant privileges on this directory to RDFUSER. Click actions and select Grant.
Grant READ and WRITE privileges to RDFUSER. Click Apply.
RDFUSER now has access to the /home/oracle directory on our DBCS instance.
Before loading into the staging table, we need to do a few operations from the Unix command line. The downloaded RDF file is compressed, so we will use a Unix named pipe to stream the uncompressed data rather than storing an uncompressed version of the file.
Use PuTTY to make an SSH connection to the remote DBCS instance as the oracle user (see the DBCS user guide for more information on how to make an SSH connection). Then execute the following commands to create a named pipe to stream the uncompressed data.
bzcat 2015-11-02-SportThing.node.sorted.nt.bz2 > named_pipe.nt &
Now, expand the RDFUSER connection in SQL Developer and expand the RDF Semantic Graph component. Then right click on Models and select “Load RDF data into staging table (External Table).”
Choose a name for the external table to create (we are using LGD_EXT_TAB) and fill in the other fields on the Source External Table tab.
Enter the names of the files to load (named_pipe.nt in this example) on the Input Files tab.
Finally, use the Staging Table tab to enter a name for the staging table that will be created (we are using LGD_STG_TAB) and choose the appropriate format.
Now click Apply to load the data into LGD_STG_TAB. Check the contents of LGD_STG_TAB.
Next, we will load the data from LGD_STG_TAB into the LGD_SPORT semantic model. To bring up the bulk load interface, expand RDF Semantic Graph under the RDFUSER connection. Then, right-click on Models and select “Bulk Load into model from staging table”.
Enter LGD_SPORT for Model and unselect the Create model option since we have already created this semantic model. Also, choose the LGD_STG_TAB for Staging table name. Be careful not to select the external table (LGD_EXT_TAB), as it will also be listed. Consult the user guide for more information on the other options for bulk load. Click Apply, and the load will finish in a minute or so.
Now that we have completed the bulk load, it is a good idea to gather statistics on the whole RDF network. Only a privileged user can gather statistics on the whole RDF network, so we need to use the SYSTEM user or some other DBA user. Expand the RDF Semantic Graph component under the SYSTEM connection, right click on RDF Semantic Graph and select Gather Statistics.
Enter the desired Degree of parallelism and click Apply.
After gathering statistics, the data is ready to query.
Now, we will use SQL Developer’s SPARQL query editor to query our dataset. Go back to the RDFUSER connection and expand Models and Regular Models under RDF Semantic Graph. Clicking LGD_SPORT will open the SPARQL query editor for this semantic model.
You can edit and execute SPARQL queries here. In addition, several pre-created templates are available.
Click Templates -> Demographics -> COUNT ALL to count all triples in the LGD_SPORT model.
Click Yes when the warning comes up.
Click the green triangle to run the query.
You can also directly edit SPARQL queries. The following example shows a SPARQL query to get the top 10 properties and their triple counts.
In addition to SPARQL SELECT queries, CONSTRUCT and DESCRIBE queries are also supported. The query below describes a particular resource in the LGD_SPORT model. Note that any namespace PREFIXes used in the query will also be used to simplify values in the query result. Here we have added several more PREFIXes.
That’s it. We have successfully bulk loaded a publicly available RDF dataset into our DBCS instance and executed some queries with SQL Developer’s SPARQL query editor. In the next installment, we will setup a W3C-standard SPARQL endpoint to provide a REST interface.
A detailed HOW-TO is available HERE.