JSON is a popular lightweight data structure used by Big Data. Increasingly, a lot of data produced by Big
Data are in JSON format. For example, web
logs generated in the middle tier web servers are likely in JSON format. NoSQL database vendors have chosen JSON as
their primary data representation. Moreover, the JSON format is widely used in the RESTful style Web
services responses generated by most popular social media websites like
Facebook, Twitter, LinkedIn, etc. This
JSON data could potentially contain wealth of information that is valuable for
business use. So it is important that we
can bring this data over to Data Miner for analysis and mining purposes.
Oracle database 184.108.40.206 provides ability to store and query JSON data. To take advantage of the database JSON
support, the upcoming Data Miner 4.1 added a new JSON Query node that allows
users to query JSON data as relational format. In additional, the current Data Source node and Create Table node are
enhanced to allow users to specify JSON data in the input data source.
In this blog, I will show you how to specify a JSON data in the input data
source and use JSON Query node to selectively query desirable attributes and
project the result in relational format. Once the data is in relational format, users can treat it as a normal
relational data source and start analyzing and mining it immediately. The Data Miner repository installation
installs a sample JSON dataset ODMR_SALES_JSON_DATA, which I will be using it
here. However, Oracle
Big Data SQL supports queries against vast amounts of big data
stored in multiple data sources, including Hadoop. Users can view and analyze data from various
data stores together, as if it were all stored in an Oracle database.
The Data Source node and Create Table nodes are enhanced to allow users to
specify the JSON data type in the input data source.
For this demo, we will focus on the Data Source node. To specify JSON data, create a new workflow
with a Data Source node. In the Define
Data Source wizard, select the ODMR_SALES_JSON_DATA table. Notice there is only one column (JSON_DATA)
in this table, which contains the JSON data.
Click Next to go to the next step where it shows the JSON_DATA is selected
with the JSON(CLOB) data type. The JSON
prefix indicates the data stored is in JSON format; the CLOB is the original
data type. The JSON_DATA column is
defined with the new “IS JSON” constraint, which indicates only valid JSON
document can be stored there. The UI can
detect this constraint and automatically select the column as JSON type. If there was not a “IS JSON” constraint
defined, the column would be shown with a CLOB data type. To manually designate a column as a JSON
type, click on the data type itself to bring up a in-place dropdown where it
lists the original data type (e.g. CLOB) and a corresponding JSON type (e.g.
JSON(CLOB)), so just select the JSON type. Note: only the following data types can be set to JSON type: VARCHAR2,
CLOB, BLOB, RAW, NCLOB, and NVARCHAR2.
Click Finish and run the node now.
Once the node is run successfully, open the editor to examine the generated
Notice the message “System Generated Data Guide is available” at the bottom
of the Selected Attributes listbox. What
happens here is when the Data Source node is run, it parsed the JSON documents
to produce a schema that represents the document structure. Here is what the schema looks like:
The JSON Path expression syntax and associated data type info (OBJECT,
ARRAY, NUMBER, STRING, BOOLEAN, NULL) are used to represent JSON document
structure. We will refer to this JSON
schema as Data Guide throughout the product.
Before we look at the Data Guide in the UI, let’s look at the settings that
can affect how it is generated. Click
the “JSON Settings…” button to open the JSON Parsing Settings dialog.
The settings are described below:
· Generate Data Guide if necessary
o Generate a Data Guide if it is not already
generated in parent node.
o Sample JSON documents for Data Guide generation.
· Max. number of documents
o Specify maximum number of JSON documents to be
parsed for Data Guide generation.
· Limit Document Values to Process
o Sample JSON document values for Data Guide
· Max. number per document
o Specify maximum number of JSON document scalar
values (e.g. NUMBER, STRING, BOOLEAN, NULL) per document to be parsed for Data
The sampling option is enabled by default to prevent long-running parsing of
JSON documents; parsing could take a while for large number of documents. However, users may supply a Data Guide
(Import from File) or reuse an existing Data Guide (Import from Workflow) if
compatible Data Guide is available.
Now let’s look at the Data Guide, go back to the Edit Data Source Node
dialog, select the JSON_DATA column and click the above to open the Edit Data Guide dialog. The dialog shows the JSON structure in a
hierarchical tree view with data type information. The “Number of Values Processed” shows the
total number of JSON scalar values was parsed to produce the Data Guide.
Users can control whether to enable Data Guide generation or import a
compatible Data Guide via the menu under the icon.
The menu options are described below:
o Use the “Generate Data Guide if necessary”
setting found in the JSON Parsing Setting dialog (see above).
o Always generate a Data Guide.
o Do not generate a Data Guide.
· Import From Workflow
o Import a compatible Data Guide from a workflow
node (e.g. Data Source, Create Table). The option will be set to Off after the import (disable Data Guide
· Import From File
o Import a compatible Data Guide from a file. The option will be set to Off after the
import (disable Data Guide generation).
Users can also export the current Data Guide to a file via the icon.
In Data Miner 4.1, a new JSON Query node is added to allow
users to selectively bring over desirable JSON attributes as relational format.
The JSON Query node is added to the Transforms group of the
Let’s create a JSON Query node and connect the Data Source node to it.
Double click the JSON Query node to open the editor. The editor consists of four tabs, and these
tabs are described as followings:
The Column dropdown lists all available columns in
the data source where JSON structure (Data Guide) is found. It consists of the following two sub tabs:
o Show the JSON structure of the selected column
in a hierarchical tree view.
o Show sample of JSON documents found in the selected
column. By default it displays first
2,000 characters (including spaces) of the documents. Users can change the sample size (max. 50,000
chars) and run the query to see more of the documents.
· Addition output
o Allow users to select any non-JSON columns in
the data source as additional output columns.
o Allow users to define aggregations of JSON
o Output Columns
o Show columns in the generated relational output.
o Output Data
o Show data in the generated relational output.
Let’s select some JSON attributes to bring over. Skip the SALES attributes because we want to
define aggregations for these attributes (QUANTITY_SOLD and AMOUNT_SOLD).
To peek at the JSON documents, go to the Data tab. You can change the Sample Size to look at
more JSON data. Also, you can search for
specific data within the displayed documents by using the search control.
If you have any non-JSON columns in the data source that you
want to carry over for output, you can select those columns here.
Let’s define aggregations (use SUM function) for
QUANTITY_SOLD and AMOUNT_SOLD attributes (within the SALES array) for each
customer group (group by CUST_ID).
Click the icon in the top toolbar to open the Edit Group
By dialog, where you can select the CUST_ID as the Group-By attribute. Notice the Group-By attribute can consists of
Click OK to return to the Aggregate tab, where you can see
the selected CUST_ID Group-By attribute is now added to the Group By Attributes
table at the top.
Click the icon in the bottom toolbar to open the Add
Aggregations dialog, where you can define the aggregations for both
QUANTITY_SOLD and AMOUNT_SOLD attributes using the SUM function.
Next, click the icon in the toolbar to open the Edit Sub Group
By dialog, where you can specify a Sub-Group By attribute (PROD_ID) to
calculate quantity sold and amount sold per product per customer.
Specifying a Sub-Group By column creates a nested table; the
nested table contains columns with data type DM_NESTED_NUMERICALS.
Click OK to return to the Aggregate tab, where you can see
the defined aggregations are now added to the Aggregation table at the bottom.
Let’s go to the Preview tab to look at the generated
relational output. The Output Columns
tab shows all output columns and their corresponding source JSON
attributes. The output columns can be
renamed by using the in-place edit control.
The Output Data tab shows the actual data in the generated
Click OK to close the editor when you are done. The generated relational output is
single-record case format; each row represents a case. If we had not defined the aggregations for
the JSON array attributes, the relational output would have been in
multiple-record case format. The
multiple-record case format is not suitable for building mining models except
for Association model (which accepts transactional data format with transaction
id and item id).
Here is an example of how JSON Query node is used to project
the JSON data source to relational format, so that the data can be consumed by
Explore Data node for data analysis and Class Build node for building models.
This blog shows how JSON data can be brought over to Data
Miner via the new JSON Query node. Once
the data is projected to relational format, it can easily be consumed by Data
Miner for graphing, data analysis, text processing, transformation, and modeling.