How to generate Scatterplot Matrices using R script in Data Miner
By Denny Wong-Oracle on Feb 03, 2014
Data Miner provides Explorer node that produces descriptive statistical data and histogram graph, which allows analyst to analyze input data columns individually. Often time an analyst is interested in analyzing the relationships among the data columns, so that he can choose the columns that are closely correlated to the target column for model build purpose. To examine relationships among data columns, he can create scatter plots using the Graph node.
For example, an analyst may want to build a regression model that predicts the customer LTV (long term value) using the INSUR_CUST_LTV_SAMPLE demo data. Before building the model, he can create the following workflow with the Graph node to examine the relationships between interested data columns and the LTV target column.
In the Graph node editor, create a scatter plot with an interested data column (X Axis) against the LTV target column (Y Axis). For the demo, let’s create three scatter plots using these data columns: HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT.
Here are the scatter plots generated by the Graph node. As you can see the HOUSE_OWNERSHIP and N_MORTGAGES are quite positively correlated to the LTV target column. However, the MORTGAGE_AMOUNT seems less correlated to the LTV target column.
The problem with the above approach is it is laborious to create scatter plots one by one and you cannot examine relationships among those data columns themselves. To solve the problem, we can create a Scatterplot matrix graph as the following:
This is a 4 x4 scatterplot matrix of data column LTV, HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT. In the top row, you can examine the relationships between HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT against the LTV target column. In the second row, you can examine the relationships between LTV, N_MORTGAGES, and MORTGAGE_AMOUNT against the HOUSE_OWNERSHIP column. In the third and forth rows, you can examine the relationships of other columns against the N_MORTGAGES, and MORTGAGE_AMOUNT respectively.
To generate this scatterplot matrix, we need to invoke the readily available R script RQG$pairs (via the SQL Query node) in the Oracle R Enterprise. Please refer to http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html?ssSourceSiteId=ocomen for Oracle R Enterprise installation.
Let’s create the following workflow with the SQL Query node to invoke the R script. Note: a Sample node may be needed to sample down the data size (e.g. 1000 rows) for large data set before it is used for charting.
Enter the following SQL statement in the SQL Query editor. The rqTableEval is a R SQL function that allows user to invoke R script from the SQL side. The first SELECT statement within the function specifies the input data (LTV, HOUSE_OWNERSHIP, N_MORTGAGES, and MORTGAGE_AMOUNT). The second SELECT statement specifies the optional parameter to the R script, where we define the graph title “Scatterplot Matrices”. The output of the function is an XML document with the graph data embedded in it.
SELECT VALUE FROM TABLE
from "INSUR_CUST_LTV_SAMPLE_N$10001"), -- Input Cursor
cursor(select 'Scatterplot Matrices' as MAIN from DUAL), -- Param Cursor
'XML', -- Output Definition
'RQG$pairs' -- R Script
You can see what default R scripts are available in the R Scripts tab. This tab is visible only when the Oracle R Enterprise installation is detected.
Click the button in the toolbar to invoke the R script to produce the Scatterplot matrix below.
You can copy the Scatterplot matrix image to a clipboard or save it to an image file (PNG) for reporting purpose. To do so, right click on the graph to bring up the pop-up menu below.
The Scatterplot matrix is also available in the Data Viewer of the SQL Query node. To open the Data Viewer, select the “View Data” item in the pop-up menu of the node.
The returning XML data is shown in the Data Viewer as shown below. To view the Scatterplot matrix embedded in the data, click on the XML data to bring up the icon in the far right of the cell, and then click on the icon to bring up the viewer.