Using machine learning to understand your data

January 28, 2022 | 11 minute read
Text Size 100%:

Building high-performance machine learning models with SQL using Oracle Machine Learning for SQL (OML4SQL)

With the Oracle Machine Learning for SQL (OML4SQL) component within Oracle Database (previously known as Oracle Data Mining), you can train, evaluate, and deploy machine learning (ML) models within a database using SQL statements. OML4SQL lets you take advantage of in-database algorithms to perform a variety of ML tasks, such as classification, regression, and clustering.

In this article, you’ll learn how use OML4SQL to solve a regression problem, estimating house prices based on real estate data. As the example datasource, this article uses the market historical dataset of real estate valuation available from the Machine Learning Repository, a public collection of more than 550 datasets from the University of California, Irvine.

By the way, did you know that SQL has become one of the most popular and powerful tools for data processing and analysis? According to StackOverflow’s 2020 survey SQL, is the third-most-popular language among all developers.

The real estate valuation dataset

The sample dataset used in this article contains several hundred records of historical real estate data for a large city. Because OML4SQL is integrated into the database it’s suitable for the analysis of very large datasets, but the small sample being used here is perfectly fine to demonstrate how you can create and evaluate an ML model in Oracle Database using SQL.

To prepare for this article’s example, download the sample dataset, Real_estate_valuation_data_set.csv. The comma-separated value file was converted from Real estate valuation data set.xlsx taken from the Machine Learning Repository. To make the conversion, I used the XLSX Converter tool.

Understanding the data. It is worth spending some time to examine the sample dataset used in this article. To start, you might want to look at the input and output variables to be used in an ML model. For that, check out the “Attribute Information” section on the dataset page. This section gives a short description of the input and output variables included in the dataset. The names of the input variables start with X and include an ordinal (X1 through X6), and the name of the output variable is Y, as follows:

X1 is the transaction date (for example, 2013.250 stands for 2013 March, 2013.500 stands for 2013 June, and so on)
X2 is the house age (unit: year)
X3 is the distance to the nearest mass rapid transit (MRT) station (unit: meter)
X4 is the number of convenience stores available in the living circle on foot (integer)
X5 is the geographic coordinate, latitude (unit: degree)
X6 is the geographic coordinate, longitude (unit: degree)

Y is the house price of the unit area

The example given for the first input variable, X1 (transaction date), shows that the data in this field has been preprocessed so that the name of the month is converted to a three-digit number (for example, 2013.500 stands for 2013 June). Actually, converting text into numbers is one of the most common techniques used when preparing data for ML. Many ML algorithms are designed to work with numerical values rather than textual values.

Examining the subsequent input variables in the dataset, you can observe that some of them provide data in a quite raw form. For example, the X3 variable contains the distance to the nearest MRT station in meters. However, from the standpoint of ML algorithms, it would be more useful if you had the distance type instead, such as by foot or public transport, possibly with several subcategories based on the number of minutes to be spent on the way to the station.

The X5 and X6 variables contain the geographic coordinates of a real estate object. But again, it would be much more useful if you had the name (or numeric code) of the area instead. To extract an area name from geographic coordinates, you need to have access to a certain geocoding API or a database. However, a discussion on the use of such techniques is far beyond the scope of this article.

You might also want to look at the actual data found in the .csv file.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
X1..      X2..  X3..      X4.. X5..      X6..       Y..
1,   2012.917, 32,   84.87882, 10,  24.98298, 121.54024, 37.9
2,   2012.917, 19.5, 306.5947, 9,   24.98034, 121.53951, 42.2
3,   2013.583, 13.3, 561.9845, 5,   24.98746, 121.54391, 47.3
...
414, 2013.500, 6.5,  90.45606, 9,   24.97433, 121.5431,  63.9

You may notice that the X5 and X6 geocoordinate values are each within a very narrow range. This reveals the need to solve another common preprocessing problem: transforming values by scaling each value to a given range. In this case, you could arrange the values on a scale from 0 to 1 to make the data more suitable for training an ML algorithm.

To perform this transformation on the data stored in Oracle Database, you can take advantage of the DBMS_DATA_MINING_TRANSFORM PL/SQL package used for Oracle Machine Learning, employing the INSERT_NORM_LIN_MINMAX procedure, in particular, from this package. To keep things simple, however, this article won’t go into those details.

Loading the data into a database table. Once you have the data in a .csv file, one simple way to load it into the database is with the external tables feature. This feature allows you to load a large number of records from an external file into a database table (referred to as an external table), so that each record appears in its own row in the table.

Before you can create an external table, create a directory object in the database to specify an alias for the file system directory in which the file of interest is located. Next, grant read and write permissions for this directory object to your database schema. After that, you’ll be able to create the external table. The real_estate.sql file accompanying this article contains the SQL statements required to accomplish all these operations and create the external table for the sample data. You can also read them using the following:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
CONN sys AS sysdba

--you'll be prompted to enter the password

CREATE DIRECTORY dataset_dir AS '/home/oracle/Downloads';
GRANT read,write ON DIRECTORY dataset_dir TO usr;
GRANT create table TO usr;

CONN usr/pswd

DROP TABLE real_estate;
CREATE TABLE real_estate
      (No NUMBER(3),
       X1  NUMBER(8, 3),
       X2  NUMBER(4, 1),
       X3  NUMBER(12, 5),
       X4  NUMBER(3),
        X5 NUMBER(10, 5),
        X6 NUMBER(10, 5),
       Y   NUMBER(4, 1))
    ORGANIZATION EXTERNAL
      (TYPE ORACLE_LOADER
      DEFAULT DIRECTORY dataset_dir
      ACCESS PARAMETERS
      (
        RECORDS DELIMITED BY NEWLINE
        FIELDS TERMINATED BY ','
      )
      LOCATION ('Real estate valuation data set.csv')
     )
REJECT LIMIT UNLIMITED
/

Now you have the external table containing the sample dataset rows and you can query it as a regular database table, as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SELECT No,X1,X2,X3,X4,X5,X6,Y FROM real_estate;

The produced output should look as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
        NO        X1       X2         X3       X4          X5         X6        Y
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
        1   2012.917        32   84.87882      10       24.98298   121.54024       37.9
        2   2012.917      19.5   306.5947       9       24.98034   121.53951       42.2
        3   2013.583      13.3   561.9845       5       24.98746   121.54391       47.3
       ...
       414    2013.5       6.5   90.45606       9       24.97433   121.5431        63.9
414 rows selected.

As you can see, both the structure and data of the dataset have been correctly loaded into the table.

Implementing an ML model

Now that you have the data in a database table, you can analyze it using in-database algorithms for performing ML. First, though, you’ll want to split the data into two sets: one for training the algorithms and the other for testing.

Splitting the data. In supervised learning (such as regression or classification), splitting the data into training and testing samples is one of the most important preprocessing steps. The training samples are used to train a predictive model, and the testing samples are to test the model’s accuracy on data unseen during training. The following PL/SQL block splits the sample dataset into 80% training data and 20% test data:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
    EXECUTE IMMEDIATE 'CREATE OR REPLACE VIEW training_set AS SELECT * FROM real_estate SAMPLE (80) SEED (1)';
    EXECUTE IMMEDIATE 'CREATE OR REPLACE VIEW test_set AS SELECT * FROM real_estate MINUS SELECT * FROM training_set';
END;
/

The following listing contains the statements, along with their output, showing that the split has been correctly done in accordance with the 80/20 pattern:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
-- the number of rows before splitting
SQL> SELECT count(*) FROM real_estate;
  COUNT(*)
----------
       414
-- the number of rows in the training set
SQL> SELECT count(*) FROM training_set;
  COUNT(*)
----------
       332
-- the number of rows in the test set
SQL> SELECT count(*) FROM test_set;
  COUNT(*)
----------
      82
-- making sure that any rows are not sampled in both sets
SQL> SELECT COUNT(*) FROM training_set train
JOIN test_set test
ON train.No = test.No;
  COUNT(*)
----------
       0

According to the listing above, the training_set view contains about 80% of the data samples, and the rest are in test_set. Moreover, there are no rows sampled in both views. It’s now time to train a model and then test it.

Building a model. This section builds a regression model on the training_set, using a generalized linear model (GLM) algorithm. Before you can create a data mining model in the database, you must grant the CREATE MINING MODEL privilege to your database schema. The CREATE TABLE and CREATE VIEW privileges are also required.

The following PL/SQL block specifies the model settings and builds the model using the DBMS_DATA_MINING.CREATE_MODEL2() procedure:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
DECLARE
    v_settings DBMS_DATA_MINING.SETTING_LIST;
    BEGIN
    v_settings('PREP_AUTO') := 'OFF';
    v_settings('ALGO_NAME') := 'ALGO_GENERALIZED_LINEAR_MODEL';
    v_settings('GLMS_FTR_SELECTION') := 'GLMS_FTR_SELECTION_ENABLE';
    v_settings('GLMS_FTR_GENERATION') := 'GLMS_FTR_GENERATION_ENABLE';
   
    DBMS_DATA_MINING.CREATE_MODEL2(
      MODEL_NAME          =>  'real_estate_model',
      MINING_FUNCTION     =>  'REGRESSION',
      DATA_QUERY          =>  'SELECT * FROM training_set',
      SET_LIST            =>  v_settings,
      CASE_ID_COLUMN_NAME =>  'No',
      TARGET_COLUMN_NAME  =>  'Y');
END;
/

Use the v_settings variable to store SETTING_LIST, through which you can specify the settings for the model being created. For this example, set the PREP_AUTO setting to OFF to disable the database’s Automatic Data Preparation feature to avoid automatic data transformations during the model-building process. Generally, you should experiment with this feature to see which mode (ON or OFF) works best for your data model.

By setting ALGO_NAME to ALGO_GENERALIZED_LINEAR_MODEL, you instruct OML4SQL to use the GLM algorithm for the model being created. This model is an advanced statistical modeling algorithm used for classification and regression problems. The list of algorithms supported by OML4SQL is quite extensive.

Enable the feature selection option so that the model can select only those features (inputs) for analysis that are most important in predicting an output (a target attribute). Also enable feature generation to allow the model to generate new features from existing features when that might be useful for analysis.

Next, invoke the CREATE_MODEL2() procedure, passing in the following parameters: a unique name for the model being created, the ML function to be used, a SQL query to obtain the training data for the model, the list of settings for the model, the unique key column in the training dataset, and the output column (column to be predicted). If the model has been successfully created, you should see a message stating that the PL/SQL procedure successfully completed.

Testing the model. Now that you have the model, you need to make sure that it is a good fit for prediction. The following query predicts the input variable value on the test data, so that you can visually compare it with the corresponding actual value:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SELECT No, ROUND(PREDICTION(real_estate_model USING *), 1) AS
predicted_value, Y AS actual_value FROM test_set;

Here is what the output might look like.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
        NO PREDICTED_VALUE ACTUAL_VALUE
---------- --------------- ------------
        2          43.1       42.2
        4          41.9       54.8
       11          34.5       41.4
       15          35.5       34.3
      ...
      414          57.9       63.9
82 rows selected.

You can also evaluate the quality of your regression model by computing the mean absolute error (MAE), a popular regression statistic, as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SELECT ROUND(AVG(ABS(P.PREDICTED_Y - R.Y)),2) MAE
  FROM (SELECT No, PREDICTION(real_estate_model using *) PREDICTED_Y
          FROM test_set) P, test_set R
  WHERE P.No = R.No;

The output might look as follows:

   MAE
----------
  6.37

Experimenting with the model’s data and settings

Once you have a working model, it’s always a good practice to experiment with the model parameters to try to improve the model’s accuracy. For example, you might want to exclude some of the input variables that you think might reduce the accuracy of the model. Thus, in the model discussed here, you might test excluding the X5 (latitude) and X6 (longitude) input variables, since bare geocoordinates may not accurately help predict a real estate object’s value.

To implement the task of excluding selected input variables from the model, you can first create another table from the table containing the original dataset. Do this using the CREATE TABLE ... SELECT statement and select only those columns that are needed, as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
CREATE TABLE real_estate2 AS SELECT No,X1,X2,X3,X4,Y FROM real_estate;

Then, you need to split the data in the newly created table, as discussed earlier. Before you can re-create the model on the new dataset, you first need to drop the existing one.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
  DBMS_DATA_MINING.DROP_MODEL(model_name => 'real_estate_model');
END;
/

Then, you can re-create the model and test it again.

Similarly, you can experiment with the model settings (for example, turn on the PREP_AUTO setting) to see if that improves the model’s accuracy. Remember, before you can re-create the model with the new settings, you first need to drop the existing one.

Conclusion

Oracle SQL provides access to the state-of-the-art features built into Oracle Database, allowing you to perform advanced data analysis within the database. In this article, you saw an example of using Oracle Machine Learning for SQL (OML4SQL), a comprehensive set of in-database ML algorithms available to you through the PL/SQL and SQL language operators. In particular, you looked at how to implement a regression task for estimating house prices.

Dig deeper

Yuli Vasiliev

Yuli Vasiliev is a programmer, freelance author, and consultant currently specializing in open source development; Oracle database technologies; and, more recently, natural-language processing (NLP).


Previous Post

Numbers game: The Golden State Warriors get a new view of basketball stats

Chris Murphy | 5 min read

Next Post


Using Python and deep learning to understand your data

Yuli Vasiliev | 17 min read
Oracle Chatbot
Disconnected