This philosophical phrase translates very well to business: All companies try to plan for the future as well as they can in a constant ambiance of uncertainty. To reduce this uncertainty, companies look at the past to search for clues of what might happen. The challenge is getting the maximum value from past experiences, to find the patterns that really matter.
As an example, let’s take the prediction of future sales. Companies need an accurate prediction in order to buy the right amount of stock, to schedule the right amount of employees, et cetera. In many cases, companies use straightforward econometrical models that often only extrapolate historical demand. In other words, to predict demand, the only data they learn from is the historical values of the demand itself. The danger with these models is that there’s no true understanding of why demand goes up or down.
Nowadays we can do much better than simple extrapolation. Machine learning can create more accurate predictions by taking into consideration a range of relevant data sources. The magic trick is to let the machine learn what the real factors are that influence demand, and how much of an influence those factors have. We basically throw a lot data at an algorithm and it will automatically identify those pieces of data that are relevant to solve the puzzle. The good news is that companies often already have the relevant data sources in house, in their databases!
In the below post, I will go through a practical example to predict customer demand with machine learning. We will use the Oracle Autonomous Data Warehouse Cloud with its built-in machine learning capabilities. One of the things I like especially about this approach rather than using an external ML tool is that it lets us work directly with the data at its source, without any data movements.
Keep reading if you like, or watch the video here if you so prefer:
One of the most important tasks is to identify the sources of data of which have some predictive value. Imagine that our company has 3 datasets:
1. Weather information: I suspect that sales will be higher with better weather.
2. A calendar with basic working day/holiday info: sales are typically not the same in the weekend as on weekdays.
3. Historical sales data: I’m convinced that the historical values of sales are related to the future values of sales, I just don’t know how exactly.
In our case, we’re going to consider this as a supervised learning problem. There are quite a lot of mathematics involved in this, but the process can also be explained on an intuitive level:
Let’s call the thing that we try to predict the target value, in this case, tomorrow’s sales. And let’s call the variables of which we believe they have an influence on the target the features, in this case, the weather columns, the calendar information, and the historical sales volumes.
In supervised learning, we learn from examples. We present the algorithms with the value of our target value at a certain moment in time. And we also present it with the values of the corresponding features at that same moment in time. This allows the algorithm to calculate the relationships between the features and the target for that example case. In fact, we do this for a large set of historical examples so that the algorithm can come up with an “average formula” that best describes the relationship between the features and the target. This formula is called the model.
Later on, when we want to predict a value, we take the values of the features and feed them into the model to predict our target value. In our case, we take a weather prediction for tomorrow, calendar information and today’s sales (=historical sales) as inputs and let the model calculate the expected sales for tomorrow.
There are many different types of algorithms that find and describe these relationships in their own way. We will use the Support Vector Machines algorithm.
The Oracle Autonomous Data Warehouse comes with a visual component that helps you build models and run predictions: The Zeppelin notebook. The great thing is that you can use it as an investigation playground. You are free to mingle SQL with comments, charts, et cetera.
Before we can present our training data to the algorithm, we must first prepare it. An important thing to know is that we typically must have all related features and the target variable in one record. So in our case, our goal is to create a table or a view that holds a column with the actual sales volume (target) for a certain moment in time, and columns for the corresponding weather and calendar information for that same moment in time. Here’s the SQL to do this:
create table dem_full_set
select sum(s.volume) as volume, s.day, s.daynum, s.dayhour, d.season, d.holiday, d.workingday, d.weather, d.temp, d.atemp, d.humidity, d.windspeed
from dem_sale_v s
, dem_day d
, dem_weather w
where d.day = s.day
and w.dayhour = s.dayhour
group by s.day, s.daynum, s.dayhour, d.season, d.holiday, d.workingday, w.weather, w.temp, w.atemp, w.humidity, w.windspeed
If you paid close attention, then you might notice that there’s one feature that’s still missing in this dataset. One feature that I want the algorithm to learn from is the historical value of sales. Remember how all relevant attributes for a datapoint have to be in a single row? Although the historical values of sales are in the dataset, we don’t yet have a historical sales value per record. So to complete our data preparation, I will take the value of yesterday’s sales and add it onto each record.
This SQL will create the new column:
alter table dem_full_set add volume_this_hr_yesterday number;
And this will actually fill the value of the column:
merge into dem_full_set fs
using (select dayhour, volume from dem_full_set) new
on (fs.dayhour-1 = new.dayhour)
when matches then update set fs.volume_this_hr_yesterday = new.volume;
Let’s have a look at the resulting dataset so far:
In this step our goal is to restructure the dataset in such a way that our algorithm can pick up important elements more easily. For example, I believe that the TIME of the day is an important feature, as it might hold a clue to the height of the sales. Therefore I separated the TIME element from the general DATE/TIME column of each transaction, as follows:
alter table dem_full_set add hr number(1);
update dem_full_set set hr = to_number(to_char(dayhour, 'HH24'));
And the result is:
Now that we have our data prepared, we have to split it into a part for training and a part for testing. The testing is done to check what the quality of the model is. The idea is simple: by taking a set of known historical data points, we can compare the actual value of sales and the predicted value of sales. It’s important that the records that we use for testing are not included in the training batch to prevent that our model is influenced by this “insider knowledge”. In my case I’ve chosen to reserve 10% of the records for testing and the rest for training.
Again, we can do this with SQL:
create table dem_train as select * from dem_full_set where mod(daynum, 10) <> 0;
create table deM_test as select * from dem_full_set where mod(daynum, 10) = 0;
Building the model is done using a PL/SQL package. The model itself will be stored as a special database object.
Apart from the training dataset, we have to pass a number of hyperparameters to the build procedure. This happens in a rather special way: We first insert the parameters into a separate table and then pass the name of that table to the build procedure.
Here’s how I created my hyperparameters table and filled it with relevant values for the algorithm:
create table dem_regr_param(setting_name varchar2(30), setting_value varchar2(4000));
insert into dem_regr_param(setting_name, setting_param) values (dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines);
insert into dem_regr_param(setting_name, setting_param) values (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
insert into dem_regr_param(setting_name, setting_param) values (dbms_data_mining.svms_kernel_function, dbms_data_mining.svms_gaussian);
insert into dem_regr_param(setting_name, setting_param) values (dbms_data_mining.svms_complexity_factor, 15);
insert into dem_regr_param(setting_name, setting_param) values (dbms_data_mining.svms_epsilon, 0.3);
And this brings us to the point where we actually build the model:
model_name => 'DEM_REGR', --The name we give to the resulting model
mining_function => dbms_data_mining.regression, --Support vector machines is a type of regression algorithm
data_table_name => 'DEM_TRAIN', --The name of our input data to train on
case_id_column_name => 'ID', --The column that uniquely identifies each row
target_column_name => 'VOLUME', --The column that we are training on (to predict). All the other column automatically are considered input/features
settings_table_name => 'DEM_REGR_PARAM', --The name of the hyperparameters table that we created earlier
x_form_list => v_xlst);
After this we have a model by the name of DEM_REGR that’s ready to go!
Now we have arrived at the testing stage. To see how well our model performs, we’re going to let it make predictions on our testing dataset. Remember that we also have the actual sales volume for that dataset, so this will allow us to evaluate whether the model is correct for each record.
Let’s first add a placeholder for the prediction to the testing dataset:
alter table dem_test add predicted_volume number(10);
The following shows the syntax to run the prediction:
update dem_test set predicted_volume = PREDICTION(DEM_REGR USING *);
Notice how simple the syntax is? That’s all it takes to create a prediction. Basically, we’re telling the algorithm to use the DEM_REGR model and all of the available feature columns.
Let’s compare the 24 predictions for a random day, in this case, day number 10:
select daynum * 24 + hr, sum(volume) as actual_volume, sum(predicted_volume) as predicted_volume
where daynum = 10
group by daynum, hr
order by daynum, hr
Next, we configure a chart as follows:
My first impression is that the prediction (in light blue) resembles the actual sales volume (dark blue) quite closely.
Let’s try the same query for another day, the 20th day in this case.
Day 20 is a weekend day. Clearly, the sales pattern during a weekend day is very different, but also here the prediction seems to align very well with reality.
The visual approach is enough to get a feeling for the accuracy, but it’s not very exact. We want to express the performance of the model using a number, and in this case, we’ll use the Root Mean Square Deviation. What that basically means is that we calculate the error for each data-point and average those errors into a single number. Therefore, the lower this number, the better. The exact formula is:
But how can we know if a certain number is good or bad? Typically we do this by comparing the scores of different types of models. And today we will compare the score of our regression model with the score of the “Naïve” model. Naïve means that we simply use the value of today’s sales as the prediction of the sales of tomorrow.
Let’s go ahead and calculate the scores for both of these models:
select sqrt(avg((predicted_volume - actual_volume) * (predicted_volume - actual_volume))) rmse
, sqrt(avg((naive_predicted_volume - actual_volume) * (naive_predicted_volume - actual_volume))) rmse_naive
(select daynum, sum(volume) as actual_volume, sum(predicted_volume) as predicted_volume, sum(volume_this_hr_yesterday) as naive_predicted_volume
group by daynum
And the result is:
So we can conclude that we’ve improved our prediction quality considerably compared to a simple naïve model!
Using machine learning for forecasting lets us find the real drivers behind whatever we’re trying to predict. Because its predictions are based on the real driving factors, they are more accurate than models that simply extrapolate historical data of the same target.
The machine learning notebooks in Autonomous Data Warehouse make the development of machine learning processes very easy. In many cases the data that we need for our predictions is stored in the database and applying machine learning directly at the source eliminates the need for complex data operations. Apart from that, I think that many developers will be positively surprised that their experience with SQL already covers a large part of the skillset required for machine learning!
Looking at the next step: for a business to really reap the benefits of machine learning, the resulting predictions must be integrated into their end user applications. Otherwise, it stays a theoretical exercise. The good news here is that these applications already communicate with databases, so it should be a fairly small effort to adapt them to pick up these predictions.
If you want to try this yourself, get a free Oracle Cloud trial account here and create your Autonomous Data Warehouse.