Using Python and deep learning to understand your data

February 20, 2022 | 17 minute read
Text Size 100%:

Use deep learning to analyze financial data with Python and the Keras library.

Machine learning is a robust method of data analysis that makes it possible to build applications capable of learning from data. Deep learning is a subfield of machine learning, and it structures algorithms in layers, allowing you to create more-accurate models. This article illustrates an example of how you can create a deep learning model for stock price analysis using Python’s Keras deep learning library. In particular, you’ll see whether the old saying “history repeats itself” works when it comes to stock price prediction implemented with deep learning algorithms.

In deep learning, a model may learn to perform regression or classification tasks directly from the raw input (such as images, sound, or text), using multiple layers to gradually extract higher-level insights from the data.

This doesn’t mean, however, that you don’t need to prepare the raw input for deep learning modeling. In some cases, you may even need to do manual feature extraction, while assuming that the model will generate new implicit features from those manually generated features.

The technique of manual feature extraction can be especially useful when dealing with time-series data where you have only the output (target) variable tied to points in time. In such cases, it’s natural to generate several metrics reflecting changes between nearby members of the series, so you can use those metrics as the model features. You can experiment with both the data and the model settings until you achieve the desired accuracy (or until the accuracy stops improving).

The following are the general steps for deep learning modeling:

  1. Obtain data to build a model.
  2. Prepare the data for modeling.
  3. Configure the model.
  4. Compile the model.
  5. Train the model.
  6. Evaluate the model.
  7. Repeat steps 2 through 6 until the model’s accuracy meets the goals or no longer significantly improves.

This article will show you how to apply these steps when building a deep learning model for stock price prediction in Python with Keras.

Preparing your Python environment

The article’s example uses stock data that you can obtain with the yfinance library, a Python wrapper for the Yahoo Finance API. You can install yfinance with the pip command; make sure you are getting the latest version.

pip install -U yfinance

The deep learning model (neural network) for stock price prediction will be configured and trained with Keras, which requires the TensorFlow platform. You’ll need to install them both, as follows:

pip install tensorflow
pip install keras

Obtaining data for building a model

Obtaining stock data with yfinance is easy. With a couple of lines of code, you can get data for a certain ticker or multiple tickers for a specified period for a specified interval. For further details, refer to the yfinance project page.

The following snippet retrieves the stock price data for the TSLA ticker for the past five years for the default interval of one day:

import yfinance as yf
tkr = yf.Ticker('TSLA')
df = tkr.history(period="5y")

The data is returned as a DataFrame sorted by date in ascending order. You will need to re-sort it in descending order to prepare it for further analysis.

df = df.iloc[::-1]

See what you have using the following command:

print(df)

Here is what the output might look like.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
Date        Open         High          Low         Close        Volume    Dividends  Stock Splits
                                                                                           
2021-12-10  1008.750000  1020.979675   982.530029  1017.030029  19229185          0           0.0
2021-12-09  1060.640015  1062.489990  1002.359985  1003.799988  19446486          0           0.0
2021-12-08  1052.709961  1072.380005  1033.000122  1068.959961  13968790          0           0.0
2021-12-07  1044.199951  1057.673950  1026.810059  1051.750000  18694857          0           0.0
2021-12-06  1001.510010  1021.640015   950.500000  1009.010010  27221037          0           0.0
...                 ...          ...          ...          ...       ...        ...           ...
2016-12-16   198.080002   202.589996   197.600006   202.490005   3796889          0           0.0
2016-12-15   198.410004   200.740005   197.389999   197.580002   3219567          0           0.0
2016-12-14   198.740005   203.000000   196.759995   198.690002   4150927          0           0.0
2016-12-13   193.179993   201.279999   193.000000   198.149994   6823884          0           0.0
2016-12-12   192.800003   194.419998   191.039993   192.429993   2438876          0           0.0

[1259 rows x 7 columns]

As you can see, the dataset contains 1,259 rows, which is quite enough for the purpose of this example. For this example, you won’t need all the columns. Use only one: Close. That contains the day’s closing prices of the stock. For clarity, rename it to Price, as follows:

df=df.rename(columns={'Close': 'Price'})

Extracting features from the data

In machine learning and deep learning, features (also known as X variables) are independent variables that act as input to a model’s training (and evaluating) process.

For example, car “make” can be used as a feature in a model designed to predict car prices, because you’d expect some car makes to be generally more expensive than others.

In some cases, the original dataset may contain all the features you require (or desire) from the outset, thus saving you the need to manually generate the features for the model yourself. Continuing with the example of a model for predicting car prices, a dataset you might use might already contain data sufficient to be immediately turned into the model input, such as car make, year, and fuel economy.

In contrast, when you want to build a prediction model on a time series where you have only the output variable (price, for example) whose values are tied to points in time, it’s your job to generate the features needed to train the model. In such cases, the features can be extracted from the time series itself. For example, when creating a model for predicting stock prices, you can compute the one-day shifts in the value of a security across the entire series and save the results to an individual independent variable, thus manually extracting a feature from the original dataset.

In simple terms, the one-day price change feature mentioned above represents the ratio of a price to the previous price. To generate this feature’s values, you need to calculate the difference between the current day’s price and the previous day’s price, and then correlate the result with the next day’s price. Technically, this is equal to calculating the price difference between the previous day and the day before last and then correlating the result with the current day’s price. With the pandas library, this can be implemented by shifting the index of the DataFrame accordingly.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
df['OneDayChange'] = df['Price'].shift(-1) - df['Price'].shift(-2)

It might be useful to have a feature that measures the rate of price change. To generate such a feature, you’ll need to analyze three consecutive points at a time. So, in the context of this example, you should analyze two-day intervals across the series, creating a new column in the DataFrame for this rate feature, as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
df['Derivative'] = df['Price'].shift(-1) - 2*df['Price'].shift(-2) + df['Price'].shift(-3)

After performing that operation, you can drop the unnecessary columns, keeping only the needed ones.

df = df[['Price','OneDayChange','Derivative']]

If you print out the DataFrame now, it should look as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
Date        Price        OneDayChange  Derivative
                                           
2021-12-10  1017.030029    -65.159973  -82.369934
2021-12-09  1003.799988     17.209961  -25.530029
2021-12-08  1068.959961     42.739990   48.699951
2021-12-07  1051.750000     -5.959961   63.670044
2021-12-06  1009.010010    -69.630005  -59.229980
...                 ...           ...         ...
2016-12-16    40.498001     -0.222000   -0.329998
2016-12-15    39.515999      0.107998   -1.036003
2016-12-14    39.737999      1.144001         NaN
2016-12-13    39.630001           NaN         NaN
2016-12-12    38.486000           NaN         NaN

[1259 rows x 3 columns]

Be sure to drop the rows that contain NaN. These were formed due to the use of offsets (shifts) when the OneDayChange and Derivative columns were created.

df = df.dropna()

As a result, there should be three rows less, and just three columns in the DataFrame.

print(df.shape)
(1256, 3)

In terms of upcoming model training, the Price column in the DataFrame contains the output (target) variable, while the other two contain input variables (features). This structure is quite sufficient for training a prediction model. However, adding more features could potentially improve the model’s prediction abilities. One common, yet simple, technique to significantly increase the number of features when dealing with time-series data is to turn the data in the preceding rows into new features, as discussed in the next section.

Transforming data to create more features

If you generate new features from the data found in the nine preceding rows of each row, you will have 9 x 3 = 27 new features, along with the two you already had in the current row. To achieve this, you need to transform the DataFrame accordingly. Technically, this transformation is reduced to adding a certain number (27, in this example) of new columns to the DataFrame and populating them with the data from the fields of the specified number (9, in this example) of preceding rows.

As the first step, convert the DataFrame to a flat (one-dimensional) NumPy array.

import numpy as np
arr = df.values.flatten()

Next, define a sliding window to move through the array with a specified step, forming the rows in a new array along the way. Before you can implement the transformation with a sliding window, you need to define the following parameters:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
#the size (height) of the sliding window
sw_height = 10 #the number of points (days) you want to include in a sample
#the step size of the sliding window
sw_step = len(df.columns) #the number of columns in the DataFrame, including Y and X columns
#space for the sliding window movement
rows = len(df) #the number of rows in the DataFrame

You can now apply the technique known as fancy indexing, which allows you to access multiple array elements in a particular order and form a new array, respectively. To implement this, you need to create an indexer, an array of indices that map the original array to the desired array.

In this case, you need to create an indexer that reshapes the flat array so the first column in the new array contains the target variable (the price for a certain day) while the other columns (designed for holding the input variables) are formed from the subsequent elements of the array found within the sliding window.

The indexer can be calculated for this example, as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
idx = np.arange(sw_step*sw_height)[None, :] + sw_step*np.arange(rows-sw_height+1)[:, None]

Once you have the indexer, you can reshape the original array.

arr = arr[idx]

To verify that these steps have worked as expected, look at the shape of the new array.

print(arr.shape)
(1247, 30)

The second element in the tuple above suggests that the array has 30 columns, assuming you have now 29 features (input variables) and still a single target (output) variable. Note that the new features have been formed not only from the features found in the previous days’ records but also from the target variable values found in the same records.

Preparing the data for training and evaluating a model

Now that you have the array containing both the features and the target variable, you need to split it into two arrays accordingly. Below, you extract the first column from the array and save it in a new array, thus defining the target array.

target = arr[:, 0]

Then, you save all the other columns (29, in this example) in the features array.

features = arr[:, 1:]

Before going further, look at the shape of the newly created arrays to ensure you’ve done the split correctly.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
print("target array shape: ", target.shape, "\nfeatures array shape: ", features.shape)

The following is what you should see:

target array shape: (1247,)
features array shape: (1247, 29)

Next, split the data into the training and testing sets. A common tool for this task is the sklearn.model_selection.train_test_split() utility, which performs random splitting with a single call.

In this case, however, using random splitting would be a mistake since nearby rows include the same data sequences, differing only in the first and final elements. As a result, the model would have a perfect score on the testing set but would fail to make accurate predictions on new, previously unseen data.

Thus, in this case, you need to split the data without shuffling it. In such cases, it’s always a good practice to put the most-recent data into the testing set. Assuming you have 1,247 samples in this example, you might put the last 1,000 samples into the training set. To do that, the dataset must start from 247, as follows:

x_train = features[247:]
y_train = target[247:]

Put the first 247 – 10 samples (allowing for the height of the sliding window) into the testing set that you will use to evaluate the model.

x_eval = features[:247-sw_height]
y_eval = target[:247-sw_height]

Subtract the 10 last samples from the testing set, because they partially include data from the first samples in the training set. Now, make sure that the split has been done as expected.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
print(x_train.shape, x_eval.shape, y_train.shape, y_eval.shape)

This should produce the following output:

(1000, 29) (237, 29) (1000,) (237,)

You have now 1,000 samples for training and 237 samples reserved for validation.

Configuring the model

Now that you have the data ready for deep learning, proceed to creating a model. As the first step, create the architecture of the model; this process includes configuring the number of the model’s layers and the number of nodes on each layer.

In the following snippet, you create a model as a stack of dense layers. In a dense layer, each unit (neuron) is connected to every unit in the previous layer, which enables mapping relationships between any two input variables (features).

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
import keras
from keras import layers
model = keras.Sequential([
  layers.Dense(128, input_shape=x_train.shape[1:], activation="relu"),
  layers.Dense(128, activation="relu"),
  layers.Dense(1)
])

The code builds a three-layer model, defining 128 units on the first (input) layer, 128 units on the second (hidden) layer, and a single unit on the third (output) layer. You should experiment with the number of units in the input and hidden layers to find the configuration with which the model renders more-accurate predictions.

Note that the output layer must have a single unit since you’re trying to predict a single variable—you have a single output variable in your regression task. This layer doesn’t have an activation function, which makes it purely linear. That is, the layer is supposed to produce a continuous value.

Also note the use of the input_shape parameter in the input layer. This parameter tells the model the number of features in your input.

Compiling, training, and evaluating the model

Training a model is an iterative process designed to update the weights of the neural connections in the neural network that defines the model so that the error between the predicted result of the model and the expected output is steadily reduced from iteration to iteration.

One of the most common algorithms to implement this process of finding the appropriate weights of the network is gradient descent, which updates the weights gradually by a certain amount, known as the learning rate. The learning rate is a configurable parameter (often in the range between 0.0 and 1.0) that configures the behavior of the training process. A smaller learning rate may let the model learn a more optimal set of weights at the cost of it taking significantly longer to train the model.

When you compile your model, you can specify the learning rate as part of the training configuration that also includes an optimizer, a loss function, and metrics.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=0.00001), loss="mse", metrics=["mae"])

In this example, you use the optimizer that implements the Adam gradient descent algorithm. With the Adam optimizer, the learning rate defaults to 0.001, but in this case, use the smaller value of 0.00001 to allow the model to learn a more optimal set of weights.

As the loss function, use mean squared error (MSE), which calculates the square of the difference between the predictions and the actual values. This loss function is commonly used in regression tasks.

For the metric, use mean absolute error (MAE), which shows the absolute value of the difference between the predictions and the actual values.

After compiling the model, look at its architecture.

model.summary()

The model’s summary should look as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param # 
=================================================================
 dense (Dense)               (None, 128)               3840    
                                                               
 dense_1 (Dense)             (None, 128)               16512   
                                                               
 dense_2 (Dense)             (None, 1)                 129     
                                                               
=================================================================
Total params: 20,481
Trainable params: 20,481
Non-trainable params: 0

Now you can train and evaluate the model. The fit() method trains the model by repeatedly iterating over the training set for a specified number of epochs (iterations on the training set).

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
model.fit(x_train, y_train, verbose=1, epochs=2000)

With the verbose parameter set to 1, you can view the loss function values and the metric values during training for each epoch.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
[[code box]]
Epoch 1/2000
32/32 [==============================] - 0s 2ms/step - loss: 131444.2656 - mae: 345.9956
Epoch 2/2000
32/32 [==============================] - 0s 2ms/step - loss: 129994.7344 - mae: 344.0551
Epoch 3/2000
32/32 [==============================] - 0s 2ms/step - loss: 128554.0312 - mae: 342.1249

...

Epoch 1000/2000
32/32 [==============================] - 0s 3ms/step - loss: 142.8597 - mae: 7.9147
Epoch 1001/2000
32/32 [==============================] - 0s 3ms/step - loss: 140.6166 - mae: 7.8482
Epoch 1002/2000
32/32 [==============================] - 0s 3ms/step - loss: 140.4102 - mae: 7.8409

...

Epoch 1990/2000
32/32 [==============================] - 0s 3ms/step - loss: 121.6326 - mae: 7.2967
Epoch 1999/2000
32/32 [==============================] - 0s 3ms/step - loss: 121.1263 - mae: 7.3090
Epoch 2000/2000
32/32 [==============================] - 0s 3ms/step - loss: 120.8455 - mae: 7.2707

The output above shows that the loss function values and the metric values improved unevenly from epoch to epoch, and there wasn’t a significant improvement from about midpoint. This suggests that you could use fewer epochs to train your model.

Then, you can evaluate the model, computing the loss function value and metrics values on the testing set.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
mse, mae = model.evaluate(x_eval, y_eval, verbose=0)
print("MSE: ", mse, "\nMAE: ", mae)

The results might look as follows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
MSE:  866.2902221679688
MAE:  20.61849021911621

It would also be interesting to look at particular predictions. Below, you generate predictions for the first five samples from the testing set and compare them with the corresponding actual values.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
predictions = model.predict(x_eval[0:5])
print("Actual: ", y_eval[0:5], "\nPredictions: ", predictions.flatten())

The output might look somewhat like the following:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
Actual:       [1017.0300293  1003.79998779 1068.95996094 1051.75       1009.01000977]
Predictions:  [1019.1517     1072.6211     1056.2333     1002.4659     1010.0428]

The results above justifiably suggest that the model can’t be considered reliable, because only some predictions come close to their corresponding actual values. This reveals the fact that the price history of a stock does not contain enough information to make accurate predictions (stock traders’ lives would be very easy otherwise).

To build a more accurate stock prediction model, you need to use more indicators. For example, you might use the information about upcoming financial reports of a company (interest in the company’s shares often increases on the eve of such events), thus turning the schedule of financial statements into a feature in your model. A discussion about choosing the right indicators, however, is far beyond the scope of this article.

Conclusion

In this article, you looked at an example of gathering, preparing, and using market data for building a deep learning model designed to predict the movements of a stock’s price. In particular, you learned to extract features from a series containing a stock’s price history, and then you saw how to configure, compile, train, and evaluate a prediction model on that data.

Dig deeper

Yuli Vasiliev

Yuli Vasiliev is a programmer, freelance author, and consultant currently specializing in open source development; Oracle database technologies; and, more recently, natural-language processing (NLP).


Previous Post

Using machine learning to understand your data

Yuli Vasiliev | 11 min read

Next Post


A Rice University-led consortium uses Oracle Cloud to give researchers a better understanding of the Atlantic slave trade

Mark Jackley | 9 min read
Oracle Chatbot
Disconnected