Machine learning is a robust method of data analysis that makes it possible to build applications capable of learning from data. Deep learning is a subfield of machine learning, and it structures algorithms in layers, allowing you to create more-accurate models. This article illustrates an example of how you can create a deep learning model for stock price analysis using Python’s Keras deep learning library. In particular, you’ll see whether the old saying “history repeats itself” works when it comes to stock price prediction implemented with deep learning algorithms.
In deep learning, a model may learn to perform regression or classification tasks directly from the raw input (such as images, sound, or text), using multiple layers to gradually extract higher-level insights from the data.
This doesn’t mean, however, that you don’t need to prepare the raw input for deep learning modeling. In some cases, you may even need to do manual feature extraction, while assuming that the model will generate new implicit features from those manually generated features.
The technique of manual feature extraction can be especially useful when dealing with time-series data where you have only the output (target) variable tied to points in time. In such cases, it’s natural to generate several metrics reflecting changes between nearby members of the series, so you can use those metrics as the model features. You can experiment with both the data and the model settings until you achieve the desired accuracy (or until the accuracy stops improving).
The following are the general steps for deep learning modeling:
This article will show you how to apply these steps when building a deep learning model for stock price prediction in Python with Keras.
The article’s example uses stock data that you can obtain with the yfinance
library, a Python wrapper for the Yahoo Finance API. You can install yfinance
with the pip
command; make sure you are getting the latest version.
pip install -U yfinance
The deep learning model (neural network) for stock price prediction will be configured and trained with Keras, which requires the TensorFlow platform. You’ll need to install them both, as follows:
pip install tensorflow
pip install keras
Obtaining stock data with yfinance
is easy. With a couple of lines of code, you can get data for a certain ticker or multiple tickers for a specified period for a specified interval. For further details, refer to the yfinance
project page.
The following snippet retrieves the stock price data for the TSLA ticker for the past five years for the default interval of one day:
import yfinance as yf
tkr = yf.Ticker('TSLA')
df = tkr.history(period="5y")
The data is returned as a DataFrame sorted by date in ascending order. You will need to re-sort it in descending order to prepare it for further analysis.
df = df.iloc[::-1]
See what you have using the following command:
print(df)
Here is what the output might look like.
Date Open High Low Close Volume Dividends Stock Splits
2021-12-10 1008.750000 1020.979675 982.530029 1017.030029 19229185 0 0.0
2021-12-09 1060.640015 1062.489990 1002.359985 1003.799988 19446486 0 0.0
2021-12-08 1052.709961 1072.380005 1033.000122 1068.959961 13968790 0 0.0
2021-12-07 1044.199951 1057.673950 1026.810059 1051.750000 18694857 0 0.0
2021-12-06 1001.510010 1021.640015 950.500000 1009.010010 27221037 0 0.0
... ... ... ... ... ... ... ...
2016-12-16 198.080002 202.589996 197.600006 202.490005 3796889 0 0.0
2016-12-15 198.410004 200.740005 197.389999 197.580002 3219567 0 0.0
2016-12-14 198.740005 203.000000 196.759995 198.690002 4150927 0 0.0
2016-12-13 193.179993 201.279999 193.000000 198.149994 6823884 0 0.0
2016-12-12 192.800003 194.419998 191.039993 192.429993 2438876 0 0.0
[1259 rows x 7 columns]
As you can see, the dataset contains 1,259 rows, which is quite enough for the purpose of this example. For this example, you won’t need all the columns. Use only one: Close
. That contains the day’s closing prices of the stock. For clarity, rename it to Price
, as follows:
df=df.rename(columns={'Close': 'Price'})
In machine learning and deep learning, features (also known as X variables) are independent variables that act as input to a model’s training (and evaluating) process.
For example, car “make” can be used as a feature in a model designed to predict car prices, because you’d expect some car makes to be generally more expensive than others.
In some cases, the original dataset may contain all the features you require (or desire) from the outset, thus saving you the need to manually generate the features for the model yourself. Continuing with the example of a model for predicting car prices, a dataset you might use might already contain data sufficient to be immediately turned into the model input, such as car make, year, and fuel economy.
In contrast, when you want to build a prediction model on a time series where you have only the output variable (price, for example) whose values are tied to points in time, it’s your job to generate the features needed to train the model. In such cases, the features can be extracted from the time series itself. For example, when creating a model for predicting stock prices, you can compute the one-day shifts in the value of a security across the entire series and save the results to an individual independent variable, thus manually extracting a feature from the original dataset.
In simple terms, the one-day price change feature mentioned above represents the ratio of a price to the previous price. To generate this feature’s values, you need to calculate the difference between the current day’s price and the previous day’s price, and then correlate the result with the next day’s price. Technically, this is equal to calculating the price difference between the previous day and the day before last and then correlating the result with the current day’s price. With the pandas
library, this can be implemented by shifting the index of the DataFrame accordingly.
df['OneDayChange'] = df['Price'].shift(-1) - df['Price'].shift(-2)
It might be useful to have a feature that measures the rate of price change. To generate such a feature, you’ll need to analyze three consecutive points at a time. So, in the context of this example, you should analyze two-day intervals across the series, creating a new column in the DataFrame for this rate feature, as follows:
df['Derivative'] = df['Price'].shift(-1) - 2*df['Price'].shift(-2) + df['Price'].shift(-3)
After performing that operation, you can drop the unnecessary columns, keeping only the needed ones.
df = df[['Price','OneDayChange','Derivative']]
If you print out the DataFrame now, it should look as follows:
Date Price OneDayChange Derivative
2021-12-10 1017.030029 -65.159973 -82.369934
2021-12-09 1003.799988 17.209961 -25.530029
2021-12-08 1068.959961 42.739990 48.699951
2021-12-07 1051.750000 -5.959961 63.670044
2021-12-06 1009.010010 -69.630005 -59.229980
... ... ... ...
2016-12-16 40.498001 -0.222000 -0.329998
2016-12-15 39.515999 0.107998 -1.036003
2016-12-14 39.737999 1.144001 NaN
2016-12-13 39.630001 NaN NaN
2016-12-12 38.486000 NaN NaN
[1259 rows x 3 columns]
Be sure to drop the rows that contain NaN
. These were formed due to the use of offsets (shifts) when the OneDayChange
and Derivative
columns were created.
df = df.dropna()
As a result, there should be three rows less, and just three columns in the DataFrame.
print(df.shape)
(1256, 3)
In terms of upcoming model training, the Price
column in the DataFrame contains the output (target) variable, while the other two contain input variables (features). This structure is quite sufficient for training a prediction model. However, adding more features could potentially improve the model’s prediction abilities. One common, yet simple, technique to significantly increase the number of features when dealing with time-series data is to turn the data in the preceding rows into new features, as discussed in the next section.
If you generate new features from the data found in the nine preceding rows of each row, you will have 9 x 3 = 27 new features, along with the two you already had in the current row. To achieve this, you need to transform the DataFrame accordingly. Technically, this transformation is reduced to adding a certain number (27, in this example) of new columns to the DataFrame and populating them with the data from the fields of the specified number (9, in this example) of preceding rows.
As the first step, convert the DataFrame to a flat (one-dimensional) NumPy array.
import numpy as np
arr = df.values.flatten()
Next, define a sliding window to move through the array with a specified step, forming the rows in a new array along the way. Before you can implement the transformation with a sliding window, you need to define the following parameters:
#the size (height) of the sliding window
sw_height = 10 #the number of points (days) you want to include in a sample
#the step size of the sliding window
sw_step = len(df.columns) #the number of columns in the DataFrame, including Y and X columns
#space for the sliding window movement
rows = len(df) #the number of rows in the DataFrame
You can now apply the technique known as fancy indexing, which allows you to access multiple array elements in a particular order and form a new array, respectively. To implement this, you need to create an indexer, an array of indices that map the original array to the desired array.
In this case, you need to create an indexer that reshapes the flat array so the first column in the new array contains the target variable (the price for a certain day) while the other columns (designed for holding the input variables) are formed from the subsequent elements of the array found within the sliding window.
The indexer can be calculated for this example, as follows:
idx = np.arange(sw_step*sw_height)[None, :] + sw_step*np.arange(rows-sw_height+1)[:, None]
Once you have the indexer, you can reshape the original array.
arr = arr[idx]
To verify that these steps have worked as expected, look at the shape of the new array.
print(arr.shape)
(1247, 30)
The second element in the tuple above suggests that the array has 30 columns, assuming you have now 29 features (input variables) and still a single target (output) variable. Note that the new features have been formed not only from the features found in the previous days’ records but also from the target variable values found in the same records.
Now that you have the array containing both the features and the target variable, you need to split it into two arrays accordingly. Below, you extract the first column from the array and save it in a new array, thus defining the target array.
target = arr[:, 0]
Then, you save all the other columns (29, in this example) in the features array.
features = arr[:, 1:]
Before going further, look at the shape of the newly created arrays to ensure you’ve done the split correctly.
print("target array shape: ", target.shape, "\nfeatures array shape: ", features.shape)
The following is what you should see:
target array shape: (1247,)
features array shape: (1247, 29)
Next, split the data into the training and testing sets. A common tool for this task is the sklearn.model_selection.train_test_split()
utility, which performs random splitting with a single call.
In this case, however, using random splitting would be a mistake since nearby rows include the same data sequences, differing only in the first and final elements. As a result, the model would have a perfect score on the testing set but would fail to make accurate predictions on new, previously unseen data.
Thus, in this case, you need to split the data without shuffling it. In such cases, it’s always a good practice to put the most-recent data into the testing set. Assuming you have 1,247 samples in this example, you might put the last 1,000 samples into the training set. To do that, the dataset must start from 247, as follows:
x_train = features[247:]
y_train = target[247:]
Put the first 247 – 10 samples (allowing for the height of the sliding window) into the testing set that you will use to evaluate the model.
x_eval = features[:247-sw_height]
y_eval = target[:247-sw_height]
Subtract the 10 last samples from the testing set, because they partially include data from the first samples in the training set. Now, make sure that the split has been done as expected.
print(x_train.shape, x_eval.shape, y_train.shape, y_eval.shape)
This should produce the following output:
(1000, 29) (237, 29) (1000,) (237,)
You have now 1,000 samples for training and 237 samples reserved for validation.
Now that you have the data ready for deep learning, proceed to creating a model. As the first step, create the architecture of the model; this process includes configuring the number of the model’s layers and the number of nodes on each layer.
In the following snippet, you create a model as a stack of dense layers. In a dense layer, each unit (neuron) is connected to every unit in the previous layer, which enables mapping relationships between any two input variables (features).
import keras
from keras import layers
model = keras.Sequential([
layers.Dense(128, input_shape=x_train.shape[1:], activation="relu"),
layers.Dense(128, activation="relu"),
layers.Dense(1)
])
The code builds a three-layer model, defining 128 units on the first (input) layer, 128 units on the second (hidden) layer, and a single unit on the third (output) layer. You should experiment with the number of units in the input and hidden layers to find the configuration with which the model renders more-accurate predictions.
Note that the output layer must have a single unit since you’re trying to predict a single variable—you have a single output variable in your regression task. This layer doesn’t have an activation function, which makes it purely linear. That is, the layer is supposed to produce a continuous value.
Also note the use of the input_shape
parameter in the input layer. This parameter tells the model the number of features in your input.
Training a model is an iterative process designed to update the weights of the neural connections in the neural network that defines the model so that the error between the predicted result of the model and the expected output is steadily reduced from iteration to iteration.
One of the most common algorithms to implement this process of finding the appropriate weights of the network is gradient descent, which updates the weights gradually by a certain amount, known as the learning rate. The learning rate is a configurable parameter (often in the range between 0.0 and 1.0) that configures the behavior of the training process. A smaller learning rate may let the model learn a more optimal set of weights at the cost of it taking significantly longer to train the model.
When you compile your model, you can specify the learning rate as part of the training configuration that also includes an optimizer, a loss function, and metrics.
from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=0.00001), loss="mse", metrics=["mae"])
In this example, you use the optimizer that implements the Adam gradient descent algorithm. With the Adam optimizer, the learning rate defaults to 0.001, but in this case, use the smaller value of 0.00001 to allow the model to learn a more optimal set of weights.
As the loss function, use mean squared error (MSE), which calculates the square of the difference between the predictions and the actual values. This loss function is commonly used in regression tasks.
For the metric, use mean absolute error (MAE), which shows the absolute value of the difference between the predictions and the actual values.
After compiling the model, look at its architecture.
model.summary()
The model’s summary should look as follows:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 3840
dense_1 (Dense) (None, 128) 16512
dense_2 (Dense) (None, 1) 129
=================================================================
Total params: 20,481
Trainable params: 20,481
Non-trainable params: 0
Now you can train and evaluate the model. The fit()
method trains the model by repeatedly iterating over the training set for a specified number of epochs (iterations on the training set).
model.fit(x_train, y_train, verbose=1, epochs=2000)
With the verbose
parameter set to 1, you can view the loss function values and the metric values during training for each epoch.
[[code box]]
Epoch 1/2000
32/32 [==============================] - 0s 2ms/step - loss: 131444.2656 - mae: 345.9956
Epoch 2/2000
32/32 [==============================] - 0s 2ms/step - loss: 129994.7344 - mae: 344.0551
Epoch 3/2000
32/32 [==============================] - 0s 2ms/step - loss: 128554.0312 - mae: 342.1249
...
Epoch 1000/2000
32/32 [==============================] - 0s 3ms/step - loss: 142.8597 - mae: 7.9147
Epoch 1001/2000
32/32 [==============================] - 0s 3ms/step - loss: 140.6166 - mae: 7.8482
Epoch 1002/2000
32/32 [==============================] - 0s 3ms/step - loss: 140.4102 - mae: 7.8409
...
Epoch 1990/2000
32/32 [==============================] - 0s 3ms/step - loss: 121.6326 - mae: 7.2967
Epoch 1999/2000
32/32 [==============================] - 0s 3ms/step - loss: 121.1263 - mae: 7.3090
Epoch 2000/2000
32/32 [==============================] - 0s 3ms/step - loss: 120.8455 - mae: 7.2707
The output above shows that the loss function values and the metric values improved unevenly from epoch to epoch, and there wasn’t a significant improvement from about midpoint. This suggests that you could use fewer epochs to train your model.
Then, you can evaluate the model, computing the loss function value and metrics values on the testing set.
mse, mae = model.evaluate(x_eval, y_eval, verbose=0)
print("MSE: ", mse, "\nMAE: ", mae)
The results might look as follows:
MSE: 866.2902221679688
MAE: 20.61849021911621
It would also be interesting to look at particular predictions. Below, you generate predictions for the first five samples from the testing set and compare them with the corresponding actual values.
predictions = model.predict(x_eval[0:5])
print("Actual: ", y_eval[0:5], "\nPredictions: ", predictions.flatten())
The output might look somewhat like the following:
Actual: [1017.0300293 1003.79998779 1068.95996094 1051.75 1009.01000977]
Predictions: [1019.1517 1072.6211 1056.2333 1002.4659 1010.0428]
The results above justifiably suggest that the model can’t be considered reliable, because only some predictions come close to their corresponding actual values. This reveals the fact that the price history of a stock does not contain enough information to make accurate predictions (stock traders’ lives would be very easy otherwise).
To build a more accurate stock prediction model, you need to use more indicators. For example, you might use the information about upcoming financial reports of a company (interest in the company’s shares often increases on the eve of such events), thus turning the schedule of financial statements into a feature in your model. A discussion about choosing the right indicators, however, is far beyond the scope of this article.
In this article, you looked at an example of gathering, preparing, and using market data for building a deep learning model designed to predict the movements of a stock’s price. In particular, you learned to extract features from a series containing a stock’s price history, and then you saw how to configure, compile, train, and evaluate a prediction model on that data.
Yuli Vasiliev is a programmer, freelance author, and consultant currently specializing in open source development; Oracle database technologies; and, more recently, natural-language processing (NLP).
Next Post