Time-series analysis is a basic concept within the field of statistical learning that allows the user to find meaningful information in data collected over time. To demonstrate the power of this technique, we'll be applying it to the S&P 500 Stock Index in order to find the best model to predict future stock values.

Throughout this tutorial, we'll leverage the horse-power of RStudio and deliver, where appropriate, gorgeous interactive data visualizations using ggplot2 and plotly. Just to be clear, using a time-series analysis to invest in stocks is highly discouraged. We've chosen to predict stock values for the sake of example only.

The GitHub repository you'll need to follow this tutorial is located here.

Load Packages

First, to correctly load the packages we'll be using in this tutorial into your RStudio console, select *File,* *Open Project in New Session, *and click on the preexisting *R Project* in the repo. With the `packrat`

package, the `.Rprofile`

script will automatically run files in the `packrat`

directory and load the appropriate packages into the R environment.

Once this is complete, you should have a new RStudio session. Here we will be referencing the `helper_functions`

script, which contains all the necessary packages. We use:

```
source(here("src",'helper_functions.R'))
```

For this demonstration we will load them in manually:

```
# LOAD YOUR PACKAGES
library(ggplot2)
library(forecast)
library(plotly)
library(ggfortify)
library(tseries)
library(gridExtra)
library(docstring)
library(readr)
library(here)
here() # Should output current work directory
```

Using the `here`

package, we are going to set the working directory. We'll be able to reference different subdirectories to make the workflow easier.

Get Data

Now we collect our data. We want to use reliable sources of complete and accurate data. For the purposes of this tutorial, we collected 21 years* *(1995-2015) of S&P 500 Stock Index data at a monthly frequency (a total of 252 observations) from Yahoo Finance, which you can download here. We chose to use the Adjusted Closing Value for this analysis.

Load the Data

We must include our data set within our working R environment. For this we use:

```
data_master <- read.csv(here("data", "data_master_1.csv"))
attach(data_master)
```

Now we can call our S&P 500 Stock Index data by typing `data_master$sp_500`

into our terminal.

Helper Functions

For this project, we created a helper script which contains functions that automate many of our plots.

In order to provide well-written documentation for these functions, we utilized the `docstrings`

package which allows you to view them on RStudio:

`?function_name`

Exploratory Analysis

Now, we want to get a feel for our data in order to decide which models may be appropriate for our forecast. To do this, we will plot our data and diagnose for *trend*, *seasonality*, *heteroskedasticity*, and *stationarity*. We will go over these concepts in further detail below.

Create a Time-Series Data Object

Our S&P 500 Stock Index data is in the form of a time series; this means that our data exists over a continuous time interval with equal spacing between every two consecutive measurements.

In **R** we are able to create time-series objects for our data vectors using the `ts()`

method. Select the vector you would like to use as the first argument, and tune the `start`

and `freq`

(frequency) parameters. Then output the time-series data to the terminal by calling your newly-created time-series object.

```
sp_500 <- ts(data_master$sp_500, start=c(1995, 1), freq=12)
```

Here, we use our own function called `plot_time_series`

, which does as its name suggests:

`plot_time_series(sp_500, 'S&P 500')`

Before you begin any analysis, you will be splitting the data to remove 2015 to use as a test set.

`sp500_training <- ts(sp_500, start=c(1995, 1), end=c(2014, 12), freq=12)`

Plot a Time Series

Plotting data is arguably the most critical step in the exploratory analysis phase. (We chose to emphasize the time-series object that has intervals from 1995 to 2014, a choice we will explain later!) Visualizing our time-series data enables us to make inferences about important components, such as trend, seasonality, heteroskedasticity, and stationarity. Here is a quick summary of each:

**Trend**: We say that a dataset has a trend when it has either a *long-term increase* or *decrease*.

**Seasonality**: We say that a dataset has seasonality when it has patterns that repeat over known, fixed periods of time (e.g. monthly, quarterly, yearly).

**Heteroskedasticity**: We say that a data is *heteroskedastic* when its variability is not constant (i.e., its variance increases or decreases as a function of the explanatory variable).

**Stationarity**: A stochastic process is called *stationary* if the mean and variance are constant (i.e., their joint distribution does not change over time).

We start our analysis by plotting our time series object to give us a visual basis to start our modeling.

```
plot_time_series(sp500_training, 'S&P 500')
```

You can easily see that our time series has instances of both positive and negative trend. Overall, it is very volatile, which tells us that we will have transform the data in order for the **Box-Jenkins Methodology** to predict with better accuracy.

Test for Stationarity

We will utilize a few statistical methods to test for stationarity. We must be wary of our model having a *unit root;* this will lead to non-stationary processes.

```
Box.test(sp_500, lag = 20, type = 'Ljung-Box')
```

Our output: ```
> Box.test(sp500_training, lag = 20, type = 'Ljung-Box')
Box-Ljung test
data: sp500_training
X-squared = 2024.8, df = 20, p-value < 2.2e-16
```

Now we will utilize the **Augmented Dickey-Fuller Test** for stationarity. The null hypothesis states that large *p* values indicate non-stationarity and smaller* p* values indicate stationarity. (We will be using 0.05 as our alpha value.)

`adf.test(sp_500)`

### Here's our output:

```
> adf.test(sp500_training)
Augmented Dickey-Fuller Test
data: sp500_training
Dickey-Fuller = -1.7877, Lag order = 6, p-value = 0.6652
alternative hypothesis: stationary
```

You can see our *p* value for the ADF test is relatively high. For that reason, we need to do some further visual inspection — but we know we will most likely have to difference our time series for stationarity.

Decompose a Time Series

Beyond understanding the *trend* of your time series, you want to further understand the anatomy of your data. For this reason, we will break down our time series into its *seasonal component*, *trend*, and *residuals*.

```
plot_decomp(sp500_training, 'S&P 500')
```

The trend line shows us what we already know; we can see there might be some seasonality in our time-series object.

Model Estimation

### Diagnosing the ACF and PACF Plots of Our Time-Series Object

*ACF* stands for "autocorrelation function" and *PACF* stands for "partial autocorrelation function." The *ACF* and *PACF *diagnosis is employed over a time-series to determine the order in which we are going to create our model using *ARIMA* modeling. Loosely speaking, a time series is *stationary* when its mean, variance, and *autocorrelation* remain constant over time.

These functions help us understand the correlation component of different data points at different time *lags*. *Lag *refers to the time difference between one observation and a previous observation in a dataset. Let's examine our plots!

```
# DIAGNOSING ACF AND PACF PLOTS
plot_acf_pacf(sp500_training, 'S&P 500')
```

When there is large autocorrelation within our lagged values, we see geometric decay in our plots. This is a huge indicator that we will have to take the difference of our time series object.

### Transform Data to Adjust for Non-Stationarity

Based on our visual inspection of the time-series object and the statistical tests used for exploratory analysis, it is appropriate to difference our time-series object to account for the *non-stationarity. *Let's see how the object fares!

A way to make a time series *stationary* is to find the difference across its consecutive values. This helps stabilize the mean, thereby making the time-series object stationary.

For this we use the `diff()`

method.

```
tsDiff <- diff(sp500_training)
```

Next we plot our transformed time series:

```
plot_time_series(tsDiff, 'First Difference')
```

This plot suggests that our working data is stationary. You will want to confirm this by running the same tests and looking at the *ACF* and *PACF* diagnostics over the differenced data to find out if you can proceed to estimating a model.

### Test for Stationarity

Let's apply the same tests to our differenced time-series object:

```
> Box.test(tsDiff, lag = 20, type = 'Ljung-Box')
Box-Ljung test
data: tsDiff
X-squared = 58.2, df = 20, p-value = 1.347e-05
```

Now let's use the ADF test:

```
> adf.test(tsDiff)
Augmented Dickey-Fuller Test
data: tsDiff
Dickey-Fuller = -4.9552, Lag order = 6, p-value = 0.01
alternative hypothesis: stationary
Warning message:
In adf.test(tsDiff) : p-value smaller than printed p-value
```

This cryptic warning message means that the result yields a small* p* value, which makes us reject the null suggestion stationarity.

Diagnose the ACF and PACF of a Transformed Time-Series Object

The plot below helps us confirm that we have stationarity and helps us decide which model we will use. It is important to keep in mind that we have a difference parameter equal to one (i.e., *d = 1*) because of the previous transformation we carried out.

```
plot_acf_pacf(tsDiff, 'First Difference Time Series Object')
```

From the above plots, we deduce that an *MA(1)* model (where *MA* stands for **moving average**) best fits our data because the *ACF* cuts off at one significant lag and the *PACF* shows geometric decay.

Because we are examining the differenced time series, we have to use the combined model *ARIMA *(**Autoregressive integrated moving average**); thus, our model so far is *ARIMA(0, 1, 1)*.

Build a Model

Our findings in the exploratory analysis phase suggest that model *ARIMA(0, 1, 1)* might be the best fit. Fortunately, there is a function in **R** that we can use to test our findings.

The `auto.arima()`

method, found within the `forecast`

package, yields the best model for a time series based on **Akaike-Information-Criterion** (*AIC*). The *AIC* is a measurement of quality used across various models to find the best fit. After running our original and differenced data sets through the `auto.arima()`

method, we confirmed that the *ARIMA(0, 1, 1)* is our best fit model.

We use the `Arima()`

method to fit our model and include our training data set, `sp500_training`

, as the first argument.

```
fit <- Arima(sp500_training, order = c(0,1,1),
include.drift = TRUE)
summary(fit)
```

Here's the summary of our model (using the `summary()`

method):

```
> summary(fit)
Series: sp500_training
ARIMA(0,1,1) with drift
Coefficients:
ma1 drift
0.5666 6.4975
s.e. 0.0551 3.4326
sigma^2 estimated as 1161: log likelihood=-1181.58
AIC=2369.17 AICc=2369.27 BIC=2379.59
Training set error measures:
ME RMSE MAE
Training set -0.00911296 33.85289 24.84955
MPE MAPE MASE
Training set -0.00840343 2.141218 0.1310854
ACF1
Training set -0.01137429
```

Our next step is to run residual diagnostics to ensure our residuals are white noise under our initial assumptions. For this we will use the `ggtsdiplay()`

method.

```
# RESIDUAL DIAGNOSTICS
ggtsdiag_custom(fit) +
theme(panel.background = element_rect(fill = "gray98"),
panel.grid.minor = element_blank(),
axis.line.y = element_line(colour="gray"),
axis.line.x = element_line(colour="gray"))
residFit <- ggplot(data=fit, aes(residuals(fit))) +
geom_histogram(aes(y =..density..),
binwidth = 5,
col="turquoise4", fill="white") +
geom_density(col=1) +
theme(panel.background = element_rect(fill = "gray98"),
panel.grid.minor = element_blank(),
axis.line = element_line(colour="gray"),
axis.line.x = element_line(colour="gray")) +
ggtitle("Plot of SP 500 ARIMA Model Residuals")
```

Based on our diagnostic plot, the residuals to seem to display a normal distribution.

This model appears to confirm all of our assumptions, which means we can continue to the forecasting phase!

Forecast

Since we believe we've found the appropriate model, let's begin forecasting! As you'll see, we relied quite heavily on the `autoplot()`

function, since we couldn't find a way to add the actual values to the plot. The code for the workaround we used can be found in the project's Github Repository. You should run the `autoplot.forecast()`

in order to get the plots we have here.

For now, we will create a time-series object that will include the actual 2015 values. In our function, it will be called up by the parameter `holdout`

.

Next we use the `forecast`

function, `ggplot2`

, and `plotly`

to visualize the predictions for the year 2015! Within the plots, the forecasted values are **BLUE**, the actual 2015 values are in **RED**, the 80% confidence intervals are encompassed in the **YELLOW** bands, and 95% confidence intervals are encompassed in the **ORANGE** bands.

```
for_sp500_all <- forecast(fit, h = 12)
```

Next, let's create the test set:

`sp500_test <- window(sp_500, 2015, c(2015, 12))`

Now let's plot the forecast and compare to the actual values for 2016:

We can see that the model performs well and within the 80% and 95% confidence intervals. You can forecast values even further into the future by tuning the appropriate parameters. Please not that this forecast project is for educational purposes only and we do not recommend investing based on predictions made during this project. The stock market is very volatile.

Conclusions

The forecasting method we used to find the best model receives the lowest *MAE* and *MAPE. *You can read more about this in "Evaluating Forecast Accuracy" by Rob J. Hyndman, which can be found here.

Below, we use the accuracy method that includes the test set to give us metrics for all models. The functions is called as follows:

```
# Using round function to make it more readable
round(accuracy(fit, sp500_test), 3)
```

Here are the results for the test set (the function will return both training and test set metrics, but we're only concerned with the test set metrics):

Surprisingly, other models performed better on the test set than the ARIMA model. This could be a result of overfitting, since our test set was really small. After a few iterations, we have been able to expand the test set to include data up until 2017 on the repo and the original inertia7 project.

For the sake of the dashboard, we have decided to leave the data in their original state. We do, however, encourage that you explore the models with a larger test set.

Sources Cited

Below are some resources that we used to write this piece, and that we would highly recommend you read too! They all cover the material very well without getting overly technical. (Although, if you'd like something more in depth, we would recommend "Time Series Analysis and Its Application with R Examples," which can be found here). I would also like to acknowledge Rob J. Hyndman for his major contributions to time-series analysis and the R community with the creation of the `forecast`

package, among other contributions, and Hadley Wickham for his contribution to both **RStudio** and `ggplot2 `

— without these tools, none of this work would have been possible. Thank you.

Hyndman, Rob J., and George Athanasopoulos. "Forecasting: Principles and Practice," Otexts. N.p., May 2012. Web.

NIST/SEMATECH e-Handbook of Statistical Methods, "Introduction to Time Series Analysis." June, 2016.

Schmidt, Drew. Autoplot: Graphical Methods with ggplot2 Wrathematics, my stack runneth over. June, 2012. Web.

"StackExchange." To all the contributors who provide answers on StackExchange, we can't thank you enough.

Citations for Packages Used

Citations created using the function (in **R**):

```
citation(""packageName"")
```

About the Authors

This post was a group effort. In addition to Raul Eulogio, whose information can be viewed below, the following people contributed their knowledge and creativity to this project:

David A. Campos

David is a startup and technology junkie with a passion for building products that help human beings reach their highest potential. Most recently, David cofounded Inertia7, a social platform for computer and data science enthusiasts to share their open-source projects. Still in its early development, Inertia7 has attracted more than 30 contributors,100 open-source projects, and 9,000 users. David was also the founder of Napses, an EdTech startup that placed third in the 2013 New Venture Competition at University of California, Santa Barbara (UCSB). In his free time, he loves learning, coding, drumming, painting, hiking, music festivals, and hanging out with friends. He holds a B.A. in Economics and Spanish, as well as a Technology Management Certificate from UCSB.

Nathan Fritter

Nathan has a B.S. in Statistics from UCSB, and is currently working as a business inteligence analyst for PointCare in the SF Bay Area. He contributes to inertia7, a open source platform for data science projects, and was a very active member in the Data Science at UCSB club during his time as an undergrad. Nathan can be seen catching a Golden State Warriors game with a nice IPA, putting in a good workout at the local gym, or working hard on an interesting machine learning project.

Amil Khan

Amil is currently a full time student at UCSB studying statistics. He also works as a materials science researcher at UCSB Beyerlein Lab, using his skills in machine learning to further the mechanical performance of advanced materials. His specialties are data visualization, time series, scientific computing, and storytelling. In his free time, he works on his Jeep JK and watches Lakers basketball.

Kimberly Specht

Kim's background is in applied statistics at UCSB. Her interest in data science has led her to a career in healthcare analytics. She is now working at a not-for-profit healthcare system in the San Francisco Bay Area where she is hoping to help transform patient care with predictive analytics.