In this series, I investigate the potential of several well known neural network architectures to predict Canadian stock prices, given only historical prices and trade volumes. In part 1 of this series, I investigated convolutional neural networks and discovered that they had some predictive capability, though perhaps not enough to prove useful for trading.
For this instalment of the series, I’ve trained two different neural network architectures - the Long Short-Term Memory (LSTM), and the Gated Recurrent Unit (GRU) - on the same data. Both LSTM and GRU are instances of a broader class of neural network architectures known as Recurrent Neural Networks (RNNs).
RNNs are theoretically a good tool for analyzing stock price and volume history, because they are well suited for analyzing time series data (i.e. any data set that’s sequenced with time). To help you understand why, let me briefly explain the concept of how RNNs work.
Let me start with an analogy from something we all have experience in: watching a movie. When we watch a movie, we only really watch one frame of it at a time. But as we view each frame in succession, we develop memories around different aspects of the movie - who the characters are, why they are running from one place to another, etc. This helps us not only to understand what is happening in the movie, but it also evokes emotions for us. .
Now, let’s suppose we want to construct a neural network that analyzes a movie. A non-RNN would try to analyze the movie by examining all the frames at once. In other words, it would examine the ending and the beginning scenes at the same time. While such networks would undoubtedly be able to derive some important information from this technique, such as the plot line, it may have trouble simulating the emotional response that human viewers may experience actually watching the movie. It would be like watching a movie for which you already know the ending.
An RNN, on the other hand, would analyze the movie frame by frame, in order. After the RNN analyzes each frame, it transfers some of what it learned to analyze the next frame. Since this is much closer to the human experience, it would theoretically be better at simulating the human viewer’s experience.
I think there are many similarities between the act of watching a movie and trading stocks. In both cases, how a human feels in a present situation depends on the path it took to get there. If a stock had been mostly falling over the past year, then the investor in that stock would most likely feel fear. If the stock had been rising instead, the investor would probably feel optimistic.
Such emotions can have a large influence on the short term price of a stock. Indeed, such emotions are the theoretical basis that underpins technical analysis, which is the style of investing that tries to predict short term price movements based on historical price charts. Proponents of technical analysis claim that it works because it predicts the emotional response of investors when a stock price follows a pattern.
An RNN refers to any neural network architecture that transfers knowledge from one step to the next. Therefore, there are many different “incarnations” of RNNs. Two of the most well known incarnations are the ones we’re examining today - the LSTM and the GRU.
I won’t go into the deep technical details of how LSTMs work (here’s an excellent article for those interested). Instead, let me just explain the gist.
LSTMs have what’s known as a cell state, and a hidden state. Despite its name, the hidden state is actually the output (i.e. stock price prediction), so it’s actually not hidden at all. The cell state, on the other hand, stores other information that’s used to calculate the hidden state. This could include information about the investor’s psyche, such as the degree of fear and greed.
As the LSTM ingests information in each step, it updates both the hidden and cell states. In our case, a step consists of a single day’s stock price movement and traded volume. The LSTM will therefore take this new set of data and combine it with the stock price prediction and the investors’ emotional state from the day before, in order to produce a new stock price prediction and a new emotional state.
As with every other neural network architecture, there are different ways to configure LSTMs. I tried a couple of dozen configurations using a relatively new machine learning library called PyTorch, and settled on a configuration that seemed to work best. The following graph shows the prediction generated by this “best”’ LSTM.
The x-axis of the graph represents the actual stock price return a week after the last historical date. In other words, if the input data consisted of daily historical price and volume data until March 7, then the x-axis shows the stock’s performance from March 7 to March 14. The y-axis represents the predicted stock price return by the LSTM in the same period.
As you can see, most of the dots in the graph belong in the region encircled in blue. This represents the area where y is close to 0. This means that most of the time, the LSTM predicts that the stock price won’t move. In other words, the LSTM is “shy” about making bold predictions, and likes to play it safe by suggesting that the stock price won’t go up or down.
Some machine learning practitioners may wonder if I’m getting this result because I didn’t let the LSTM train enough. The more data the LSTM sees, the more confident it will get about making bolder predictions. Indeed, I could let the model train for longer, and this will lead more dots to appear outside the blue encircled region. However, doing so seems to reduce the overall quality of the predictions as well. In other words, the bolder predictions tend to backfire, leading to more losses.
The R2 (R-squared) of LSTM’s predictions appear to be close to 0, which means that the LSTM models I tried didn’t appear to have any predictive powers at all. At first, I was surprised by the result, given LSTM’s success in other applications, such as speech recognition. But on the other hand, it helped me understand why I couldn’t find any reputable publication on applying LSTMs to stock price prediction. I’m sure I wasn’t the first one to use LSTMs to predict stock prices. The lack of such publications tells me that other researchers have similarly not found any luck with LSTMs.
The other popular instance of RNNs that I investigated is the GRU. There are many differences between how LSTMs and GRUs work, but the biggest conceptual difference is the fact that GRUs don’t have cell states. Instead, GRUs only carry hidden states forward.
Despite its simpler structure, researchers appear to have had more success on many applications using GRUs than they did with LSTMs, especially on those applications without a lot of data. I believe this is because the simpler structure actually helps to prevent overfitting (see this article for a discussion on overfitting). In fact, that appears to be the case with predicting stock prices as well.
As with LSTMs, I tried many different configurations of GRUs to find out what works best. The graph below shows the predictions generated by my best GRU.
As you can see, the GRU is also mostly gun shy about making bolder predictions, judging by the concentration of green dots around the level where y is close to 0. However, the dots are more dispersed relative to the LSTM graph, suggesting that the GRU is less shy about making predictions compared to the LSTM.
We see this phenomenon because the GRU sometimes does make predictions that are useful. The R2 of GRU’s predictions is roughly 6%, which means that the GRU captures 6% of the various influences that lead to a stock’s movement. On the one hand, this is a high enough number to say that there’s something there, but not high enough that we can expect to create an exciting trading strategy around it.
Interestingly enough, 6% is the same level of R2 that I arrived at using convolutional neural networks (CNNs), which I explored in part 1 of this series. This makes me wonder if GRUs and CNNs are detecting the same stock price influencers through different means, or whether they’re actually detecting different influencers that just happen to have the same levels of predictability.
In future instalments of this series, I plan on investigating the effectiveness of combining CNNs and GRUs. If they are indeed detecting the same influencers, then we probably won’t see R2 values much higher than 6%. But if the latter is true, then we might start to see R2 values that are high enough to merit serious investigation as a trading strategy.