You’d think that machine learning would thrive in finance. After all, the whole industry operates on data. Banks look at your credit history (data!) before they approve your mortgages. Credit card companies look at payment patterns (data!) to sniff out fraud. Investors use financial statements (data!) to select stocks to invest in.
Anytime there’s an abundance of data, you can often use machine learning to make better decisions, more quickly and with lower costs. You may therefore be surprised to learn not only that machine learning adoption has been slow in finance, but also that many finance professionals are skeptical of its potential.
Over the past number of years, I’ve heard the same basic refrain over and over again from the skeptics: “I knew so-and-so who started a hedge fund based on machine learning, and it collapsed,” or “We hired a machine learning PhD, but his algorithm didn’t perform any better than our existing methodology.” In other words, the slowness of adoption is not for lack of trying. Rather, machine learning just hasn’t produced good results for those who tried it.
The data lends credence to this disappointment. From Sep 2014 to Sep 2019, the Eurekahedge AI Hedge Fund Index, which tracks machine learning powered funds, returned 45.5%. Investing in an S&P 500 index fund, on the other hand, would have generated over 66% in returns.
But why has machine learning failed to produce positive results? Whenever I’ve seen that question asked, I’ve heard two types of answers.
The first type of answer involves blaming the machine learning practitioners for their lack of financial knowledge. A physicist, for example, may not be aware of the difference between a financial statement’s fiscal and filing dates. Such deficiencies in knowledge probably have led to flawed machine learning models.
The second type of answer for why machine learning has failed to generate results comes from mathematicians such as Lopez de Prado, who have criticized the statistical methodology employed by investment firms. For example, de Prado points out that many firms have abused backtesting by running them over and over again until firms got results that they liked. Many firms have undoubtedly fallen into the traps that de Prado describes, which accounts for their failure.
In summary, those with finance backgrounds blame the architects’ lack of financial knowledge for machine learning algorithms’ failures, while mathematicians blame the finance folks for their lack of mathematical knowledge. But is this the whole story?
For the past number of months, my firm has been working on a sophisticated machine learning model that we hope to turn into a fund. In the process of creating the model, it occurred to me that there’s one particular challenge that nobody ever talks about, a challenge that could have resulted in many machine learning models not working as expected. The lack of discussion surrounding this challenge is a shame, because I think it is the most difficult component of machine learning to manage.
So what is this challenge I’m talking about? I’m referring to the mundane topic of producing code without bugs.
While we humans are good at a lot of things, avoiding mistakes is not one of them. Some people estimate that there are typically between 1 and 25 bugs per 1,000 lines of code. That’s a wide range. There are several factors that determine whether the defect rate is high or low, and I believe that the business nature of software is one of them.
Unfortunately, coding financial software lends itself to high defect rates. Think about it. Let’s say a programmer is developing a content management system for a website, which lets users change the background colour of their websites. It would be very easy to see if such a system was working or not as either the background colour would change as specified or it wouldn’t. In such a case, the programmer would fix any bugs in this area very quickly.
But things are not so obvious with software that deals primarily with numbers. For example, let’s say a programmer created a function that calculates the standard deviation of a list of numbers. He then observed that when the array [1, -2, 5] was fed in, the function returned the value ‘0.8’. Now, if you’re a good statistician, that answer might make you pause and re-evaluate the function. But a typical programmer without that background may not see that there’s a problem.
Some people think the answer to this problem is to get finance and statistics people to double check the programmers’ work. But in my experience, this method isn’t very effective. I have yet to meet anyone who enjoys manually double checking programmers’ work, and lacking interest, they tend to do the job poorly. Also, I’ve seen finance and statistics people point out errors to programmers, only for programmers to fix the problem for that specific scenario and leave the root problem intact.
This is a much more severe problem than many people realize. As an anecdote, we once took over a software project where the client had previously contracted offshore developers to create financial algorithms. We found that none, and I mean none, of the major calculations were being done correctly, despite the fact that the client double checked the developers’ work. I don’t think it’s any different at major banks or financial institutions. I should know, since I used to work for a bank.
Yet the consequences of even a single bug can be severe for a machine learning algorithm. As an example, imagine mistakenly including some out-of-sample data with the training data. The algorithm would probably appear to perform really well, but fall flat when used in the real world. Just one bug would be responsible for the illusion of a strong performance.
Unfortunately, it’s difficult to avoid such mistakes even if you’re extremely meticulous. Machine learning projects tend to be a lot bigger than your traditional investment algorithm projects, giving them much more room for error. Our project, for example, consists of over 10,000 lines of code, not counting all the third party libraries we use. Even if we assume a very low defect rate of 1 bug out of 1,000 lines, that still means we probably have over 10 bugs in our software. For similar sizes projects written by programmers without finance/math knowledge, there may even be over 200 bugs.
Of course, there are ways to mitigate this problem. We blanket our code base with tests. We also ensure that multiple people review code before it gets merged into production. We only hire people who are detail-oriented. (We’ve had to let go of people in the past for making too many mistakes.) But I’ve found that none of these methods is a panacea. We have to fight the never-ending battle of discovering and fixing bugs.
Thankfully, I feel confident that we’ll develop an algorithm free of major bugs. We’re able to do this because at least some of our programmers have the finance, math and programming knowledge to be able to recognize bugs and write good tests. But whenever we find an unintuitive bug that could have derailed our results, I wonder, how many other firms currently have similar defects in their machine learning algorithms?