MLFinLab Package & Research Update.

First of all we want to thank everyone who has reached out to us with ideas and contributions to our package. Without all of your help, none of this would be possible.

We have done a lot of work this week and hope that this update provides you with more insight into both the package for Advances in Financial Machine Learning, as well as the research notebooks which answer the questions at the back of every chapter.

In the next two weeks we will finish the first part of our research project in a small paper titled: Does Meta-Labeling Add to Signal Efficacy? If possible, we will be sure to upload it as a blog post.

So lets get started!

Barriers to Entry

As most of you know, getting through the first 3 chapters of the book is challenging as it relies on HFT data to create the new financial data structures. Sourcing the HFT data is very difficult and thus we have resorted to purchasing the full history of S&P500 Emini futures tick data from TickData LLC.

We are not affiliated with TickData in any way but would like to recommend others to make use of their service. They have really done a great job at cleaning the data and providing it in a user friendly manner.

Sample Data

TickData does offer about 20 days worth of raw tick data which can be sourced from their website – link.

For those of you interested in working with two years (period 2015-01-01 to 2017-01-01) of sample tick, volume, and dollar bars, it is provided for in the research repo. For a more detailed view on how the data is cleaned you can checkout this repo. Please note that we don’t share the tick data but rather the transformed data structures as described by de Prado.

You should be able to work on a few implementations of the code with this set and we hope that it helps the community.

Research Notebooks

We added the following notebooks to the research repo:

Chapter 2: Sample Techniques

This notebook analyses the various sampling techniques (Tick, Volume, & Dollar bars) and their statistical properties. In particular we have a look at:

  • Which bar type produces the most stable weekly count.
  • Compute serial correlation of returns for the four bar types and determine which method has the lowest serial correlation.
  • Apply the Jarque-Bera normality test on returns from the three bar types to see which has the lowest test statistic. (We test to see which is closest to the normal distribution.)
  • Standardize & Plot the Distributions. Readers will note that we try to match our plot with that from the paper: The Volume Clock.

Chapter 3: Labeling

We spent most of our research time in chapter 3 this week. We developed 4 different notebooks.

Chapter3-Part1:

  • Answers the first easy set of questions at the back of chapter 3. We split out the meta labeling section into 3 other notebooks.
  • Apply CUSUM filter
  • Compute vertical barriers
  • Apply the triple barrier method (labeling)

Meta-Labels MNIST:

  • This notebook explores the idea of Meta-Labels using the MNIST data set.
  • This is because MNIST is a solved problem and we can use it to build an intuition behind what the model is doing.
  • We also read the 39 papers at in the bibliography section to filter down the core papers that we believe are the inspiration for meta-labels.
  • Conclusion: Meta-Labeling is a genius concept that can be applied to all types of primary models (including fundamental PMs).

Trend Following Question:

  • We apply a trend following strategy to act as a primary model and then fit a Random Forrest meta-labeling model.
  • Conclusion: Its great to see in this notebook that meta-labeling helped to improve the portfolio’s performance metrics by reducing the drawdown (-41.8% -> -20.8) and increasing the returns from -4.2% to a 4.2%!

Mean Reverting Question:

  • Apply a mean reverting strategy using Bollinger Bands.
  • The stand alone strategy works extremely well. Most likely on account of the better statistical properties of the dollar bars.
  • We observe that in case of Bollinger bands mean-reverting strategy, the meta-labeling process looses some upside (lower annualized return 44% v. 58%) but helps reduce the risk in the strategy compared to the primary model. The maximum drawdown falls from 24% to 12.3%.

mlfinlab Package Updates

We have implemented the standard bar types. Works on very large CSV files (25 Gigs and up), also runs rather fast so no long waiting periods. As soon as we are done with our write up for the project (end of next week) then we will add the code for chapter 3. It takes a while on account of all the unit tests we need to create. 100% code coverage takes time.

We did however add quite a few read me files, a rules of conduct, templates for raising issues, contributing guidelines. Our hope is that this makes it easier for early adopters to create issues. We do have a public agile board so everyone can see what we are working on.

Community Feedback

We have been following the principals of the Lean Startup to see which parts of the project most interests the community. The feedback we have had is amazing and we would like to thank everyone who reached out to us with contributions and guidance.

Interestingly most of our feedback was from the Quant / Computer Science communities. We got a very low response from traditional channels such as those associated with discretionary portfolio managers.

Last but not least we have the performance metrics on our Github group:

It’s clear from the image above that the majority want us to work further on developing the package. So that is where we will be spending our time. I do though expect to see a rise in the research repo now that it contains sample data.

That’s all the news we have for now! Got to get back to writing the report.