Mlfinlab 0.5.2 Version Release

In the last 4 months the research team has been focused on wrapping up the final chapters of Advances in Financial Machine Learning as well as a few extra papers from the Journal of Financial Data Science. We are very excited that 2020 comes with the release of 3 new financial machine learning textbooks which promise many rewarding tools.

We have recently set up a Patreon account where members in the community can support our work via sponsorship and help the group continue our development for the community at large. A special thank you to our sponsors, it is because of you that we are able to pay for data, academic journals, and other 3rd party expenses. In particular the work we did on the Model Fingerprint as well as the Codependance module, was a direct result of your donations.

A special thank you to the following sponsors:

If you haven’t already, please do star our Github repo. This helps us to rank our package amoungst our peers and obtain funding from open-source foundations.

So what’s new?

Bar data structures

We have refactored the BaseBars class and added extra fields to bar data schema:

Added the tick_num field which corresponds to the tick number when the bar was formed. Our research team faced a problem where several ticks may have the same timestamp, in this case the timestamp is no more a unique identifier, however the tick number (index) is. This new column helped us a lot in the MicrostructuralFeaturesGenerator implementation which will be described below.
Added cum_dollar_value (sum of tick price * tick size) which is used in Amihud/Hasbrouck lambda calculations.
Added cum_buy_volume field which corresponds to volume being classified as buy volume using the tick rule algorithm.
Added pandas.DataFrame as possible input for bar data generation.
Added Time Bars structure

Information driven bars – changes:

We were faced with the problem of having to define hyperparamters for the information driven bars. We cross referenced our implementaiton with other implementations and found that they too suffered from the same problem. The main problem was that different settings would either result in bars which converge to ticks, or diverge to almost daily or even weekly bars. We decided to add the analyse_thresholds to information-driven bars generation. If this value is True, then the bar compression function returns the bars generation and a dataframe with threshold values for each tick so that user can understand how the hyperparameters influence thresholds values.
Introduced ConstantImbalance/RunBars structures. Finding the optimal value for expected_num_ticks_window and num_prev_bars may take some time, so we decided to introduce a data structure with only one hyperparameter(num_prev_bars) while expected number of ticks is fixed by a researcher.
These two solutions help make the information driven bars more practical.

Structural breaks tests:

We’ve added Chu-Stinchcombe-White CUSUM, Chow-Type Dickery-Fuller and SADF (Autoregressive, Sub- and Super- Martingale) tests. The following code can be used:

import pandas as pd
import numpy as np
from mlfinlab.structural_breaks import get_chu_stinchcombe_white_statistics, get_chow_type_stat, get_sadf

bars = pd.read_csv('BARS_PATH', index_col = 0, parse_dates=[0])
log_prices = np.log(bars.close) # see p.253, 17.4.2.1 Raw vs Log Prices

# SADF test
linear_sadf = get_sadf(log_prices, model='linear', add_const=True, min_length=20, lags=5)

Market microstructure features:

There are two types of microstructural features:

The one which uses bar data information only (OHLC prices, volumes, dollar volumes, …)
Those formed on a bar basis, but need information from ticks forming each bar.

For example, Kyle/Amihud/Hasbrouck lambdas can be formed using both bar data and tick data only. However, entropy features need tick data to encode the message for entropy estimation.

Bar based features added:

Bar based Kyle/Amihud/Hasbrouck lambdas
Roll measure, Roll impact, Corwin-Schultz estimators
VPIN

While implementing, tick based features we faced a problem: all features values are generated for each bar, but they needed to use tick information. That is why we implemented the MicrostructuralFeaturesGenerator class.

How it works:

Mlfinlab bar schema was extended to store the information about the tick number (tick_num) on which a given bar was formed. The tick number series and tick data are used to form a bars series which are used as inputs to the MicrostructuralFeaturesGenerator. The class also generates VWAP, average tick size, tick rule sum, and tick rule entropies.

import numpy as np
import pandas as pd
from mlfinlab.microstructural_features import quantile_mapping, MicrostructuralFeaturesGenerator

df_trades = pd.read_csv('TRADES_PATH', parse_dates=[0])
df_trades['log_ret'] = np.log(df_trades.Price / df_trades.Price.shift(1)).dropna()
non_null_log_ret = df_trades[df_trades.log_ret != 0].log_ret.dropna()

# Take unique volumes only
volume_mapping = quantile_mapping(df_trades.Volume.drop_duplicates(), num_letters=10)

returns_mapping = quantile_mapping(non_null_log_ret, num_letters=10)

# Compress bars from ticks
compressed_bars = pd.read_csv('BARS_PATH', index_col=0, parse_dates=[0])
bar_index = compressed_bars.index

gen = MicrostructuralFeaturesGenerator('TRADES_PATH', bar_index, volume_encoding=volume_mapping,
                                           pct_encoding=returns_mapping)
features = gen.get_features(to_csv=False, verbose=False)

Entropy features

Added Shannon, Lempel-Ziv, Plug-In, Kontoyiannis entropy estimators
Added Quantile and Sigma encoding schemes
Price/Volume/Tick rule entropy estimation needs tick data for message encoding. Entropy estimation is a part of the MicrostructuralFeaturesGenerator. If volume_encoding value is not None and equals to the encoding dictionary (generated by either quantile or sigma encoding schemes) volume entropies will be generated. The sames stands for the price_encoding paramater.

Other features

Added Garman-Class, Parkinson, Bekker-Parkinson, Yang-Zhang volatility estimators
Added BVC (Buy Volume Classification) algorithm

Clustering

Added Optimal Number of Clusters Algorithm (ONC) used in detecting overfit strategies.

Codependence

Added mutual information estimation
Added variation of information estimation
Added angular (absolute and squared) distance estimation (modification of correlation coefficient so that it satisfies metrics conditions)
Added distance correlation estimation

Feature importance

Fixed PCA feature analysis (use absolute eigen values instead of signed)
Add RegressionModelFingerprint, ClassificationModelFingerprint. These modules extract linear, non-linear and pairwise effects of features on model predictions. From a paper in the Journal of Financial Data Science: Beyond the Black Box: An Intuitive Approach to Investment Prediction with Machine Learning. By Yimou Li, David Turkington and Alireza Yazdani.

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from mlfinlab.feature_importance import RegressionModelFingerprint

data = load_boston() # Get a dataset
X = pd.DataFrame(columns=data['feature_names'], data=data['data'])
y = pd.Series(data['target'])

# Fit the model
reg = RandomForestRegressor(n_estimators=10, random_state=42)
reg.fit(X, y)

reg_fingerpint = RegressionModelFingerprint()
reg_fingerprint.fit(reg, X, num_values=20, pairwise_combinations=[('CRIM', 'ZN'), ('RM', 'AGE'), ('LSTAT', 'DIS')])
reg_fingerpint.fit() # Fit the model
linear_effect, non_linear_effect, pair_wise_effect = reg_fingerpint.get_effects() # Get linear non-linear effects and pairwise effects

# Plot the results
fig = reg_fingerpint.plot_effects()
fig.show()

Bug fixes

Fixed error in labeling/get_daily vol if timestamp index contains time-zone info
Import error due to not exposing modules in the init.py files. (portoflio_optimisation, z_score filter, label uniquenes)

Note:

We have recently updated our online documentaiton with all of the new additions to the code base. It is by far the easiest way to learn how to use the package.