An implementation applying meta-labeling to minute-frequency stock data. Utilized LSTM as the primary model for price direction prediction (side), upon which a trading strategy is built. A secondary meta-labeling layer (size) helps to filter false positives and improve risk-return metrics.
This repository is a reorganisation of my master's dissertation.
- Using minute-level price and volume data of Apple, 20 technical indicators are derived and used as input for a LSTM model to predict the price movement.
- Based on this side prediction (up or down), a threshold mechanism is used to construct the underlying trading strategy.
- Then, meta-labeling applies a secondary model to assess the credibility of this trading decision. It utilize addtional features to predict the appropriate trading size. Meta-labeling helps to refine trading signals, filtering false positives and improving the profitability of trading strategies.
- Two different position sizing methods, all-or-nothing and ECDF, are compared. Both classification metrics for machine learning model (precision and recall) and risk-return metrics for strategy (Sharpe) are used to evaluate the efficiency of LSTM and meta-labeling.
- LSTM and Meta-labeling
- Table of Contents
- 0. Thoughts after Five Months
- 1. Meta-labeling
- 2. LSTM
- 3. Implementation
- 4. Results
- 5. Main Reference
Altough this is a relatively complete study, and the implementation involves a certain amount of technical details, the current practical value is quite limited and there's a lot of room for improvement. The deficiencies in each area are listed below.
LSTM
- In the primary model, I used almost exclusively readily available technical indicators from TA-lib. Technical indicators based entirely on volume data have limited information, at least feature selection can be performed.
- The data for different dates are considered to be continuous time series without interruptions, without taking into account market closures. The simplest solution would be adding an additional feature indicating time elapsed since market open.
- The label to predict in the primary model is binary up or down in next minute. Such labelling method is coarse and difficult for the model to learn. Also, for data points with high absolute returns and data points with almost zero absolute returns, the model gives them the same importance. They should be viewed differently and the weighting of samples can be applied.
- In expanding window cross-validation, the accuracy of LSTM decreases as the length of training set increases, indicating the need to re-train model using up-to-date data more frequently.
Meta-labeing
- Meta-labeling is considered to tackle the trade-off between trading precision and recall. But as the threshold changes, the changes in precision and recall are not synchronised. Recall decreases slowly, while precision remains almost constant until a sudden increase after a certain value. This leads to the benefits of meta-labelling not being effictively tested. This is actually also due to the poor performance of the LSTM in the prime model.
- In the secondary model, the original features used by the primary model are used in the hope that more information would be extracted. In this sense, a heterogeneous model (a non-LSTM model) in the secondary model would be more appropriate.
- An additional comparison should be carried out to show that the improved metrics are actually attributed to meta-labeling, instead of the introduction of additional features in the secondary model. This should be the crucial part to show the effect of meta-labeling. It is not included due to the word limit of my dissertaion requirement.
Trading Strategy
- Information leak when proprocessing data. Although min-max scaling is done after spliting training and testing sets, future information which is not accessible back in time is still used. Solution: use rolling minimum and maximum to perform min-max scaling
- The backtest is a rather simple vectorised backtest, in which transaction costs, including commission fees and bid-ask spreads, are not taken into account. With such a high frequency of trading, the fina risk-return metrics must be distorted.
Code
- Hard to read. Confusing varialbe name.
- Jupyter notebook is used exclusively, resulting in unnecessary code duplication and coupling between different parts.
Meta-labeling is proposed by Lopez de Prado in Advances in Financial Machine Learning. It addresses the challenge faced by practitioners who possess a model for predicting the direction of investment (long or short) but require a separate model to determine the position size (amount of money investing in that bet).
Meta-labeling involves applying a secondary machine learning model on top of a primary classifier to assess the quality or reliability of the primary model. By combining the secondary and primary models, it is possible to filter out false positives, which in turn improves the strategy’s risk-return metrics.
Trading Confusion Matrix
To evaluate the effectiveness of the classification model, precision, recall, and F1-score serve as common performance metrics:
-
Precision measures the proportion of correctly predicted positive instances (profitable trades) out of all instances predicted as positive, indicating the model’s ability to avoid false positives (losing trades).
-
Recall measures the proportion of correctly predicted positive instances (profitable trades) out of all actual positive instances (potential opportunities), reflecting the model’s ability to avoid false negatives or capture profitable trades.
-
F1 score is a combination of both.
How can the model’s performance metrics, such as precision and recall, be related to the risk-return metrics of the strategy, eg. Sharpe ratio?
Consider the following example derived from Advances in financial machine learning. Suppose a strategy generates
In the case of symmetric profits and losses, the risk-adjusted performance of the strategy is positively correlated with precision and the number of investment opportunities (recall). These conclusions hold true for asymmetric payouts as well.
Conclusion: to achieve a high Sharpe ratio strategy, both high-recall and high-precision classifiers are necessary for predicting whether to trade or not.
However, it can be challenging to train a classifier that achieves both high recall and high precision due to the inherent trade-off between these two metrics. Increasing recall involves capturing a broader range of positive instances, but it can lead to more false positives and lower precision. Conversely, maximizing precision requires being more cautious in classifying instances, potentially resulting in missed positive instances and lower recall.
One motivation for employing meta-labeling is to effectively use this trade-off between recall and precision to achieve better risk-return performance. The appropriate balance between these metrics depends on the specific characteristics of the investment strategy. For example, in high-frequency trading, precision may be prioritized due to the limited profit per trade and the abundance of trading opportunities. Conversely, in medium to long-term momentum strategies, recall may be more crucial given the potential profitability of individual trades.
Furthermore, meta-labeling offers additional motivations stemming from its architecture, which involves separating side prediction from size prediction. This separation brings several benefits, including improved interpretability, prevention of overfitting, and enhanced flexibility.
The meta-labeling framework consists of a primary model and a secondary model. The primary model is responsible for predicting the side of a profitable trade, indicating whether to buy, sell, or close the position. In the following figure, the upper part illustrates the three possible labels: -1 for sell, 1 for buy, and 0 for no trade. This primary model represents a generic and commonly used approach for side prediction in trading.
The primary model can take various forms, encompassing discretionary or quantitative strategies, rulebased or model-based approaches, and traditional methods or machine learning algorithms.
When the primary model is a machine learning model, there are two additional benefits, as depicted in the lower part of the figure. Many machine learning classifiers output probabilities,
-
The first benefit of machine learning models is that the trade-off between recall and precision can be adjusted through the threshold values. If two thresholds
$\tau_1$ and$\tau_2$ overlaps, where all signals are either buy or sell, recall is 1 because no trades are missed, but precision is typically low. By setting a higher gap between two thresholds, the frequency of trades is reduced, leading to lower recall but potentially higher precision. -
The second benefit is the ability to utilize predicted probabilities to compute evaluation metrics for the primary model, such as rolling cross-entropy or rolling accuracy. These metrics can serve as inputs for the secondary model, allowing for performance evaluation of the primary model over time.
The secondary model is a key component of the meta-labeling framework. It receives more input data than the primary model and produces a meta-label indicating whether the output of the primary model is correct. Specifically, the secondary model assesses whether the predictions made by the primary model will result in a profit or a loss. This assessment allows for adjustments to the size of the actual position, so it is said that the secondary model predicts the size of positions.
The initial output of the secondary model is a value between 0 and 1, representing the confidence level of the primary model. Various position sizing methods can be applied based on this output:
- For instance, an all-or-nothing strategy invests a full unit of position when the secondary model’s probability of a positive outcome exceeds a certain threshold, while not investing at all when it falls below the threshold.
- A more sophisticated position sizing method allocates more capital to high-confidence trading signals. The higher the probability of true positive (profitable trade), the greater the investment amount. Approaches such as using the empirical cumulative distribution function can be employed for this purpose.
The secondary model takes four types of data as inputs:
- Firstly, it receives the output of the primary model.
- Secondly, it incorporates the original features used by the primary model. As the secondary model may employ a different model type, it has the potential to uncover distinct information, providing an informational advantage.
- The third input is evaluation data, which assesses the recent performance of the primary model. This evaluation helps determine if the primary model should be deactivated.
- Lastly, additional features are utilized to assess the suitability of the primary model for the current market conditions. These additional features can include macro indicators (e.g., inflation rate, yield curve slope) that relate to the market environment. Other options encompass price moments (e.g., skewness, kurtosis), which reflect shifts in the statistical distribution of an asset. Such shifts may render the primary model inappropriate for the current market.
In summary: the secondary model uses more features and different model, which helps to screen out false negatives. By this, informational advantage is gained and overfitting is avoided by decoupling the prediction of side and size. Additionally, multiple position sizing methods can be implemented to suit specific investment needs. Finally, more complex frameworks can be applied, such as having separate secondary models for buy and sell signals.
- It allows machine learning models to be built on top of white-box models rooted in economic theory, addressing concerns about interpretability.
- By separating side prediction from size prediction, overfitting issues are alleviated, as the primary and secondary models can use appropriate features independently.
- The decoupling of side prediction from size prediction offers greater flexibility. Separate secondary models can be built exclusively for long and short positions, enabling better adaptation to market conditions and the construction of more suitable models.
- Meta-labeling improves risk-return metrics by focusing on getting the most important predictions correct, preventing low returns from high accuracy on small bets and low accuracy on large bets.
- Forget Gate
The forget gate determines the degree to which the previous memory cell state (
- Input Gate
The input gate manages the incorporation of new information into the memory cell. It utilizes the current input (
- Output Gate
The output gate regulates the final output from the memory cell. It takes the current input (
The selected dataset consists of price and volume data for Apple’s stock at a minute-by-minute granularity. The data spans from October 2013 to December 2019, comprising a total of 610,980 entries.
Category | Feature | Description |
---|---|---|
Price/Volume | Open | Opening price in the one-minute time frame |
Price/Volume | High | Highest price in the one-minute time frame |
Price/Volume | Low | Lowest price in the one-minute time frame |
Price/Volume | Close | Closing price in the one-minute time frame |
Price/Volume | Volume | Total trading volume in the one-minute time frame |
Price/Volume | Return | Percentage change of price in the one-minute time frame |
Overlap Studies | EMA(5 Min) | Exponential moving average of close price (5 minutes) |
Overlap Studies | EMA(10 Min) | Exponential moving average of close price (10 minutes) |
Overlap Studies | EMA(30 Min) | Exponential moving average of close price (30 minutes) |
Overlap Studies | SMA(5 Min) | Simple moving average of close price (5 minutes) |
Overlap Studies | SMA(10 Min) | Simple moving average of close price (10 minutes) |
Overlap Studies | SMA(30 Min) | Simple moving average of close price (30 minutes) |
Momentum Indicators | CCI | Commodity Channel Index |
Momentum Indicators | MACD | Moving Average Convergence/Divergence |
Momentum Indicators | STOCHRSI | Stochastic Relative Strength Index |
Momentum Indicators | WILLR | Williams' %R |
Volume Indicators | AD | Chaikin Accumulation/Distribution Line |
Volume Indicators | ADOSC | Chaikin Accumulation/Distribution Oscillator |
Volatility Indicators | ATR | Average True Range |
Cycle Indicators | HT_INPHASE | Inphase phasor components of Hilbert transform |
Cycle Indicators | HT_QUADRATURE | Quadrature phasor components of Hilbert transform |
Statistical Function | VAR(30 Min) | Variance of the closing price (30 minutes) |
TA-lib is used for calculating technical indicators.
As mentioned earlier, the secondary model encompasses a more extensive range of inputs compared to the primary model. It incorporates four distinct types of features, namely the primary model’s original inputs and outputs, evaluation data, and additional features.
1. Incorporating Primary Model’s Prediction
The primary model’s prediction of a stock’s upward or downward movement. This is a natural inclusion, as the secondary model’s purpose is to assess the credibility of raw trading signals generated by the primary model.
2. Utilizing Primary Model’s Original Inputs
The primary model’s original inputs, specifically the technical indicators. The rationale behind this is that the secondary model may be able to extract further relevant information from these original features.
3. Leveraging Evaluation Data
The third type of features utilized by the secondary model consists of evaluation data, specifically focusing on three key performance metrics calculated from recent data: accuracy, precision, and cross-entropy, derived from the primary model’s performance over the last thirty minutes. The primary model generates stock direction forecasts and raw trading signals at time
4. Additional Features for Secondary Model
The fourth type of additional features used by the secondary model aims to reflect the overall market conditions and the nature of stock characteristics. This allows us to observe the adaptability of the underlying strategy to changing market dynamics.
For the market conditions, S&P 500’s minute closing prices and returns are included. Significant changes in these values may indicate shifts in the overall market sentiment, which could impact the performance of the primary model.
For the distributional characteristics of the stock prices, the skewness and kurtosis of the stock traded are included. These statistical measures reflect the statistical distribution of the stock prices. Drastic changes in these moments may indicate a failure of the primary strategy to adapt to the changing dynamics of the traded stock.
To avoid information leakage, we apply min-max scaling only after partitioning the data into training and test sets. This maintains the integrity of the data by ensuring that future information doesn't influence past data points.
Labels
We'll apply LSTM to both primary and secondary models. In meta-labelling, primary and secondary models have separate target variables.
Primary model predicts stock movement (up/down) in the next minute. These predictions form the raw trading signals, which are then transformed into buy, sell, or hold signals using a threshold mechanism, discussed in a later chapter.
The secondary model predicts if primary's signals align with market behavior. It binary predicts primary's profitability. If buy signal aligns with rising close or sell signal with falling close, profit happens. The secondary model's predictive class is the meta-label.
Stacked LSTM
For minute-level stock prediction, we compare single-layer LSTM with two-layer LSTM architectures. Generally, the two-layer LSTM performs better across various hyperparameter combinations.
Hence, we employ the two-layer LSTM for both primary and secondary models. It strikes a balance between efficiency and performance for our objectives. Note that a more rigorous approach using cross-validation could provide a stronger architecture assessment, but our resource limitations prevent its use here.
Time series data demands maintaining chronological order, unlike traditional k-fold cross-validation. Expanding window cross-validation is adopted, progressively widening the training window over time. As illustrated in the figure, each step includes all data up to a point, validating the subsequent data points. This approach captures temporal dependencies effectively.
After training the primary model, we can obtain its prediction on the test set in each split. These four test sets also follow a chronological order. The secondary model will join the data from the four test sets, performs the expanding window cross-validation again, re-partitioning into the smaller training and test sets. The final backtest will be performed on the test set of the test set.
If the predicted probability of a stock going up exceeds a certain threshold, a signal to open a long position is triggered. If the probability falls below another threshold, a signal to open a short position is triggered. Otherwise, a signal to close the position is activated.
The threshold mechanism is defined as follows:
By adjusting
There is no universally optimal value for
The final trading unit is determined based on the credibility of the secondary model's output with respect to a threshold value of
The trading unit is set to
The second method utilizes an empirical cumulative distribution function (ECDF) to determine the final trading unit.
The output of the secondary model (
In the 4 splits in expanding window cross-validation, the primary model achieved accuracy of 55.09%, 54.26%, 53.74%, 52.22%, and 51.85% as the length of training set increases.
It's not very high, but given that the data is minute frequency, every one per cent advantage provides a lot of trading opportunities. This demonstrates the ability of LSTM to mine information from just technical indicators obtained from price and volume market data. Additionally, the gradual decrease in accuracy may indicate the need to retrain the model using the up-to-date data (online-learning).
To assess the performance of the primary model at different trading thresholds (
In Experiment 2, we explore the impact of meta-labeling using the all-or-nothing trading method. Here, the secondary model filters out false positives from the raw trading signals, and the predicted credibility from the secondary model is employed to generate the final traded signals. We report the results in three aspects: the LSTM model's performance metrics, metrics related to the trading confusion matrix, and strategy-related metrics.
As shown in the figure, the accuracy of the secondary LSTM model varies between 53.30% and 87.80% as
Firstly, as
Secondly, in the presence of a larger
To evaluate the performance, we refer to the trading confusion matrix. It is important to note that the terms "precision" and "recall" in this context differ from their usage in the previous subsection.
In the preceding subsection, the binary classification task involved determining whether the LSTM component of the secondary model correctly predicted adjustments for the raw trading signal. However, in this subsection, the binary classification task is focused on determining whether the combined trading signal from both the primary and secondary models will lead to profitable trades. To maintain clarity, we shall now refer to these metrics as "trading precision" and "trading recall" since they directly relate to the trading strategy.
Analyzing the change of recall, we observe that when the secondary model rejects certain raw trading signals, the trading recall naturally decreases, resulting in missed opportunities. On the other hand, the lower figure shows a significant increase in trading precision. Even with the simplicity of the all-or-nothing logic, the final strategy's probability of profit per trade improves. While meta-labelling does appear to manage the trade-off between precision and recall, its precise impact on risk-return metrics requires further investigation in subsequent analyses.
Analyzing Figure the curve of NAV, we observe that the impact of the all-or-nothing method of meta-labeling varies at different thresholds (
In conclusion, meta-labeling generally improves risk-return metrics by filtering out false positive investment signals and managing the precision-recall trade-off. Nonetheless, it's important to note that in specific scenarios, the all-or-nothing method of meta-labeling may lead to suboptimal investments. Furthermore, this approach results in significant variations in investment styles across different thresholds, which can pose challenges in selecting appropriate investment styles.
Comparing the NAV curves with those in Experiment 1, we observe that the overall trend remains largely consistent with the strategy generated by raw trading signals, with no significant changes in style. Analyzing the various strategy metrics and comparing them with those from Experiment 1, we find an improvement in the Sharpe ratio and maximum drawdown for all values of
In conclusion, meta-labeling with the ECDF method enhances the strategy metrics without introducing drastic changes to the trading style. The improvements achieved with the ECDF method are more subtle and exhibit greater stability compared to the all-or-nothing method used in Experiment 2.
- Lopez de Prado, Marcos (2018). Advances in financial machine learning.
- Joubert, Jacques Francois (2022). “Meta-Labeling: Theory and Framework”
- Meyer, Michael, Barziy, Illya, and Joubert, Jacques Francois (2023). “Meta-Labeling: Calibration and Position Sizing”
- Meyer, Michael, Joubert, Jacques Francois, and Alfeus, Mesias (2022). “Meta-Labeling Architecture”
- Thumm, Dennis, Barucca, Paolo, and Joubert, Jacques Francois (2022). “Ensemble MetaLabeling”.