Performance Validation

Synnax Competitions Performance Validation.

One of the benefits of Synnax's unique crowdsourced, decentralized consensus mechanism, is its ability to generate more accurate financial status predictions than what a single model can achieve. This is evidenced by the results from our ongoing Kaggle competition.

Individual Model Submissions

Model Number
Individual R-squared Score

sub_1

0.8358

sub_2

0.7724

sub_3

0.7892

sub_4

0.8234

sub_5

0.8114

sub_6

0.6459

Ensemble Result

Average Ensemble
Ensemble R-squared Score

mean(sub_1, sub_2, ... sub_6)

0.8615

Comparing to individual submissions, the ensemble method produces a 3% higher score than the highest scoring submission (sub_1) and 25% higher score than the lowest scoring submission (sub_6). Below we describe the methodology in more detail.

Wisdom of the Crowd

Data Scientists (DS) develop their own approaches to data preprocessing, feature engineering, and model selection. As such, there will be no two identical approaches to predicting financial components. Some DS may use linear models with specific data preparation & scaling, others may opt for tree-based algorithms or gradient boosting models, lastly there may be neural networks. Each approach will find different dependencies in the data and predict based on different set of assumptions (model weights).

Average Ensemble

Best models are selected based on the validation score and their predictions are averaged. This way, two models which had individual accuracy score of 0.75 and 0.77 accordingly may jointly complement each other and result in increased accuracy (E.g. 0.82). Dozens of models will make an even more robust prediction.

A limitation of this approach is that each model has an equal contribution to the final prediction made by the ensemble. There is a requirement that all ensemble members have skill as compared to random chance, although some models are known to perform much better or much worse than other models.

Weighted Average Ensemble

A weighted ensemble is an extension of a model averaging ensemble where the contribution of each member to the final prediction is weighted by the performance of the model. The model weights are small positive values, and the sum of all weights equals to one, allowing the weights to indicate the percentage of trust or expected performance from each model.

There is no analytical solution to finding the weights (we cannot calculate them precisely); instead, the value for the weights can be estimated using the holdout validation dataset. Weights can be assigned either by ranking each individual prediction and giving the highest weight to the highest scoring individual model. Alternatively, a more exhaustive approach to finding weights for the ensemble members is to grid-search values. We can define a course grid of weight values from 0.0 to 1.0 in steps of 0.1, then generate all possible vectors with those values. Generating all possible combinations is called a Cartesian Product, which can be implemented in Python using the itertools.product() function from the standard library.

Then one must enumerate each weight vector generated by the Cartesian product, normalize it, and evaluate it by scoring a prediction and keeping the best to be used in our final weight averaging ensemble. A weighted average prediction involves first assigning a fixed weight coefficient to each ensemble member. This could be a real value between 0 and 1, representing a percentage of the weight. It could also be an integer starting at 1, representing the number of votes to give each model. One can draw a parallel of this process with Neural Network model training: evolution of weights to minimize the loss.

For example, we may have the fixed weights of 0.84, 0.87, 0.75 for the three ensemble members. These weights can be used to calculate the weighted average by multiplying each prediction by the model's weight to give a weighted sum, then dividing the value by the sum of the weights. For example:

y^=(97.2×0.84)+(100.0×0.87)+(95.8×0.75)0.84+0.87+0.75=81.648+87+71.850.84+0.87+0.75=240.4982.46=97.763\begin{align*} \hat{y} &= \frac{(97.2 \times 0.84) + (100.0 \times 0.87) + (95.8 \times 0.75)}{0.84 + 0.87 + 0.75} \\ &= \frac{81.648 + 87 + 71.85}{0.84 + 0.87 + 0.75} \\ &= \frac{240.498}{2.46} \\ &= 97.763 \end{align*}

This same approach can be used to calculate the weighted sum of votes for each crisp class label or the weighted sum of probabilities for each class label on a classification problem. We can implement weighted average ensembles manually, although this is not required as we can use the voting ensemble in the scikit-learn library to achieve the desired effect.

Unless the holdout validation dataset is large and representative, a weighted ensemble has a higher tendency to overfit as compared to a simple averaging ensemble. The volume of validation data will define whether a simple Averaging Ensemble or its Weighted variant shall be used in each specific case. This addresses the problem of biased credit ratings derived from subjective evaluation of one technique or a single analyst. Not only that, but ensembling improves the quality of the produced predictions (if we assume that the models are independent).

Last updated