# Datasets

{% hint style="info" %}
We prepared a few boilerplate code examples to get you up and running with your first submission and to give you ideas on how to improve your score:

[basic example](https://www.kaggle.com/code/danilz/datathon-2-boilerplate) – make your first submission incorporating macroeconomic features;

[advanced example 1](https://www.kaggle.com/code/danilz/datathon-2-advanced-hyperparameter-tuning) – hyper-parameter tuning for a higher score;

[advanced example 2](https://www.kaggle.com/code/danilz/datathon-2-advanced-nan-impute-mean-target-enc) – NaN imputation with gradient boosting and mean-target-encoding of categorical variables.

Feel free to use it as a starting point and tinker on it to get better results!
{% endhint %}

### Train Dataset

The train dataset contains financial data points for 13,180 publicly traded companies based on their quarterly and annual financial reports. This dataset is compiled using the 5 latest quarterly reports and 4 latest annual reports, and reflects financial components extracted from their corresponding balance sheet and income statements.

### Test Dataset

The test dataset contains 3296 companies with their corresponding features in the same format as the train dataset.

### Macro Dataset

Two separate files (to match with train and test datasets) are provided which feature 1554 columns with different macroeconomical indicators collected considering the dates of each company's quarterly reports.  For example a feature `'Federal Government Current Expenditures_Q_0_min_180_days'`reflects the minimum value of Federal Government Current Expenditures for 180 days prior to Q\_0. Columns names reflect the actual indicators (not obfuscated). Data in columns is normalized. `company_id` is a key to match the rows in the `macro_train.csv` and `macro_test.csv`with `X_train.csv`and `X_test.csv` accordingly.  Try to use different indicators to enrich your training/testing data.

### The features ordering in train/test data is as follows:

Columns starting with Q\_($$n$$) (where $$n$$ is the number of the quarter) contain the companies' quarterly reported financial components.

Columns starting with Y\_($$n$$) (where $$n$$ is the number of the annual report) contain the companies' annually reported financial components.

Other columns represent metadata per each company.

Columns starting with:&#x20;

* `Q_0` are only present in **targets\_train.csv**: contain financial components for the latest (closest to today) quarter
* `Q_1` are a part of **X\_train.csv** and **X\_test.csv**: contain financial components of the quarter which went before `Q_0`
* `Q_4` are a part of **X\_train.csv** and **X\_test.csv**: contain financial components of the furthest reported quarter, 4 quarters before `Q_0`
* `Y_0` are a part of **X\_train.csv** and **X\_test.csv**: contain financial components from the latest annual report
* `Y_3` are a part of **X\_train.csv** and **X\_test.csv**: contain financial components from the  furthest annual report, 3 years before `Y_0`

Each quarter and each year (except for `Q_0`) contains 143 financial components, please refer to the **data\_dictionary.txt** for details.

There are 17 targets (**train\_targets.csv**) which represent the latest financial data points for each company. Participants need to train model(s) which will map the historical financial performance of the companies (**X\_train.csv**) to their latest financial indicators.

### Files

* **X\_train.csv** - training features
* **targets\_train.csv** - training targets
* **X\_test.csv** - testing features
* **sample\_submission.csv** - a sample submission file in the correct format
* **macro\_train.csv** - macroeconomical data for companies in the X\_train.csv
* **macro\_test.csv** - macroeconomical data for companies in the **X\_test.csv**
* **data\_dictionary.txt** - detailed data points description

Data archive can be downloaded [**here**](https://datathon-2-files.synnax.ai/dataset.zip)

Please refer to **data\_dictionary.txt** for detailed columns description.
