CatBoost : Hype or Not ?
One of the things that I encountered while experimenting with different libraries and ML frameworks in Kaggle was this library called CatBoost. So naturally, I started wanted to know what it is about, and what actually makes it different, better performing in many cases than more mainstream libraries such as XGBoost and LGBM. For this, I decided to write this page: it will be organized in parts, each representing a checkpoint in my path of understanding the gem that is CatBoost. Without further talking, let’s dig into this matter!
The problem with XGBoost (& cie)
To begin with, we need to understand how gradient boosting algorithms are designed, and implemented in these functions.
How It Should Work
Initialize the model with a constant (e.g., mean of targets on full training set).
- For each boosting round $t = 1$ to $T$:
- For each training example $i$:
- Train a temporary version of the current model without using example $i$.
- Compute the prediction for $x_i$ using this “leave-one-out” model: $F_{t-1}^{-i}(x_i)$.
- Compute residual:
$residual_i = y_i - F_{t-1}^{-i}(x_i)$
- Train a new weak learner (e.g., tree) on all $(x_i, residual_i)$ pairs.
- Add the new learner to the ensemble with a learning rate.
- For each training example $i$:
- During prediction (test time):
- Simply apply the final model:
\(F_T(x) = initial_{pred} + η × \sum tree_{outputs}\) - No access to true targets, as it should be.
- Simply apply the final model:
How It Actually Works
Initialize the model (same as above).
- For each boosting round $t = 1$ to $T$:
- Compute residuals for all examples at once:
- $residual_i = y_i - F_{t-1}(x_i)$
- But $F_{t-1}(x_i)$ comes from a model trained using example $i$.
- Train a new weak learner on $(x_i, residual_i)$ pairs.
- Update the model: \(F_t(x_i) = F_{t-1}(x_i) + η × h_t(x_i)\)
- Repeat
- Compute residuals for all examples at once:
- During prediction (test time):
- Apply the model normally: $F_T(x)$
- No targets used — correct behavior.
Problem: During training, each $y_i$ influences its own prediction path — but this doesn’t happen at test time.
As we can see, classical boosting implementations use the entire training dataset when making predictions, which introduces bias during training. So here comes CatBoost was developed to overcome this issue.
CatBoost : What is it about?
CatBoost is a gradient boosting algorithm designed to avoid overfitting, especially on datasets with categorical features, by preventing the model from accidentally using target information too early during training. It doesn’t simply resolves the problem we mentioned above on predictions tailoring without bias, but goes further with categorical encoding and technical ordered boosting.
The problem with categorical features representation
The core problem with categorical features in machine learning, in gradient boosting
One of the first techniques you may learn in your machine learning journey is one-hot encoding, that is:
For a categorical feature with k distinct values, we create k binary columns — each representing one category. A sample gets 1 in the column corresponding to its category, and 0 everywhere else.
Example
| Color | Red | Green | Blue |
|---|---|---|---|
| Green | 0 | 1 | 0 |
| Red | 1 | 0 | 0 |
| Blue | 0 | 0 | 1 |
But the problem with this technique is that, with many labels, dimensionality of our data suddenly explodes, sparsity too, and in general inefficient for Tree-Based models. So, one way to avoid these problems that i recently discovered and that are used in this context is Target Statistics (TS) : Replace each categorical value $c$ with a statistic derived from the target variable $y$ over all training samples where the category equals $c$. The most common statistic is the conditional mean :
\[TS(c) = E[y | category = c]\]| Color | TS = P(y=1|Color) |
|---|---|
| Green | 0.85 |
| Red | 0.63 |
| Blue | 0.11 |
This encoding preserves predictive signal (categories associated with higher target values receive higher numerical representations) while collapsing dimensionality from 3 to 1 in our example.
The leakage problem with TS
If you compute TS in a greedy way (use all data, including the current example):
\[\hat x_{ik}= \frac{\sum_{j=1}^n 1\{x_{ij} = x_{ik}\} \cdot y_j + a p}{\sum_{j=1}^n 1\{x_{ij} = x_{ik}\} + a}\](where $a$ is a smoothing parameter, $p$ is a prior, and $y_j$ = target value of example $j$) Then the target of example $k$, $y_k$, appears in its own encoding.
But now let’s see this situation. When predicting whether a customer buys (1) or not (0), using a naive target encoding for “country” can leak information. For example, if both US customers bought (1, 1), encoding “USA” as 1.0 includes each customer’s own label, so the feature essentially “knows” the answer before the model learns it. This creates bias during training. At test time, new US customers still get encoded as 1.0, but that value no longer includes their personal outcome, breaking the consistency between training and testing. Even in this tiny example, you can see the problem: each row’s feature encoding is contaminated by its own target. CatBoost’s ordered target statistic fixes this by calculating averages using only data from before each row, ensuring no self-leakage.
The Fix : Ordered Target Statistics
CatBoost uses ordered TS using the same “ordering principle” as in ordered boosting:
- Generate a random permutation of training examples.
For each example, compute its TS using only the “past” examples in that permutation.
\[\hat x_{ik}=\frac{\sum_{j: \sigma(j) < \sigma(k)} 1\{x_{ij} = x_{ik}\} \cdot y_j + a p}{\sum_{j: \sigma(j) < \sigma(k)} 1\{x_{ij} = x_{ik}\} + a}\]- For test examples, use all training data.
This guarantees no leakage on each step of the training. But here you may see that we get permutations in the ordering logic. Well, that absolutely has a purpose here : if we only excluded the current sample (leave-one-out), we would still be letting a row “peek” at information from other rows that, in reality, would not yet exist at prediction time. By imposing a random permutation and restricting each row to its past, CatBoost ensures that encodings are computed in the same way they would be at test time — no self-leakage, no future leakage, and efficient to compute in a single pass.
Imagine you don’t use permutations, and instead compute leave-one-out (LOO) encodings:
\[\hat x^{LOO}_{ik}= \frac{\sum_{j \neq k } 1\{x_{ij} = x_{ik}\} \cdot y_j + a p}{\sum_{j \neq k} 1\{x_{ij} = x_{ik}\} + a}\]This does exclude $y_k$, so no self-leakage. But it still uses all other rows, no matter if they are “before” or “after” row $k$.
At test time, when you see a new sample, you only have the training set you built the model on — you don’t get to “peek” at labels of future samples that you haven’t seen yet.
So using all $j\neq k$ during training gives each sample an encoding that has too much information compared to what would be available if that sample really came in online. That’s what the paper means by future data leakage.
To keep it simple, we are trying to simulate a data stream, so we position ourselves in a real-world situation, which in-fine eliminates both complexity and the potential bias from some sort of fake clairvoyance.
Another Order missing
Now that we saw that the use of order in TS fixes both the “biais” and complexity issues, you can think : “What about the boosting itself?”. Well, you’re absolutely right! After all, residuals in classical gradient boosting suffer from the very same issue ; they are computed using models that have already “seen” the target of the point they’re supposed to predict.
This is exactly where ordered boosting comes in: CatBoost extends the ordering principle from feature encoding to the core training loop itself, ensuring that every residual is computed in a way that faithfully mimics test-time conditions.
The ordered solution
CatBoost introduces ordered boosting:
- Generate a random permutation of training samples.
- For each sample $k$, compute its residual using only the model trained on the prefix of samples that appear before $k$ in that permutation.
- This way, row $k$ never leaks its own label into its residual.
Residuals are then defined as: \(r_{k,t}= y_k - F_{t-1}^{\text{prefix}(k)}(x_k)\) where $F_{t−1}^{prefix(k)}$ is the model built from all rows before $k$ in the permutation.
Across multiple permutations, randomness ensures balance: no example always “arrives first” or always “arrives last” ; Each permutation gives slightly different encodings and residuals. This reduces variance and stabilizes training.
The actual implementation and the enhancements
In this section, we’ll explain how CatBoost achieves this efficiency, and how it differs from classical boosting libraries :
Multiple permutations vs single dataset pass (already analyzed)
Feature combinations (crosses) vs manual feature engineering
CatBoost doesn’t just use single categorical features, but it also forms combinations of categories (e.g., “color + weight”). This allows the model to capture richer interactions between categorical variables.
- Classical boosting: relies on the user to provide feature crosses (e.g. “country × device type”).
- CatBoost: automatically creates combinations of categorical features and applies ordered TS to them.
Oblivious trees vs standard decision trees
CatBoost: uses oblivious (symmetric) trees : the same split is applied at each depth for all nodes , instead of the classical approach where trees are grown greedily, and their shape can vary (unbalanced, different split structures).
Example
For classical decision trees, we will get something like this:
1
2
3
4
5
6
7
8
9
Age < 30 ?
├─ Yes
│ Income < 50 k ?
│ ├─ Yes → “Low”
│ └─ No → “Medium”
└─ No
HasChildren ?
├─ Yes → “Medium”
└─ No → “High”
We see that we have different labels to test at each node in the same depth! Well, for the oblivious tree, it behaves in a symmetric way, like for example :
1
2
3
4
5
6
7
8
9
Age < 30 ?
├─ Yes
│ Income < 50 k ?
│ ├─ Yes → Leaf1
│ └─ No → Leaf2
└─ No
Income < 50 k ?
├─ Yes → Leaf3
└─ No → Leaf4
Every path applies the same sequence of splits. It is much faster to evaluate because of simpler encoding (each depth can be represented by 0 or 1 for example).
There is also less overfitting thanks to ordered TS which helps with rare events, and consequently is more robust to noise, and simpler tree complexity and encoding, which helps for more efficient training and realistic inference.
Conclusion
CatBoost was created to fix a subtle but fundamental problem in gradient boosting: prediction shift caused by hidden target leakage. Classical systems like XGBoost and LightGBM work really well, but when categorical features are encoded greedily or when residuals are computed with models that “peek” at their own labels, the result is a mismatch between training and test conditions. This bias quietly hurts generalization. That’s where CatBoost comes in : The result is a boosting system that is not only fast and practical, but also mathematically principled: it trains on the same conditions it will face at test time. That alignment is why CatBoost consistently delivers strong accuracy, especially on datasets with many categorical features : CatBoost is simply gradient boosting without the hidden leaks.
This case study was based on the paper published by Yandex researchers, the creators of CatBoost. I hope this deep analysis did well on explaining the different aspects of the algorithm, and the problems it came to solve. I’m preparing new articles in parallel that are quite interesting, so keep hooked till next time, and thanks for reading!
Here, $F_{t-1}^{-i}(x_i)$ means: