Post

From 0.76 to 0.78: A 48-Hour Hackathon Journey in Data Science

From 0.76 to 0.78: A 48-Hour Hackathon Journey in Data Science

Hey everyone!

This post is a walkthrough of my experience at the HiParis Hackathon 2025, a grueling 48-hour data science marathon where we tackled the PISA Math Score Prediction challenge. What started as a promising first submission spiraled into a battle against hardware limitations, stubborn models, and the eternal quest for that extra percentage point.

Hackathon Vibes
November 30, 2025

48 hours of intense coding, debugging, and learning. Sleep-deprived but incredibly rewarding. There's something special about solving problems under time pressure with a team ,and the frustration of being stuck at third place!

The First Submit

We were handed a dataset: 1.17 million student records from PISA, 307 features, task: predict Math Scores. After some quick data exploration and feature preprocessing, we trained a finetuned CatBoost model.

First submission: R² = 0.76

A solid start! CatBoost handled the categorical features and missing values like a champ. But we knew we could do better.

The Compute Nightmare

Here’s where things got… challenging. My trusty laptop with its 8th gen i5 was struggling hard. Training on 1.17 million rows? Forget it.

“No problem,” we thought, “let’s use Google Colab!”

The kernel crashed. Not enough RAM to even load the full dataset.

Hardware Hell
November 30, 2025

We explored several solutions: stream learning, batch learning, chunked processing… All seemed promising in theory but added complexity we didn’t have time for during a hackathon.

The solution? Kaggle Notebooks. With 32GB of RAM, we could finally load and explore the full dataset properly. Sometimes the simplest solution is just finding better hardware!

Model Benchmarking

With proper compute available, we started benchmarking different models:

  • CatBoost: Still our champion : fast, handles NaN natively, great performance
  • LightGBM: Promising on paper, but painfully slow even on GPU with this data volume
  • XGBoost: Decent but no improvement over CatBoost
  • Linear Models (Ridge, Lasso): Intuition suggested a linear relationship might exist between subject scores, but performance was poor. The data had too many non-linearities and interactions

The frustrating part? LightGBM performed poorly when trained on partial data, and we couldn’t afford to train it on the full dataset. CatBoost remained king.

The Negative Values Problem

Then we hit a wall: our model was predicting negative scores. Math scores can’t be negative!

That’s when I got the idea for a two-stage approach:

  1. Classification: Is MathScore = 0? (handling the 38% of zeros in the dataset)
  2. Regression: Predict the actual score for non-zero cases

This elegantly solved the negative prediction issue and improved our overall performance.

The Two-Stage Idea
November 30, 2025

Sometimes the best solutions come from stepping back and rethinking the problem structure. Instead of forcing a single model to handle everything, we split the task into what it really was: a classification problem followed by a regression problem.

The Feature Engineering Rabbit Hole

We tried everything to squeeze out more performance:

  • Aggregate features (mean, std, max, min of related columns)
  • Interaction features
  • Log transforms
  • Target normalization

And then we went deeper exploring advanced techniques like Nyström Kernel PCA for feature extraction. The idea was to run Nyström KPCA on the “Science” columns to get dense features representing “Science Ability,” then repeat for other domains. Mathematically elegant. In practice? It performed worse than our finetuned CatBoost baseline. (I’ll do a dedicated blog post on Nyström kernels later. The theory deserves its own deep dive!)

The Frustrating Ceiling
November 30, 2025

Despite all our efforts ( two-stage architecture, feature engineering, advanced kernel methods, hyperparameter tuning ), we kept hitting the same wall: 0.78 R². Third place. So close, yet so far.

Final Architecture

After 48 hours, our final pipeline looked like this:

  1. Data Cleaning: Remove leaky columns, handle missing values
  2. Stage 1 - Classifier: CatBoost to predict zero vs non-zero (AUC: 0.997)
  3. Stage 2 - Regressor: Ensemble of finetuned CatBoost models (R²: ~0.65 on non-zeros)
  4. Combination: Use classifier to route predictions

What I Learned

1. Hardware matters. Don’t fight your tools, find better ones (Kaggle’s 32GB RAM saved us).

2. Simple models, well-tuned, beat complex models poorly implemented. CatBoost outperformed all our fancy kernel methods.

3. Problem decomposition is powerful. The two-stage approach solved multiple issues at once.

4. Know when to stop. Sometimes 0.78 is your ceiling with the available features. Accepting this saves hours of frustration.

Final Thoughts
November 30, 2025

Would I do it again? Absolutely. 48 hours of sleep deprivation, frustrating debugging sessions, and hitting walls, but also incredible learning, teamwork, and the satisfaction of building something that actually works. See you at the next hackathon!

Check Out the Code

The complete implementation, including data preprocessing, the two-stage model, and all our experiments, is available on GitHub:

HiParis 2025 Repository

Feel free to explore, fork, and reach out if you have questions!

This post is licensed under CC BY 4.0 by the author.
PRESENT DAY • PRESENT TIME