The Causal Inference Series : Part I
Introduction
Sometimes, numbers tell a story that feels clear, until you look a little closer!
This kind of statistical surprise is known as Simpson’s Paradox: when a trend appears in aggregate data but disappears or reverses when you look at the subgroups.
A famous real-world example happened in 1973 at UC Berkeley. At the time, their graduate school admissions data seemed to show discrimination against women: 44% of male applicants were admitted, compared to just 35% of female applicants. But when researchers dug deeper, the story flipped. Most departments were actually admitting women at higher rates than men. The paradox? Women were applying more often to highly competitive departments with lower acceptance rates overall.
These examples illustrate the fundamental challenge: observing that two things occur together tells us nothing about whether one causes the other.
So in this series, we’ll explore how concepts like Simpson’s Paradox lead us into the deeper world of causal reasoning. By the end, we’ll move beyond surface-level data and into the mathematical foundations of what it means to say that one thing causes another.
Correlation vs Causation
Correlation (or association): Two variables $X$ and $Y$ display a statistical relationship, e.g. as $X$ increases, $Y$ tends to increase (positive correlation), or tends to decrease (negative correlation).
Causation: A change in $X$ brings about (directly or indirectly) a change in $Y$; we write $X \to Y$.
A famous admonition is: “correlation does not imply causation.” That is, the fact that $X$ and $Y$ are correlated is not sufficient to conclude $X$ causes $Y$.
How correlations can arise without causation
When you see a correlation, several possible underlying “stories” could explain it : not all of them involve $X \to Y$. Here are common scenarios:
Reverse causation (or bidirectional causation): $Y \to X$, or both directions.
- E.g. more wealth might increase education opportunities, but also more education might lead to higher wealth.
Confounding / common cause: A third variable $Z$ influences both $X$ and $Y$. The correlation is spurious in terms of $X \to Y$.
- Example: Ice cream sales and drowning incidents correlate (both rise in summer). The hidden confounder is ambient temperature.
Coincidence / randomness: The correlation is simply by chance, especially if one looks at many variable pairs and picks those that “look interesting.”
- Even if $\rho = 0.9$ (a high correlation), that does not guarantee a causal link. Correlations may arise spuriously in large data sets.
Mediation / indirect causation: $X$ causes $Inter$, which in turn causes $Y$. So $X$ is upstream of $Y$, but not via a direct link.
- In this case one can talk of a causal chain $X \to Inter \to Y$.
Measurement artifacts, selection bias, or data issues: The observed correlation is introduced by how data is collected, missing data, aggregation, or measurement errors.
Simultaneity or feedback: $X$ and $Y$ respond to each other or to a common equilibrium (e.g. supply and demand).
Pearl’s Causal Hierarchy (Ladder of Causation)
Pearl proposes three levels of causal reasoning, each strictly more expressive than the previous.
| Level | Name / Role | Question Form | What It Captures | What You Need / What You Can Do |
|---|---|---|---|---|
| Level 1 | Association (Observing / Seeing) | “What is $P(Y \mid X)$?” | Statistical association / correlation / predictive relation | You only need the joint (or conditional) distributions of observed variables. You can do prediction, pattern recognition, statistical inference. |
| Level 2 | Intervention (Doing / Experimenting) | “What is $P(Y \mid \mathrm{do}(X))$?” | Causal effect of forcing $X$ to a value | You need a causal model (or assumptions like no unobserved confounding) that lets you reason about interventions (the “do-operator”). Enables policy evaluation, A/B testing, decision-making. |
| Level 3 | Counterfactuals (Imagining / Retrospective) | “What would $Y$ have been if $X$ had been different (for this same unit)?” | Individual-level causal reasoning, “what if” for the actual case | You need structural (mechanistic) causal models (structural equations, latent factors). Enables explanations, personalized treatment effects, “why did this instance happen?” |
Moving up the ladder requires stronger assumptions and different methods. Most AI operates only at Level 1. We’re going to explore these notions later on the series. For now, let’s move to the main notions we’ll use to model the concept.
Fundamental Causal Concepts
Treatment and Outcome
Treatment $T$: The intervention or exposure of interest. Can be binary (drug vs. placebo), multi-valued (low/medium/high dose), or continuous (price).
Outcome $Y$: The variable we care about measuring. Must be well-defined and measurable.
Unit $i$: The entity receiving treatment could be a person, company, city, or any observational unit.
Confounding
Confounder $C$ : A variable that affects both treatment and outcome, creating a misleading correlation. Example: Ice cream sales correlate with drowning deaths. Temperature is a confounder: Hot weather increases both ice cream consumption and swimming (which increases drowning risk).
Notation: $C \to T$ and $C \to Y$
Causal Effect
Individual Treatment Effect $ITE$: The difference between outcomes under treatment vs. control for the same unit: \(\mathrm{ITE}_i = Y_{1,i} - Y_{0,i}\)
where $Y_{1,i}$ is the outcome if treated and $Y_{0,i}$ is the outcome if not treated.
But the problem we have is that we can never observe both simultaneously for the same unit: This is the Fundamental Problem of Causal Inference.
Observable difference:\(\mathbb{E}[Y \mid T = 1] - \mathbb{E}[Y \mid T = 0]\)This is just comparing treated vs untreated groups in observational data.
Observable vs. Causal quantity:
- Observable: $E[Y \mid T=1] - E[Y \mid T=0]$ (comparing treated to control groups)
- Causal: $E[Y_{1}] - E[Y_{0}]$ (comparing same units under different treatments)
The causal parameter $\mathbb{E}[Y_1] - \mathbb{E}[Y_0]$ is equal to the observable difference under special conditions that we’ll explore in next posts.
The Selection Bias Problem
Why doesn’t simple comparison work? Consider a study examining whether attending tutoring sessions $T$ improves exam scores $Y$. Observational data shows students who attended tutoring scored $5$ points lower on average than those who didn’t: $E[Y|T=1] - E[Y|T=0] = -5$. Does tutoring harm performance? No! Students who attended tutoring were struggling students with weaker backgrounds who sought extra help precisely because they expected to perform poorly. The true causal effect $E[Y_{1}] - E[Y_{0}]$ might actually be $+10$ points of improvement. The observed $-5$ conflates the genuine treatment effect with selection bias: The $–5$ result is confusing because students who choose tutoring are already $15$ points different from those who don’t. This shows why just comparing treated and untreated groups in observational studies can be misleading when trying to find cause-and-effect relationships.
What’s Next
In this post, we’ve established why correlation differs from causation and introduced the basic framework for thinking causally. In the next post, we’ll formalize these intuitions with the Potential Outcomes Framework, a way to clearly define cause-and-effect and think about “what if” scenarios.
Take this example: Two hospitals are treating patients for the same illness. Hospital A presents an 80% recovery rate, while Hospital B trails behind at 60%. Naturally, you’d assume Hospital A is doing a better job.
But here’s the twist: Hospital A mainly handles simple cases. Hospital B, on the other hand, treats more severe ones. When you break down the data by case severity, it turns out that Hospital B actually has better recovery rates in both mild and severe cases. The overall numbers were hiding the truth.