Demystifying Linear Regression: From Basic Equations to Loss Functions

Posted Mar 17, 2025

By Lycoriolis

7 min read

I’ve always found linear regression to be the perfect starting point for anyone venturing into machine learning. It’s simple enough to grasp intuitively, yet rich enough to introduce fundamental concepts that echo throughout more advanced models. In this post, I want to walk through linear regression with a focus on loss functions - the beating heart of how these models learn.

The Basic Linear Model

At its core, linear regression is about finding a straight line (or hyperplane in higher dimensions) that best fits our data. The equation is refreshingly simple:

\[\hat{y} = w^\intercal x + b\]

What’s happening here?

$\hat{y}$ is our prediction - what we’re trying to get right
$x$ represents our input features - the information we have available
$w$ is the weights vector - how much each feature influences our prediction
$b$ is the bias term - allowing our line to intersect the y-axis anywhere

I like to think of the weights as “importance factors.” They tell us how much each feature pushes or pulls our prediction. The bias lets our model make reasonable predictions even when all features are zero.

Our goal is straightforward: find the values of $w$ and $b$ that make our predictions as close as possible to reality. But this raises a question - what exactly do we mean by “close”? This is where loss functions enter the picture.

Core Assumptions of Linear Regression

Before diving into loss functions, I should mention the assumptions that underpin linear regression. These aren’t just theoretical concerns - they directly affect how our model performs in practice:

Linearity: We’re assuming the relationship between inputs and outputs is actually linear. Nature doesn’t always comply with this neat assumption.
Independence: Each observation in our dataset should be independent of others. Like flipping a fair coin multiple times - one flip shouldn’t affect the next.
Homoscedasticity: A fancy word for “constant variance.” The errors in our predictions should be similarly spread out across all input values.
Normality of Errors: The errors (differences between predictions and actual values) should follow a normal distribution.
No Perfect Multicollinearity: Our features shouldn’t be perfect copies of each other, otherwise we can’t distinguish their individual effects.

I’ve found that understanding these assumptions helps not just in theory, but in diagnosing what’s going wrong when models underperform.

Loss Functions: Quantifying Prediction Error

So how do we measure the “wrongness” of our predictions? This is where loss functions come in. They give us a numerical score for how bad our predictions are - and our goal is to minimize this score.

1. Squared Error Loss (L² Loss)

The most common approach is to square the differences between predictions and actual values:

\[L(y, \hat{y}) = (y - \hat{y})^2\]

For a whole dataset, we average these squared errors to get the Mean Squared Error (MSE):

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

When we minimize this L² loss, we’re essentially finding the conditional mean of y given X - or E[y

X] in statistical notation.

Let’s see why this works: If we have a single point and want to minimize $(y - \mu)^2$, we take the derivative with respect to μ and set it to zero:

\[\frac{d}{d\mu} (y - \mu)^2 = -2(y - \mu) = 0 \implies \mu = y\]

For multiple points, this extends to the sample mean. When we’re doing linear regression with $\hat{y} = w^\intercal x + b$, minimizing the squared errors gives us the famous Normal Equation:

\[\beta = (X^T X)^{-1} X^T y\]

This equation directly gives us the optimal weights and bias in one elegant step (assuming the inverse exists).

What This Means in Practice

Outliers have outsized influence: Squaring the errors means large errors are penalized heavily. One extreme outlier can significantly pull your regression line.
It assumes Gaussian errors: L² loss is mathematically connected to the normal distribution. If your errors follow this pattern, L² is statistically optimal.
It’s smooth and differentiable: Making it easier to optimize with techniques like gradient descent.

I typically reach for L² loss when I believe my data has well-behaved errors and when I care equally about over-predictions and under-predictions.

2. Absolute Error Loss (L¹ Loss)

An alternative is to use the absolute difference between predictions and actual values:

\[L(y, \hat{y}) = |y - \hat{y}|\]

The corresponding aggregate metric is Mean Absolute Error (MAE):

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|\]

When we minimize L¹ loss, we’re finding the conditional median of our output variable. This makes intuitive sense: the median is the value that minimizes the sum of absolute deviations.

Unlike with squared error, there’s no simple closed-form solution like the Normal Equation. We need to use specialized optimization algorithms.

What This Means in Practice

More robust to outliers: Since we don’t square the errors, extreme values don’t have the same overwhelming influence.
It assumes Laplace-distributed errors: The L¹ loss corresponds to the Laplace distribution, which has heavier tails than the normal distribution.
Not differentiable at zero: Creating some challenges for optimization.

I’ve found L¹ loss particularly useful when working with financial data or other domains where outliers are common and meaningful.

3. Huber Loss: The Best of Both Worlds?

Huber loss is an elegant compromise between L¹ and L² loss. It behaves like squared error for small errors and like absolute error for large ones:

\[L(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2, & \text{if } |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2}\delta^2, & \text{otherwise} \end{cases}\]

The parameter δ controls where the transition happens between the quadratic and linear regions.

What This Means in Practice

Adaptive robustness: You get the efficiency of L² for most points, but outliers are handled more like L¹.
Fully differentiable: Unlike L¹ loss, Huber loss is smooth everywhere.
Requires tuning: The transition parameter δ needs to be set appropriately for your data.

I’ve found Huber loss to be a practical choice for many real-world datasets where I suspect there might be some outliers, but I don’t want to completely abandon the nice properties of squared error.

4. Log-Likelihood (Cross-Entropy) - A Probabilistic View

While primarily used for classification, log-likelihood offers a valuable probabilistic perspective. For binary classification:

\[L(y, \hat{p}) = -\left[ y \log(\hat{p}) + (1 - y) \log(1 - \hat{p}) \right]\]

Here, $\hat{p}$ is the predicted probability that y equals 1.

Minimizing this loss is equivalent to finding the Maximum Likelihood Estimate for our parameters - the values that make our observed data most probable.

Which Loss Function Should You Choose?

After working with these loss functions across various projects, I’ve developed some rules of thumb:

Loss Function	When to Use It
Squared Error (L²)	When you believe errors are normally distributed and outliers are rare or should have high influence
Absolute Error (L¹)	When outliers are common or when you want your model to predict medians rather than means
Huber	When you want robustness to outliers but still prefer the nice mathematical properties of L² loss for most points
Log-Likelihood	When you need probabilistic outputs, especially for classification

Remember that your choice of loss function directly impacts what your model learns to predict - means, medians, or something in between.

What’s Next?

This exploration of linear regression and loss functions lays the groundwork for many more advanced topics. In future posts, I’ll dive deeper into the concept of robustness - a critical consideration as we scale to more complex models and messier real-world data.

Loss functions are more than just technical details - they encode our definition of what makes a good prediction. Choose wisely, and your models will better align with what you’re truly trying to accomplish.

What loss functions have you found most useful in your work? I’d love to hear about your experiences in the comments.

linear-regression machine-learning

This post is licensed under CC BY 4.0 by the author.

The Basic Linear Model

Core Assumptions of Linear Regression

Loss Functions: Quantifying Prediction Error

1. Squared Error Loss (L² Loss)

What This Means in Practice

2. Absolute Error Loss (L¹ Loss)

What This Means in Practice

3. Huber Loss: The Best of Both Worlds?

What This Means in Practice

4. Log-Likelihood (Cross-Entropy) - A Probabilistic View

Which Loss Function Should You Choose?

What’s Next?

Trending Tags