you should read this 

As someone who did a Bachelors in Mathematics I was first introduced to L¹ and L² as a measure of Distance… now it seems to be a measure of error — where have we gone wrong? But jokes aside, there seems to be this misconception that L₁ and L₂ serve the same function — and while that may sometimes be true — each norm shapes its models in drastically different ways. 

In this article we’ll travel from plain-old points on a line all the way to L∞, stopping to see why and matter, how they differ, and where the L∞ norm shows up in AI.

Our Agenda:

  • When to use L¹ versus L² loss
  • How L¹ and L² regularization pull a model toward sparsity or smooth shrinkage
  • Why the tiniest algebraic difference blurs GAN images — or leaves them razor-sharp
  • How to generalize distance to Lᵖ space and what the L∞ norm represents

A Brief Note on Mathematical Abstraction

You might have have had a conversation (perhaps a confusing one) where the term mathematical abstraction popped up, and you might have left that conversation feeling a little more confused about what mathematicians are really doing. Abstraction refers to extracting underlying patters and properties from a concept to generalize it so it has wider application. This might seem really complicated but take a look at this trivial example:

A point in 1-D is x = x₁​; in 2-D: x = (x₁,x₂); in 3-D: x = (x₁, x₂, x₃). Now I don’t know about you but I can’t visualize 42 dimensions, but the same pattern tells me a point in 42 dimensions would be x = (x₁, …, x₄₂). 

This might seem trivial but this concept of abstraction is key to get to L∞, where instead of a point we abstract distance. From now on let’s work with x = (x₁, x₂, x₃, …, xₙ), otherwise known by its formal title: x∈ℝⁿ. And any vector is v = x  —  y = (x₁ — y₁, x₂ — y₂, …, xₙ — yₙ).

The “Normal” Norms: L1 and L2

The key takeaway is simple but powerful: because the L¹ and L² norms behave differently in a few crucial ways, you can combine them in one objective to juggle two competing goals. In regularization, the L¹ and L² terms inside the loss function help strike the best spot on the bias-variance spectrum, yielding a model that is both accurate and generalizable. In Gans, the L¹ pixel loss is paired with adversarial loss so the generator makes images that (i) look realistic and (ii) match the intended output. Tiny distinctions between the two losses explain why Lasso performs feature selection and why swapping L¹ out for L² in a GAN often produces blurry images.

Code in Github

L¹ vs. L² Loss — Similarities and Differences

  • If your data may contain many outliers or heavy-tailed noise, you usually reach for .
  • If you care most about overall squared error and have reasonably clean data, is fine — and easier to optimize because it is smooth.

Because MAE treats each error proportionally, models trained with L¹ sit nearer the median observation, which is exactly why L¹ loss keeps texture detail in GANs, whereas MSE’s quadratic penalty nudges the model toward a mean value that looks smeared.

L¹ Regularization (Lasso)

Optimization and Regularization pull in opposite directions: optimization tries to fit the training set perfectly, while regularization deliberately sacrifices a little training accuracy to gain generalization. Adding an L¹ penalty 𝛼∥w∥₁​ promotes sparsity — many coefficients collapse all the way to zero. A bigger α means harsher feature pruning, simpler models, and less noise from irrelevant inputs. With Lasso, you get built-in feature selection because the ∥w∥₁​​​ term literally turns small weights off, whereas L² merely shrinks them.

L2 Regularization (Ridge)

Change the regularization term to 

and you have Ridge regression. Ridge shrinks weights toward zero without usually hitting exactly zero. That discourages any single feature from dominating while still keeping every feature in play — handy when you believe all inputs matter but you want to curb overfitting. 

Both Lasso and Ridge improve generalization; with Lasso, once a weight hits zero, the optimizer feels no strong reason to leave — it’s like standing still on flat ground — so zeros naturally “stick.” Or in more technical terms they just mold the coefficient space differently — Lasso’s diamond-shaped constraint set zeroes coordinates, Ridge’s spherical set simply squeezes them. Don’t worry if you didn’t understand that, there is quite a lot of theory that is beyond the scope of this article, but if it interests you this reading on Lₚ space should help. 

But back to point. Notice how when we train both models on the same data, Lasso removes some input features by setting their coefficients exactly to zero.

from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso, Ridge

X, y = make_regression(n_samples=100, n_features=30, n_informative=5, noise=10)

model = Lasso(alpha=0.1).fit(X, y)
print("Lasso nonzero coeffs:", (model.coef_ != 0).sum())

model = Ridge(alpha=0.1).fit(X, y)
print("Ridge nonzero coeffs:", (model.coef_ != 0).sum())

Notice how if we increase α to 10 a lot more features are deleted. This can be quite dangerous as we could be getting rid of informative data.

model = Lasso(alpha=10).fit(X, y)
print("Lasso nonzero coeffs:", (model.coef_ != 0).sum())

model = Ridge(alpha=10).fit(X, y)
print("Ridge nonzero coeffs:", (model.coef_ != 0).sum())

L¹ Loss in Generative Adversarial Networks (GANs)

GANs pit 2 networks against each other, a Generator G (the “forger”) against a Discriminator D (the “detective”). To make G produce convincing and faithful images, many image-to-image GANs use a hybrid loss

where

  • x — input image (e.g., a sketch)
  • y— real target image (e.g., a photo)
  • λ — balance knob between realism and fidelity

Swap the pixel loss to and you square pixel errors; large residuals dominate the objective, so G plays it safe by predicting the mean of all plausible textures — result: smoother, blurrier outputs. With , every pixel error counts the same, so G gravitates to the median texture patch and keeps sharp boundaries.

Why tiny differences matter

  • In regression, the kink in ’s derivative lets Lasso zero out weak predictors, whereas Ridge only nudges them.
  • In vision, the linear penalty of keeps high-frequency detail that blurs away.
  • In both cases you can blend and to trade robustness, sparsity, and smooth optimization — exactly the balancing act at the heart of modern machine-learning objectives.

Generalizing Distance to Lᵖ

Before we reach L∞, we need to talk about the the four rules every norm must satisfy: 

  • Non-negativity — A distance can’t be negative; nobody says “I’m –10 m from the pool.”
  • Positive definiteness — The distance is zero only at the zero vector, where no displacement has happened
  • Absolute homogeneity (scalability) — Scaling a vector by α scales its length by |α|: if you double your speed you double your distance
  • Triangle inequality — A detour through y is never shorter than going straight from start to finish (x + y)

At the beginning of this article, the mathematical abstraction we performed was quite straightforward. But now, as we look at the following norms, you can see we’re doing something similar at a deeper level. There’s a clear pattern: the exponent inside the sum increases by one each time, and the exponent outside the sum does too. We’re also checking whether this more abstract notion of distance still satisfies the core properties we mentioned above. It does. So what we’ve done is successfully abstract the concept of distance into Lᵖ space.

as a single family of distances — the Lᵖ space. Taking the limit as p→∞ squeezes that family all the way to the L∞ norm.

The L∞ Norm

The L∞ norm goes by many names supremum norm, max norm, uniform norm, Chebyshev norm, but they are all characterized by the following limit:

By generalizing our norm to p — space, in two lines of code, we can write a function that calculates distance in any norm imaginable. Quite useful. 

def Lp_norm(v, p):
    return sum(abs(x)**p for x in v) ** (1/p)

We can now think of how our measure for distance changes as p increases. Looking at the graphs bellow we see that our measure for distance monotonically decreases and approaches a very specific point: The largest absolute value in the vector, represented by the dashed line in black. 

Convergence of Lp norm to largest absolute coordinate.

In fact, it does not only approach the largest absolute coordinate of our vector but

The max-norm shows up any time you need a uniform guarantee or worst-case control. In less technical terms, If no individual coordinate can go beyond a certain threshold than the L∞ norm should be used. If you want to set a hard cap on every coordinate of your vector then this is also your go to norm.

This is not just a quirk of theory but something quite useful, and well applied in plethora of different contexts:

  • Maximum absolute error — bound every prediction so none drifts too far.
  • Max-Abs feature scaling — squashes each feature into [−1,1][-1,1][−1,1] without distorting sparsity.
  • Max-norm weight constraints — keep all parameters inside an axis-aligned box.
  • Adversarial robustness — restrict each pixel perturbation to an ε-cube (an L∞​ ball).
  • Chebyshev distance in k-NN and grid searches — fastest way to measure “king’s-move” steps.
  • Robust regression / Chebyshev-center portfolio problems — linear programs that minimize the worst residual.
  • Fairness caps — limit the largest per-group violation, not just the average.
  • Bounding-box collision tests — wrap objects in axis-aligned boxes for quick overlap checks.

With our more abstract notion for distance all sorts of interesting questions come to the front. We can consider p value that are not integers, say p = π (as you will see in the graphs above). We can also consider p ∈ (0,1), say p = 0.3, would that still fit into the 4 rules we said every norm must obey?

Conclusion

Abstracting the idea of distance can feel unwieldy, even needlessly theoretical, but distilling it to its core properties frees us to ask questions that would otherwise be impossible to frame. Doing so reveals new norms with concrete, real-world uses. It’s tempting to treat all distance measures as interchangeable, yet small algebraic differences give each norm distinct properties that shape the models built on them. From the bias-variance trade-off in regression to the choice between crisp or blurry images in GANs, it matters how you measure distance.


Let’s connect on Linkedin!

Follow me on X = Twitter

Code on Github

Share.
Leave A Reply