Post

Neural Tangent Kernel: A theoretical view of Deep Learning Theory

Neural Tangent Kernel: A theoretical view of Deep Learning Theory

In 2018 (a bit too old for now), a paper by Arthur Jacot, Franck Gabriel, and Clément Hongler dropped a theoretical study on the deep learning community. Their work on Neural Tangent Kernels (NTKs) showed that neural networks, under certain conditions, behave exactly like classical kernel method, a connection that was both surprising and profound. In this blog, we’ll explore the different aspects of this concept and some of its applications in modern ML.

This viewpoint connects modern deep nets to classical Gaussian processes (at initialization) and kernel ridge regression (during training), offering simple, linear training dynamics and convergence guarantees in a regime often called lazy learning.

My Thoughts on NTK
September 02, 2025

NTK theory made neural networks feel like classical approximations, more like expressing a function via a Taylor or Fourier series. In the infinite‑width limit the network behaves like a fixed kernel that approximates the target, which explains why complicated optimization reduces to linear, kernel-based dynamics. What fascinates me most is how this theory explains why wide networks work so well in practice, even though the infinite-width assumption seems unrealistic.

Why infinite width?

Analyzing highly nonconvex, finite networks is hard, but taking the width to infinity yields tractable limits where the function distribution is Gaussian at initialization and training dynamics become linear in function space. This limit does not claim real nets are infinite; it serves as an asymptotic lens that often predicts trends for wide—but finite—nets (infinite-width limit).

From networks to Gaussian processes

At random initialization with suitable scaling, an infinitely wide fully connected network defines a Gaussian process (NNGP), meaning any finite collection of outputs is jointly Gaussian with a kernel computable layer‑by‑layer. This result, developed and popularized by Lee et al. (2017/2018), provides closed‑form predictive means and uncertainties before any training takes place.


Enter the NTK

Jacot, Gabriel, and Hongler (2018) defined the Neural Tangent Kernel for a network \(f(x;\theta)\) as the input‑pair kernel formed by parameter‑gradient inner products, \(K_{\text{NTK}}(x,x')=\langle \nabla_\theta f(x;\theta), \nabla_\theta f(x';\theta)\rangle\). Under gradient flow and squared loss, the function evolves as a linear ODE in function space, \(\frac{df}{dt} = -K_{\text{NTK}}(f-y)\), so training reduces to kernel gradient descent with the NTK.

The constant‑kernel miracle

In the infinite‑width limit, the NTK converges at initialization to a deterministic kernel and remains constant throughout training, turning the nonlinear network training into linear dynamics entirely governed by that fixed kernel. Consequently, fitting a wide net by gradient descent matches kernel ridge regression with the NTK, giving a closed‑form predictor \(f(x)=K(x,X)(K(X,X)+\lambda I)^{-1}y\) in the regularized case.

Convergence and spectral bias

Because the limiting NTK is positive‑semidefinite (and typically positive‑definite on finite datasets), training under gradient flow converges exponentially with rates set by the kernel eigenvalues, and large‑eigenvalue directions are learned first. This yields a natural explanation for early stopping as regularization and a principled spectral view of which target components are learned quickly or slowly.


Computing NTKs in practice

For fully connected nets, both NNGP and NTK kernels admit simple recursions over layers that propagate covariance and derivative terms. Efficient software (e.g., Neural Tangents) and theory for finite‑width corrections make these kernels practical for experimentation and scalable approximations.

Convolutional and other architectures

Convolutional NTKs (CNTKs) extend the theory to CNNs and can be computed exactly and efficiently, yielding strong small‑data performance benchmarks without feature learning. Recurrent NTKs (RNTKs) bring the kernel view to sequence models, preserving weight sharing through time and enabling analysis of long‑range dependencies under proper initialization.


Lazy learning vs. feature learning

In the NTK (infinite‑width) regime, features are effectively frozen and only the last linear readout evolves, a phenomenon termed lazy training (Chizat & Bach, 2018). Empirically, many high‑performing finite networks benefit from feature learning beyond the lazy regime, explaining observed gaps between finite nets and their NTK counterparts.

Where width helps—and hurts

As width grows, training stability improves and dynamics align with the NTK, but performance can plateau if the model fails to learn new features beyond its initial representation. Balanced designs aim to be wide enough for stable optimization yet not so wide as to suppress beneficial feature evolution.

The Lazy vs Feature Learning Dilemma
November 09, 2025

This tension between lazy learning and feature learning is at the heart of modern deep learning architecture design. The NTK tells us that infinite-width networks get stuck in lazy mode, but finite networks can escape this limitation and learn richer representations.

I’ve spent countless hours debugging training instabilities in narrow networks, only to realize that making them wider often magically fixes the problem. NTK theory explains why - wider networks stay closer to the stable, predictable dynamics of the infinite-width limit.

The key insight? We want networks that are wide enough to be stable, but not so wide that they can’t adapt their features to the data. It’s a delicate balance that modern architecture design (like the GPT series) seems to have figured out empirically.


Generalization through the kernel lens

With a fixed NTK, generalization reduces to classical kernel learning, where the target is decomposed along kernel eigenfunctions, and learning rates follow eigenvalues \(\{\lambda_i\}\). This lens explains fast learning of dominant modes, slow learning of fine details, and how early stopping controls high‑frequency fitting that can overfit noise.


Double descent and interpolation

As capacity grows into the interpolating regime, test error can follow a “double descent” curve, and the NTK solution corresponds to a minimum‑norm interpolant that can generalize surprisingly well. This reconciles classical bias–variance intuition with modern overparameterization by highlighting the role of inductive bias encoded in the NTK.


Depth, initialization, and stability

Depth interacts with width and initialization, and NTK constancy can break when depth and width scale together unless the network operates in specific phases (e.g., ordered vs. chaotic). Recent analyses quantify how initialization and depth‑to‑width ratios affect NTK dispersion, guiding stable regimes and revealing when NTK predictions hold.


Practical notes

  • Training dynamics: Select learning rates mindful of the NTK spectrum, and use early stopping to regularize high‑frequency components that learn slowly.
  • Architecture: Convolutions, skip connections, and recurrences imprint distinct inductive biases into the NTK, which can be analyzed to match data symmetries.
  • Computation: Exact kernels cost \(O(n^2)\) memory and \(O(n^3)\) time in sample size \(n\), so use random features, sketches, or near input‑sparsity‑time methods for scale.
Practical Takeaways for ML Engineers
October 16, 2025

As someone who’s implemented NTK-based methods in production, the computational complexity is the biggest practical hurdle. The \(O(n^3)\) scaling makes exact NTK computation impossible for datasets larger than a few thousand samples.

The key breakthrough has been approximate methods - random features, structured sketches, and finite-width corrections. Neural Tangents library has been a game-changer here, providing efficient implementations that actually scale.

My advice? Start with NTK analysis on small datasets to understand your architecture’s inductive bias, then use the insights to guide hyperparameter selection. The theory gives you a north star for understanding why certain architectures work better than others.


Key references

  • Jacot et al. (2018): Introduced NTK, proved constancy at infinite width, and derived linear training dynamics and spectral convergence insights.
  • Lee et al. (2018): Established deep NNGPs, giving GP priors for infinitely wide nets at initialization.
  • Chizat & Bach (2018): Clarified lazy training, showing linearized behavior and its limits for practical performance.
  • Arora et al. (2019): Exact and efficient computation for CNTK and strong small‑data benchmarks.
  • Belkin et al. (2019): Formalized double descent, relevant to interpolating NTK solutions and modern overparameterization.

Takeaway

NTKs expose a linear, kernel‑based core to the training of infinitely wide networks, yielding simple dynamics, closed‑form predictors, and convergence guarantees in the lazy regime. The most powerful practical systems, however, appear to benefit from controlled departures into finite‑width feature learning, where architecture and optimization coax richer representations beyond the fixed NTK’s inductive bias.


References

This theoretical foundation was established by Arthur Jacot, Franck Gabriel, and Clément Hongler in their groundbreaking 2018 paper on Neural Tangent Kernel: Convergence and Generalization in Neural Networks.

The connection to Gaussian processes was explored by Jaehoon Lee and colleagues in their 2018 work on Deep Neural Networks as Gaussian Processes.

The analysis of lazy training dynamics was developed by Lénaïc Chizat and Francis Bach in their 2018 paper on On the Global Convergence of Gradient Descent for Over-Parameterized Models.

Convolutional neural tangent kernels were introduced by Sanjeev Arora and team in their 2019 work on Fine-Grained Analysis of Optimization and Generalization.

The double descent phenomenon was formalized by Mikhail Belkin and colleagues in their 2019 paper on Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-off.

This post is licensed under CC BY 4.0 by the author.
PRESENT DAY • PRESENT TIME