Basics of deep learning

Deriving convolution from first principles

Have you ever wondered what is so special about convolution? In this post, I derive the convolution from first principles and show that it naturally emerges from translational symmetry.

[

Michael Bronstein

](https://archive.is/o/w0HQ0/https://medium.com/@michael-bronstein)

[

TDS Archive

](https://archive.is/o/w0HQ0/https://medium.com/data-science)

La connoissance de certains principes supplée facilement à la connoissance de certains faits. (Claude Adrien Helvétius)

During my undergraduate studies, which I did in Electrical Engineering at the Technion in Israel, I was always appalled that such an important concept as convolution [1] just landed out of nowhere. This seemingly arbitrary definition disturbed the otherwise beautiful picture of the signal processing world like a grain of sand in one’s eye. How nice would it be to have the convolution emerge from first principles rather than have it postulated! As I will show in this post, such first principles are the notion of translational invariance or symmetry.

Let me start with the formula taught in basic signal processing courses defining the discrete convolution [2] of two n-dimensional vectors x and w:

Here, for convenience, I assume that all the indices run from zero to n−1 and are modulo n; it is convenient to think of vectors as defined on a circle. Writing the above formula as a matrix-vector multiplication leads to a very special matrix that is called circulant:

A circulant matrix has multi-diagonal structure, with elements on each diagonal having the same value. It can be formed by stacking together shifted (modulo n) versions of a vector w [3]; for this reason, I use the notation C(w) referring to a circulant matrix formed by the vector w. Since any convolution x∗w can be equivalently represented as a multiplication by the circulant matrix C(w)x, I will use the two terms interchangeably.

One of the first things we are taught in linear algebra is that matrix multiplication is non-commutative, i.e., in general, AB≠BA. However, circulant matrices are very special exception:

Circulant matrices commute,

or in other words, C(w)C(u)=C(u)C(w). This is true for any circulant matrix, or any choice of u and w. Equivalently, we can say that the convolution is a commutative operation, x∗w=w∗x.

A particular choice of w=[0,1,0…,0] yields a special circulant matrix that shifts vectors to the right by one position. This matrix is called the (right) shift operator [4] and denoted by S. The transpose of the right shift operator is the left shift operator. Obviously, shifting left and then right (or vice versa) does not do anything, which means S is an orthogonal matrix:

Circulant matrices can be characterised by their commutativity property. It appears to be sufficient to show only commutativity with shift (Lemma 3.1 in [5]):

==A matrix is circulant if and only if it commutes with shift.==

The first direction of this “if and only if” statement leads to a very important property called translation or shift equivariance [6]: the convolution’s commutativity with shift implies that it does not matter whether we first shift a vector and then convolve it, or first convolve and then shift — the result will be the same.

The second direction allows us to define convolution as the shift-equivariant linear operation: in order to commute with shift, a matrix must have the circulant structure. This is exactly what we aspired to from the beginning, to have the convolution emerge from the first principles of translational symmetry [7]. Instead of being given a formula of the convolution and proving its shift equivariance property, as it is typically done in signal processing books, we can start from the requirement of shift equivariance and arrive at the formula of the convolution as the only possible linear operation satisfying it.

Illustration of shift equivariance as the interchangeability of shift and blur operations.

Another important fact taught in signal processing courses is the connection between the convolution and the Fourier transform [8]. Here as well, the Fourier transform lands out of the blue, and then one is shown that it diagonalises the convolution operation, allowing to perform convolution of two vectors in the frequency domain as element-wise product of their Fourier transforms. Nobody ever explains where these sines and cosines come from and what is so special about them.

In order to get to the bottom of it, recall a fact from linear algebra:

Commuting matrices are jointly diagonalisable.

In other words, two matrices satisfying AB=BA will have the same eigenvectors (but possibly different eigenvalues) [9]. Since all circulant matrices commute, we can pick one of them and compute its eigenvectors — the above theorem assures that these will be the eigenvectors of all circulant matrices as well.

It is convenient to pick the shift operator S. Since S is orthogonal matrix, we expect its eigenvectors to be orthogonal [10]. A simple calculation (see Section 4.1 in [5]) leads to the conclusions that

Shift operator is diagonalised by the Fourier transform.

I hope that at this point you have your second “aha” moment for today: this is where the sines and cosines come from! They are the eigenvectors of the shift operator; I denote them as columns of the matrix Φ. Note that the eigenvectors are complex, so we need to take complex conjugation when transposing Φ. Multiplication (from the left) by Φ* is called the Fourier transform, and by Φ the inverse Fourier transform.

Since all circulant matrices are jointly diagonalisable, they are also diagonalised by the Fourier transform [11]. They differ only in their eigenvalues. The last missing bit is the realisation that

The eigenvalues of C(w) are the Fourier transform of w.

We can now put all the pieces of the puzzle into a statement known as the Convolution Theorem: the convolution x∗w can be computed either as a circulant matrix C(w) applied to x in the original system of coordinate (sometimes this is called “spatial domain” convolution), or in the Fourier basis (“spectral domain”) by first computing the Fourier transform of Φ*x, multiplying it by the Fourier transform of w [12], and then computing the inverse Fourier transform.

Because Φ has a special redundant structure, the products Φ*x and Φx can be computed with 𝒪(n log n) complexity using a [Fast Fourier Transform](https://archive.is/o/w0HQ0/https://en.wikipedia.org/wiki/Fast_Fourier_transform%23:~:text=A%20fast%20Fourier%20transform%20(FFT,frequency%20domain%20and%20vice%20versa.) (FFT) algorithm.

Why is such a definition of convolution a big deal and should be taught this way? I will repeat here the quote from Helvétius I opened this post with: “The knowledge of certain principles easily compensates the lack of knowledge of certain facts”. In the case of convolution, its derivation from first principles allow easy generalisation to other domains. In a next post, I will show how to define convolution on graphs, in order to produce a key building block of graph deep learning architectures.