## Maryam Mirzakhani and Dynamics on Moduli Space

1. A pioneer on many frontiers

Many remarkable breakthroughs were acknowledged at this year’s International Congress of Mathematicians, but perhaps the most significant one for the intellectual community was not mathematical at all: for the first time in history, the Congress bestowed the highest award in mathematics — the Fields medal — on a female mathematician, Maryam Mirzakhani.

This was, in fact, only one of several historical landmarks this year. Mirzakhani is also the first Iranian to receive the Fields medal, Artur Avila is the first Brazilian, and Manjul Bharghava is the first recipient of Indian descent. But the conspicuous dearth of women in mathematics is, I think, the most glaring example of intellectual oppression that exists in any academic community.

It’s obviously a personal tragedy that so many women are pushed out of mathematics, and it’s also clear that the intellectual community is losing out because of it. Throughout history, female mathematicians who persisted through society’s myriad hurdles have made great contributions to mathematics. Names like Sophie Germain, Emmy Noether, and Sofia Kovalevskaya adorn some of the most important theorems from a time when women were not even permitted to participate officially in higher mathematical education, nor to hold full research positions.

There’s still a gaping gender disparity in mathematics, but perhaps the awarding of the Fields medal to Mirzakhani signals the beginning of change. In the future, girls will be able to look up to her as a role model. Minorities will feel encouraged that the circles of influence in mathematics, which have been tightly locked by pride for so long, are finally open to them. Beginning researchers will study her work and be inspired by what can be accomplished in the face of adversity.

On that note, last weekend I set aside some time to try and learn about Mirzakhani’s mathematics. I’ve been aware of this area, complex dynamics, ever since I took my freshman math class from Mirzakhani’s thesis advisor, but every time I looked at it in the past it seemed impenetrable to me. This time I received another refreshing lesson in humility, but eventually I started to understand something of what was going on. I will try to explain here one of the threads of the story. The goal is just to put her work into context, in a more comprehensible way than reading the literature.

There are already several great resources documenting Mirzakhani’s life and work. In particular, I thought this was a fantastic article. Here is also an interesting interview from a few years ago. Curt McMullen has given a very nice overview of her mathematical contributions.

But before we begin the mathematics, I wanted to highlight some pieces from the articles that stood out to me. I really like how Mirzakhani describes, “doing mathematics for me is like being on a long hike with no trail and no end in sight!” I’ve been asked a few times this summer alone to explain what pure mathematics is about, and I still find it impossible to explain. In most areas of research (especially here in industry), you start with problems that have clear practical motivations. But mathematics is different; mathematics is a hike. The purpose is not to go from point A to point B; the purpose is to enjoy what’s in between, and to savor things along the way that you didn’t anticipate when you started out. As Mirzakhani says, “the beauty of mathematics only shows itself to more patient followers.”

Another chord that resonated with me was McMullen’s comment that what distinguishes Mirzakhani from other Olympiad champions is her “ability to generate her own vision.” It seems like so much of my mathematical experience has been about performing certain party tricks, like following a tortuous lecture, recalling an intricate proof from memory, or producing the solution to a clever puzzle. The importance of vision and imagination is underemphasized in education.

Finally, I want to highlight this quote:

To her dismay, Mirzakhani did poorly in her mathematics class that year. Her math teacher didn’t think she was particularly talented, which undermined her confidence. “At that age, it’s so important what others see in you,” Mirzakhani said. “I lost my interest in math.”

In my time at Harvard, I saw many talented students quit mathematics. I thought about it several times myself, and I still do. To be fair, there are many good reasons not to want to pursue mathematics academia, but I don’t think discouragement is one of them. I often hear people say that mathematics is “beautiful,” but I can testify that the beauty is hard to see when you’re demoralized.

At such a time, in my senior year, a senior faculty member once sent me a line of encouragement. It was the only time that such a thing happened, but it felt so meaningful. This brand of kindness costs almost nothing, but can make such an impact on a young person’s life. It’s certainly easier to give than a Fields medal. What would the gender gap look like if we handed out encouragement more freely? I don’t know, but maybe we should find out.

2. What is moduli space?

Much of Mirzakhani’s breakthrough concerns the geometry of “moduli space.” In her thesis she introduced an approach to counting simple geodesics on hyperbolic space by relating them to volumes in moduli space, and more recently she has proved rigidity properties of orbits in moduli spaces. That’s cool if you’re used to thinking about such things; otherwise it’s more like cool story bro, what is this moduli space thingy anyway?

Very roughly speaking, a moduli space is a geometric object that parametrizes a certain class of things of interest. You can imagine a moduli space as a kind of map, in which each point of the map represents an entity. This might sound like a crazy thing if you haven’t seen it before, but it’s actually pretty familiar.

Suppose, as a crude example, that you were interested in modeling a pair of particles, perhaps gas particles, in space. Each particle has a position in ${\mathbb{R}^3}$, so the set of possible configurations for the pair of particles is parametrized by ${\mathbb{R}^6}$. Suppose you wanted to know more, like the velocities of each particle. Then there are three dimensions for each particle’s velocity, so the ensemble of the two particles and their velocities would be parametrized by ${\mathbb{R}^{12}}$. If you added an additional condition, say that the speed of one of the particles should be some fixed constant, then that constrains its velocity vector to lie on a sphere, and the resulting parameter space would be some sort of cylindrical object within ${\mathbb{R}^{12}}$. Every point of this cylinder thing would represent a different particle system.

The previous example wasn’t really a moduli space in any mathematical sense, but it illustrates the idea. We were interested in certain systems of gas particles, and we parametrized it by a geometric object, each of whose points represents a different possible system.

Figure 1. Surfaces of low genus.

The most famous moduli space is ${\mathcal{M}_g}$, the “moduli space of genus ${g}$ curves.” (Technically, this only applies if ${g \geq 2}$.) This moduli space was “discovered” and studied by Riemann, and — in classic mathematical fashion — it took another century for mathematicians to say what a moduli space actually is. To be fair, the technical setup needed to give a proper definition is rather sophisticated, but at a first encounter it’s enough to know that a moduli space is a space that naturally parametrizes things, and you understand the moduli space in terms of what it parametrizes.

In this case, ${\mathcal{M}_g}$ parametrizes “complex structures on a genus ${g}$ surface.” (A complex curve is a real surface, hence the confusing interchange of terminology.) Given a topological surface, there are many ways to turn it into a complex manifold, i.e. to give an atlas of charts between pieces of the surface and the complex plane, and these are what is parametrized by ${\mathcal{M}_g}$. A different way to think about the same thing is that there are many different complex curves having the same underlying topology, and ${\mathcal{M}_g}$ parametrizes them. The moduli spaces that we’ll be looking at later aren’t quite the same as ${\mathcal{M}_g}$, but they are very related.

Example. Why is moduli space useful? I can’t resist giving an example close to my own research interests. Mathematicians like to know how to solve equations, and a particularly interesting class of equations takes the form

$\displaystyle y^2 = x^3 + ax + b.$

The modern point of view is that you should think of the set of solutions as forming a geometric object, which in this case called an elliptic curve. Then we can think interchangeably about solutions to the equation and points of the elliptic curve. It turns out that the set of complex-valued solutions to such an equation looks topologically like a genus 1 surface.

Number theorists are interested in knowing whether such an equation has a rational solution, i.e. a solution ${(x,y)}$ such that ${x}$ and ${y}$ are actually rational numbers. An even more refined question is whether or not it has a particular type of rational solution, say a “torsion point of order ${N}$.” (This just means the usual notion of torsion element in a group, but the technical definition is unimportant.)

Luckily, there is a moduli space parameterizing the set of elliptic curves together with a torsion point of order ${N}$! Crucially, this moduli space is itself a curve, which we call $X_1(N)$, and the parametrization is refined in the sense that if a point on ${X_1(N)}$ has rational coordinates, then the elliptic curve plus torsion point it parametrizes also has rational coordinates.

So the punchline is that if we want to know whether any elliptic curve (and there are infinitely many) has a rational torsion point of order ${N}$, then we just have to figure out whether the particular curve ${X_1(N)}$ has a rational point. Witness the power of moduli space: it has reduced an infinite calculation to a finite one!

To be fair, this gain comes at a price. The price is that even though ${X_1(N)}$ is a curve and can in principle be described by one equation, it can be very hard to say what this equation is. So you can wind up in the bizarre situation of having to show that there is no solution to an equation, when you don’t even know the equation. As silly as it sounds, this is actually possible (in favorable circumstances). The advantage of the geometric approach is that it can furnish enough tools to reason indirectly about such questions.

It’s perhaps worth remarking that moduli spaces can quickly become very complicated, even if the objects that they describe are quite simple. We saw hints of this already when discussing parametrization of gas particles. There are moduli spaces called Hilbert schemes, which parametrize certain types of submanifolds in projective space. In principle the Hilbert schemes can be described in terms of equations, but even Hilbert schemes parameterizing fairly simple configurations can be so complex that it would require more particles than exist in the known universe to write down their equations. So understanding the geometry of moduli space is very difficult in general, and one has to settle for reasoning indirectly and achieving only partial understanding.

3. Billiards in Polygons

3.1. Billiards

Billiards are a source of many interesting problems in dynamics. The setup is very simple: you have some billiard table, which is an enclosed plane region, and a billiard ball inside it. You shoot off the billiard ball in some direction and it travels at constant speed, reflecting (elastically) off the walls subject to the usual rule: angle of incidence equals angle of reflection. Figure 2 depicts some billiards paths in a rectangle.

Figure 2. Billiards on a rectangular table.

So why is this interesting at all? It turns out that billiards can arise in modeling physical systems. Consider a simple example in which you have two particles confined to the unit interval ${[0,1]}$, moving around with velocities ${(u_1, u_2)}$. Assume that they collide with each other or the walls elastically, i.e. preserving momentum and energy. So in a collision, if their pre-collision velocities are ${(u_1^{\text{old}},u_2^{\text{old}})}$ and their post-collision velocities are ${(u_1^{\text{new}}, u_2^{\text{new}})}$, then conservation of momentum says that

$\displaystyle m_1 u_1^{\text{old}} + m_2 u_2^{\text{old}} = m_1 u_1^{\text{new}} + m_2 u_2^{\text{new}}$

and conservation of energy says that

$\displaystyle \frac{1}{2} m_1 (u_1^{\text{old}})^2 + \frac{1}{2} m_2 (u_2^{\text{old}} )^2 = \frac{1}{2} m_1 (u_1^{\text{new}} )^2 + \frac{1}{2} m_2 (u_2^{\text{new}} )^2.$

If we rescale our coordinates appropriately, the second condition just reads that the initial and final velocity vectors have the same magnitude, and the first condition says that the inner product of the velocity vector with ${(\sqrt{m_1}, \sqrt{m_2})}$ is constant. But the inner product measures the angle between the two vectors, so if you unwind what this all means then you will see that the collision rules for velocity are precisely the same as the reflection rules for a billiard wall in direction ${(\sqrt{m_1}, \sqrt{m_2})}$! Similar considerations apply for collisions with the walls of the interval at ${0}$ and ${1}$.

The conclusion is that the parameter space for the velocities of the system looks like a triangle, and in this parameter space a time evolution of the system corresponds to a billiard path! In other words, billiards can be interpreted as describing how a physical system evolves in time. If you thought that was meta, then you better buckle up.

There are many questions we can ask about billiards, but here are two basic ones, which are nevertheless extremely interesting and difficult.

1. What can you say about closed paths, that is billiard trajectories that come back to themselves? These would correspond to systems of evolution that are periodic. If there are closed paths, one could go further and ask if there are infinitely many of them. If so, then how are they distributed, i.e. how does the quantity

$\displaystyle \# \{\text{closed paths of length (or period)} \leq L \}$

grow with ${L}$?

2. What can you say about dense orbits, that is billiard trajectories that run around chaotically and come arbitrarily close to every possible point? (Or phrased in terms of the physical model, can the system evolve in such a way that it essentially attains every possible state?) When you have dense orbits, you can ask the more refined question: are they equidistributed in the sense that the amount of time that they spend in a certain region of the billiard table is proportional to its area? (Does the system spend a “fair” amount of time in each state?)

3.2. Deforming billiard tables

The questions just posed are impossibly difficult in general. We will restrict our attention to billiard tables that are polygons, but the questions are completely open even for the class of triangles. However, they are tractable in some special cases, and that allows us to say something in more generality. How? The very interesting observation here is that even if you are interested in a question about a specific setting, it may still be useful to consider the question for other settings.

You may have already used this trick without thinking about it. Perhaps at some point you forgot why a matrix satisfies its own characteristic polynomial. This is obvious for diagonal matrices, though, and the diagonalizable matrices are dense in the space of all matrices (at least over the complex numbers). So it must be true for all matrices!

The point is that if the question is hard for your given polygon, perhaps you can deform it to a “nicer” polygon for which it’s easier to answer. Of course, you have to take care to perform the deformation in a way that preserves all the necessary structure involved.

In our case, we’re interested in questions about billiard trajectories, or straight lines, in our polygon, so we had better deform through transformations that preserve lines, i.e. through the group of linear transformations, ${\text{GL}_2(\mathbb{R})}$. And if we want to further study questions of equi-distribution, then we had better further restrict ourselves to linear transformations that preserve area, i.e. ${\text{SL}_2(\mathbb{R})}$.

The question is now: if we start with a polygon and deform it through ${\text{SL}_2(\mathbb{R})}$, then what other kinds of polygons will we get? We’ll shortly see how to answer this question using moduli spaces.

4. Connections with moduli space

4.1. The Unfolding Trick

For special classes of polygons, there is a useful “unfolding trick” that relates the polygons with genus ${g}$ surfaces. Specifically, we’ll consider the rational polygons, whose angles are all rational multiples of ${\pi}$.

The idea, which you might have seen already, is that instead of reflecting the billiard trajectory, you can keep the billiard trajectory a straight line and instead reflect the table. (See Figure 3.)

Figure 3. Unfolding a rational polygon.

The rationality of the angles implies that the edges will eventually match up perfectly after only finitely many such reflections. If you then glue the corresponding edges, you will obtain a compact, genus ${g}$ surface (possibly with singular points, which we’ll ignore). The billiard path in the polygon is then equivalent to a path in the surface (see Figure 4). So we have translated questions about polygons into questions about surfaces, which opens us up to use the mathematical theory of surfaces. In order to see how this is useful, we should point out that the unfolding procedure gives more data than just a surface.

Figure 4. Unfolding and gluing to a surface.

Whenever we have a plane polygon, we can regard it as a subset of ${\mathbb{C}}$, from which it acquires a natural complex structure. When we glue this polygon up to a surface, we get in this way a particular complex structure on the surface. So associated to each rational polygon we have not only a surface, but a surface plus a complex structure. As mentioned earlier, there is a moduli space ${\mathcal{M}_g}$ parameterizing complex structures on a surface, so the unfolding trick takes a rational polygon and produces a point of ${\mathcal{M}_g}$.

In fact, because of the particular way that the complex structure was constructed, we get, in addition to the complex structure, the data of a “flat structure on the surface.” You see, the plane is a flat object and so it makes sense to talk about the notion of the “family of lines of angle ${\theta}$” laminating the plane, and hence the polygon. On a curved surface this doesn’t necessarily make sense, but because we glued up our surface from a polygon, we can transfer this “family of lines of angle ${\theta}$” to the surface. If you wanted to be more fancy, you could describe this in terms of Thurston’s notion of measured foliations.

Here is an equivalent way of capturing the same data. On ${\mathbb{C}}$ there is the natural holomorphic ${1}$-form ${dz}$. This restricts to a holomorphic ${1}$-form on the polygon. Now it doesn’t necessarily descend to the surface when we glue up the polygon because we might have glued edges in an orientation-reversing way, but its square is a “quadratic differential” ${dz^2}$, which does always descend to the surface. (A quadratic differential is just a section of ${\text{Sym}^2 (T^* X)}$, the symmetric square of the holomorphic cotangent bundle on ${X}$.)

The reason that these are the same data is not obvious but not too deep either: if you have a quadratic differential, you can essentially of integrate to get flat coordinates; conversely, if you have flat coordinates then you can form ${dz^2}$ in a canonical fashion.

So the point is that if we start with a rational polygon, then we obtain a surface enriched with a complex structure and a quadratic differential ${q}$. Now the key fact is that there is a moduli space of such objects! It is stratified by the vanishing datum of the quadratic differential. More precisely, if we let ${\alpha = (\alpha_1, \ldots, \alpha_k)}$ denote the unordered tuple of zero multiplicities of the quadratic differential, then there is a moduli space ${\mathcal{H}(\alpha)}$ parameterizing pairs ${(X, q)}$ where ${X}$ is a surface equipped with complex structure, and ${q}$ is a quadratic differential whose zeros have multiplicities ${\alpha_1, \ldots, \alpha_k}$. Each rational polygon corresponds to a point of this moduli space $\mathcal{H}(\alpha)$

Now let’s return to the question that motivated all this discussion. If we start with a (rational) polygon and deform it through ${\text{SL}_2(\mathbb{R})}$, then what other kinds of polygons will we get? Well, the ${\text{SL}_2(\mathbb{R})}$ action extends to the moduli space ${\mathcal{H}(\alpha)}$ in a natural way: for a pair ${(X, \omega)}$, you pick a (not necessarily rational) plane polygon with parallel opposite edges representing this data, and act on it by ${\text{SL}_2(\mathbb{R})}$. So in these terms, our question becomes: what are the orbits of a point ${x \in \mathcal{H}(\alpha)}$ under this ${\text{SL}_2(\mathbb{R})}$ action? The orbits consist of billiard tables having the same billiard path behavior, since they are related by linear transformations.

The whole group ${\text{SL}_2(\mathbb{R})}$ is quite a handful, so let’s start by considering some simpler subgroups. There are a few distinguished one-parameter subgroups of ${SL_2(\mathbb{R})}$:

$\displaystyle g_t = \begin{pmatrix} e^t & 0 \\ 0 & e^{-t} \end{pmatrix} \quad \text{and} \quad h_t = \begin{pmatrix} 1 & t \\ 0 & 1 \end{pmatrix}.$

Here we think of ${t \in \mathbb{R}}$ indexing the time, and the matrices as a time-flow of linear transformations. The first is called the geodesic (or Teichmuller) flow, and the second is called the horocycle flow. We aim to study what we find when we start these flows at a point ${x \in \mathcal{H}(\alpha)}$.

Now you see we’ve done something very meta. We started out asking about the time evolution of a mechanical system, and saw how it could be interpreted as billiards in a polygon. Then, to study billiards in a polygon, we proposed to deform the polygon through a family of polygons, in such a way that the properties of the billiards are preserved. But each polygon represents a point in this moduli space ${\mathcal{H}(\alpha)}$, so we have translated a question about billiard flows in all polygons into a question about flows in the moduli space ${\mathcal{H}(\alpha)}$. It’s like moduli within moduli! (Cue Inception soundtrack.)

4.2. Geodesic flow

At last, we can start describing some of the actual work of Mirzakhani. We have seen how understanding the game of billiards ties into understanding the geometry of the moduli space $\mathcal{H}(\alpha)$, especially its ${\text{SL}_2(\mathbb{R})}$-orbits. In a recent series of papers Mirzakhani does just that.

Let’s focus on the geodesic flow first. This is so named because ${\mathcal{H}(\alpha)}$ comes equipped with a natural metric, called the Teichmuller metric, and the orbits of the flow trace out what are called geodesics for this metric. So we are interested in geodesics on ${\mathcal{H}(\alpha)}$.

The geometry of ${\mathcal{H}(\alpha)}$ is quite complicated, but there are some helpful analogies that we can use as crutches. For compact hyperbolic (negatively curved) manifolds, these questions have been studied and the answers are known (after much work).

Theorem 1 (Margulis) If ${X}$ is a compact hyperbolic manifold, then

$\displaystyle \# \{ \text{geodesics on }X \text{ of length } \leq R \} \sim \frac{e^{hR}}{hR}.$

where ${h}$ is a quantity called the entropy of the geodesic flow.

${\mathcal{H}(\alpha)}$ is trickier because it is not strictly negatively curved – its curvature is ${\leq 0}$ everywhere but not strictly negative – and it is not compact. However, it does have finite volume. So it is in some sense analogous to a compact hyperbolic manifold, and we might conjecture that the results are similar. In a paper with Eskin and Rafi, Mirzakhani confirmed this conjecture. Most of the results are rather technical even to state, but here is a special case of the simplest one.

Theorem 2 (Eskin, Mirzakhani, Rafi) If ${\mathcal{H}(\alpha)}$ is connected, then

$\displaystyle \# \{ \text{geodesics on } \mathcal{H}(\alpha) \text{ of length } \leq R \} \sim \frac{e^{hR}}{hR}.$

where ${h = \frac{1}{2} (1+ \dim_{\mathbb{R}}\mathcal{H}(\alpha))}$.

So how do you prove this? The (very, very) rough idea of the proof goes back to work of Eskin, Margulis, Masur, etc. As we mentioned, things are known for compact hyperbolic manifolds, but in the non-compact case the problem is that the geodesics might sort of escape off to infinity. It turns out that there is a kind of “force” pushing the geodesics away from infinity, but you have to quantify this somehow.

Here is the spirit of it. A non-compact hyperbolic space will have a “thin part” stretching off to infinity, which makes it non-compact, so imagine dividing your space into a “compact part” and a “thin part”. You want to construct a function ${G}$ on your space with special property that at any point in the thin part of the space, its average value at nearby points satisfies a kind of “exponentially subharmonic” inequality of the form

$\displaystyle \int G(y) \leq G(x) e^{-\tau}.$

The idea is that if the right inequality is satified, then ${G}$ quantifies a kind of force that tries to push the path away from the thin part, which implies that most of the geodesics spending their time in the compact part, and then you can mimic the argument for compact hyperbolic spaces. Of course, the art is all in a careful choice of the function ${G}$.

4.3. Horocycle flow

The geodesic flow can be nasty. You can obtain geodesics that have crazy Cantor-like behavior, which are dense but not equidistributed, and which “diverge” off to infinity. But it turns out that when you throw in the unipotent action from the horocycle flow as well, many of the pathologies disappear.

A very rough heuristic explanation is the chaotic behavior stems from exponential stability, which is hinted at in the form of the geodesic flow. Now, you might complain that we artificially made the geodesic flow exponential. But the point is that the geodesic flow is obtained by exponentiating the Lie algebra element ${ t \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}}$ while the horocycle flow is obtained by exponentiating the element ${ t \begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix}}$. That is the nice thing about nilpotent matrices: even when you exponentiate them, their size grows only polynomially!

Here the analogy is that moduli space behaves somewhat like a homogeneous space, which is a space possessing a transitive group action. What that means is that the space has so many symmetries that every point looks locally the same as every other point. For instance, ${\mathbb{R}}$ is homogeneous because you can translate any point to any other point, and the sphere ${S^2}$ is homogeneous because you can rotate any point to any other point. A cube has some symmetries, but not all points look alike; for instance, a corner point looks different from an point in the middle of a face.

There are celebrated results of Marina Ratner studying the closures of unipotently generated orbits on a particularly prominent class of homogeneous spaces, namely those arising as a quotient of a Lie group by a lattice. The first is a rigidity result, asserting that the closure is itself a homogeneous space.

Theorem 3 (Ratner’s orbit closure theorem) Let ${G}$ be a compact Lie group, ${X = G/ \Gamma}$ be a homogeneous space, and let ${U \subset G}$ be a connected subgroup generated by unipotent elements. Then for any ${x \in X}$, the orbit closure ${\overline{Ux}}$ is a homogeneous space of finite volume, i.e. ${\overline{Ux} = Hx}$ for some closed subgroup ${H \leq G}$.

The second of Ratner’s theorems is an equidistribution result, which we won’t state here. Although these theorems are deep and difficult to prove, the fact that they apply to Lie groups means that they are in some sense “just linear algebra.” The amazing thing is that Mirzakhani and Eskin were able to establish analogous results for ${\mathcal{H}(\alpha)}$, suggesting that the moduli space is somehow analogous to homogeneous spaces, even though they “look” nothing alike! Moduli space is in fact totally inhomogeneous, meaning that no two points look locally alike. But even though it is completely opposite of homogeneous space, in this sense, it turns out to have a similar rigidity property.

Theorem 4 (Eskin, Mirzakhani) If ${x \in \mathcal{H}(\alpha)}$, then the orbit closure ${\overline{\text{SL}_2(\mathbb{R}) S}}$ is a submanifold of ${\mathcal{H}(\alpha)}$.

The full theorem is slightly more precise; see the paper for details. This suggests that the ${\text{SL}_2(\mathbb{R})}$ orbits, which we can think of families of polygons having similar billiards properties, very nicely tile the moduli space $\mathcal{H}(\alpha)$

## PDE and a glimpse of Martin Hairer’s work

The newest quartet of Fields medalists were just announced last week at the ICM 2014 in Seoul. The Fields medal is widely considered to be the highest prize in mathematics, and the community always seems to be very excited to hear about its new recipients. Although I’m wary of paying too much attention to these sorts of prizes in general, one thing I like very much about the process is that it inspires experts to take cutting-edge mathematical research and try to explain it to non-experts. I think that other graduate students will concur that this is an extremely difficult thing to do, so I’ve been impressed by the quality of some expositions I’ve seen.

While I was browsing Gowers’s blog of the Laudatio for Martin Hairer, one of this year’s Fields medalists, I encountered some elements that I found exciting simply because they connected with ideas that I had actually encountered, and rather enjoyed, in a recent course on PDE. So I thought that I would try and explain these ideas; whether or not they really are related to Hairer’s work in a more than superficial way is something I don’t know, since I don’t understand his work, but at least they will clarify some of the jokes (that’s the most important part, right?).

Since I am extremely far from being an analyst myself, I am sure that everything I have to say will seem very basic to people who work in this area, but it may be useful for people like myself who come from a more algebraic background. As usual, I’ll try to keep things non-technical – almost all of the technical details that I know are written up in these lecture notes (though they are currently optimized for faithfulness to the lectures and not readability, which are quite different in this case).

1. Partial differential equations

Hairer’s work is in a subject called “stochastic partial differential equations.” That’s a rather intimidating extension of the already difficult subject of (deterministic) partial differential equations, so let’s take baby steps first.

A partial differential equation (PDE) is just a functional equation involving partial derivatives: for a given ${F}$, it asks for functions ${u(x_0, \ldots, x_n)}$ satisfying the relation

$\displaystyle F\left(x_0, x_1, \ldots, x_n, \frac{\partial u}{ \partial x_0}, \ldots, \frac{\partial u}{\partial x_n}, \ldots \frac{ \partial^2 u}{\partial x_i \partial x_j}, \ldots \right) = 0.$

The main source for PDEs is from physics, where they come up in modeling all sorts of phenomena. There ${u}$ describes the state of the physical system, and ${F}$ is some kind of local constraint.

Let’s take a look at some of the main examples.

1. The Poisson equation describes the electric potential in the presence of a charge density ${\rho}$

$\displaystyle \Delta u := \sum_{j=1}^{\ell} \frac{\partial^2 u }{\partial x_j^2} = \rho.$

2. The wave equation models the displacement of a wave in space:

$\displaystyle - \frac{\partial^2 u}{\partial t^2} + \sum_{j=1}^{\ell} \frac{\partial^2 u }{\partial x_j^2} = 0.$

3. The heat equation models the diffusion of heat in space:

$\displaystyle - \frac{\partial u}{\partial t} + \sum_{j=1}^{\ell} \frac{\partial^2 u }{\partial x_j^2} = 0.$

4. The Schrödinger equation governs the time evolution of the quantum wavefunction of a system.

$\displaystyle i\frac{\partial u}{\partial t} + \sum_{j=1}^{\ell} \frac{\partial^2 u }{\partial x_j^2} = 0.$

An early important idea in the theory of partial differential equations, which was first understood and clarified by Hadamard, is that of well-posedness. You see, when we ask for solutions to a PDE we implicitly mean solutions in some specific space, which could be for instance the space of differentiable functions, smooth functions, or even analytic functions.

Ideally we want to work in a space where the solution to the PDE exists and is unique. (There is one more condition to well-posedness, which is that the solution should somehow depend “continuously on the initial data” in an appropriate sense.) This obsession with existence and uniqueness has been the subject of several jokes; my favorite is this one:

A mathematician and an engineer are staying in a hotel. The engineer wakes up to find his room on fire. He spots a fire extinguisher, uses it to put out the fire, and goes back to bed.

The mathematician wakes up to find his room on fire. He spots a fire extinguisher, exclaims “a solution exists!” and goes back to bed.

All joking aside, though, existence and uniqueness are quite important if we think of our solution as modeling a physical situation. That is because real physical systems tend to evolve in a specific, determined way from their initial conditions, and so any “real” solution should somehow be unique.

This creates some interesting room for interpretation as far as “solving” the PDE goes. The most natural interpretation is to seek a solution that possesses at least as much regularity as demanded by the equation, and then to impose additional regularity until one obtains uniqueness. For instance, if the equation involves at most second derivatives, as all our examples above do, then the most natural problem would be to ask for a twice-differentiable function that satisfies the equation. This is the notion of “classical solution.” However, this already turns out to be impossible for many interesting cases.

Therefore, one has to expand the solution concept beyond the familiar realm of functions. The broadest sense of solution are the “weak solutions,” which are solve the equation as “distributions.” Such functions are not differentiable in the usual sense, but they possess derivatives as distributions, which are not functions but do possess some of the key properties of functions.

It may be useful to motivate the notion of distribution by borrowing some ideas from physics (I should credit this explanation to Strichartz). The classical notion of a function is something that takes in an input and outputs some value. For instance, we could let ${\varphi(x)}$ be a function associating some physical quantity like temperature or energy to a particular point ${x}$ in space, ${x}$. We might think that we can measure ${\varphi(x)}$ by holding up a thermometer to position ${x}$, but this is really just an idealization. In physical reality, any thermometer we use will have a definite size and hence imprecision, so what we are really measuring is some sort of average temperature in a localized region. That is, instead of measuring ${\varphi(x)}$ we are measuring something like

$\displaystyle \int \varphi(x) \rho(x) \, dx$

where ${\rho(x)}$ can be interpreted as some averaging function.

The idea of a “generalized function,” or distribution is to entirely drop the notion that a “function” has to have a definite value at each point. A distribution provides a mean of “measuring” a function, i.e. to a function it associates a value, but it does not need to have any intrinsic values itself.

Formally, now: we define the space of test functions ${\mathcal{D}({\mathbb R}^n)}$ to be the vector space of smooth, compactly supported functions on ${{\mathbb R}^n}$. The space of distributions is then the continuous dual space (in the sense of linear algebra) ${D'({\mathbb R}^n)}$. Any vector space includes into its dual, and indeed any test function ${\varphi}$ realizes a distribution, which measures another test function ${\phi}$ as

$\displaystyle \int_{{\mathbb R}^n} \phi(x) \varphi(x) \, dx.$

Finite-dimensional vector spaces are isomorphic to their duals, but for infinite-dimensional spaces the dual space is bigger. The theory of distributions allows us to make sense of objects like the “Dirac ${\delta}$ function,” which is not really a function at all, although it is also not quite a distribution under our definition either, because it fails to be continuous.

Now we can say what it means for a function to have a “distributional derivative”: it should be a distribution that measures test functions in the same way that a classical derivative would. In other words, for a test function ${\varphi}$, if ${f}$ were differentiable then by integration by parts we would have

$\displaystyle \int \frac{\partial f}{\partial x} \cdot \varphi \, dx = - \int f \cdot \frac{\partial \varphi}{\partial x} \, dx.$

Even if ${\frac{\partial f}{\partial x}}$ does not exist classically, it may be that there is a distribution ${A}$ such that ${A(\varphi) = - \int f(x) \frac{\partial \varphi}{\partial x}}$, so that it measures like ${\frac{\partial f}{\partial x}}$. In this case we would say that ${A}$ is the distributional derivative of ${f}$. And if ${A}$ is actually represented by some function ${g}$, then we would say that ${g}$ is the weak derivative of ${f}$.

Now you can see how you might cook up a function ${u}$ that is not even itself differentiable (classically), but deserves to be called a solution to some partial differential equation: the derivative exists as a distribution, and so then the equation asks for an equality of distributions. This just means that the two distributions measure up against all test functions in the same way, but there is no point-wise equality in the usual sense. However, this introduces a wrinkle into some partial differential equations, since distributions can be differentiated but not generally multiplied (or operated on in other ways that functions can be). So one winds up with some PDE that simply “do not make sense.”

At last, we can explain one of the great jokes of the ICM laudatios, concerning the progression of PDE theory.

1. Newton introduced the differential calculus, and initiated the study of differential equations. For mathematicians, the task became: you are given a differential equation, perhaps modeling some physical system, and you want to find its solution.
2. Poincaré appreciated that in many cases of interest, one cannot write down a formula for the solution to a differential equation, and so one has to settle for a less explicit, often qualitative, description. The task instead becomes: you are given a differential equation, and you want to say something about its solution. For example, one might try to prove energy estimates or ${L^1}$ bounds on the solution, or investigate the behavior as time goes to infinity.
3. In the 1960s, Smale and Thom realized that often the physicists did not themselves know the correct equations that they wished to solve (at least from a mathematical perspective). For instance, the equations might not really make sense, for the reasons mentioned above. Therefore, the task then becomes: you are not given a differential equation, and you want to say something about its solution.

Now, Hairer’s work concerns an exotic type of PDE called stochastic partial differential equations. Here the equations are not determinstic, but involve random variables, which introduces all kinds of complications. The particular example that Hairer is famous for solving is called the Kardar-Parisi-Zhang (KPZ) equation:

$\displaystyle \frac{\partial h }{\partial t} = \frac{\partial^2 h}{\partial x^2} + \left( \frac{\partial h}{\partial x} \right)^2 - C + \xi.$

This equation describes the time evolution of a boundary between two substances. Hairer gives the example of the shape of a piece of paper as it burns: it starts out straight, but becomes irregular and bumpy in a way that is sort of random, but also sort of structured.

The ${\xi}$ is a “white noise” term, and the solution ${h}$ is expected to look like Brownian motion, so the derivative doesn’t make sense classically, and has to be interpreted as a distribution. Unfortunately, one cannot multiply distributions in general, so the term ${ \left( \frac{\partial h}{\partial x} \right)^2 }$ doesn’t really make sense.

It is a great credit to physicists that they can manipulate completely nonsensical mathematical objects, like this KPZ equation, and somehow still obtain meaningful answers and predictions. Nonetheless, a mathematician would like to perform mathematics properly, and the great achievement of Hairer was to formulate a theory of “regularity structures” that does make sense out of the KPZ equation and also other kinds of stochastic PDE previously thought to be similarly nonsensical.

2. Elliptic regularity

Perhaps the most famous phenomenon in the basic theory of PDE is elliptic regularity. It describes the “miracle” by which solutions to certain classes of PDE, which are called elliptic, automatically possess more regularity than prescribed by the equation itself.

If you have studied complex analysis, then you have already witnessed this “miracle”: a function that is once complex-differentiable, or “holomorphic” is automatically infinitely differentiable, and even analytic. This basic fact lends complex analysis its characteristic rigidity.

The underlying reason is that a holomorphic function can be thought of as a solution to a system of PDE, the Cauchy-Riemann equations, which are elliptic. So the collapsing of all the notions of differentiability, smoothness, and analyticity in complex function theory is a special case of elliptic regularity.

The Poisson equation $\displaystyle \Delta u = \rho$ is a prototypical example of elliptic equation. The special case where ${\rho=0}$, which is called Laplace’s equation, defines the famous class of harmonic functions, which have great significance in analysis. As you may know, general harmonic functions satisfies the sequence of properties that are familiar from complex analysis: the mean value property, the maximum principle, etc. These may be generalized, to some extent, to all solutions of elliptic equations.

So what is an elliptic PDE anyway? There is a partial classification of PDE into four broad categories: elliptic, hyperbolic, parabolic, and dispersive, which are modeled on the four main equations from physics above. This classification is neither formal nor comprehensive, but we can formalize it in a few cases of interest. A “second-order semi-linear” PDE takes the form

$\displaystyle \sum_{i,j} a_{ij}(x) \frac{\partial^2 u }{\partial x_i \partial x_j} + \sum_i b_i(x) \frac{\partial u}{\partial x} + c(x) = 0.$

(The “second-order” refers to the fact that the PDE involves derivatives only up to second order, and the “semi-linear” refers to the fact that the coefficients ${a_{ij}, b_i, c}$ of the differential operators do not depend on the solution ${u}$.)

By the equality of mixed partial derivatives for ${C^2}$ functions, there is no loss of generality in assuming that ${a_{ij}(x) = a_{ji}(x)}$ for all ${i,j}$. Now we perform a common mathematical maneuver, which is to throw away the lower-order terms and look only at the expression

$\displaystyle \sum_{i,j} a_{ij}(x) \frac{\partial }{\partial x_i }\frac{\partial }{\partial x_j} .$

This is called the principal symbol of the equation. The idea is that the qualitative nature of the PDE is controlled by the highest-order terms. This is analogous to how the qualitative behavior of a polynomial is controlled by its highest-order terms, i.e. its degree. Indeed, you may be familiar with the fact that the Fourier transform turns derivatives into multiplication, so that in the Fourier world this corresponds precisely to taking highest-order terms in a traditional sense. This reflects the fact that in PDE the behavior of the solution is often tied to the level of oscillation, which is precisely what is measured by the derivatives.

So anyway, we can take this expression and interpret it as quadratic form on the cotangent bundle. In this case, that’s just a fancy way of saying to look at the symmetric matrix ${[a_{ij}(x)]}$. This is classified up to conjugacy by its signature, i.e. its number of positive and negative eigenvalues. We say that the PDE is elliptic if it has signature ${(n,0)}$, hyperbolic if it has signature ${(n-1, 1)}$, and parabolic if it has signature ${(n-1,0)}$. This comes from the classification of plane conics: we have ellipses (${x^2+y^2 \rightarrow}$ signature ${(2,0)}$), hyperbolas (${x^2-y^2 \rightarrow}$ signature ${(1,1)}$), and parabolas (${y-x^2 \rightarrow}$ signature ${(1,0)}$). In other words, an elliptic equation has a positive-definite matrix ${[a_{ij}(x)]}$ at each point ${x}$.

Now here is one possible formulation of elliptic regularity (some technical conditions omitted).

Theorem 1 (Elliptic Regularity) Let ${u \colon {\mathbb R}^n \rightarrow {\mathbb R}}$ be a ${C^2}$ function satisfying an elliptic PDE

$\displaystyle \sum_{i,j} a_{ij}(x) \frac{\partial^2 u }{\partial x_i \partial x_j} + \sum_i b_i(x) \frac{\partial u}{\partial x} + c(x) = 0.$

where ${a_{ij}, b_i, c}$ are smooth. Then also ${u \in C^{\infty}({\mathbb R})}$.

So why is this true? There is a heuristic that “singularities flow along the characteristics” of the PDE. The characteristics are described by the zeros of the principal symbol, so elliptic equations have no characteristics, hence one expects no propagation of singularities.

Another heuristic to mention, which I already alluded to above, is that lack of regularity is linked to oscillation. Indeed, recall that the regularity of a function is related to the rate of decay of its Fourier transform, which describes its oscillation. To make this slightly more precise, consider the differential operator described by the principal symbol:

$\displaystyle \mathcal{P} = \sum_{|\alpha| = k} a_{\alpha} \partial_x^{\alpha}$

where we assume for simplicity that the ${a_{\alpha}}$ are constant. Then ${\sigma_{\mathcal{P}} (\xi) = \sum_{|\alpha| = k} a_{\alpha} \xi^{\alpha}}$. If this has a nontrivial zero ${\xi_0}$, let’s consider how it operates on the plane wave, ${\psi = e^{i \lambda x \cdot \xi_0}}$ as $\lambda$ gets very large. You can just compute that

$\displaystyle \mathcal{P}(e^{i \lambda x \cdot \xi_0})= \sigma_{\mathcal{P}}(\lambda \xi_0) e^{i \lambda x \cdot \xi_0}= 0.$

Therefore, we can get as large oscillations as we want by letting ${\lambda \rightarrow \infty}$. So the absence of characteristics prohibits this sort of violent oscillation.

Now, how does one prove an elliptic-regularity type theorem? Generally speaking, there are three steps.

1. First one introduces algebraic contexts that provide a means to measure regularity by integrals rather than derivatives. These are a special class of Hilbert spaces, called Sobolev spaces. Perhaps the simplest way to define them is as the subspace of ${L^p}$ possessing weak (i.e. distributional) derivatives in ${L^p}$.

The power of these Sobolev spaces is in providing a natural framework for understanding regularity. The Sobolev embedding theorems provide a link between the integral regularity measured by the Sobolev spaces and the classical sense of regularity. In addition, it is important to understand the geometric relationships between these spaces; for instance, that the Sobolev spaces are relatively compact. So this step is basically a lot of functional analysis.

2. Next, one establishes a priori estimates on the derivatives, assuming their existence. This is usually the trickiest step, requiring a careful analysis of the structure of the particular PDE in question. Consider the prototype, the Poisson equation:

$\displaystyle \Delta u = f.$

Squaring and integrating by parts, we find that

$\displaystyle \int f^2 = \int \sum_{i,j} \frac{\partial^2 u}{\partial x_i^2} \frac{\partial^2 u}{\partial x_j^2} = \int \sum_{i,j} \left( \frac{\partial^2 u}{ \partial x_i \partial x_j} \right)^2 .$

This shows that we can bound the norm of all the second-order derivatives of ${u}$ uniformly in terms of the initial data. We only assume that ${u \in C^2}$ but just suppose, for the moment, that ${u}$ were three times differentiable. Then just by differentiating the original Poisson equation we would have the PDE

$\displaystyle \Delta \left( \frac{\partial u}{\partial x_i} \right) = \frac{\partial f}{\partial x_i}.$

(In general, one obtains a different PDE for the derivative, but crucially it is still elliptic). By the same argument as above, we would obtain control over the third derivatives of ${u}$ in terms of the first derivatives of ${f}$. Now, we don’t really have any such bound since we don’t know that the third derivative of ${u}$ exists. But we have shown that if it did exist, then we would have a certain amount of control over it. And in practice, this usually turns out to be enough: the saying is that “existence follows from the estimate.” Carrying that out rigorously is the content of the third step, which is usually the most technical but also relatively straightforward.

3. The final step typically consists of converting the a priori estimates into a rigorous proof. There are a couple of general approaches possible. The first is to “regularize” the functions and equation involved, approximating them by smooth substitutes so that all the estimates become valid. Then one wants to slowly remove the regularization and show that these nice properties are preserved in the limit. Here again, the uniformity estimates prove crucial. A second approach is to approximate the process of differentiation, instead of the functions. For instance, one can introduce a “discrete derivative” or “difference quotient” and use the a priori estimates to show that in the limit where the discrete derivative becomes the usual infinitesimal derivative, all the objects are sufficiently well-behaved to converge to sensible classical objects.

Carrying out this program is actually quite lengthy and technical, as you might imagine. For details, see the lecture notes.

3. Transport equations and Entropy

Elliptic regularity may appear to be a wonderful thing. Our instinct is to think of regularity as being “nice,” so it may at first seem very fortunate that the solution to PDE are naturally very regular. However, it turns out that this is actually somewhat limiting if we want our solutions to describe interesting physical phenomena, which often result from a lack of regularity.

For example, did you ever wonder why matter can take several different states: solid, liquid, gas? Statistical mechanics answers this question in a rather remarkable way. It assigns to a physical system a single function, called the partition function, which encodes all of its fundamental properties like energy, entropy, heat, etc. Interesting phenomena like phase transitions correspond to discontinuities in the partition function (or its derivatives).

Just as a side remark, it is an amazing testament to the power of concentration of measure that even though describing the motion of as few as three particles directly with Newton’s laws is extremely difficult, when you scale up to a physical system with billions of particles then you suddenly find such an elegant mathematical theory.

Similarly, we might try to right down PDE that model “singular” events, like a traffic jam or sudden eruption. Then we don’t want the solutions to be regular – we expect to find interesting information in the lack of regularity.

One prominent example is Burgers’ equation, which comes from fluid dynamics in modeling traffic flow:

$\displaystyle \frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} = 0.$

This has the interesting property that it does not admit a classical solution; one can prove that any classical solution blows up in finite time. Physically, this models the fact that transport equations can develop “shocks,” which are fundamentally singular events. So then we must look in the realm of weak solutions for functions that exhibit shockwave behavior, but it turns out that there are infinitely many different weak solutions (even for the same initial data). So in order to achieve well-posedness, we must work in some class of solutions interpolating between the classical and the weak. How can we find this?

This question is answered by a beautiful perturbative argument. The idea is to introduce a second order term, which is sometimes called a “viscosity term” because it corresponds physically to adding some viscosity to the system. Let us instead consider the family of PDE:

$\displaystyle \frac{\partial u_{\epsilon}}{\partial t} + u \frac{\partial u_{\epsilon}}{\partial x} - \epsilon \frac{\partial^2 u_{\epsilon}}{\partial x^2} = 0.$

You might at first think that this is a family of elliptic PDE, but it is technically parabolic for ${\epsilon \neq 0}$ because there is a time variable as well. Nonetheless, one can use arguments like what I explained for elliptic regularity to argue that certain types of parabolic PDE also have nice regularity properties. (Apologies that this makes the whole previous section seem somewhat misleading, but what goes around comes around.) In particular, that means that there will be exist a unique classical solution to the equations in this family, for ${\epsilon \neq 0}$. As ${\epsilon \rightarrow 0}$, the equation “converges” to the one we want to solve, so one might hope that these solutions ${u_{\epsilon}}$ also converge to a solution to Burgers’ equation. If that is case, then it makes sense that this limiting solution is what we should consider the unique “physical” solution.

In practice, it tends to be extremely difficult to show that the solutions converge. In fact it’s basically impossible to do this in any level of generality, but for specific equations like the Burgess one can make an argument. What is more tractable is to show that if the solutions converge, then the limit has nice properties. For instance, it is almost immediate that the limit will be a weak solution: if ${\varphi}$ is a test function, then using integration by parts we have

$\displaystyle 0 = \int \left( \frac{\partial u_{\epsilon}}{\partial t} + u \frac{\partial u_{\epsilon}}{\partial x} - \epsilon \frac{\partial^2 u_{\epsilon}}{\partial x^2} \right) \varphi = \int u \frac{\partial \varphi}{\partial t} + \frac{1}{2} u^2 \frac{\partial \varphi}{\partial x} - \epsilon u_{\epsilon} \frac{\partial^2 \varphi}{\partial x^2}.$

One can then take the limit as ${\epsilon \rightarrow 0}$ to conclude that

$\displaystyle \int u \frac{\partial \varphi}{\partial t} + \frac{1}{2} u^2 \frac{\partial \varphi}{\partial x} = 0$

which is the defining condition of a weak solution.

It takes quite a bit of work to digest this idea into a formulation of solution concept that is well-posed for the transport equation, but answer turns out to be the notion of entropy solution. Roughly speaking, this says that all entropy functions of the solution should increase (or at least not decrease) in time. A classical solution preserves the entropy perfectly, whereas a weak solution can increase it or decrease it. The precise formulation is somewhat technical: it says that for all pairs of smooth function ${(\Phi, \Psi)}$ such that ${\Phi}$ is convex and ${\Phi'(u) f'(u) = \Psi'(u)}$, we have for all positive test functions ${\varphi}$ that

$\displaystyle \int \int \Phi(u) \frac{\partial \varphi}{\partial t} + \Psi(u) \frac{\partial \varphi}{\partial x} \, dt dx \geq 0.$

It is a celebrated theorem of Khruzkov that entropy solutions exist and are unique. The uniqueness is fairly involved but not so difficult; a proof is outlined in my notes. The existence is somewhat deeper still, requiring semigroup methods. And then after this theoretical preparation, it is quite a bit more work to explain how to actually compute entropy solutions. This material is discussed in a book of Denis Serre, but I should warn that he seems to have inherited some of the terseness of his famous uncle.

4. Hairer and the KPZ equation

Hairer introduced a regularization procedure for the KPZ equation that sounds similar in spirit (to the basic extent that I understand it) to the “viscosity approximation” that I described for transport equations. In his case, the process can be viewed as assuming that the random noise in the equation occurs at a small but not infinitesimal scale, which leads to “smoothed out” equations that make sense. The monumental technical task that Hairer carried out was in showing that the solutions to these regularized equations converge properly, which I imagine was quite difficult, since it was already difficult for the simple example of the Burgers equation.

That is, the original equation was

$\displaystyle \frac{\partial h }{\partial t} = \frac{\partial^2 h}{\partial x^2} + \left( \frac{\partial h}{\partial x} \right)^2 - C + \xi$

and Hairer considers the family of equations

$\displaystyle \frac{\partial h_{\epsilon} }{\partial t} = \frac{\partial^2 h_{\epsilon}}{\partial x^2} + \left( \frac{\partial h_{\epsilon}}{\partial x} \right)^2 - C_{\epsilon} + \xi_{\epsilon}$

where ${\xi_{\epsilon}}$ is smooth and converges in an appropriate sense to ${\xi}$, and ${C_{\epsilon}}$ is a well-chosen renormalization constant. By “smoothing out the noise” in this way, he obtains a family of PDE admitting sensible solutions. Since the equations in this family “converge” to the KPZ equation, one hopes the same for the solutions. This is really the technically difficult step, but after a lot of work he is able to show that the family of solutions to these perturbed SPDE does indeed converge, to a function that is then interpreted as the solution to the original KPZ equation, and has the properties predicted by physicists.

I’ve seen mentioned a number of key insights of Hairer, but I don’t understand the technicalities well enough to really say what they mean. For instance, proving convergence involves a Taylor-like expansion of the solutions. One of the difficulties with proving convergence to a limit seems to be that if one picks a fixed “universal” basis, such as the basis of polynomials that is used for Taylor expansion, then the roughness of the solutions means that the approximation is not good enough. Instead, Hairer defined a custom basis for the equation that was designed to approximate the solutions well. I think that the objects he defines are really more subtle than a basis in the usual sense, but they have a similar kind of role. In any case, it’s nice to see that his ideas at least sound connected to things that I can actually appreciate.

Posted in Mathematics, Uncategorized | 1 Comment

## My #LiveTheWage Challenge

I read an article yesterday about an American politician trying to subsist on a minimum wage budget for a week. This instance is part of a wider challenge for Americans to experience what it’s like to live on our federal minimum wage, which is a measly $7.25 per hour. That totals to about$290 a week, which is supposed to allow only $77 for food and transportation. The idea is to teach the privileged the meaning of America’s gaping economic inequality. Guilty as I am of being a privileged brat, I thought I should try it for myself. It didn’t seem so bad, since I start out with several unfair advantages: I ride a bike to work every day, and I also have a free bus pass, so that already eliminates all transportation costs. And I’m used to budgeting my food, just because I’ve always admired the stories of how my immigrant parents clawed their way from poverty to the land of opportunity. My usual breakfast is a packet of oatmeal from Trader Joe’s, which at$2.95 for a box of ten figures to only about 30 cents per day. Add that to the frozen meal I bring for lunch, which usually costs between $2 and$3, and I have almost $7 to play with for dinner. I’ve been dining on a salad and soup, plus eggs for an economic and healthy protein source. The salad is actually the budget killer because I like to put nuts and fresh fruits in it, both of which are relatively expensive. But heck, it shouldn’t be too hard to deal with for one week, right? I’ll spare you the sob story about tightened belts and empty stomachs and McDonalds dollar menus. The #livethewage hashtag is already saturated with those, anyway. No … my LiveTheWage challenge was torpedoed basically before it even began. I woke up this morning to find my teeth clenched and my right eyelid clamped shut. My eye felt like it had a splinter in it. It took me fifteen minutes to coax it open, so that I could see the network of inflamed blood vessels inside. On my 2.5 mile bike ride to work, I had to stop on the sidewalk twice while it flared up and cascaded tears. So it didn’t take me long to decide that I needed to see an eye doctor. The problem was that I didn’t have health insurance. I’d always taken health care for granted, since I’d been covered my parents’ company plans up until I went to England, which has free universal healthcare anyway. Then I was going to be covered by my grad school starting in the fall, and I figured I could survive for three months without issues. After all I’m just a kid, damnit; I eat sort of healthily and put in my regular aerobic workouts. Isn’t that supposed to be enough? It took the eye doctor maybe fifteen minutes to decide that I was suffering a strong autoimmune reaction, probably as a result of severe allergies (which I had never experienced before) but also possibly from a bacterial infection. So probably nothing serious – a couple of eyedrops and I immediately started to feel better. To be safe, he prescribed antiobiotics and a checkup the next day. After those fifteen minutes I was slapped a$180 bill. I winced, but it wasn’t entirely unexpected. When the receptionist heard that I’d be paying out of pocket, she invoked a discount down to $128. Then I had to go pick up a prescription at the pharmacy, where I sheepishly told the pharmacy technician that I didn’t have health insurance. Back in the good old insured days this part was free, but now it totaled a rather hefty$208. When she saw the price tag, she exclaimed and ran back to her computer. After twiddling a few parameters, which I didn’t understand, she managed to bring it down to $140. And then she asked tenderly if I would like to go back to the doctor before buying to see if he could prescribe a cheaper medication instead. I actually hesitated and considered doing just that. The number was so astronomical relative to the terms that I had been used to thinking in. After my careful comparison of miso soup prices between two nearby grocery stores, factor in tomorrow’s appointment and six weeks’ worth of food budget could be gone just because I had looked at some damned flowers the wrong way. Then I took a breath and put things into perspective. I valued being able to see a lot more than a few hundred dollars. And I was lucky enough to have a well paying internship and supportive parents who could always bail me out – lucky enough that my intended week of frugality amounted to nothing more than a stunt. When I started this challenge I expected to be just another one in the crowd, lamenting the horrors of cheap food. But what I’ve taken away from my failed little stunt is not to think of money just as a ticket to luxury, as the difference between Chicken McNuggets and sushi. To see minimum wage only in terms of the comfort level it provides is already a kind of blindness on the part of the privileged. It also has to do with the sacrifice of things that we don’t even think about on a weekly basis. Among the millions of people out there to whom$77 a week is more than just a one-time stunt, probably at least one is another kid seriously considering risking his sight for a few hundred dollars. Maybe for just a moment today, I saw from his eyes.

## Düsseldorf and Cologne

Düsseldorf

You might have noticed by now that I do a good chunk of my observing with my stomach. When I visited Germany last June, there was one thing that stood out most…

… the beer. A year ago I didn’t even drink beer, but now I sometimes find myself daydreaming about that cool, crisp German brew.

The most interesting thing was actually how the beer was served. It was like tap water in America: every time my glass dipped below half full, the waiter brought a refill like clockwork. I’m sure those who have witnessed my drinking feats will testify to the stoutness of my constitution, but even I have my limits, and I eventually had to beg him to stop.

As for the food itself, I quickly realized the formula. The typical meal was a hunk of meat, plus some bratwurst (sausages) and sides of mashed potatoes and sauerkraut. This picture came from dinner in a packed sports bar in which every table got the same meal.

I’ve been somewhat puzzled by how enthusiastic some of my more eastern European friends are for plain hunks of meat, but I now understand much better.

I found this tidbit amusing: at dinner another night, our German host ordered a baked potato filled with butter and sour cream, assuring us that it was a very special local delicacy. He was very surprised to hear that this dish existed in America.

After seeing London, Cambridge, Edinburgh, Amsterdam, etc. I had cultivated an impression that European cities were all charming and traditional. Düsseldorf, which is decidedly not a tourist city, disproved that notion. The steel buildings, concrete and roaring cars all reminded me of those impersonal American cities that I knew so well. At the same time there was something refreshingly authentic about the experience, in that it was largely unadulterated by the enthusiastic tourist cameras that had followed my around everywhere on my previous journeys. I enjoyed the feeling I was somehow genuinely tasting the daily culture, with its industrial bustle and businesslike demeanor.

One interesting feature was the Altstadt, or “Old Town.” This was a section of the old city center set aside for historical purposes. Düsseldorf proudly proclaims its Altstadt “the longest bar in the world” since it consists almost entirely of bars and pubs. I suppose that’s partly why the beer association is so strong in my mind. There were also some other quaint buildings, like a small chapel and a painting store.

Another feature was the Rheinturm – the “tower on the Rhine.” The name is sort of self-explanatory. I’ve noticed that whenever people have a tall tower, they also have a tendency to put some sort of revolving restaurant at the top. This one had that plus a nice cafe and bar, from which I snapped a few shots of the city.

The weather was dreary on the first day there (as you can see!), but when it cleared up a couple days later we embarked on a river cruise alone the Rhine. It was a relaxing way to enjoy a lovely day, and soak in some of Düsseldorf’s history.

Cologne

My second destination was Cologne, one of Germany’s largest and culturally rich cities. It was just a short train ride from Düsseldorf.

I was quite surprised to find myself surrounded by luxurious leather seats when I first embarked on the train. Then again, I had heard that public transportation in Germany was far better than what I was used to in America, so I thought it might be the standard. But when a stewardess came down the aisle offering a basket of complimentary gummy bears, I knew it was too good to be true – I had accidentally settled down in the business class of the train. Still, I was very impressed that the Germans used gummy bears as snacks (on this and some other occasions) instead of peanuts or pretzels; it was the mark of a very advanced civilization.

I had heard that the cathedral of Cologne was a must-see, so when I got off at the train station I sought out an information services receptionist and asked how I could find it. In response the man just stared at me in shock, and I soon understood why. As soon as we stepped out of the station, it was very difficult to find an angle of view that didn’t include the towering edifice.

I couldn’t believe how enormous it was, eclipsing all of the more modern buildings surrounding it. And to think that it was built hundreds of years ago! It was simply mind-boggling.

Next stop: the chocolate museum of Cologne, home of the Lindt chocolate factory. I enjoyed learning about the history and culture of chocolate, although the quantity of free samples was slightly disappointing!

The museum began with exhibits concerning how chocolate is grown and transported. Then came some historical morsels: it was an especially important piece of Mesoamerican culture, which was compelling to me because of my interest in the Aztec and Mayan civilizations.

Finally we got to see modern aspects of chocolate processing and production. And there were free samples, but not enough!

The museum had a section where you could order chocolate with your choice from a wide selection of toppings/ingredients, as in an ice cream store. Some very interesting chocolate flavors resulted.

Finally, we stopped by the botanical gardens and zoo of Cologne.

The gardens were quite pretty. There wasn’t such a variety of plants, but the flowers were arranged in intricate patterns. I only really had time at the zoo to see the reptiles. There was probably nothing too idiosyncratic about the exhibit, but I still got a kick out of it.

## Iceland

Ah, Iceland. The week I spent in this wintry wonderland will always occupy a special place in my memories. The whole experience was almost surreal, with so much fantastical visual input that my brain had a hard time accepting it. You know the feeling when you’re dreaming and you can’t quite decide if something is real or imagined? That was Iceland for me. Everywhere I looked I saw this vast, completely still landscape – no animals, no wind – set against picture-perfect mountains, and I felt completely disoriented, as if I had somehow been trapped in a world inhabited by desktop screensavers.

There is almost too much to talk about. This was my first time hosteling, and every time I walked in the room, I found new roommates. I learned that among eight random people, there will always be one impressive snorer. On the last day I went to a bar and was educated in some local Icelandic culture. Apparently everybody takes antidepressants in the winter to cope with the two hours of sunlight, and parties for days straight in the perpetually luminescent summers.

But I don’t want this story to get too long, so here is my selection of six things I did in Iceland.

1. The Golden Circle

The Golden Circle tour is a showcase for Iceland’s natural attractions. The entire country seems to consists of rivers, mountains, and vast expanses of empty fields. And waterfalls. Outside of Reykjavik, the population of waterfalls seems to be higher than that of any animal species. We passed by the odd farmhouse here and there on our bus rides, but we probably saw more waterfalls than human establishments. They weren’t as huge as Niagara, but they possessed a less extravagant kind of beauty. Here is a waterfall at one of early pit stops.

Our first major stop was in the valley Haukadalur, home to geysers Geysir and Strokkur. One of these was inactive, but the other was geothermally active and “erupted” every few minutes. The eruption consisted of steam billowing out – it wasn’t terribly dramatic, but still quite a spectacle. I also enjoyed looking at the inactive geyser because it presented an intriguing symphony of colors: clay red, green, and deep blue.

Next stop: the national waterfall, Gulfoss. It was by far the biggest waterfall that I saw that week, and I particularly appreciated that we could inch up so close for a great view. (Not that there was anybody there to stop us; it was almost completely unadulterated by tourism institutions.)

Again I was entranced by the unexpected liveliness of the colors. Around the waterfall, the water had a beautiful jade hue. You can maybe see just a hint of it at the bottom right of this picture, but my plebeian camera did a very crude job of capturing the sights (throughout the entire trip, in fact).

The last stop was at the national park Þingvellir, which was a charming valley filled with shrubbery and little ponds, and ringed by rock walls. After starting out on the footpaths, my friend and I decided that it would be more fun and quicker to explore by scaling the mountains. As we looked down at our tour group from the overhanging cliffs at the top, we realized that we were right about the “more fun” and wrong about the “quicker.” So we ended up having to climb back down and catch up, but it was just fine.

2. Glacier Hiking

Perhaps my favorite activity of the week was hiking on the Icelandic glacier, Sólheimajökull. I’m not much of a hiker by any standards, but then this was no ordinary hike. Something about the glacier’s being unmoored tickles the imagination, as if it were a totally different world from the one solidly anchored to the earth that we know. Adding to the glacier’s aura of transience was the knowledge that, slowly but surely, it was dying out. At the edge we were able to see how it had shrunk a few meters in just the past couple years.

My mental association for a glacier is basically a floating ice cube, but you wouldn’t necessarily have known from a cursory look at this glacier that it was any different from the snowing mountains surrounding it. There were layers of black sand that looked like mountain rock, which made me skeptical that I was even standing on a glacier at all, but after peeling past those layers I started to find just ice and packed snow. It was an interesting role reversal: the mountains were rock with little flakes of ice; the glacier was ice with little flakes of rock.

We suited up with an unexpected amount of safety gear, including spikes, helmets, ice picks. The deceptively innocent blanket of white snow concealed some hidden perils, like this deep pothole.

We also found a tiny glacier stream. I amused myself, and the other hikers, by pretending to be a polar bear and drinking from the stream.

Iceland’s weather is fickle. Coming from someone who has now spent a significant amount of time in England and Seattle, that’s saying something. It was pouring while we were getting into our boots and other snow gear. It was dry and sunny during the hike. And on the way back, we were bombarded by a blizzard. I’d never been afraid of hail before, but these hailstones hurt! The one thought that kept running through my mind was that it would have been a good time to film a live action video of “Let it go” from Frozen. Brains are weird things.

The wind rises before the storm.

3. Caving

And the third course of our Iceland sampler was … caving!

Iceland has many underground caves cut out by volcanic eruptions. Our guide was actually part of a society that discovers and explores these caves. Only a few have their locations released to the public to become tourist attractions.

You could see very little in the caves with the naked eye. Our flashlights cut narrow walking paths in front of us, but all around everything was black. Surprisingly, however, a flash camera would reveal a distinctive red hue in the rocks around us.

There was another very surprising feature of the caves. In the parts that had been melted out by volcanic lava, the rock had actually melted to form an eerie, glassy surface. The ceilings were pocked with tiny little stalagmite-things that had cooled from dripping rock. I found this hard to process because I had somehow never entertained the possibility of rock melting. Metal, sure, but isn’t rock like the kind of stuff you put over a fire to smother it?

4. Snorkeling

When I saw a snorkeling option listed on the Iceland Guided Tours website, my mind immediately jumped to my wonderful snorkeling experience in Bermuda a few years earlier. I smiled at the memory of dozens of species of marine life darting between vivid reefs in those pristine Bermudan waters and, imagining myself enjoying a similar idyllic adventure in Iceland, signed up immediately.

Soon after I arrived in Iceland, I began to wonder if my mental image of relaxing among fish in warm waters might have been slightly misguided. On several occasions over the first two days, a guide mentioned that such-and-such a pond or lake had just thawed. So by the time that snorkeling came up on the fourth day, I was feeling slightly anxious.

To my relief, I learned upon arriving at the snorkeling site that we were going to be well equipped for the ordeal. I had naively brought my swimming trunks, thinking that I would have to jump in bare-chested or with a wetsuit, but it turned out that I would be fitted in what can best be described as a heavy duty spacesuit. After spending an hour wriggling into a thermal undersuit, thermal hood, a gigantic drysuit, gloves and flippers, I was ready for battle. Unfortunately, for reasons that you can imagine, I was unable to take any pictures during this adventure.

The lake simply looked like a shroud of blackness from above water, but from below it was a beautiful mural of color and light, especially if you looked up at the sun. I enjoyed crawling around the edges of the lake and inspecting the rocks and their visual contrast with the water. For someone as near-sighted as me, looking at something very closely without glasses is an unusual and striking visual experience. I also amused myself by drinking copiously from the lake, which I would end up regretting on the bus ride back.

But the snorkeling wasn’t all fun. As time went on, I became increasingly aware of distress signals sent from my hands, which were not encased in the protective drysuit. In the process of becoming numb, they somehow managed to continue transmitting biting, intense pain. So I was rather relieved when I stumbled out of the lake after half an hour so, the last minutes of which had seemed to stretch on forever. It was a satisfying experience, but not a wholly pleasurable one.

5. An Icelandic dinner

Iceland has many enchanting qualities, but its food pricing is not one of them. When I first arrived I entertained delusions of going out to enjoy authentic Icelandic cuisine. However, after familiarizing myself with the cost of food I decided that I could best avoid bankruptcy by subsisting on rice, canned vegetables, bread, and peanut butter. Still, while standing in line at the grocery store I couldn’t help but bristle at the indignity of having flown thousands of miles across the Atlantic to purchase a tiny jar of peanut butter for the equivalent of \$6. My thoughts wandered involuntarily to the prospect of a career in smuggling condiments.

By this time on the penultimate day, I was satisfied that I had enough money saved in my bank account to survive a nice Icelandic meal. So my travel buddy and I headed out to the Grillmarkadurinn, or Market Grill. And it did not disappoint. The ambience was excellent, though perhaps a bit too romantic for the situation. Our table was lit by what might be described as a volcano rock chandelier.

The meal was prefaced by a delicious starter of bread with butter and volcanic rock salt. Yes, you read that correctly.

This was delicious for the first few slices of bread, but after that I started to really taste the rock, which was a little too strong. Still, I was very pleased to be able to add “ate Icelandic volcano” next to “drank Icelandic lake” and “consumed Icelandic glacier” on my rapidly growing résumé of impressive feats.

Next up was an appetizer consisting of three mini-sliders made from lobster, reindeer, and whale. You might have guessed that I was mainly interested in the latter two. I hadn’t even realized that reindeer was a real animal (Santa’s other helpers, the elves, are still fictional as far as I know). I felt rather guilty at eating these creatures, but the portions of meat were smaller than my thumb.

You might be wondering what the reindeer and whale tasted like. I’m not sure what words to put to it. Both pieces of meet looked like beef, but I remember the reindeer being a lot softer and chewier than I expected. The whale was less so, being more like fish.

That’s…reindeer.

For the main course, I had elected grilled salmon with cherry glaze and broccoli and carrot salad. I just love salmon, and my anticipation for Icelandic salmon had been building over the past four days as people kept mentioning how salmon was the big thing in Iceland. What would it be like?!

This salmon was done to perfection. There is a very delicate stage in between salmon’s (delicious) sashimi state and (also delicious) cooked state, when it briefly phase transitions into the absolute paragon of deliciousness. The lusciously tender pieces dissolved on my tongue like butter, leaving behind that characteristic hearty salmon flavor. I suppose that my hands were so shaky with excitement that I could only take this blurry picture.

6. The Blue Lagoon.

The Blue Lagoon is a natural geothermal spa in Iceland. As soon as I read those two words I knew I had to see it for myself. The idea of bathing in the very essence of the earth is just so compelling; it’s no surprise that thousands of people flock to the Blue Lagoon every year to enjoy its mineral-rich waters.

On the bus ride there I was already squirming to catch sight of the lagoon. That puddle coming up seems a little bluish – is it the Blue Lagoon? What about that little pond? Are we there yet? There was no question when we did finally arrive: you’ll know the Lagoon when you see it. A brilliant sapphire against Iceland’s otherwise simple landscape, it is surely one of the island’s most idiosyncratic gems.

The rest of the day was spent lazily loafing around in the warm, relaxing waters. The Blue Lagoon was like a giant hot tub, but it felt more wholesome. There were tubs of silica mud lying around, which people could apply to their skin for nutrition and restorative effects. I went for several coats. The mud itself was cool and soothing, and after washing it off I thought my skin might have felt softer and more hale, but it might very well just been my imagination.

The Lagoon featured a few other interesting novelties. There was a giant geyser-like thing in the middle that spewed steam, and I enjoyed standing in it and pretending that I was floating around in a cloud. A very warm, comfortable cloud. Unfortunately it also smelled of sulfur, so I had to take a break for fresh air every now and then.

There were also saunas in a corner of the Lagoon, but I didn’t spend much time in them. Why use a man-made sauna when there’s a natural, super-awesome earth sauna?

Looking back, I still find it amusing that I spent five straight hours soaking in a giant geothermal hot tub, a stupid grin plastered to my face with silica mud. But I don’t regret a second of it.

## Bounded gaps between primes: a retrospective, II

In this post, I’ll finish my story about bounded gaps between primes. Two of my reasons for undertaking this particular essay were to prepare myself to read the Polymath and to have a chance to learn from Professor Tim Gowers, who was the advisor. In fact, I had originally intended just to write about my experience with these two things, but I decided that it wouldn’t make any sense without first summarizing the mathematical story as in Part I.

Although this is still about the same problem, it has a different goal. Instead of describing specific mathematics, I’ll be discussing the experience of looking at research in its raw, unpolished form. It shouldn’t depend much on understanding the mathematics behind the Maynard-Tao theory, so you’ll be OK if you didn’t read the previous post.

Unfortunately, there is precious little material out there for easing students like myself into the process of doing research. Maybe this will help. I realize that I’m ill-qualified to give any sort of advice, and I wondered for a while if I should even make this public. But in the end, this isn’t about giving advice; it’s just about recounting an experience, and hopefully that will be at least slightly useful or entertaining even coming from me.

1. The Polymath8b Project

For those who are unfamiliar with the Polymath, let me give a brief introduction. Mathematical research has traditionally been a solitary endeavor, with papers rarely featuring more than two authors. In 2009 Tim Gowers (who also set my essay topic!) raised the possibility of massively collaborative mathematics, in which an online community of mathematicians would pool their efforts on a single problem. On his blog he hosted the first Polymath Project devoted to finding a combinatorial proof of something called the Hales-Jewett Theorem. Since then there have been several more Polymath projects, with the most recent being Polymath 8b, which is devoted to improving on the results of the Maynard-Tao method.

The Polymath 8b forums preserve (online) conversations between many mathematicians, including Tao and Maynard themselves, making further strides towards the Twin Prime Conjecture. You can see appeal for a neophyte like myself: how often do you get to peer inside the thoughts of leading experts in the process of discovery? I was hoping to glean some insights by opening up the hood of the research vehicle. I didn’t go through everything carefully (in fact, Polymath 8b is still ongoing), but I can describe a couple interesting things I noticed.

1.1. General observations

One aspect of the Polymath that differs greatly from texts and papers is that it’s composed of much more guesswork and heuristics than formal argument. Published math tends to be a barrage of Definitions, Lemmas, Propositions, Theorems, and especially proofs, and I think that this gives students the somewhat misleading impression that in math one only makes very measured, rigorous steps. In contrast, the reasoning of Polymath featured lots of hand-waving, cartwheeling, and other acrobatic logic. To be clear, everything was rigorously justified in the end, but I was struck by how often a thread would be built on a “perturbative” analysis or estimate for a new idea. In my earlier undergraduate years I was uncomfortable with my physics classes because the arguments are often mathematically unsound, with technical difficulties always disappearing conveniently under the rug. But the meat of Polymath had a similar flavor.

Of course, it’s a lot easier to take a wrong turn after leaving the realm of rigorous mathematics, so one has to tread carefully. My takeaway is that the ability to reason informally but accurately is very useful (and subtle), because writing down formal proofs is an inefficient and clunky way of getting at the truth.

Let me illustrate with an example. One question that I asked myself while reading Maynard’s paper was the following. The key to Maynard and Tao’s enormous improvement over Goldston-Pintz-Yildirim is in a better choice of weights, specifically in generalizing singlevariate weights of Selberg form to multivariate weights (still of Selberg form). But there are many other sieves than Selberg’s: Brun’s sieve, the Large Sieve, the combinatorial sieve, etc. What if one models the weights on one of these alternatives instead? Or sieve weights of a completely new form?

My instinctive approach to answering this question would to spend a long time on slowly and painfully understanding these other sieves, and then blindly trying some new possibilities. But there is a more efficient way. Terence Tao actually considered the same question, and gave a heuristic argument for why a different sieve would be unlikely to lead to significant improvements.

I discussed in Part I that sieve theory works with sums of the form

$\displaystyle \sum_{n \in [N,2N]} w(n)$

and these ${w(n)}$ are called the sieve weights. Selberg chooses weights of the form ${w(n) = \left(\sum_{d \mid n} \lambda_d \right)^2}$ and then optimizes with respect to the ${\lambda_d}$. We are wondering if we can do better with some different choice of weights. Any such choice is probably going to look like

$\displaystyle w(n) = \sum_{\substack{d_1, \ldots, d_k \\ d_i \mid n + h_i}} \lambda_{d_1, \ldots, d_k}.$

This isn’t completely general, but it encompasses essentially all known sieves, and philosophically we expect that a sieve has to work by combining data from the divisors in a way like this. In order to control the error terms that will come up, we will also need to impose some restriction looking like ${d_1 \ldots d_k \leq N^{O(1)}}$. (In the Maynard-Tao sieve, the requirement is essentially ${d_1 \ldots d_k \leq N^{\theta/2}}$ where ${\theta}$ is the level of distribution.)

For a fixed value of ${n}$, this means that we expect ${w(n)}$ to basically depend only the prime factors of ${n+h_i}$ that are less than ${N^{O(1/k)}}$, since we imagine that having a very large divisor (compared to average) takes up too much space for the factorization to make much contribution. But if this is the case, then the numbers whose only prime factors are very large look the same as prime numbers to the sieve. More precisely, we say that ${n}$ is “${N^{O(1/k)}}$-rough” if its prime factors are all larger than ${N^{O(1/k)}}$, and this reasoning suggests that ${w(n)}$ is the same for all the ${N^{O(1/k)}}$-rough numbers.

Now it turns out that the prime numbers in ${[N,2N]}$ form a proportion ${O(1/k)}$ of the ${N^{O(1/k)}}$-rough numbers in this interval, and this is bad for us. To see why, recall that in the Goldston-Pintz-Yildirim framework we consider sums

$\displaystyle S_1 = \sum_{n \in [N,2N]} w(n) \qquad \text{and} \qquad S_2^{(\ell)} = \sum_{n \in [N,2N]} w(n) \chi(n+h_{\ell})$

and we win if ${S_2^{(\ell)}/S_1}$ is large relative to ${1/k}$. But if ${w(n)}$ is indeed approximately constant across ${N^{O(1/k)}}$-rough numbers and the primes are only a proportion ${O(1/k)}$ of these, then just by looking at the form of the sums we see that ${S_2^{(\ell)}/S_1}$ is at most ${O(1/k)}$. So we can’t do much better with any such choice of weights.

Now, there are a couple of interesting features of this analysis. We had a fuzzy idea for improving the Maynard-Tao sieve, and we were able to formulate a precise question and give a rather hand-wavy heuristic argument to answer it. That was the first interesting thing; the second is that the analysis is wrong! Even the current Selberg sieve achieves a ratio of ${S_2^{(\ell)}/S_1 \approx \log k/k}$, beating our predicted ${O(1/k)}$ performance. The fault in our reasoning was that there is actually enough concentration of measure for well-chosen weights to distinguish between prime numbers and very rough numbers. In fact, the multivariate weights that are the key ingredient in the Maynard-Tao proof were already (essentially) written down by Selberg years ago, and he believed that they would not lead to much improvement. I don’t know his exact justification, but I imagine that it was probably similar to the argument I just described.

So in principle, we still haven’t (formally) ruled out the possibility of making huge gains just by choosing different ways. Still, going through the preceding analysis sheds some insight on the inner workings of the Goldston-Pintz-Yildirim mechanism, and what we would need to do to make it more efficient. At the end of the day, even though we haven’t proven a formal limitation on the method, I think we have given a reasonably convincing case that there isn’t much more room to squeeze out of it.

1.2. The Polymath’s Results

Having just spent a lot of time talking about what Polymath 8b did not formally prove, let me say a bit about what it did establish, which represents the state of the art in our current understanding of small gaps between primes. (I don’t claim to describe all of its results; in particular I am omitting any discussion of the asymptotics on ${M_k}$.)

We are interested in proving bounds for ${\liminf_{n \rightarrow \infty} p_{n+1}-p_n}$. The Maynard-Tao argument gives a bound in terms of two components: (1) the level of distribution ${\theta}$ which we have already discussed, and (2) a functional optimization problem. The details of this optimization problem are not important, but just to be less vague I’ll say what it is: maximize the quantity

$\displaystyle \frac{ \int_0^1 \ldots \int_0^1 f(t_1, \ldots, t_k)^2 \, dt_1 \ldots dt_k } {\int_0^1 \ldots \int_0^1 \left( \int_0^1 f(t_1, \ldots, t_k ) \, dt_1 \right)^2 dt_2 \ldots dt_k} \ \ \ \ \ (1)$

over functions ${f: [0, \infty)^k \rightarrow [0, \infty)}$ supported on the unit simplex

$\displaystyle \{ (t_1, \ldots, t_k) \colon t_1 + \ldots + t_k \leq 1 \}.$

Polymath focused on proving bounds in two setting. The unconditional bounds are those that do not depend on any unproven conjectures, i.e. those which we can prove completely right now, and the conditional bounds are those that can be proven using widely believed (but potentially very difficult) conjectures. In practice this basically means that the unconditional bounds use level of distribution ${\theta = \frac{1}{2}}$ from the Bombieri-Vinogradov Theorem, and the conditional bounds use ${\theta = 1}$ coming from the (generalized) Elliott-Halberstam conjecture.

The current state of affairs for these bounds is depicted below.

 Unconditional Conditional Pre-2005 $\infty$ $\infty$ GPY $\infty$ 16 Zhang 70000000 16 Polymath8a 4680 16 Maynard 600 12 Polymath8b 246 6
1. The Unconditional bound. Maynard’s unconditional bound of ${600}$ is achieved by solving the optimization problem (1) in a certain finite-dimensional space, which basically turns it into a quadratic programming exercise. It turns out that one can already get a bound of ${272}$ unconditionally without having to introduce new theory, by only using a bigger quadratic program. However, the bound ${246}$ is known to be impossible to achieve if one does not modify Maynard’s optimization problem, so it requires new ideas. The key observation is the following. Maynard’s argument depends on bounding the quantity ${\rho = \sum_{\ell=1}^k S_2^{(\ell)} / S_1}$ from below. He does this by establishing asymptotics for ${S_1}$ and ${S_2^{(\ell)}}$, and the level of distribution is needed to restrain the error terms from ruining the asymptotics. But if we are really just interested in bounding ${\rho}$, then we don’t need asymptotics; we just need bounds in the right direction (a lower bound on the numerator and an upper bound on the denominator). So in fact one can allow the error terms to overflow a bit, as long as they overflow in the right direction. In technical terms, this means that we can allow the parameters to go ${\epsilon}$ above what we can control, at a small sacrifice elsewhere; this tradeoff gives a little more wiggle room that can be optimized to gain a few inches. On Polymath 8b this is called the “${\epsilon}$-trick,” and it leads (after a lot of computation) to the bound of ${246}$.
2. The conditional bound. Polymath 8b was unable to improve on Maynard’s conditional bound of ${12}$ just using the Elliott-Halberstam conjecture. However, it was able to prove a bound of ${6}$ assuming something called the generalized Elliott-Halberstam conjecture, which concerns the equidistribution of general types of arithmetic functions; the (normal) Elliott-Halberstam conjecture is essentially the special case of prime-counting functions. So the generalized version is definitely harder, but perhaps not much harder, as the generalized analogue of Bombieri-Vinogradov has been proven. The reason why the generalized conjecture is useful goes roughly as follows. The level of distribution is needed to control the irregularity of prime counting functions in the ${S_2}$ sums. The analogous counting that goes on in ${S_1}$ is actually very regular, so some room is “wasted” in ${S_1}$ because we are forced to be so careful about ${S_2}$. Conversely, there is some “waste” in ${S_2}$ coming from choices that are needed for ${S_1}$. The generalized Elliott-Halberstam conjecture lets us remove this wastefulness by aligning the resources needed for ${S_1}$ and ${S_2}$. The technical effect is that we can somewhat expand the space of functions one is optimizing the quantity (1) over, which gives the extra inches needed.

1.3. The Parity Problem

The enormous progress mathematicians have made in the past year may make it seem like the Twin Prime Conjecture is now tantalizing close. If we have already brought the known bound on ${\liminf p_{n+1}- p_n}$ down from ${\infty}$ to ${246}$ in one year, perhaps some enterprising person, maybe even you, can get the other measly ${244}$?

The truth is that there is a fundamental barrier to questions like the Twin Prime Conjecture called the “parity problem” that we are no closer to surmounting now than before. The bound of ${6}$ is important because it is the best result allowed by the parity problem to sieve-theoretic means, but we expect that the jump from ${6}$ to ${4}$ to be very difficult.

You can think of sieve theory as being a general type of tool, say a hammer, that we can use to try and crack our questions. The Maynard-Tao-GPY sieve is a very nice model of sledgehammer, which can crack some really tough nuts. The results it can produce depend on an input, the level of distribution, which measures the strength of the hammer wielder. The stronger the better, but there is a limit to what a sledgehammer can accomplish even with the strongest wielder. The parity problem says that the Twin Prime Conjecture is too tough for any hammer to open.

This statement cannot be proved formally; it is just a heuristic (but one that has a fair bit of empirical, theoretical, and philosophical evidence). It’s always possible that a sieve could be the finishing move in a proof of Twin Prime Conjecture, just like a hammer could be the finishing blow in cracking any nut: you could always smash the nut with an intergalactic laser cannon, and then tape the pieces together with scotch tape and break that with a hammer.

The parity problem expresses the general phenomenon that sieves can’t really distinguish numbers with an odd number of prime factors from those with an even number of prime factors. They only “see” sets that have a balanced mix. The technical basis for this principle is the belief that the Möbius function is pseudorandom. It isn’t hard to understanding the connection, but I want to avoid all technicalities, so you can look at the last section of my essay if you are curious. The point is that a set of primes, say of size ${A}$, consists only of things with an odd number of prime factors. Sieves can only see balanced sets, so the imbalanced set that they can see could be a shadow of something as big as a balanced set of size ${2A}$, or as small as a balanced set of size ${0}$. That is why the parity problem dictates that sieve methods can’t establish nontrivial lower bounds, or upper bounds closer than within a factor of ${2}$ of the truth.

To illustrate the philosophical flavor of the problem, it may be helpful to consider a completely different issue. Another famous and old problem in number theory is Goldbach’s conjecture, which predicts that every even number is a sum of two prime numbers. The “weak” Goldbach conjecture that every odd number is a sum of three prime numbers has now been proved, and it has been known for a while that it is true for all but finitely many numbers. Why is the weak version so much easier?

The best known approach to Goldbach’s conjecture is to use something called the “circle method,” which gives a formula for the way to write ${n}$ as a sum of two prime numbers or three prime numbers. Unfortunately this formula is not very transparent, so you can’t just look and see if it’s zero. But you can estimate its size, and in the case of three prime numbers you can prove that it quickly grows large, so it can’t be zero. In the case of two prime numbers this doesn’t work because the answer shouldn’t be large – it should be too small to be detected by the analytic techniques that we possess.

As I said, this is a different obstruction from the parity problem, but it also illustrates how we can believe that a question can be beyond a certain method. The Twin Prime Conjecture is like this beautiful star in space that we would like to reach but we don’t know how. Maynard and Tao have handed us a telescope from which it looks tantalizingly close, but in reality we are probably no closer than before.

2. A conversation with Gowers

Now onto my second noble motivation for writing the essay: I wanted to meet Professor Gowers, who set the topic. That played out rather differently from what I expected. I did eventually get to meet him … more than a month after I wrote and submitted the final version. During my last couple of weeks in Cambridge, I finally plucked the courage to request a conversation. At that point I would probably have been glad just to shine his shoes, but he agreed to chat about math in his office.

For those who don’t know why I am talking like a teenaged fangirl, Gowers is somewhat of a celebrity in the mathematics world. The obvious reason is that he won the Fields medal, which is commonly described as the “Nobel Prize of mathematics.” But he has been a leader of the community in other ways, such as in starting the Polymath and the Elsevier journal boycotts.

The journey to Gowers’s office was itself a bit of an adventure. His Trinity college office is not actually located in Trinity College, as you might guess, but in a small passageaway next to the Round Church of Cambridge. As I squeezed past this 11th century church, which is so ancient that the ground it’s built on is about six inches below everything else around, I felt more like I was descending into a top-secret MI6 lair than an academic office. And Gowers himself cuts a rather impressive figure, especially for a mathematician. As we talked, he in that awesome British accent and me in my usual bumbling way, I couldn’t shake the impression that he was playing some distinguished 00 agent in a Bond film and I was the American comic relief.

I won’t attempt to recount all the disorganized questions I asked, but one thing I was particularly keen to talk about was his experience of doing research. There were some unexpected things I learned from the Polymath, as I’ve already noted above, and I wanted to know how they fit with his own work. After all, the Polymath wasn’t exactly the most representative research project, and I still sort of believed that Fields medalists conceive their mystical ideas while meditating under a waterfall, or something like that. But the picture that Gowers painted could not have been more different.

Another great thing about talking to Gowers is that he’s thought a good deal about “meta-mathematical” questions, like how one should think about math. My impression of math, cultivated from my dismal experience of the Olympiads and by reading stuff from the inscrutable “French school”, was that the essence of math is about pulling magical rabbits from magical hats. In contrast, Gowers presented an analogy of the researcher as a kind of detective, searching for the answer to a particular problem (say, as a sequence of formal symbols constituting a proof). There are too many possible suspects to interrogate carefully, so one has to use heuristics to narrow down the set of leads that one pursues. The little informal argument I described above might be an example of such a heuristic; the parity problem is another. The point is that one has to develop a bag of tricks and techniques to efficiently navigate this space of ideas, which would otherwise be too massive and complex. And at the end, one presents a formal proof to the court, but the proof is just the bureaucratic seal on the search for truth.

Of course, I pressed him to give some examples of tricks and heuristics. If you’ve been doing math for a while then I expect that they won’t be so surprising: they are all things that you probably already do at some instinctive level, but perhaps don’t consciously think about. For instance, here are some of the general examples he mentioned.

• When trying to prove a statement like ${A \implies B}$, you often conjecture some intermediate statement ${C}$ that you know implies ${B}$, and you wonder if ${A \implies C}$. At that point it’s useful to have lots of examples and toy models on hand to test your conjecture. Sometimes the conjecture even comes about by observing the behavior in your examples.
• Another trick when trying to understand an implication ${A \implies B}$ is to twiddle the hypotheses and conclusions. Does your current strategy use all the hypotheses? Does it prove something stronger? You can maybe add some helper hypotheses for now and figure out how to remove them later, or prove a slightly weaker conclusion first and then try to boost it. Another thing that Gowers mentioned, which was unfamiliar to me, was that you can sometimes understand better how to prove ${A \implies B}$ if you try to prove something stronger first. I’ll have to try that out at some point.
• When you’re in the dark you can try to reason by analogy. There are lots of surprisingly effective analogies in mathematics, or general theories that are well encapsulated by special models.

He also mentioned some more subject-specific technical heuristics. In analysis, for instance, one can get a good estimate on a sum or integral even if one can’t prove it; I described how some estimates arise from random models in Part I. In functional analysis it often takes a painful, technical slog to justify the estimate, but in practice it almost always works. There were also meta-mathematical heuristics. If you have an idea for a problem, you might ask yourself if it’s something that people would have tried already – if it’s a famous hard problem, then something too simple will probably not work.

Probably none of that was new to you, but even if you already use all of these heuristics it may be useful to pay conscious attention to them. Professional athletes train by drilling “obvious” skills, anyway, and this doesn’t seem so different. Being but an infant to the research world right now, I can only speculate.

3. Concluding thoughts

The research community seems to operate with a “sink or swim” mentality. Somewhat perversely, our education places all its emphasis on inculcating technical material and none on teaching the “tricks of the trade” for how to think. Those who naturally employ the right skills filter through the system; the others are just left behind. And people like me get the impression that math is only for the magicians, who possess some talent for producing miraculous ideas from out of nowhere.

I think that my experience, as documented here, partially dispels this perception. If any of you are still reading, then I hope you walk away feeling hopeful that a lot of the secret sauce consists of concrete ingredients that you can learn and practice, and that everybody else struggles and makes mistakes as well, whether or not they show it.

## Bounded gaps between primes: a retrospective, I

One of my projects over this past year was writing a mathematical “essay” (basically the analogue of a master’s thesis) on the recent progress towards proving bounded gaps between primes. You may have already heard about this result as it has received quite a bit of media attention, largely because of its fantastical backstory.

I just spent some time reflecting on what I learned in this process, and I thought it might be useful to write some of it down. This turned out to be lengthier than I anticipated, so I broke it up into two parts. This first part, posted here, is just a summary of the mathematical story. I’ve given only a vague sketch of the main ideas; those who wish to see the technical details can take a look at the full essay, which is posted in my notes page here. The second part will be much less mathematical: I’ll recount some general research things that I learned from the experience, especially by reading the Polymath and talking to my advisor.

I wrote this first part to be the kind of thing that I would have liked to read in order to understand the ideas involved, without the headache of digging through and processing the literature. For me that means as little actual math as possible, but I don’t know if that will work for anybody else. In the best case, this post will make for light reading that conveys some sense of the subject to people who are foreign to it. In the worst case, it will cause non-experts to grimace at the excess of technicality, and others to grimace at the lack thereof. (If you like, e-mail me some feedback to help me figure out the balance.)

1. Introduction to the Problem

1.1. The Twin Prime Conjecture

One fundamental theme in mathematics is that of primality. Recall that prime numbers are those natural numbers that cannot be divided into the product of two smaller natural numbers. They are the basic atoms out of which all numbers are built.

For number theorists, the prime numbers are inherently beautiful and mysterious gems, manifesting simultaneously a great deal of structure and a great deal of randomness. But they have significance beyond their aesthetic appeal; for instance, prime numbers lead in to the notion of prime ideals, which form a robust bridge between algebra and geometry by which one interprets primes as points of a geometric space. This simple observation is now the cornerstone of the entire subject of algebraic geometry. Prime numbers even enjoy the rare distinction of having important practical applications, mainly to cryptography. This is all just to say that primes are extremely interesting objects, both for their own sake and for the deeper ideas that they suggest.

It’s no surprise then that the distribution of the prime numbers is one of the most prominent mysteries in mathematics. How densely are they distributed? How regularly are they spaced? These basic questions have seen a lot of attention, but have also proved extremely elusive.

We do know a fair bit about the density of the primes. We know that if you look at all the positive integers up to ${ X}$, you’ll find about ${X/\log X}$ primes. More precisely, if ${\pi(X)}$ denotes the number of primes up to ${X}$, then ${\pi(X)}$ is well approximated by the function ${ \int_2^X dt/\log t}$ – that is the assertion of the Prime Number Theorem. We can only prove that this approximation is modestly good, but it is conjectured to be very good: the content of the famous Riemann Hypothesis is that the error term is about square-root in magnitude. (This isn’t the usual formulation of Riemann’s hypothesis, but it is equivalent.)

It may be worthwhile to pause and reflect on the significance of “square-root error.” If you flip a fair coin many times, say 1000 for concreteness, how often do you expect to see heads? Clearly you should flip heads about half the time (if the coin is indeed fair), so in our example you should expect about 500 heads. But you probably won’t get exactly 500 heads – there’s always some random “noise” to the results. It turns out that we can expect to be within a “square-root error” of parity. For 1000 flips, the deviation from parity should be on the order of magnitude of 30 or so. There is a general result called the Central Limit Theorem, which says that this is the kind of behavior we always witness when aggregating many independent random events, no matter what they are. The moral is that in quite general circumstances, a square-root error is the best that we can hope for: it is the result of perfectly random noise. Any tighter error is the kind of thing that will bring the IRS knocking on your door.

Now let’s return to our discussion of prime numbers. The density of the primes looks like what we would get if primality were random and the number ${ n}$ had about a ${1 / \log n}$ probability of being prime. In other words, it looks almost as if the property of ${ n}$ being prime is encoded by a Bernoulli random variable with parameter ${1 / \log n}$. Obviously this is a ridiculous statement, since everything is completely deterministic, but what if we just play along with it? If we further assumed that our random variables were independent, then the Central Limit Theorem would tell us that there should be about ${ \int_2^X dt/\log t}$ primes up to ${ X}$ and would give the hypothesized square-root error term. In other words, the content of the Riemann hypothesis is that we can predict the distribution of prime numbers up to what is essentially random noise. So our probabilistic heuristic has led us quite naturally to the Riemann hypothesis. The lesson here is that it is often very useful to model sequences of numbers by random variables.

The spacing between prime numbers has proven to be more difficult to analyze. If ${p_n}$ denotes the ${n}$th prime, then the Prime Number Theorem tells us that the “average” gap between ${p_n}$ and ${p_{n+1}}$ is about ${\log n}$. So the primes are generally growing sparser and sparser apart, but we could still conceivably have coincidences every now and then in which they are much closer than expected. How often does this happen?

We can at least try to guess at the answer using the kind of heuristics we described above. Here you see an immediate flaw: the events [${n}$ is prime] and [${n+1}$ is prime] are obviously not independent, since ${n}$ or ${n+1}$ is even. So we obviously can’t expect multiple pairs of primes that are separated by a distance of ${ 1}$. But there is no such obvious dependence between ${ n}$ and ${ n+2}$ both being prime, so we can press on in our fantasy world and model their primality as the realization of two independent Bernoulli random variables. If we do so, then we see that there’s about a ${ 1/(\log n)^2}$ chance of ${ n}$ and ${ n+2}$ being simultaneously prime; such an ${n}$ is called a twin prime. This heuristic already suggests the Twin Prime Conjecture that there are infinitely many twin primes.

We could easily have stated the Twin Prime Conjecture earlier, without undergoing this lengthy discussion, but I think it’s valuable to emphasize that the conjecture is more than an isolated prediction; it fits in naturally with underlying themes that link together the main questions of the field.

1.2. Historical overview

The Twin Prime Conjecture seems to be at least several centuries old, but we had made very little progress towards it until very recently. It is useful to think in terms of the quantity ${\liminf_{n \rightarrow \infty} p_{n+1}-p_n}$, which is by definition the smallest gap between primes that occurs infinitely often (or ${\infty}$, if no such finite gap exists).

Conjecture (Twin Prime Conjecture) We have

$\displaystyle \liminf_{n \rightarrow \infty} p_{n+1} - p_n =2.$

Despite contributions from Erdös, Selberg, Davenport, Bombieri, and many others, we still had no plausible strategy for even showing that ${\liminf_{n \rightarrow \infty} p_{n+1}-p_n < \infty}$ until the turn of the millenium, when Goldston, Pintz, and Yildirim released a landmark paper showing that bounded prime gaps follow from something called the Elliott-Halberstam conjecture.

Theorem 1 (Goldston-Pintz-Yildirim, 2009) If the Elliott-Halberstam conjecture is true then

$\displaystyle \liminf_{n \rightarrow \infty} p_{n+1} - p_n \leq 16.$

A precise formulation of the Elliott-Halberstam conjecture is given in my essay. Here it’s enough to understand that the conjecture is concerned with prime equidistribution; it predicts that primes are as as evenly in different residue classes as is plausible. The Elliott-Halberstam conjecture appears to be quite difficult, but it has a similar feel to the Riemann hypothesis and is probably not much more difficult than it (of course, that’s not saying much).

The Goldston-Pintz-Yildirim argument also came tantalizingly close to unconditionally establishing that

$\displaystyle \liminf_{n \rightarrow \infty} p_{n+1} - p_n < \infty.$

However, at the turn of the decade many mathematicians still believed that the argument could not be extended to close that extra distance, and that finite prime gaps were still far beyond our reach to prove.

Then in the spring of 2013, the mathematical world was shaken unexpectedly, and by the unlikelist of sources. Yitang Zhang was a virtual unknown in the mathematics community. After receiving his Ph.D. from Purdue in 1991 (in a subject distinct from number theory), Zhang failed to obtain an academic job. He all but disappeared from the academic community, pursuing odd jobs as an accountant and even a Subway worker to support himself. He finally secured a position as a teacher (not even a researcher!) at the University of New Hampshire in 1999. From then, he worked on several prominent problems in analytic number theory in secrecy and isolation, until he shocked the world in May 2013 with the first-ever proof of bounded prime gaps:

Theorem 2 (Zhang, 2013) We have

$\displaystyle \liminf_{n \rightarrow \infty} p_{n+1}-p_n \leq 70,000,000.$

In fact, I was present on that May afternoon at Harvard when Zhang first publicly announced his proof. I witnessed a spectacle I had never before seen in four years of walking by the seminar room: the whole of the department – topologists, algebraists, geometers, and of course number theorists – packed into a single talk, united in admiration for this impossible success story in the face of adveristy. Zhang, with no number theory training and no research support, had not only conquered this seemingly impossible challenge, but he had done so by improving technically on the work of experts, in ways that the experts had thought impossible.

Mathematicians quickly converged to discuss Zhang’s work. Most prominently, Terence Tao gathered an online community of mathematicians to sharpen and optimize the various components of Zhang’s argument. After several months, this massively collaborative online project, dubbed “Polymath 8,” pared down Zhang’s bound from ${70,000,000}$ to about ${5000}$. They also found many simplifications to Zhang’s proof, but it was still a difficult and intricate argument.

In November 2013, another huge breakthrough rocked the mathematics community. James Maynard, a postdoctoral researcher with a newly minted Ph.D., found a way to dramatically improve the apparatus used by Goldston-Pintz-Yildirim. Whereas Zhang had made technical improvements to the existing arguments, Maynard discovered a fundamental alteration that made them much more efficient. Maynard’s work was completely independent of Zhang’s, and led to a much shorter and easier proof of a stronger result. After centuries of feeble efforts, mathematicians made two huge strides in the span of a few months!

Theorem 3 (Maynard, 2013) We have

$\displaystyle \liminf_{n \rightarrow \infty} p_{n+1}-p_n \leq 600.$

In fact, Maynard’s work says much more: it automatically proves that there are multiple primes among any “admissible” system of linear forms. As a consequence, it establishes not only infinitely many bounded gaps between pairs of primes, but infinitely many bounded intervals containing arbitrarily long strings of consecutive primes!

Maynard’s results were simultaneously and independently discovered by Tao, who then initiated another communal project, Polymath 8b, devoted towards improving the result further. Through collaboration from a wide pool of contributors, Polymath 8b improved on several facets of Maynard’s results, with the following consequence in particular.

Theorem 4 (Polymath 8b, 2014) We have

$\displaystyle \liminf_{n \rightarrow \infty} p_{n+1}-p_n \leq 246.$

This is, at present, the best bound anybody can prove.

2. A primer on sieve theory

2.1. The sieve of Eratosthenes-Legendre

The main technical apparatus (currently) for counting twin primes is something called sieve theory. The basic idea of sieve theory to take a complicated set, like the set of prime numbers, and try to find its size by measuring it in terms of simpler sets. The process is analogous to that of trying to measure the area of some complicated plane region by comparing it with a collection of rectangles and other “simple” shapes, which you might have done in geometry class. But in our situation, simplicity is measured in an “arithmetic” sense rather than a geometric sense.

For concreteness, let’s consider an example. Suppose we want to count how many prime numbers there are in some interval, say ${[N, 2N]}$ where ${N}$ is pretty large. Prime numbers seem pretty complicated, but their complement, the composite numbers, seem a little less daunting. We can start listing off composite numbers organized by their divisors. For instance, there are about ${N/2}$ integers in ${[N,2N]}$ that are divisible by ${2}$. There are also about ${N/3}$ integers in ${[N,2N]}$ that are divisible by ${3}$. We can combine this information to figure out (approximately) how many integers in ${[N,2N]}$ are divisible by ${2}$ or ${3}$, but we have to be a little careful: the answer is about ${N/2+N/3-N/6}$, where the extra subtracted term comes from the fact that the number of integers divisible by ${6}$ will be double-counted. We can continue on in this way.

You probably recognize this counting procedure as the “principle of inclusion-exclusion.” Let’s think about what the general procedure looks like. We begin with a list of small primes ${p_1, p_2, \ldots}$. We want to sift out the composite numbers in our interval ${[N,2N]}$, and we know that about ${N/p_i}$ of them will be divisible by ${p_i}$, so we subtract out

$\displaystyle N - \frac{N}{p_1} - \frac{N}{p_2} - \ldots - \frac{N}{p_i} - \ldots.$

However, we have overcounted: the numbers divisible by ${p_1}$ and ${p_2}$ are subtracted out twice (which is once more than we want), so we have add them back in once. So then we add back in

$\displaystyle \frac{N}{p_1 p_2} + \ldots + \frac{N}{p_i p_j} + \ldots .$

Now the numbers divisible by three of the primes are not counted with the correct multiplicity, so we must put in more correction terms, and so on. There is a nice, clean way of packaging this argument, which is by introducing the Möbius function ${\mu(n)}$:

$\displaystyle \mu(n) := \begin{cases} 0 & p^2 \mid n \text{ for some prime p}, \\ (-1)^k & n = p_1 \ldots p_k \text{ for distinct primes } p_1, \ldots, p_k.\end{cases}$

The key property of the Möbius function is the identity

$\displaystyle \sum_{d \mid n} \mu(d) := \begin{cases} 1 & n=1, \\ 0 & \text{otherwise.} \end{cases}$

This identity encapsulates the inclusion-exclusion principle counting procedure that we described earlier. We strike out the number ${n}$ once for each prime divisor, then put it back in once for each pair of prime divisors, and so on. At the end, ${n}$ will be struck out exactly once, which cancels with the term coming from the divisor ${d=1}$ in the sum above.

Using this identity, we see that

$\displaystyle \# \{ n \in [N,2N] \colon n \text{ coprime to } m \} = \sum_{N \leq n \leq 2N} [(m,n) = 1] = \sum_{N \leq n \leq 2N} \sum_{d \mid (m,n)} \mu(d).$

So we have used the Möbius function to create “sieve weights”

$\displaystyle w(n) := \sum_{d \mid (m,n)} \mu(d)$

that “filter out” numbers that are not prime. In principle, this will lead to an exact formula for the number of primes in our interval ${[N,2N]}$. However, to digest this formula we need to make certain approximations, and it turns out that we cannot control the error terms that accumulate from all of the approximations. The point is that the error terms are sums of values of the Möbius function, which oscillates wildly from ${-1}$ to ${1}$. If this oscillation is sufficiently random then the errors will cancel to a very large degree, and all will be well. In practice that always seems to be the case, but it is extremely difficult to prove that there is any appreciable degree of cancellation at all.

2.2. Selberg’s sieve

The modern perspective of sieve theory replaces the approach of measuring sets by physically breaking them down into simpler sets with that of approximating the indicator function of a set with simpler functions. (Approximation is not necessarily in the usual sense of functional analysis; for instance ${1/2}$ is a good approximation to the indicator function of the even integers.) This point of view emphasizing functions over objects is a characteristic motif of twentieth-century mathematics. To continue with our previous analogy, this is like developing the theory of integration in order to handle our questions about areas.

In sieve theory, there are many variations of this theme that improve upon the method of Eratosthenes-Legendre. For the question of small prime gaps, the most successful has been Selberg’s sieve. To motivate this, it is helpful to reflect on what went wrong in the sieve of Eratosthenes-Legendre. The problem was that the weights ${\mu(n)}$ fluctuate violently between ${-1}$ and ${1}$, and there was a lot of cancellation in our expressions that we couldn’t codify because this function is so wild. Philosophically, the method is too precise: it gives an exact formula for the quantity we are seeking the describe, which is itself somewhat wild and “random,” so this randomness propagates to the sieve weights. Said differently, the unpredictable behavior of the Möbius function reflects the inherently unpredictable behavior of the primes. There is a kind of uncertainty principle at work here: the tighter you try to pin down the behavior of the primes, the wilder the error term will be.

Somewhat paradoxically then, we can gain ground by deliberately writing down a worse expression for the quantity of interest, which then allows us to use nicer weights. Intuitively, we want to be working with functions that are “smoother” than the Möbius function. In the argument above, we summed weights

$\displaystyle \sum_{N \leq n \leq 2N} w(n)$

satisfying ${w(n) = 0}$ or ${1}$, and ${w(n) = 1}$ if ${n}$ is prime; this gave an upper bound on the number of primes in ${[N,2N]}$. Actually, for the purpose of obtaining and upper bound, it suffices to choose ${w(n)}$ to be any sequence satisfying ${w(n) \geq 0}$ always, and ${w(n) \geq 1}$ if ${n}$ is prime. Selberg’s sieve simply refers to a specific choice of the form

$\displaystyle w(n) = \left( \sum_{d \mid n} \lambda_d \right)^2$

where the ${\lambda_d}$ are a sequence of real numbers chosen to optimize the sieve, and ${\lambda_1 = 1}$. This choice obviously satisfies the required conditions.

The advantage of choosing weights in this form is that the resulting sum can be interpreted as a quadratic form in ${\lambda_d}$, and the theory of such forms is quite well understood. In particular, one can explicitly diagonalize the quadratic form using simple linear transformations, which makes it easy to optimize for applications. This method yields, fairly easily, some nice consequences: it can provide a better upper bound on the density of the prime numbers, and also a good upper bound on the density of the twin primes. A general feature of sieve theory is that it is good for proving decent upper bounds (within a constant factor of the truth), but miserable at proving nontrivial lower bounds. That is why it is so much harder to prove that there are many small prime gaps, and that is what we turn to next.

3. Bounded Gaps Between Primes

3.1. The Goldston-Pintz-Yildirim sieve

In 2005 Goldston, Pintz, and Yildirim announced groundbreaking work towards proving that there are many primes in bounded gaps. Their basic idea is easy to explain. We would ideally like to construct sieve weights that filter out the twin primes, that’s too hard to do directly. So we instead adopt the following indirect approach.

The prime number theorem gives a tool that takes as input integers $n$ and filters out those numbers that are not prime. And it is trivial to modify this to another tool that filters out everything but those $n$ such that $n+h$ is prime. I like to think in terms of the physical analogy of water (representing the integers) pouring through two filters that allow only prime $n$ or prime $n+h$ through.

Now, what we really want to know is some number passes through both filters in the two scenarios above. We can’t detect this directly, but we would know that it were true if so much stuff made it through the filters that there had to be some overlap. Unfortunately that is not true, because the primes are quite sparse. But we can try to add another sieve on top of our two filters (depicted in red below), which has the effect of increasing the relative density of the primes. And this is precisely what Goldston, Pintz, and Yildirim do; their second sieve turns out to roughly have the effect of filtering out numbers with more than a few prime factors.

Let’s now turn this intuitive picture into mathematics. Suppose we want to show that there will be multiple primes in the set ${\{n+h_1, n+h_2, \ldots, n+h_k\}}$ for infinitely many values of ${n}$. Let ${\chi}$ denote the indicator function of the primes:

$\displaystyle \chi(n) = \begin{cases} 1 & \text{if }n \text{ is prime}, \\ 0 & \text{otherwise}. \end{cases}$

For an appropriate choice of weights ${w(n) \geq 0}$ we define the sum

$\displaystyle S_1 = \sum_{N \leq n \leq 2N} w(n),$

which we think of as the result of filtering ${[N,2N]}$ through the sieve weights ${w(n)}$. Then for each ${\ell}$ we consider the sum

$\displaystyle \qquad S_2^{(\ell)} = \sum_{N \leq n \leq 2N} w(n) \chi(n+h_{\ell})$

which is the result of layering the sieve weights ${w(n)}$ with another filter that takes out everything except the prime numbers (such a thing is prescribed by the Prime Number Theorem). If ${S_2^{(\ell)}}$ is pretty large compared to ${S_1}$ for all large ${N}$, which is to say that adding this extra prime filter didn’t make a big difference, then we can deduce something about primes in our interval ${[N, 2N]}$ (e.g. that there actually are some such primes). And if this holds across different values of ${\ell}$, then we can even deduce things about when there are multiple primes among ${\{ n + h_1, \ldots, n+h_k\}}$. For instance, consider the quantity

$\displaystyle \rho := \frac{ \sum_{\ell} S_2^{(\ell)} }{S_1}.$

If ${\rho > 1}$, then

$\displaystyle 0< -S_1 + \sum_{\ell} S_2^{(\ell)} = \sum_{N \leq n \leq 2N} w(n) \left( -1 + \sum_{\ell=1}^k \chi(n+h_{\ell}) \right).$

Since ${w(n) \geq 0}$, this forces ${ \sum_{\ell=1}^k \chi(n+h_{\ell}) > 1}$ for at least one ${n}$, which shows that there are at least two primes among ${\{n+h_1, \ldots, n+h_k\}}$. The challenge, then, is to choose good weights. Goldston-Pintz-Yildirim follow Selberg in selecting ${w(n)}$ to be of the form

$\displaystyle w(n) = \left( \sum_{d \mid (n+h_1) \ldots (n+h_k)} \lambda_d \right)^2.$

This takes care of the positivity requirement automatically, and leaves us with a pure optimization problem. However, the optimization problem is now much more difficult than it was for Selberg’s sieve, because we are now working with two different quadratic forms ${S_1}$ and ${S_2^{(\ell)}}$ that cannot be simultaneously diagonalized. It seems intractable to maximize their ratio in an infinite-dimensional space of functions, so instead we simply guess a specific functional form and tune parameters. It turns out that a good guess is to take ${\lambda_d = \mu(d) F \left( \frac{\log d }{\log R} \right)}$ for some large ${R}$ and some function ${F}$ supported in ${[0,1]}$.

With this choice, one can compute asymptotics for ${S_2^{(\ell)}/S_1}$ purely in terms of ${F}$. This is the main technical challenge in all the arguments, although it appears to be somewhat standard. There are a couple of ways of going about it. The method used by Goldston-Pintz-Yildirim relies on Perron’s formula, which gives a systematic way of extracting coefficient sums from ${L}$-functions. A somewhat slicker Fourier-theoretic formulation of this approach was introduced by Green and Tao, and used by Tao in his own work on bounded gaps between primes. The second way, used by Maynard, goes back to the original combinatorial method of Selberg. Here one explicitly diagonalizes the quadratic forms (separately) using arithmetic identities, and then applies general results on sums of multiplicative functions.

So what is the result of the computation? Recall that we were interested in the quantity ${\rho}$, and especially whether or not it could be made to be bigger than ${1}$. After the above settings, it turns out to be given in terms of an important parameter ${\theta}$ called the “level of distribution.” The point is that we make some approximations in computing the sums ${S_1}$ and ${S_2^{(\ell)}}$, meaning that we are essentially working with only approximate filters. The level of distribution dictates the amount by which our approximate filters deviate from the exact ones, which sets the limit on how far we can sieve before we know that the error terms grow out of control. In other words, eventually our imperfect filters will accumulate too much junk to give meaningful results, and the level of distribution describes when this will happen.

The Bombieri-Vinogradov Theorem asserts that the primes have level of distribution at least ${\theta > 1/2-\epsilon}$ for any ${\epsilon>0}$. It turns out that with any such ${\theta}$, ${\rho}$ is just barely too small. For any ${\theta > \frac{1}{2}}$, which is just out of reach of all known methods, ${\rho}$ is large enough to ensure bounded gaps between primes. It is conjectured (The Elliott-Halberstam Conjecture) that ${\theta}$ is as large as ${1-\epsilon}$ for any ${\epsilon > 0}$.

3.2. The work of Yitang Zhang

We just said that one can deduce from the method of Goldston, Pintz, and Yildirim that ${\liminf p_{n+1}-p_n < \infty}$ if the level of distribution ${\theta > \frac{1}{2}}$. Unfortunately, the best result currently available is that ${\theta > \frac{1}{2}-\epsilon}$ for any positive ${\epsilon}$. We are just barely out of reach!

There are some good reasons why the ${\theta = \frac{1}{2}}$ barrier is difficult. To explain them, we should elaborate on what the level of distribution really means. Fix some modulus ${q}$, and imagine that we are interested in how evenly distributed the primes are in different residue classes mod ${q}$. Define ${\pi(X;q,a)}$ to be the number of primes ${p \leq X}$ which are congruent to ${a}$ modulo ${q}$. If ${\gcd(a,q)>1}$, then obviously ${\pi(X;q,a) \leq 1}$. But otherwise, we might guess by symmetry that each residue class mod ${q}$ should see roughly the same number of primes, i.e. ${\pi(X;q,a) \approx \pi(X;q,b)}$ as long as ${\gcd(a,q) = \gcd(b,q)=1}$. And indeed, a generalized version of the Prime Number Theorem tells us that this is true in an asymptotic sense. The important question, though, is: how large is the deviation from parity?

If one models this problem with random variables, much as we did before in discussing the Riemann hypothesis, then one arrives at the guess of a square-root error. This is the Generalized Riemann Hypothesis, and it is far beyond the reach of current methods. Fortunately, in our case we aren’t concerned with the error for an individual ${q}$, but the average error over all ${q}$. In general averaging makes problems like these much easier, and the Bombieri-Vinogradov Theorem asserts that we can average up to ${N^{1/2-\epsilon}}$ without accumulating too much error. This is the best that one could deduce directly from the Generalized Riemann Hypothesis, so one can interpret the Bombieri-Vinogradov Theorem as saying that the Generalized Riemann Hypothesis is true on average. In particular, anything beyond Bombieri-Vinogradov is beyond the (immediate) reach of GRH (that is to say that it wouldn’t be a direct consequence of GRH, not that it is necessarily deeper). The Elliott-Halberstam conjecture predicts that we can average all the way up to $N^{1-\epsilon}$ without a problem, expressing the belief that the error terms for individual moduli behave like random noise and also that they cancel across the different moduli like random noise.

In the 1980s the mathematicians Bombieri, Iwaniec, and Friedlander explored ways of “extending” the Bombieri-Vinogradov Theorem in ways that would still be useful for applications. Under a weaker conception of “error”, they were able to prove levels of distribution ${\theta > \frac{1}{2}}$. Unfortunately, nobody found a way to exploit their work in the sieves of interest. Later, Motohashi and Pintz did outline a program for how a different weakened version of “level of distribution” could be implemented in the Goldston-Pintz-Yildirim sieve.

Yitang Zhang’s work is a natural completion of these two threads. Following Motohashi-Pintz’s program, Zhang showed that for applying the Goldston-Pintz-Yildirim it suffices to restrict our attention to those ${q}$ which are called “smooth moduli,” meaning that they have small prime factors, or in other words that they factorize into primes which aren’t of wildly different size. Zhang then adapted the methods of Bombieri, Iwaniec, and Friedlander to show the level of distribution for smooth moduli is greater than ${\frac{1}{2}}$. This is highly technical work, relying on the very deep Weil conjectures from algebraic geometry. It would be impressive coming from any mathematician, but from someone working in virtual isolation, who could not even secure a research position, it was nothing short of astounding.

3.3. The Maynard-Tao sieve

Just a few months after Zhang’s groundbreaking work, James Maynard announced results eclipsing everything that had come before. We won’t try to formulate his theorems in full strength here, but he is able to prove uniform existence results for primes being represented by any system of “admissible” linear forms. His work leads not only to much tighter numerical bounds than Zhang’s, but also implies, for instance, that there are infinitely many bounded intervals containing an arbitrarily large number of primes. And unlike Zhang’s argument, Maynard’s is completely elementary!

After fighting so hard for every inch of progress, Maynard’s solution is almost anticlimactic. Maynard and Terence Tao independently discovered that a simple modification to the Goldston-Pintz-Yildirim method led to a vastly improved sieve. Recall from earlier that the Goldston-Pintz-Yildirim sieve was determined by weights ${w(n)}$ defined as

$\displaystyle w(n) = \left( \sum_{d \mid (n+h_1) \ldots (n+h_k)} \lambda_d \right)^2. \ \ \ \ \ (2)$

Maynard and Tao propose the more general definition

$\displaystyle w(n) = \left( \sum_{\substack{d_1, \ldots, d_k \\ d_i \mid n+h_i \forall i }} \lambda_{d_1, \ldots, d_k} \right)^2. \ \ \ \ \ (3)$

The difference is essentially just that the Maynard-Tao sieve weights are parametrized by a multivariable function, whereas the Goldston-Pintz-Yildirim sieves are parametrized by a single variable function.

This generalization of the sieve weights is the key new ingredient. After selecting these weights, one analyzes sums ${S_1, S_2^{(\ell)}}$ much as before. The calculation is similar in outline, and although the technical details are more challenging they are, in the grand scheme of things, just a calculation. It is probably fair to say that there are many mathematicians who could have worked out the result if they were told that it could be done simply replacing (2) with (3).

The punchline is that the choice of weights (3) affords enough additional flexibility to make the key quantity ${\rho}$ we defined earlier as large as we want, if we are willing to take ${k}$ to be large enough. This means that we can bound ${\liminf_{n \rightarrow \infty} p_{n+m}-p_n}$ by a finite constant for any ${m}$. For any fixed ${m}$, optimizing this finite bound essentially reduces to a variational problem, in which one wants to maximize some functional on a certain space of functions. The details are fairly involved and painful, and are the focus of my essay.

Posted in Mathematics | Tagged | 1 Comment