Hessian Ascent for the SK Model II
Optimizing the SK model via free probability and convex duality
This is the second blog post in a 3part series on a recent result [JSS24] by David, Jonathan and me. The first blog post can be found here.
A quick recap of what we accomplished in the first blog post:
 We briefly went over the various representations of the Parisi formula, and then derived the Hessian ascent algorithm from the generalized TAP free energy.
 We then motivated the two main statements we need to prove to demonstrate the success of our algorithm. We concluded by developing a primal theory for the Parisi PDE and AC SDE, which will be quintessential at various points in the main arguments.
Table of Contents
Proof Sketch
Analyzing the TAPcorrected Hessian via free probability
We now begin sketching the proof overview for the first main statement. As stated in the previous post, the goal is to create a matrix \(Q(t_j, \sigma_j)\) for every step \(j \in [K]\) with time \(t_j = j\eta\) that, with high probability,
 Projects into the topeigenspace of \(\nabla^2 \mathsf{TAP} :=2\beta A_{sym}  D(t_j,\sigma_j)\), and
 Has “wellbehaved” diagonal entries, so that we can assert convergence of the empirical distribution of the coordinates of every iterate of the algorithm to the primal version of the AC SDE in [2.2].
We proceed in chronological fashion, first giving a construction of \(Q\) and showing it projects into the topeigenspace, and then proving that this construction leads to diagonal entries with the desired behavior.
Projecting into the topeigenspace
To project into the toppart of the spectrum of \(\nabla^2 \mathsf{TAP}\), we will create an operator whose eigenspectrum is subtracted from the (idealized) operator norm of \(\nabla^2 \mathsf{TAP}\) and then inverted. Intuitively, this should make the large eigenvalues more important than the small eigenvalues, while preserving the eigenvectors. Sans normalization and a projection away from the current iterate \(\sigma_j\), therefore, a candidate choice is,
\[\begin{equation} P(D)^2 := b\left(b^2\mathsf{Id} + (\widetilde{a(D)}  (2\beta A_{sym}  D))^2\right)^{1}\,, \end{equation}\]where \(\widetilde{a(D)}\) is (approximately) the maximum value of the spectrum of \(\sqrt{2}\beta S  D(j\eta,\sigma_j)\), and \(S\) is an operator whose empirical eigenspectrum is the semicircular law. As we shall see, \(b\) will be chosen to be sufficiently small (but nonzero). It is not hard to see that as \(b \to 0\), the eigenvalues farthest from \(a(D)\) will receive the smallest weight, and the ones closest to \(\widetilde{a(D)}\) will receive the largest. Two quick points:
 This choice of \(\widetilde{a(D)}\) implicitly suggests that we will compare \(2\beta A_{sym}\) with \(\sqrt{2}\beta S\); indeed, this is one of the key parts in the proof strategy.
 We will use Gaussian concentration to compare \(P^2(D)\) with its expected value (under the Gaussian randomness), and show that this deviation is desirably small.
 We will then show that the expected value \(\mathbb{E}[P(D)^2]\) is close to the “idealized” value of
 The scalar \(b\) will be chosen to be sufficiently small so that it does not cause \(P(D)^2\) to deviate too much from projecting into the desired topeigenspace. However, it will be nonzero all the same, so that the operator \(P(D)^2\) continues to be welldefined.
Let us now write a brief and informal statement that summarizes two (of the three) key qualities the operator \(P(D)^2\)^{1} will have. In the statement below, we will assume that \(\frac{2\beta^2}{n}\mathsf{Tr}[D^{2}] = 1\).
\[\begin{align*} \text{[Idealized operator norm]:}& \quad\quad \mathsf{maxSpec}\left(\sqrt{2}\beta S  D\right) = \frac{2\beta^2}{n}\mathsf{Tr}[D^{1}]\,. \\ \text{[Largeoverlap with topeigenspace]:} & \quad\quad \frac{1}{n}\left\langle P(D)^2,(\widetilde{a(D)}(2\beta A_{sym}  D))^2\right\rangle \le \frac{O_\beta(\mathsf{Tr}[D^{4}])}{n^{1.03}}\,. \end{align*}\]The notation \(O_\beta(\cdot)\) suppresses the dependencies on \(\beta = O(1/\epsilon)\) for terms that decay sufficiently fast in \(n\). For technical reasons, \(\widetilde{a(D)}\) is not exactly \(\mathsf{maxSpec}\left(\sqrt{2}\beta S  D\right)\), but a small perturbation around it whereby,
\[\begin{equation} \left \widetilde{a(D)}  \frac{2\beta^2}{n}\mathsf{Tr}[D^{1}] \right \le \frac{O_\beta\left(\mathsf{Tr}[D^{3}]\right)}{n^{1.02}}\,. \end{equation}\]By putting the two statements above together with the observation that \(P(D)^2\) projects into the topeigenspace of \(\sqrt{2}\beta S  D\), one can reasonably believe that
\[\begin{equation} \frac{1}{n} \P(D)  \Pi_{\delta'}\left(2\beta A_{sym}  D\right)\_F^2 \le \mathsf{small} \,, \end{equation}\]where \(\Pi_{\delta'}\left(\sqrt{2}\beta S  D\right)\) is a projector into the top\(\delta' n\) dimensional eigenspace of the spectrum of \(2\beta A_{sym}  D\).
We will now give overviews of the proofs for the two statements above. Giving the actual detailed proofs would require introducing various definitions, technical facts and auxiliary lemmata about noncommutative \(\mathcal{L}^p\) spaces, the CauchyStieljtes transform (denoted for the spectral measure of an operator \(A\) as \(g_A(z)\)), the \(R\)transform and analytic subordination. Consequently, we will try to sketch the main ideas with minimal possible reference to the underlying formalism. That being said, it is simply impossible to explain even the ideas without invoking some basic terminology^{1}. Therefore, those readers that are satisfied with the two statements above may freely skip the rest of this subsection without any loss of ability to following the remaining two subsections in this post.
Idealized Operator Norm: We locate the maximum of the bulk spectrum of \(\sqrt{2}\beta S  D\). Let us lay down the strategy to do so:

Those familiar with resolvent formalism will immediately notice that \(P(D)^2\) is nothing but the (negative of the) imaginary part of the resolvent^{2} \(g_{2\beta A_{sym}  D}\), evaluated at a purely imaginary point \(ib\), which is a slight perturbation^{3} from \(0\). First, we compute the explicit form of a function^{4} which reduces the resolvent of the sum of two (free) matrices \(X\) and \(Y\) to the resolvent of just one matrix \(X\) evaluated at a shifted argument. This is a standard (and firsthalf of a remarkable) fact borrowed from [Bia98], and says that:
\[\begin{equation} \exists! f: \mathbb{H} \to \mathbb{H}, \text{s.t.},\,\text{i) }f(z) = z + O(1)\text{ for z large enough, and ii) }g_{X+Y}(z) = g_X(f(z))\,, \end{equation}\]
where \(\mathbb{H}\) denotes the upperhalf plane.

Using the explicit form of \(f^{1}\) for our specific case with \(X = D\) and \(Y = \sqrt{2}\beta S\) we will conclude that, under a normalization assumption about the trace of \(D^{2}\), the function \(f^{1}\) is strictly monotone in a certain region of the complex plane and, therefore, invertible.

At this point, we will reason about the analytic behavior of the resolvent of \(\sqrt{2}\beta SD\) at the shifted argument. In particular, we will demonstrate that the resolvent is analytic in \(\left(2\beta^2 g_{D}(0), \infty\right)\) to obtain an upper bound and then do some approximation to obtain a matching lower bound.
We begin by obtaining an explicit form of \(f^{1}\) and inferring basic analytic properties about (its injectivity and domain).
Note that the \(R\)transform is defined as \(r_X(z) = g^{1}_X(z)  1/z\), and that for of a sum of two free operators, it simplifies to \(r_{X+Y}(z) = r_X(z) + r_Y(z)\). With this, one quickly obtains that
for \(z\) in a neighborhood around \(0\).
Combining the above functional equation with the fact that the \(R\)transform of a semicircular operator with variance \(2\beta^2\) is \(2\beta^2 z\), and that \(g^{1}_{D}(z)\) will be in a neighborhood of \(\infty\), one can deduce that,
where we define \(w = g^{1}_{D}(z)\). Since \(w\) is sufficiently large^{5}, one can apply the magical fact of [Bia98] and conclude that, for \(w = f(z)\),
\[\begin{equation} g_{\sqrt{2}\beta S  D}(w + 2\beta^2g_{D}(w)) = g_{ D}(w) = g_{D}(f(z)) = g_{\sqrt{2}\beta S  D}(z)\, .\end{equation}\]Then, using the injectivity of \(g_{\sqrt{2}\beta S  D}\) itself, it straightforwardly follows that:
 \(f^{1}(z) = z + 2\beta^2 g_{D}(z)\), and
 \(f(\cdot)\) is injective.

In fact, with some more effort, one can conclude that,
\[\begin{equation} z \in \mathsf{dom}(f^{1}) \iff \mathsf{Im}\left(z + 2\beta^2 g_{D}(z)\right) > 0\,.\end{equation}\]
Consequently, the domain of \(f^{1}\) is a subset of the upperhalf (complex) plane where the imaginary part of \(2\beta^2 g_{D}(z)\) is “countered” by the imaginary part of \(z\) itself. Under the assumption that \(\frac{2\beta^2}{n}\mathsf{Tr}[D^{2}] = 1\), one may strengthen this observation via some simple algebra and conclude that \(\mathsf{sgn}\left(\mathsf{Im}(z)\right) = \mathsf{sgn}\left(\mathsf{Im}(f^{1}(z))\right)\). Combining this fact with basic properties about the resolvent allows one to conclude that \(f^{1}\) is injective on \(A = \{z = x + iy \mid x \in (0,\infty), y \in \mathbb{R} \}\).
At this point, the upper bound follows by using the injectivity of \(f^{1}\) in the desired region to conlcude that \(g_{D}\circ f^{1}\) agree on the region of injectivity, and that \(g_{D}\circ f^{1} = g_{\sqrt{2}\beta S  D}\) is analytic on the region \(f^{1}(A)\). It is a standard property of the CauchyStieljtes transform that \(f^{1}(\mathsf{Re}(A)) = \left(\frac{2\beta^2}{n}\mathsf{Tr}[D^{1}], \infty\right)\) cannot, then, be in the support of the spectral measure of \(\sqrt{2}\beta S  D\).
To obtain the lower bound, we do a Taylor expansion of \(f^{1}\) near \(0\), observing that the first derivative is \(0\) and the second derivative is strictly greater than \(0\). Using these observations, in conjunction with an approximation argument then demonstrates that the upper and lower values of \(g_{\sqrt{2}\beta S  D}\) do not agree on \((f^{1}(0)\delta, f^{1}(0))\) for a sufficiently small \(\delta\). This immediately yields that it is not analytic in the neighborhood. This closes the argument.
Largeoverlap with topeigenspace: We now focus on formally showing that the matrix \(P(D)^2\) will have large overlap with the topeigenspace of the shifted Hessian.
We will formally choose the distortion parameter \(b\), and explicitly compute the overlap between \(P(D)^2\) and the subspace spanned by the eigenvectors of \(2\beta A_{sym}  D\) that are small. Our goal will be to show that there is a valid choice of \(b\) where this overlap is desirably minimal, which immediately implies that \(P^2(D)\) must have large overlap with the topeigenspace. It is easy to see that, by the definition of \(P(D)^2\),
\[\begin{equation} \frac{1}{n}\left\langle P(D)^2, (\widetilde{a(D)}(2\beta A_{sym}  D))^2\right\rangle = \frac{1}{n}\mathsf{Tr}\left[b(b^2 + (\widetilde{a(D)}(2\beta A_{sym}  D))^2)^{1}(\widetilde{a(D)}(2\beta A_{sym}D))^{1}\right]\,. \end{equation}\]At this point one can use the fact that we will choose \(b > 0\) to assert that the above can be bounded as
\[\begin{equation} \frac{1}{n}\left\langle P(D)^2, (\widetilde{a(D)}(2\beta A_{sym}  D))^2\right\rangle \le \frac{b}{n}\mathsf{Tr}\left[\mathsf{Id}_n\right] = b\,, \end{equation}\]and, therefore, it now remains to define and bound \(b\). Recall from the calculation for the idealized operator norm that \(z \in \mathsf{dom}(f^{1}) \iff \mathsf{Im}(z + 2\beta^2 g_{D}(z)) > 0\). We will choose \(b = \mathsf{Im}(z + 2\beta^2 g_{D}(z))\) with \(z = ic = i\,\text{small}_n\). Then,
\[\begin{equation} \mathsf{Im}(ic + 2\beta^2 g_{D}(ic)) = c + 2\beta^2\mathsf{Im}(g_{D}(ic)) = c + \frac{2\beta^2}{n} \left(b\mathsf{Tr\left[(b^2\mathsf{Id}_n + D^2)^{1}\right]}\right)\, , \end{equation}\]which follows with some straightforward algebra and the definition of the resolvent. At this point, adding and subtracting \(D^{2}\) inside the trace and invoking the resolvent identity, one obtains,
\[\begin{equation} \mathsf{Im}(ic + 2\beta^2 g_{D}(ic)) = c\left(1\frac{2\beta^2}{n}\mathsf{Tr}\left[D^{2}\right] + \frac{2\beta^2}{n}\mathsf{Tr}\left[(b^2\mathsf{Id}_n + D^2)^{1}D^{2}\right]\right)\,. \end{equation}\]Using the assumption of the trace of \(\frac{2\beta^2}{n}D^{2}\) being normalized to \(1\) and the fact that \(b > 0\) immediately yields,
\[\begin{equation} b = \mathsf{Im}(ic + 2\beta^2 g_{D}(ic)) \le \frac{2\beta^2 c^{3}}{n}\mathsf{Tr}\left[D^{4}\right]\,.\end{equation}\]A choice of \(c = \beta n^{.01}\), along with the fact that our diagonal matrix coming from the entropic correction will be positivedefinite and satisfying a bound \(D^{1} \preceq (1 + O(\epsilon))\mathsf{Id}_n\) uniformly, will then yield the desired result.
Approximating the diagonal of the projector
We have shown that \(P(D)^2\) has desirably large overlap with the topeigenspace of the shifted Hessian, and that the idealized operator norm is \(\approx \frac{2\beta^2}{n}\mathsf{Tr}\left[D^{1}\right]\) uniformly over the choices of \(D\). We now give an overview of how to infer that the diagonal entries of \(P(D)^2\) behave desirably. This is essential to arranging the fact that the dynamics of the algorithm converge to the primal AuffingerChen SDE; without asserting this convergence, it is unclear how to prove that the energy achieved by the final iterate output by the algorithm is sufficiently large.
We do this via two steps:

Using the secondhalf of the remarkable result of [Bia98] to show that the conditional expectation of a free sum can be studied by looking at the algebra of one of the components. Specifically, the result tells us that the same subordination function \(f\) used above also preserves the projection of the free sum \(X + Y\) onto the diagonal algebra when \(X\) is an element of the same algebra. Namely,
\[\begin{equation} \mathsf{diag}[(z  X + Y)^{1}] = (f(z)  X)^{1} \,.\end{equation}\]The equation above will allow us to reduce studying \(\mathsf{diag}\left(\tilde{z}  (\sqrt{2}\beta S  D)\right)^{1}\) to simply studying \((f(\tilde{z}) + D)^{1}\).

Comparing the “idealized” freeprobability analog \(\mathsf{diag}(\tilde{z}  \sqrt{2}\beta S  D)^{1} = \left(f(\tilde{z})+ D\right)^{1}\) with the “true” randommatrix analog \(\mathsf{diag}\left(\tilde{z}  \left(2\beta A_{sym}  D\right)\right)^{1}\). This is done by using concentration of measure for Lipschitz functions of Gaussians to show that,
\[\begin{equation} \left\ \mathsf{diag}\left(\tilde{z}  \left(2\beta A_{sym}  D\right)\right)^{1}  \mathbb{E}\left[\mathsf{diag}\left(\tilde{z}  \left(2\beta A_{sym}  D\right)\right)^{1}\right] \right\_2 \le \text{small}_1\,, \end{equation}\]and then using a free interpolation, which may be regarded as a noncommutative version of Gaussian interpolation that is ubiquituous in spinglass theory, to obtain the fact that
\[\begin{equation} \left\ \left(f(\tilde{z})+ D\right)^{1}  \mathbb{E}\left[\mathsf{diag}\left(\tilde{z}  \left(2\beta A_{sym}  D\right)\right)^{1}\right] \right\_2 \le \text{small}_2\,. \end{equation}\]The free interpolation used to prove the above statement is the most technically involved part of section4 (and likely the paper). It extends an idea of Collins, Guionnet and Parraud [CGP22] which attempts to understand the quantitative strength of fluctuations between the expected value of \(\frac{1}{n}\mathsf{Tr}\left[f(X_n)\right]\) and \(\tau\left(f(S)\right)\); \(X_n\) is a finite random matrix, \(f\) is a reasonably “nice” function (not necessarily a polynomial), \(S\) is an idealized operator in a noncommutative probability space, and \(\tau\) is the tracial operator in this space. The free interpolation looks at
\[\begin{equation} h(t) := \mathbb{E}\left[\frac{1}{n}\mathsf{Tr}\left[(zI_{nk}  2\beta[\sqrt{1t}A_{sym} + \sqrt{t}B_{sym}]+ D\otimes I_k)^{1}(D' \otimes I_k)\right]\right]\,, \end{equation}\]with
\[\begin{equation} h(0) = \mathbb{E}\left[\frac{1}{n}\mathsf{Tr}\left[(zI_n 2\beta A_{sym} + D)^{1}(D' \otimes I_k)\right]\right]\,, \end{equation}\]and
\[\begin{equation} h(1) = \mathbb{E}\left[\frac{1}{n}\mathsf{Tr}\left[(zI_{nk}  \beta(B + B^{\mathsf{T}}) + D)^{1}(D' \otimes I_k)\right]\right]\,. \end{equation}\]Here, \(h(1)\) becomes a semicircular element as the limit \(k \to \infty\) is taken at the end of the argument and almostsurely convergence is invoked. The element \(h(0)\) is the quantity we want to bound, and since we know the limiting spectrum of \(h(1)\), it remains to show the timederivative of \(h(t)\) is sufficiently small and apply the fundamental theorem of calculus. We apply a lemma about the resolvent to expand the derivative \(\frac{d}{dt} h(t)\); we then evaluate the “easy” term using Gaussian integrationbyparts, and use various Lipschitz estimates on the resolvent \(G(z,t) := \left(zI_{nk}  2\beta(\sqrt{1t}\,A_{sym} + \sqrt{t}\,B_{sym}) + D \otimes I_k\right)^{1}\) and the conjugatedresolvent \(h(t)\) in conjunction with the Poincare inequality to estimate the “hard” term.
The above line of reasoning, with a choice of \(\tilde{z} = f^{1}(z) = z + 2\beta^2 g_{D}(z)\) where \(z = a + ib = \widetilde{a(D)} + ic\), leads to the conclusion that
\[\begin{equation} \ \mathsf{diag}\left(\tilde{z}  \left(2\beta A_{sym}  D\right)\right)^{1}  (z + D)^{1}\_2 \le \text{small}_1 + \text{small}_2 \,,\end{equation}\]at which point taking the imaginary component of the operators immediately yields the desired result,
\[\begin{equation} \text{[Diagonal entries]} \quad\quad \left\ \mathsf{diag}(P(D)^2)  c(c^2 + D^2)^{1}\right\_2 \le \text{small}_1 + \text{small}_2\,.\end{equation}\]Empirical distribution of the coordinates of the iterates
With a clear construction of the projection operator \(P(D)^2\), and proofs that it projects into the topeigenspace with desirable diagonal behavior, we are ready to demonstrate that the empirical distribution of the coordinates of the sequence of iterates
\[\begin{equation} \left(\sigma_1,\dots,\sigma_K\right) \sim \left(\mathcal{N}\left(0, P(D_1)^2\right), \dots, \mathcal{N}\left(0,P(D_k)^2\right)\right)\, , \end{equation}\]will approximate, at every step (with high probability), the distribution of the primal AuffingerChen SDE \(dY^\gamma_t = \frac{\sqrt{2}\beta}{\partial_{2,2}\Lambda_\gamma(t, Y^\gamma_t)}dW_t\). We will set an external “clock” where \(t = j\eta\) for some sufficiently small \(\eta(\epsilon)\) and \(j \in [K]\).
To demonstrate that the empirical distribution converges to the desired SDE, we will use an inductive argument and follow a strategy of, once again, using the Lipschitzness of the underlying SDE to show concentration of the empirical distribution around its expectation. We will then compare the expected empirical distribution of the coordinates under the inductive hypothesis with a discretization of a step of the primal AC SDE.
To start, we renormalize \(P(D)^2\) so that its (normalized) trace is \(1 \pm o_n(1)\); we also project \(P(D)^2\) orthogonal to the current iterate \(\sigma_i\). This normalization will be helpful in the convergence computations, and the projection will be particularly useful in the energy analysis that follows. The final covariance matrices \(\{Q_i(D)^2\}_{i \in [K]}\) are then given as,
\[\begin{equation} Q_i(D)^2 := 2\beta n^{.01}\Pi_{\sigma_i^{\perp}}P(D)^2\Pi_{\sigma_i^{\perp}}\,, \end{equation}\]where \(D := \left(\frac{2\beta^2}{n}\sum_{j=1}^n \partial_{2,2}\Lambda_\gamma\left(i\eta,(\sigma_i)_j\right)^{2}\right)^{1/2}\mathsf{diag}\left[\partial_{2,2}\Lambda_\gamma(i\eta,,\sigma_i)\right]\). It is straightforward to verify with the new definition and using the three conclusions of the prior section that:

The normalized trace is,
\[\begin{equation} \frac{1}{n}\mathsf{Tr}\left[Q_i(D)^2\right] = 1 \pm O_\beta(n^{.01})\,.\end{equation}\] 
The diagonal of \(Q_i(D)^2\) is close to \(D^{2}\),
\[\begin{equation} \left\ \mathsf{diag}\left[Q_i(D)^2\right]  2\beta^2 D^{2}\right\_2 \le O_\beta\left(n^{.49}\right)\,.\end{equation}\]
In fact, we can also assert that the diagonal of \(Q_i(D)^2\) has \(2\)norm that is of \(O_\beta(\sqrt{n})\) and that the operator norm of the covariance matrix is \(O(n^{.04})\). As stated above we will proceed in two steps:

Consider a fixed iterate \(\sigma_i\) with desirably small Wasserstein distance to the distribution of \(Y_{i\eta}\). Then, we first show that,
\[\begin{equation} d_{W,2}\left(\mathbb{E}\left[\mathsf{emp}(\sigma_{i+1})\right], Y^\gamma_{(i+1)\eta}\right)^2 \le d_{W,2}\left(\mathsf{emp}(\sigma_{i}), Y^\gamma_{i\eta}\right)^2 + \text{small}_{\beta}\left(\eta,\epsilon\right)\,,\end{equation}\]To show this, we first combine a standard estimate using Gronwall’s inequality, the Lipschitzness of the driving function \(\frac{\sqrt{2}\beta}{\partial_{2,2}\Lambda_\gamma(t,y)}\), and an application of Itoisometry to conclude that,
\[\begin{equation} \left\Y^\gamma_{(i+1)\eta}  \left(Y^\gamma_{i\eta} + \frac{\sqrt{2}\beta}{\partial_{2,2}\Lambda_\gamma(t,Y^\gamma_t)\vert_{t=i\eta}}\left(W_{(i+1)\eta}  W_{i\eta}\right)\right)\right\^2_2 \le O_\beta(\eta^2)\,,\end{equation}\]allowing us to approximate the primal AC SDE in small, discretized time increments with sufficiently small quantitative error. At this point, the inductive hypothesis kicks in to accumulate the errors from the previous step, provided we can bound the action of (Lipschitz) test functions on the iterate \(\sigma_{i+1} = \sigma_i + \sqrt{\eta}Q_i(D)z_i\) where \(z_i \sim \mathcal{N}(0,\mathsf{Id}_n)\).
They key point here is to notice that \(\sqrt{\eta}(Q_i(D)z_i)_j\) can be replaced by \(\left(Q^2_i(D)\right)_{j,j}^{1/2} W\) with \(W \sim \mathcal{N}(0,\eta)\) while retaining equality in distribution. This allows us to study the normalized counting measure on the coordinates \([n]\). Letting \(\Sigma\) denote the random variable that takes value \((\sigma_i)_j\) on \(j \in [n]\), and \(\Pi\) be one that takes value \(\left(Q^2_i(D)\right)_{j,j}^{1/2} W\) on \(j \in [n]\), a simple rewrite then implies that,
\[\begin{equation} \mathbb{E}[\mathsf{emp}(\sigma_{i+1})] \overset{d}{=} \Sigma + \Pi W\, , \end{equation}\]and consequently, optimally coupling \(\Sigma\) and \(Y^{\gamma}_{i\eta}\),
\[\begin{equation} \\Sigma  Y^\gamma_{i\eta}\_2 = d_{W,2}\left(\mathsf{emp}(\sigma_i), Y^\gamma_i\right)\,. \end{equation}\]At this point, we use the definition of the \(W_2\) distance in conjunction with a triangle inequality to conclude
\[\begin{align*} d_{W,2}\left(\mathbb{E}\left[\mathsf{emp}(\sigma_{i+1})\right], Y^\gamma_{(i+1)\eta}\right) &\le \(\Sigma + \Pi W) Y^\gamma_{(i+1)\eta}\_2 \\ &\le \ \Sigma  Y^\gamma_{i\eta}\_2 + \ \Pi W  (Y^\gamma_{(i+1)\eta}  Y^\gamma_{i\eta})\_2\,. \end{align*}\]Now, we apply the inductive hypothesis, and use the discretized primal AC SDE estimate above in conjunction with the properties of \(Q_i(D)^2\) mentioned above to conclude that
\[\begin{equation} d_{W,2}\left(\mathbb{E}\left[\mathsf{emp}(\sigma_{i+1})\right], Y^\gamma_{(i+1)\eta}\right) \le \text{small}_{\beta}(\eta,\gamma) + o_n(1)\,. \end{equation}\] 
Having shown this, we use concentration of measure arguments to argue that the empirical distribution doesn’t fluctuate a lot. Namely,
\[\begin{equation} d_{W,2}\left(\mathsf{emp}\left(\sigma_{i+1}\right), \mathbb{E}\left[\mathsf{emp}\left(\sigma_{i+1}\right)\right]\right) \le o_n(1)\,.\end{equation}\]To do this we use the fact that we can upper bound, with high probability over the randomness of the iterate, the maximum value of every coordinate. Then, since the operator norm of the covariance is bounded, we can apply Lipschitz test functions on a single updated iterate and project the underlying space into a large enough cube which contains the current iterate. On doing this, one can use concentration for Lipschitz functions to bound the behavior of \(1\)Lipschitz functions, and make the strength of the concentration fight the size of a net over all Lipschitz functions.
From the ingredients above, the amount of energy gained in each step follows tediously (but conceptually relatively straightforwardly) by doing a Taylor expansion of the modified objective. The critical observation is to note that the secondorder space term is made to interact with the firstorder time term in the expansion. In doing the analysis, various terms pop up. Each term needs careful handling and is estimated by using the primal Parisi PDE expansion on the firstorder time derivative, followed by invocations of various Gaussian concentration arguments, estimates using the convergence to the primal AC SDE, and the “largeeigenspace” alignment of the entropy corrected Hessian. The complete details are relegated to [Section 6, JSS24] in the paper.
FOOTNOTES

I will introduce the relevant complexanalytic basics and the relevant freeprobability notions in a separate blogpost. In the very same blogpost, I will provide sufficient background so that a motivated reader can actually follow most of the proofs in detail (except for the proof which reasons about the projection of \(P(D)^2\) onto the diagonal subalgebra, which I will introduce in yet another post). For the impatient reader, [Section 4.1, JSS24] should also act as an intense, but reasonably accessible proxy. ↩ ↩^{2}

See here for a definition of the resolvent, and a statement of the famous resolvent identity. ↩

Since, for a compactly supported measure, which we expect the empirical eigenspectrum of \(2\beta A_{sym}  D\) to converge to, the Cauchy transform (which is equivalent to the resolvent) is invertible everywhere except at poles, this is a lossless way of studying analytic properties of the spectrum. ↩

A function such as this is known in the literature as a subordination function. In fact, this function can be extended and shown to be analytic on \(\bar{\mathbb{H}}\). It allows one to, in some sense, “separate” the support of the spectrum of the two freely independent operators by pushing the domain of one inside the resolvent to be significantly farther than the other. ↩

We work in the free product of the algebras generated by a semicircular operator \(S\) and an element \(D \in \text{M}_n(\mathbb{C})\) with \(0 < \ D\_{\mathsf{op}} \le O(1)\). ↩