1 Introduction
Many modern learning tasks involve sampling from a highdimensional and largescale distribution, which calls for algorithms that are scalable with respect to both the dimension and the data size. One approach [welling2011bayesian] that has found wide success is to discretize the Langevin Dynamics:
(1.1) 
where presents a target distribution and is a dimensional Brownian motion. Such a framework has inspired numerous firstorder sampling algorithms [ahn2012bayesian, chen2014stochastic, ding2014bayesian, durmus2016stochastic, lan2016sampling, liu2016stochastic, patterson2013stochastic, simsekli2016stochastic], and the convergence rates are by now wellunderstood for unconstrained and logconcave distributions [cheng2017convergence, dalalyan2017user, durmus2018analysis].
However, applying (1.1) to sampling from constrained distributions (i.e., when has a bounded convex domain) remains a difficult challenge. From the theoretical perspective, there are only two existing algorithms [brosse17sampling, bubeck2015sampling] that possess nonasymptotic guarantees, and theif rates are significantly worse than the unconstrained scenaria under the same assumtions; cf., Table 1. Furthermore, many important constrained distributions are inherently nonlogconcave. A prominent instance is the Dirichlet posterior, which, in spite of the presence of several tailormade firstorder algorithms [lan2016sampling, patterson2013stochastic], is still lacking a nonasymptotic guarantee.
In this paper, we aim to bridge these two gaps at the same time. For general constrained distributions with a strongly convex potential , we prove the existence of a firstorder algorithm that achieves the same convergence rates as if there is no constraint at all, suggesting the stateoftheart can be brought down to . When specialized to the important case of simplex constraint, we provide the first nonasymptotic guarantee for Dirichlet posteriors, for deterministic and for the stochastic version of our algorithms; cf., Example 1 and 2 for the involved parameters.
Our framework combines ideas from the Mirror Descent [beck2003mirror, nemirovsky1983problem] algorithm for optimization and the theory of Optimal Transport [villani2008optimal]. Concretely, for constrained sampling problems, we propose to use the mirror map to transform the target into an unconstrained distribution, whereby many existing methods apply. Optimal Transport theory then comes in handy to relate the convergence rates between the original and transformed problems. For simplex constraints, we use the entropic mirror map to design practical firstorder algorithms that possess rigorous guarantees, and are amenable to minibatch extensions.
The rest of the paper is organized as follows. We briefly review the notion of pushforward measures in Section 2. In Section 3, we propose the Mirrored Langevin Dynamics and prove its convergence rates for constrained sampling problems. Minibatch extensions are derived in Section 4. Finally, in Section 5, we provide synthetic and realworld experiments to demonstrate the empirical efficiency of our algorithms.
1.1 Related Work
FirstOrder Sampling Schemes with Langevin Dynamics: There exists a bulk of literature on (stochastic) firstorder sampling schemes derived from Langevin Dynamics or its variants [ahn2012bayesian, brosse17sampling, bubeck2015sampling, chen2015convergence, cheng2017convergence, cheng2017underdamped, dalalyan2017user, durmus2018analysis, dwivedi2018log, luu2017sampling, patterson2013stochastic, welling2011bayesian]. However, to our knowledge, this work is the first to consider mirror descent extensions of the Langevin Dynamics.
The authors in [ma2015complete] proposed a formalism that can, in principle, incorporate any variant of Langevin Dynamics for a given distribution . The Mirrored Langevin Dynamics, however, is targeting the pushforward measure (see Section 3.1), and hence our framework is not covered in [ma2015complete].
For Dirichlet posteriors, there is a similar variable transformation as our entropic mirror map in [patterson2013stochastic] (see the “reducednatural parametrization” therein). The dynamics in [patterson2013stochastic] is nonetheless drastically different from ours, as there is a positiondependent matrix multiplying the Brownian motion, whereas our dynamics has no such feature; see (3.2).
Mirror DescentType Dynamics for Stochastic Optimization: Although there are some existing work on mirror descenttype dynamics for stochastic optimization [krichene2017acceleration, mertikopoulos2018convergence, raginsky2012continuous, xu2018accelerated], we are unaware of any prior result on sampling.
Convergence Rates for Sampling from Dirichlet Posteriors: The work [dai2016provable] proposed a zero^{th} order method that achieves convergence in relative entropy for Dirichlet posteriors, which requires computation per iteration. Our method achieves the same rate with complexity per iteration.
2 Preliminaries
2.1 Notation
In this paper, all Lipschitzness and strong convexity are with respect to the Euclidean norm . We use to denote times differentiable functions with continuous ^{th} derivative. The Fenchel dual [rockafellar2015convex] of a function is denoted by . Given two mappings of proper dimensions, we denote their composite map by
. For a probability measure
, we write to mean that “is a random variable whose probability law is
”.2.2 PushForward and Optimal Transport
Let be a probability measure with support , and be a convex function on . Throughout the paper we assume:
Assumption 1.
is closed, proper, on .
Assumption 2.
All measures have finite second moments.
Assumption 3.
All measures vanish on sets with Hausdorff dimension [mandelbrot1983fractal] at most .
The gradient map induces a new probability measure through for every Borel set on . We say that is the pushforward measure of under , and we denote it by . If and , we will sometimes abuse the notation by writing to mean
If , the triplet must satisfy the MongeAmpère equation:
(2.1) 
Using and , we see that (2.1) is equivalent to
(2.2) 
which implies .
3 Mirrored Langevin Dynamics
This section demonstrates a framework for transforming constrained sampling problems into unconstrained ones. We then focus on applications to sampling from strongly logconcave distributions and simplexconstrained distributions, even though the framework is more general and futureproof.
3.1 Motivation and Algorithm
We begin by briefly recalling the mirror descent (MD) algorithm for optimization. In order to minimize a function over a bounded domain, say , MD uses a mirror map to transform the primal variable into the dual space , and then performs gradient updates in the dual: for some stepsize . The mirror map is chosen to adapt to the geometry of the constraint , which can often lead to faster convergence [nemirovsky1983problem] or, more pivotal to this work, an unconstrained optimization problem [beck2003mirror].
Inspired by the MD framework, we would like to use the mirror map idea to remove the constraint for sampling problems. Toward this end, we first establish a simple fact [villani2003topics]:
Theorem 1.
Let satisfy Assumption 1. Suppose that and . Then and .
Proof.
For any Borel set , we have . Since is onetoone, if and only if . ∎
In the context of sampling, Theorem 1 suggests the following simple procedure: For any target distribution with support , we choose a mirror map on satisfying Assumption 1, and we consider the dual distribution associated with and :
(3.1) 
Theorem 1 dictates that if we are able to draw a sample from , then immediately gives a sample for the desired distribution . Furthermore, suppose for the moment that , so that is unconstrained. Then we can simply exploit the classical Langevin Dynamics (1.1) to efficiently take samples from .
The above reasoning leads us to set up the Mirrored Langevin Dynamics (MLD):
(3.2) 
Notice that the stationary distribution of in MLD is , since is nothing but the Langevin Dynamics (1.1) with . As a result, we have .
Using (2.1), we can equivalently write the term in (3.2) as
In order to arrive at a practical algorithm, we then discretize the MLD, giving rise to the following equivalent iterations:
(3.3) 
where in both cases , ’s are i.i.d. standard Gaussian, and ’s are stepsizes. The first formulation in (3.3) is useful when has a tractable form, while the second one can be computed using solely the information of and .
Next, we turn to the convergence of discretized MLD. Since in (3.2) is the classical Langevin Dynamics, and since we have assumed that is unconstrained, it is typically not difficult to prove the convergence of to . However, what we ultimately care about is the guarantee on the primal distribution . The purpose of the next theorem is to fill the gap between primal and dual convergence.
We consider three most common metrics in evaluating approximate sampling schemes, namely the 2Wasserstein distance , the total variation , and the relative entropy .
Theorem 2 (Convergence in implies convergence in ).
If, furthermore, is strongly convex: . Then .
Proof.
See Appendix A. ∎
3.2 Applications to Sampling from Constrained Distributions
We now consider applications of MLD. For strongly logconcave distributions with general constraint, we prove matching rates to that of unconstrained ones; see Section 3.2.1. In Section 3.2.2, we consider the important case where the constraint is a probability simplex.
3.2.1 Sampling from a strongly logconcave distribution with constraint
As alluded to in the introduction, the existing convergence rates for constrained distributions are significantly worse than their unconstrained counterparts; see Table 1 for a comparison.
Assumption  Algorithm  
unknown  unknown  MYULA [brosse17sampling]  
unknown  unknown  PLMC [bubeck2015sampling]  
MLD; this work  

Langevin Dynamics [cheng2017convergence, dalalyan2017theoretical, durmus2018analysis] 
The main result of this subsection is the existence of a “good” mirror map for arbitrary constraint, with which the dual distribution becomes unconstrained:
Theorem 3 (Existence of a good mirror map for MLD).
Let be a probability measure with bounded convex support such that , , and is bounded away from in the interior of the support. Then there exists a mirror map such that the discretized MLD (3.3) yields
Proof.
See Appendix B. ∎
Remark 1.
We remark that Theorem 3 is only an existential result, not an actual algorithm. Practical algorithms are considered in the next subsection.
3.2.2 Sampling Algorithms on Simplex
We apply the discretized MLD (3.3) to the task of sampling from distributions on the probability simplex
, which is instrumental in many fields of machine learning and statistics.
On a simplex, the most natural choice of is the entropic mirror map [beck2003mirror], which is wellknown to be 1strongly convex:
(3.4) 
In this case, the associated dual distribution can be computed explicitly.
Lemma 1 (Sampling on a simplex with entropic mirror map).
Let be the target distribution on , be the entropic mirror map (3.4), and . Then the potential of the pushforward measure admits the expression
(3.5) 
where is the Fenchel dual of , which is strictly convex and 1Lipschitz gradient.
Proof.
See Appendix C. ∎
Crucially, we have , so that the Langevin Dynamics for is unconstrained.
Based on Lemma 1, we now present the surprising case of the nonlogconcave Dirichlet posteriors, a distribution of central importance in topic modeling [blei2003latent], for which the dual distribution becomes strictly logconcave.
Example 1 (Dirichlet Posteriors).
Given parameters and observations where is the number of appearance of category
, the probability density function of the Dirichlet posterior is
(3.6) 
where is a normalizing constant and . The corresponding is
The interesting regime of the Dirichlet posterior is when it is sparse, meaning the majority of the ’s are zero and a few ’s are large, say of order . It is also common to set for all in practice. Evidently, is neither convex nor concave in this case, and no existing nonasymptotic rate can be applied. However, plugging into (3.5) gives
(3.7) 
which, magically, becomes strictly convex and Lipschitz gradient no matter what the observations and parameters are! In view of Theorem 2 and Corollary 7 of [durmus2018analysis], one can then apply (3.3) to obtain an convergence in relative entropy, where is the initial Wasserstein distance to the target. ∎
4 Stochastic Mirrored Langevin Dynamics
We have thus far only considered deterministic methods based on exact gradients. In practice, however, evaluating gradients typically involves one pass over the full data, which can be timeconsuming in largescale applications. In this section, we turn attention to the minibatch setting, where one can use a small subset of data to form stochastic gradients.
Toward this end, we assume:
Assumption 4 (Primal Decomposibility).
The target distribution admits a decomposable structure for some functions .
The above assumption is often met in machine learning applications, where each represents one data. If there is an additional prior term (that is, for some ), then one can redefine so that Assumption 4 still holds.
Consider the following common scheme in obtaining stochastic gradients. Given a batchsize , we randomly pick a minibatch from with
, and form an unbiased estimate of
by computing(4.1) 
The following lemma asserts that exactly the same procedure can be carried out in the dual.
Lemma 2.
Assume that is 1strongly convex. For i = let be such that
(4.2) 
Define and , where is chosen as in (4.1). Then:

Primal decomposibility implies dual decomposability: There is a constant such that .

For each , the gradient depends only on and the mirror map .

The gradient estimate is unbiased: .

The dual stochastic gradient is more accurate: .
Proof.
See Appendix D. ∎
Lemma 2 furnishes a template for the minibatch extension of MLD. The pseudocode is detailed in Algorithm 1, whose convergence rate is given by the next theorem.
Theorem 4.
Let be a distribution satisfying Assumption 4, and a 1strongly convex mirror map. Let
be the variance of the stochastic gradient of
in (4.1). Suppose that the corresponding dual distribution satisfies . Then, applying SMLD with constant stepsize yields^{2}^{2}2Our guarantee is given on a randomly chosen iterate from , instead of the final iterate . In practice, we observe that the final iterate always gives the best performance, and we will ignore this minor difference in the theorem statement.:(4.3) 
provided that .
Proof.
See Appendix E. ∎
Example 2 (SMLD for Dirichlet Posteriors).
For the case of Dirichlet posteriors, we have seen in (3.7) that the corresponding dual distribution satisfies , where and . Furthermore, it is easy to see that the stochastic gradient can be efficiently computed (see Appendix F):
(4.4) 
where is the number of observations of category in the minibatch . As a result, Theorem 4 states that SMLD achieves
with a constant stepsize.
5 Experiments
We conduct experiments with a twofold purpose. First, we use a lowdimensional synthetic data, where we can evaluate the total variation error by comparing histograms, to verify the convergence rates in our theory. Second, We demonstrate that the SMLD, modulo a necessary modification for resolving numerical issues, outperforms stateoftheart firstorder methods on the Latent Dirichlet Allocation (LDA) application with Wikipedia corpus.
5.1 Synthetic Experiment for Dirichlet Posterior
We implement the deterministic MLD for sampling from an 11dimensional Dirichlet posterior (3.6) with , and , which aims to capture the sparse nature of real observations in topic modeling. We set for all .
As a baseline comparison, we include the Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) [patterson2013stochastic] with the expandedmean parametrization. SGRLD is a tailormade firstorder scheme for simplex constraints, and it remains one of the stateoftheart algorithms for LDA. For fair comparison, we use deterministic gradients for SGRLD.
We perform a grid search over the constant stepsize for both algorithms, and we keep the best three for MLD and the best five for SGRLD. For each iteration, we build an empirical distribution by running independent trials, and we compute its total variation with respect to the histogram generated by the true distribution.
Figure 1(a) reports the total variation error along the first dimension, where we can see that MLD outperforms SGRLD by a substantial margin. As dictated by our theory, all the MLD curves decay at the rate until they saturate at the dicretization error level. In contrast, SGRLD lacks nonasymptotic guarantees, and there is no clear convergence rate we can infer from Figure 1(a).
5.2 Latent Dirichlet Allocation with Wikipedia Corpus
An influential framework for topic modeling is the Latent Dirichlet Allocation (LDA) [blei2003latent], which, given a text collection, requires to infer the posterior word distributions without knowing the exact topic for each word. The full model description is standard but somewhat convoluted; we refer to the classic [blei2003latent] for details.
Each topic in LDA determines a word distribution , and suppose there are in total topics and words. The variable of interest is therefore . Since this domain is a Cartesian product of simplices, we propose to use , where is the entropic mirror map (3.4), for SMLD. It is easy to see that all of our computations for Dirichlet posteriors generalize to this setting.
5.2.1 Experimental Setup
We implement the SMLD for LDA on the Wikipedia corpus with documents, and we compare the performance against the SGRLD [patterson2013stochastic]. In order to keep the comparison fair, we adopt exactly the same setting as in [patterson2013stochastic], including the model parameters, the batchsize, the Gibbs sampler steps, etc. See Section 4 and 5 in [patterson2013stochastic] for omitted details.
Another stateoftheart firstorder algorithm for LDA is the SGRHMC in [ma2015complete], for which we skip the implementation, due to not knowing how the was chosen in [ma2015complete]. Instead, we will repeat the same experimental setting as [ma2015complete] and directly compare our results versus the ones reported in [ma2015complete]. See Appendix G for comparison against SGRHMC.
5.2.2 A Numerical Trick and the SMLDapproximate Algorithm
A major drawback of the SMLD in practice is that the stochastic gradients (4.4) involve exponential functions, which are unstable for largescale problems. For instance, in python, np.exp(800) = inf, whereas the relevant variable regime in this experiment extends to 1600. To resolve such numerical issues, we appeal to the linear approximation^{3}^{3}3One can also use a higherorder Taylor approximation for , or add a small threshold to prevent the iterates from going to the boundary. In practice, we observe that these variants do not make a huge impact on the performance. . Admittedly, our theory no longer holds under such numerical tricks, and we shall not claim that our algorithm is provably convergent for LDA. Instead, the contribution of MLD here is to identify the dual dynamics associated with (3.7
), which would have been otherwise difficult to perceive. We name the resulting algorithm “SMLDapproximate” to indicate its heuristic nature.
5.2.3 Results
Figure 1(b) reports the perplexity on the test data up to documents, with the five best stepsizes we found via grid search for SMLDapproximate. For SGRLD, we use the best stepsizes reported in [patterson2013stochastic].
From the figure, we can see a clear improvement, both in terms of convergence speed and the saturation level, of the SMLDapproximate over SGRLD. One plausible explanation for such phenomenon is that our MLD, as a simple unconstrained Langevin Dynamics, is less sensitive to discretization. On the other hand, the underlying dynamics for SGRLD is a more sophisticated Riemannian diffusion, which requires finer discretization than MLD to achieve the same level of approximation to the original continuoustime dynamics, and this is true even in the presence of noisy gradients and our numerical heuristics
Acknowledgments
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 725594  timedata).
References
Appendix A Proof of Theorem 2
We first focus on the convergence for total variation and relative entropy, since they are in fact quite trivial. The proof for the 2Wasserstein distance requires a bit more work.
a.1 Total Variation and Relative Entropy
Since is strictly convex, is onetoone, and hence
On the other hand, it is wellknown that applying a onetoone mapping to distributions leaves the relative entropy intact. Alternatively, we may also simply write (letting ):
The “in particular” part follows from noticing that and .
a.2 2Wasserstein Distance
Now, let be strongly convex. The most important ingredient of the proof is Lemma 3 below, which is conceptually clean. Unfortunately, for the sake of rigor, we must deal with certain intricate regularity issues in the Optimal Transport theory. If the reader wishes, she/he can simply assume that the quantities (A.1) and (A.2) below are welldefined, which is always satisfied by any practical mirror map, and skip all the technical part about the welldefinedness proof.
For the moment, assume ; the general case is given at the end. Every convex generates a Bregman divergence via . The following key lemma allows us to relate guarantees in between ’s and ’s. It can be seen as a generalization of the classical duality relation (A.4) in the space of probability measures.
Lemma 3 (Duality of Wasserstein Distances).
Before proving the lemma, let us see that the relation in is a simple corollary of Lemma 3. Since is strongly convex, it is classical that, for any and ,
(A.4) 
Using Lemma 3 and the fact that and , we conclude . It hence remains to prove Lemma 3 when .
a.2.1 Proof of Lemma 3 When
We first prove that (A.2) is welldefined by verifying the sufficient conditions in Theorem 3.6 of [de2014monge]. Specifically, we will verify (C0)(C2) in p.554 of [de2014monge] when the transport cost is .
Since is strongly convex, is injective, and hence is also injective, which implies that is strictly convex. On the other hand, the strong convexity of implies , and hence is globally upper bounded by a quadratic function.
We now show that the conditions (C0)(C2) are satisfied. Since we have assumed , we have . Since is upper bounded by a quadratic function, the condition (C0) is trivially satisfied. On the other hand, since is strictly convex, simple calculation reveals that, for any , the mapping is injective, which is (C1). Similarly, for any , the mapping is also injective, which is (C2). By Theorem 3.6 in [de2014monge], (A.2) is welldefined.
We now turn to (A.3), which will automatically establish the welldefinedness of (A.1). We first need the following equivalent characterization of [villani2008optimal]:
(A.5) 
for all measurable . Using (A.5) in the definition of , we get
where the infimum is over all such that . Using the classical duality and , we may further write
(A.6) 
where the infimum is again over all such that . In view of (A.6), the proof would be complete if we can show that if and only if .
For any two maps and , we claim that
(A.7) 
Indeed, for any Borel set , we have, by definition of the pushforward,
On the other hand, recursively applying the definition of pushforward to gives
which establishes (A.7).
a.2.2 When is only
Appendix B Proof of Thereom 3
In previous sections, we are given a target distribution and a mirror map , and we derive the induced distribution through the MongeAmpère equation (2.1). The highlevel idea of this proof is to reverse the direction: We start with two good distributions and , and we invoke deep results in Optimal Transport to deduce the existence of a good mirror map .
First, notice that if has bounded domain, then the strong convexity of implies . Along with the assumption that is bounded away from in the interior, we see that is bounded away from and in the interior of support.
Let be the standard dimensional Gaussian measure. By Brenier’s polarization theorem [brenier1987decomposition, brenier1991polar] and Assumption 2, 3, there exists a convex function whose gradient solves the optimal transportation problem. Caffarelli’s regularity theorem [caffarelli1990localization, caffarelli1992regularity, caffarelli2000monotonicity] then implies that the Brenier’s map is in . Finally, a slightly stronger form of Caffarelli’s contraction theorem [kolesnikov2011mass] asserts:
(B.1) 
which implies is strongly convex.
Let us consider the discretized MLD (3.3) corresponding to the mirror map . The SDE governing is simply the OrnsteinUhlenbeck Process [oksendal2003stochastic]:
(B.2) 
Invoking Theorem 3 of [cheng2017convergence], for each iteration from (3.3) applied to (B.2), we have , which in turn implies and Theorem 2 then completes the proof.
Appendix C Proof of Lemma 1
Straightforward calculations in convex analysis shows
(C.1) 
which proves that is 1strongly convex.
Let
Comments
There are no comments yet.