MCMC, simulated annealing, travelling salesman
Rejection sampling
The basic idea is that if we know how to sample on a bigger set, then we can sample on a smaller set. Another idea is that if we know how to sample from density \(q\), then we can also sample uniformly from area under the graph \((x,q(x))\), because for \((X,Y)\) sampled uniformly on that area we have \(X\sim q\) and \(Y\mid x\sim \text{ Unif} [0,q(x)]\). The application is this. If we have \(p\), which we don’t know how to sample from and \(q\), which we know how to sample from, then for some \(M\) \[p<Mq\]Therefore, we sample uniformly from area under \(Mq\). If we sample something outside area of \(p\), we reject it until we get something in the area. Then all the points we did accept are distributed uniformly in area of \(p\). But that means their \(X\) coordinates have distribution \(X\sim p\). Which we wanted. The closer two areas are, the less rejections we have. All good. Except it suffers from curse of dimensionality.
Importance sampling
This is useful for computing expectation Again, we can’t sample from \(p\), but can evaluate from \(p\).
Being able to evaluate a density up to a constant but not sample from it is very common in machine learning. For example, positive valued neural networks can be thought of that way.
But if we have \(q\) we know how to sample from, we can do this. \[\mathbf{E}_{X\sim p}[f(X)]=\int f(x)p(x)dx=\int f(x)\frac{p(x)}{q(x)}q(x) = \mathbf{E}_{X\sim q}[f(X)\frac{p(X)}{q(X)}]\] The usual way it works is that \(\int p= Z\), so it’s not normalized. But \(Z\) can be found by taking \(f=1\).
Also suffers from curse of dimensionality.
MCMC
The basic idea is that Markov chain converge in distribution to their stationary distribution, that is \(\pi = \pi T\) for transition kernel \(T\) and measure \(\pi\). Caveat: Chain needs to be irreducible and aperiodic to guarantee that.
Metropolis-Hastings
MH is based on the fact that, if Markov chain is reversible for \(\pi\), then \(\pi\) is stationary distribution. We construct kernel that is reversible using \(\pi\) and proposed transition distribution \(q\) (this is used to sample new states). We define acceptance rate \(\alpha(x,y)= \text{ min }(1,\frac{\pi(y) q(x\mid y)}{\pi(x) \mid q(y|x)})\). Since this only relies on ratio of \(\pi\), any pesky normalization constant will cancel out.
- Initialize at some starting point \(x_0\).
- For \(t \in \{0,\ldots,T-1\}\):
- Propose \(x' \sim q(\cdot \mid x_t)\).
- Sample \(U \sim \mathrm{Unif}[0,1]\).
- Accept if \(U < \alpha(x_t, x')\) and set \(x_{t+1}=x'\).
- Otherwise set \(x_{t+1}=x_t\).
Gibbs sampling
Gibbs sampling is another approach to MCMC. We can notice that if \(\pi T_1 =\pi\) and \(\pi T_2=\pi\), then \(\pi T_1\circ T_2 = \pi\). In other words, stationary distribution is invariant under compositions. So we might sample from \(\pi(\cdot \mid z_{-i})\), that is, conditional distribution where all components of \(z\) but \(i\)-th are fixed. Turns out this satisfies the property above. So we sample from each component from these conditional distribution and always accept new sampled coordinate.
Simulated annealing and travelling salesman
One pretty application of Metropolis-Hastings is to approximately solve travelling salesman. Suppose \(x\) is an order in which we visit cities and \(H(x)\) is distance travelled on that trajectory between cities. Then \(\mu_T(x) = \exp(\frac{-H(x)}{T})\) will converge to uniform measure on set of minimizers if \(T\to 0\). So we run Metropolis-Hastings on this Markov chain and measure, with some transition kernel and gradually lower the temperature \(T\) to \(0\). Some pretty pictures.
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning
- https://judelo.github.io/pages/m2-mm-algorithmes-stochastiques/