A further step is Langevin Dynamics, where the system has damped momentum, and the noise is inserted into the momentum. This can be used in molecular dynamics simulations, and it can also be used for Bayesian MCMC sampling.
Oddly, most mentions of Langevin Dynamics in relation to AI that I've seen omit the use of momentum, even though gradient descent with momentum is widely used in AI. To confuse matters further, "stochastic" is used to refer to approximating the gradient using a sub-sample of the data at each step. You can apply both forms of stochasticity at once if you want to!
The momentum analogue for Langevin is known as underdamped Langevin, which if you optimize the discretization scheme hard enough, converges faster than ordinary Langevin. As for your question, your guess is as good as mine, but I would guess that the nonconvexity of AI applications causes problems. Sampling is a hard enough problem already in the log-concave setting…
Oddly, most mentions of Langevin Dynamics in relation to AI that I've seen omit the use of momentum, even though gradient descent with momentum is widely used in AI. To confuse matters further, "stochastic" is used to refer to approximating the gradient using a sub-sample of the data at each step. You can apply both forms of stochasticity at once if you want to!