So what is EM? Well, i know its Expectation and Maximization... Well, firstly it's a way to do MLE. Actually it is said to be able to do MAP as well. EM promises to find a local minimum but cannot always find global minimum if the distribution is multi-modal. I just realized the derivation from our course advisor Carlos Guestrin very useful... I must be misunderstanding him for a long tim.
In short, we want to:
arg max(theta) (Data Likelihood), or
arg max(theta) (Log P(data))
And assume we have some observable variable notated by X and some hidden variable notated by Z, then a two-step EM procedure is defined as:
First, set initial theta randomly. Then, for each iteration:
Q(t+1)(z|xj) = P(Z|Xj, theta(t))
arg max(theta) [sigma(j) sigma(z) Q(t+1)(z|xj) logP(z,xj|theta)]
Several points here. First, after E-step, the Q(t+1)(z|xj) is settled, and we can consider it as a constant in M-step; Second, i think we really need some examples to get a clearer idea of what those formula means. I'll give two examples here, but i'll just explain the variable, rather than go through the whole calculations.
Example1 Gaussian Mixture Model (GMM)
The most popular example.
Assume that we have 2 (can be >2) gaussian distribution, each with mu (central point) and sigma (variance). Assume each point is generated by the following steps: 1. select which gaussian distribution it comes from (by another variables tau1, and tau2=1-tau1) and 2. place the point according to corresponding distribution. Then we want to estimate all above parameters (mu, sigma, tau, which constitutes theta). X here are each data points, and Z is which Gaussian distribution each data point belongs to, i.e., P(Z=1|x1), P(Z=2|x1), P(Z=1|x2)...
So first step is just choose random values of theta as initial values.
P(Z|xj, theta) for each x and z value. Btw, this step is always simple because we can usually use Bayesian formula here.
M-step: Now we know all the P(Z|xj, theta) values. Just maximize the function by corresponding mu, sigma and tau.
The most simple scenario is that, we can directly calculate the partial derivative of each parameter, and set it to zero to find the value of parameter that minimize the function.
From section3. Two things i want to mention here:
1. Z here is a combination of (Zx,Zy). So the E-step actually calculates P(Zx,Zy|xi,yi,ri) for each rating i and each Zx, Zy.
2. At M-step, we need to use Lagrange multiplier to estimate parameter.
That's probably it. I'll keep updating if i find anything more interesting.