OK, i admit that it's the Nth time that I studied EM algorithm all over again. But I don't quite blame myself when N<=10. I follows

wikipedia page but didn't finish reading that. So were some other materials like

this. Two most unbearable things related: different set of notations, and theory without concrete and suitable examples.

So what is EM? Well, i know its Expectation and Maximization... Well, firstly it's a way to do

MLE. Actually it is said to be able to do MAP as well. EM promises to find a local minimum but cannot always find global minimum if the distribution is multi-modal. I just realized the

derivation from our course advisor Carlos Guestrin very useful... I must be misunderstanding him for a long tim.

In short, we want to:

arg max(theta) (Data Likelihood), or

arg max(theta) (Log P(data))

And assume we have some observable variable notated by X and some hidden variable notated by Z, then a two-step EM procedure is defined as:

First, set initial theta randomly. Then, for each iteration:

**E-step: Calculate**

**Q(t+1)(z|xj) = P(Z|Xj, theta(t))**

**M-step:**

**arg max(theta) [sigma(j) sigma(z) Q(t+1)(z|xj) logP(z,xj|theta)]**

Several points here. First, after E-step, the Q(t+1)(z|xj) is settled, and we can consider it as a constant in M-step; Second, i think we really need some examples to get a clearer idea of what those formula means. I'll give two examples here, but i'll just explain the variable, rather than go through the whole calculations.

**Example1 Gaussian Mixture Model (GMM)**

The most popular example.

Assume that we have 2 (can be >2) gaussian distribution, each with **mu** (central point) and **sigma** (variance). Assume each point is generated by the following steps: 1. select which gaussian distribution it comes from (by another variables **tau1**, and **tau2**=1-**tau1**) and 2. place the point according to corresponding distribution. Then we want to estimate all above parameters (**mu, sigma, tau, **which constitutes **theta**). X here are each data points, and Z is which Gaussian distribution each data point belongs to, i.e., P(Z=1|x1), P(Z=2|x1), P(Z=1|x2)...

So first step is just choose random values of **theta** as initial values.

E-step: Calculate

**P(Z|xj, theta) **for each x and z value. Btw, this step is always simple because we can usually use Bayesian formula here.

M-step: Now we know all the P(Z|xj, theta) values. Just maximize the function by corresponding mu, sigma and tau.

The most simple scenario is that, we can directly calculate the

partial derivative of each parameter, and set it to zero to find the value of parameter that minimize the function.

From section3. Two things i want to mention here:

1. Z here is a combination of (Zx,Zy). So the E-step actually calculates P(Zx,Zy|xi,yi,ri) for each rating i and each Zx, Zy.

That's probably it. I'll keep updating if i find anything more interesting.