Sunday, June 28, 2009

EM and etc.

OK, i admit that it's the Nth time that I studied EM algorithm all over again. But I don't quite blame myself when N<=10. I follows wikipedia page but didn't finish reading that. So were some other materials like this. Two most unbearable things related: different set of notations, and theory without concrete and suitable examples.

So what is EM? Well, i know its Expectation and Maximization... Well, firstly it's a way to do MLE. Actually it is said to be able to do MAP as well. EM promises to find a local minimum but cannot always find global minimum if the distribution is multi-modal. I just realized the derivation from our course advisor Carlos Guestrin very useful... I must be misunderstanding him for a long tim.

In short, we want to:
arg max(theta) (Data Likelihood), or

arg max(theta) (Log P(data))

And assume we have some observable variable notated by X and some hidden variable notated by Z, then a two-step EM procedure is defined as:

First, set initial theta randomly. Then, for each iteration:

E-step: Calculate
Q(t+1)(z|xj) = P(Z|Xj, theta(t))

arg max(theta) [sigma(j) sigma(z) Q(t+1)(z|xj) logP(z,xj|theta)]

Several points here. First, after E-step, the Q(t+1)(z|xj) is settled, and we can consider it as a constant in M-step; Second, i think we really need some examples to get a clearer idea of what those formula means. I'll give two examples here, but i'll just explain the variable, rather than go through the whole calculations.

Example1 Gaussian Mixture Model (GMM)
The most popular example.

Assume that we have 2 (can be >2) gaussian distribution, each with mu (central point) and sigma (variance). Assume each point is generated by the following steps: 1. select which gaussian distribution it comes from (by another variables tau1, and tau2=1-tau1) and 2. place the point according to corresponding distribution. Then we want to estimate all above parameters (mu, sigma, tau, which constitutes theta). X here are each data points, and Z is which Gaussian distribution each data point belongs to, i.e., P(Z=1|x1), P(Z=2|x1), P(Z=1|x2)...

So first step is just choose random values of theta as initial values.

E-step: Calculate
P(Z|xj, theta) for each x and z value. Btw, this step is always simple because we can usually use Bayesian formula here.

M-step: Now we know all the P(Z|xj, theta) values. Just maximize the function by corresponding mu, sigma and tau.

The most simple scenario is that, we can directly calculate the partial derivative of each parameter, and set it to zero to find the value of parameter that minimize the function.

From section3. Two things i want to mention here:

1. Z here is a combination of (Zx,Zy). So the E-step actually calculates P(Zx,Zy|xi,yi,ri) for each rating i and each Zx, Zy.

2. At M-step, we need to use Lagrange multiplier to estimate parameter.

That's probably it. I'll keep updating if i find anything more interesting.

Sunday, June 7, 2009

Hash & VC6 Release Mode

英明的符老大提出了Debug & Release的观点。事实说明,采用release mode,能把上面的内存减少大概一半。也就是说,还是理论值的大概两倍
我们看一个例子:普通的AVL Tree node,做hash用。每个节点是两个int和两个指针(也是4字节),所以5M个节点大概要占用80MB内存,而不是Debug模式下的320MB
而采用C++自带的STL hash(采用RB tree,红黑树,wiki有详细介绍),内存占用大概是自己编写的数据结构的1.25倍(但是没有研究过是否时间复杂度低)
一个可能的原因是,频繁new新的小的但是数量多的class member,导致内碎片。还没有证明是否是这样,如果是,就可以直接分配大块内存,再进行处理
VC6的release mode设置在 Build -> Set Action Configuration
* To do: 分析一下大规模数据的hash Vs BST。我老板说,大规模数据下hash根本没法用。听符老大的观点似乎hash在大数据量的时候反而更有优势。做一做

64-bit integer under VC6

using __int64 and unsigned __int64

generally it works fine (as you're using ordinary int / unsigned int), and the only problem i encountered until now is that, you cannot use cout<< (cin) to output (input) them, which means that you need to handle them seperately...

Word, backspace doesn't work, but delete works

For example, if you highlight a text, want to delete it, but backspace key doens't work (so you can only use e.g. delete key), then:

Word 2003:
Tools -> Options -> check "Typing replaces selected text"

Word 2007:
Start -> Word Options -> check "Typing replaces selected text"