Tuesday, September 8, 2009

First taste on the facebook application

I decided to write a facebook application. At least, that will increase my odds to work for facebook, isn't it?

Or, at least it helps me to work for xiaonei.com...

OK, come back to the point. Just follow the instruction on facebook. But several things remain to stop me from doing that. Several lessons:

* Use a host that supports php5 (or, maybe php4 is good enough?). I used godaddy's economical hosting, whose default configuration (IIS6) doesn't support php at all. Fortunately it provides a free (!!) update to IIS7 which works for php5. That confused me quite a while.

* Be careful for the settings:
- Call back URL. Make sure to end with a "/" at last.
- Use a IFrame to start with, and you can see the results for even a single HTML source page. Then be careful to change to FBML.
- When using FBML, write a index.php and write code to that file. Now I'm using the example provided by facebook (Get started quickly with some example code) which is a php code (FBML also works).

Gotta move on. Add oil!

Sunday, July 26, 2009

MapReduce.. PartI

I want to summarize my recent experience using MapReduce Framework.

Actually I will start after if you can successfully run a single-alone version of "wordcount" application. And it is more than just putting everything of single-alone code to the cluster and run it.

Moreover, I'll not introduce how to configure MapReduce on the cluster... Actually this time someone else has set it up for us users.

Let's begin now.

Tip#1
This one is related not to MapReduce, but the difference between Windows and Linux.

Well, as i found out earlier, cygwin is a cool tool, because you can simulate a Linux environment from windows, AND something more: for example, after I setting up Java environment under windows, you can directly use that Java under linux as well, without the need to install again.

Well... really good?

However, as I turned out later then, another programming tool, gcc, yeilds different results when compiling under cygwin and real Linux.

Maybe Java indeed is more a platform-independent programming language.

Tip#2
There are several reasons why programs running fine under single-alone MapReduce framework cannot work on the cluster (fully distributed MapReduce mode).

Here is one:
You should not assume that, you can smoothly pass parameters to Map and Reduce functions, from, e.g., class member variables. For example:

public class WordCount {
int screwyou;
public static class Map ...
{
public void map()
{
...
}
}

public static class Reduce ...
{
public void reduce()
{
...
}
}

public static void main(String[] args) throws Exception {
int screwyou=1;
...
}
}

Then you cannot assume that map/reduce function can access to screwyou variable (because you're screwed). Instead, you need some other mechanism to pass variable. Actually I haven't understood very well how to do that, but the general idea is, use:

public void configure(JobConf job) {}

function inside map and reduce to pass values. You can look at RandomWriter.java to see the example there.


to be continued...

Sunday, June 28, 2009

EM and etc.

OK, i admit that it's the Nth time that I studied EM algorithm all over again. But I don't quite blame myself when N<=10. I follows wikipedia page but didn't finish reading that. So were some other materials like this. Two most unbearable things related: different set of notations, and theory without concrete and suitable examples.

So what is EM? Well, i know its Expectation and Maximization... Well, firstly it's a way to do MLE. Actually it is said to be able to do MAP as well. EM promises to find a local minimum but cannot always find global minimum if the distribution is multi-modal. I just realized the derivation from our course advisor Carlos Guestrin very useful... I must be misunderstanding him for a long tim.

In short, we want to:
arg max(theta) (Data Likelihood), or

arg max(theta) (Log P(data))

And assume we have some observable variable notated by X and some hidden variable notated by Z, then a two-step EM procedure is defined as:

First, set initial theta randomly. Then, for each iteration:

E-step: Calculate
Q(t+1)(z|xj) = P(Z|Xj, theta(t))

M-step:
arg max(theta) [sigma(j) sigma(z) Q(t+1)(z|xj) logP(z,xj|theta)]

Several points here. First, after E-step, the Q(t+1)(z|xj) is settled, and we can consider it as a constant in M-step; Second, i think we really need some examples to get a clearer idea of what those formula means. I'll give two examples here, but i'll just explain the variable, rather than go through the whole calculations.

Example1 Gaussian Mixture Model (GMM)
The most popular example.

Assume that we have 2 (can be >2) gaussian distribution, each with mu (central point) and sigma (variance). Assume each point is generated by the following steps: 1. select which gaussian distribution it comes from (by another variables tau1, and tau2=1-tau1) and 2. place the point according to corresponding distribution. Then we want to estimate all above parameters (mu, sigma, tau, which constitutes theta). X here are each data points, and Z is which Gaussian distribution each data point belongs to, i.e., P(Z=1|x1), P(Z=2|x1), P(Z=1|x2)...

So first step is just choose random values of theta as initial values.

E-step: Calculate
P(Z|xj, theta) for each x and z value. Btw, this step is always simple because we can usually use Bayesian formula here.

M-step: Now we know all the P(Z|xj, theta) values. Just maximize the function by corresponding mu, sigma and tau.

The most simple scenario is that, we can directly calculate the partial derivative of each parameter, and set it to zero to find the value of parameter that minimize the function.

From section3. Two things i want to mention here:

1. Z here is a combination of (Zx,Zy). So the E-step actually calculates P(Zx,Zy|xi,yi,ri) for each rating i and each Zx, Zy.

2. At M-step, we need to use Lagrange multiplier to estimate parameter.

That's probably it. I'll keep updating if i find anything more interesting.

Sunday, June 7, 2009

Hash & VC6 Release Mode

道理说,对于VC6,一个int是4字节,因此开一个5M的数组需要20MB内存。事实也是如此
不过对于一个class,虽然理论上的内存占用只有类的成员变量,但是事实说明,对于大量数据,占用的内存是理论值的4倍左右
英明的符老大提出了Debug & Release的观点。事实说明,采用release mode,能把上面的内存减少大概一半。也就是说,还是理论值的大概两倍
我们看一个例子:普通的AVL Tree node,做hash用。每个节点是两个int和两个指针(也是4字节),所以5M个节点大概要占用80MB内存,而不是Debug模式下的320MB
而采用C++自带的STL hash(采用RB tree,红黑树,wiki有详细介绍),内存占用大概是自己编写的数据结构的1.25倍(但是没有研究过是否时间复杂度低)
一个可能的原因是,频繁new新的小的但是数量多的class member,导致内碎片。还没有证明是否是这样,如果是,就可以直接分配大块内存,再进行处理
VC6的release mode设置在 Build -> Set Action Configuration
* To do: 分析一下大规模数据的hash Vs BST。我老板说,大规模数据下hash根本没法用。听符老大的观点似乎hash在大数据量的时候反而更有优势。做一做

64-bit integer under VC6

using __int64 and unsigned __int64

generally it works fine (as you're using ordinary int / unsigned int), and the only problem i encountered until now is that, you cannot use cout<< (cin) to output (input) them, which means that you need to handle them seperately...

Word, backspace doesn't work, but delete works

For example, if you highlight a text, want to delete it, but backspace key doens't work (so you can only use e.g. delete key), then:

Word 2003:
Tools -> Options -> check "Typing replaces selected text"

Word 2007:
Start -> Word Options -> check "Typing replaces selected text"