tag:blogger.com,1999:blog-53761070830730346972024-03-08T05:31:00.771-08:00csberniefu@gmail.comhttp://www.blogger.com/profile/06278204524366522378noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-5376107083073034697.post-57520365226305006812009-09-08T20:20:00.000-07:002009-09-08T20:28:58.520-07:00First taste on the facebook applicationI decided to write a facebook application. At least, that will increase my odds to work for facebook, isn't it?<div><br /></div><div>Or, at least it helps me to work for xiaonei.com...</div><div><br /></div><div>OK, come back to the point. Just follow the instruction on facebook. But several things remain to stop me from doing that. Several lessons:</div><div><br /></div><div>* Use a host that supports php5 (or, maybe php4 is good enough?). I used godaddy's economical hosting, whose default configuration (IIS6) doesn't support php at all. Fortunately it provides a free (!!) update to IIS7 which works for php5. That confused me quite a while.</div><div><br /></div><div>* Be careful for the settings:</div><div> - Call back URL. Make sure to end with a "/" at last.</div><div> - Use a IFrame to start with, and you can see the results for even a single HTML source page. Then be careful to change to FBML.</div><div> - When using FBML, write a index.php and write code to that file. Now I'm using the example provided by facebook (<span class="Apple-style-span" style="font-family: 'lucida grande'; font-size: 11px; ">Get started quickly with some <a onclick="return show_example_code("125687791913", "bernie_1st_test", "489d6cad2ec06ec533ac597409d92625", "4d36120b610e173aada4432c1245f4a0", "http:\/\/www.facebook.com\/developers\/editapp.php?app_id=125687791913", "http:\/\/www.berniefu.net\/guess\/", "http:\/\/apps.facebook.com\/berniefu\/");" style="cursor: pointer; color: rgb(59, 89, 152); text-decoration: none; ">example code</a><span class="Apple-style-span" style="font-family: Georgia; font-size: 16px; ">) which is a php code (FBML also works).</span></span></div><div><br /></div><div>Gotta move on. Add oil!</div><div> </div>berniefu@gmail.comhttp://www.blogger.com/profile/06278204524366522378noreply@blogger.com0tag:blogger.com,1999:blog-5376107083073034697.post-16885018847331286192009-07-26T14:59:00.001-07:002009-07-26T14:59:31.557-07:00MapReduce.. PartI<span class="Apple-style-span" style="font-family: 'Times New Roman'; "><div style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 3px; padding-right: 3px; padding-bottom: 3px; padding-left: 3px; width: auto; font: normal normal normal 100%/normal Georgia, serif; text-align: left; "><div>I want to summarize my recent experience using MapReduce Framework.<div><br /></div><div>Actually I will start after if you can successfully run a single-alone version of "wordcount" application. And it is more than just putting everything of single-alone code to the cluster and run it.<br /></div><div><br /></div><div>Moreover, I'll not introduce how to configure MapReduce on the cluster... Actually this time someone else has set it up for us users. </div><div><br /></div><div>Let's begin now.</div><div><br /></div><div><b>Tip#1</b></div><div>This one is related not to MapReduce, but the difference between Windows and Linux.</div><div><br /></div><div>Well, as i found out earlier, <b>cygwin</b> is a cool tool, because you can simulate a Linux environment from windows, <b>AND</b> something more: for example, after I setting up Java environment under windows, you can directly use that Java under linux as well, without the need to install again. </div><div><br /></div><div>Well... really good?</div><div><br /></div><div>However, as I turned out later then, another programming tool, <b>gcc</b>, yeilds <b>different</b> results when compiling under <b>cygwin</b> and <b>real Linux</b>. </div><div><br /></div><div>Maybe Java indeed is more a platform-independent programming language.</div><div><br /></div><div><b>Tip#2</b></div><div>There are several reasons why programs running fine under single-alone MapReduce framework cannot work on the cluster (fully distributed MapReduce mode).</div><div><br /></div><div>Here is one:</div><div>You should not assume that, you can smoothly pass parameters to Map and Reduce functions, from, e.g., class member variables. For example:</div><div><div><br /></div><div>public class WordCount { </div><div> int screwyou;</div><div> public static class Map ...</div><div> {</div><div> public void map()</div><div> {</div><div> ...</div><div> }</div><div> }</div><div><br /></div><div><div> public static class Reduce ...</div><div> {</div><div> public void reduce()</div><div> {</div><div> ...</div><div> }</div><div> }</div><div><br /></div></div><div> public static void main(String[] args) throws Exception { </div><div> int screwyou=1;</div><div> ...</div><div> } </div><div>} </div></div><div><br /></div><div>Then you cannot assume that map/reduce function can access to screwyou variable (because you're screwed). Instead, you need some other mechanism to pass variable. Actually I haven't understood very well how to do that, but the general idea is, use:</div><div><br /></div><div><div>public void configure(JobConf job) {}</div><div><br /></div><div>function inside map and reduce to pass values. You can look at RandomWriter.java to see the example there.</div><div><br /></div><div><br /></div><div>to be continued...</div><div><br /></div></div></div></div></span>berniefu@gmail.comhttp://www.blogger.com/profile/06278204524366522378noreply@blogger.com0tag:blogger.com,1999:blog-5376107083073034697.post-65078956098902081542009-06-28T12:38:00.000-07:002009-06-28T13:43:01.490-07:00EM and etc.OK, i admit that it's the Nth time that I studied EM algorithm all over again. But I don't quite blame myself when N<=10. I follows <a href="http://en.wikipedia.org/wiki/Em_algorithm">wikipedia</a> page but didn't finish reading that. So were some other materials like <a href="http://www.cc.gatech.edu/~dellaert/em-paper.pdf">this</a>. Two most unbearable things related: different set of notations, and theory without concrete and suitable examples.<div><br /></div><div>So what is EM? Well, i know its Expectation and Maximization... Well, firstly it's a way to do <a href="http://en.wikipedia.org/wiki/Maximum_likelihood">MLE</a>. Actually it is said to be able to do MAP as well. EM promises to find a local minimum but cannot always find global minimum if the distribution is multi-modal. I just realized the <a href="http://www.cs.cmu.edu/~guestrin/Class/10701/slides/em-pca-annotated.pdf">derivation</a> from our course advisor Carlos Guestrin very useful... I must be misunderstanding him for a long tim.</div><div><br /></div><div>In short, we want to:</div><div>arg max(theta) (Data Likelihood), or</div><div><br /></div><div>arg max(theta) (Log P(data))</div><div><br /></div><div>And assume we have some observable variable notated by X and some hidden variable notated by Z, then a two-step EM procedure is defined as:</div><div><br /></div><div>First, set initial theta randomly. Then, for each iteration:</div><div><br /></div><div><b>E-step: Calculate</b></div><div><b>Q(t+1)(z|xj) = P(Z|Xj, theta(t))</b></div><div><b><br /></b></div><div><b>M-step:</b></div><div><b>arg max(theta) [sigma(j) sigma(z) Q(t+1)(z|xj) logP(z,xj|theta)]</b></div><div><br /></div><div>Several points here. First, after E-step, the Q(t+1)(z|xj) is settled, and we can consider it as a constant in M-step; Second, i think we really need some examples to get a clearer idea of what those formula means. I'll give two examples here, but i'll just explain the variable, rather than go through the whole calculations.</div><div><br /></div><div><b>Example1 Gaussian Mixture Model (GMM)</b></div><div>The most popular example. </div><div><br /></div><div>Assume that we have 2 (can be >2) gaussian distribution, each with <b>mu</b> (central point) and <b>sigma</b> (variance). Assume each point is generated by the following steps: 1. select which gaussian distribution it comes from (by another variables <b>tau1</b>, and <b>tau2</b>=1-<b>tau1</b>) and 2. place the point according to corresponding distribution. Then we want to estimate all above parameters (<b>mu, sigma, tau, </b>which constitutes <b>theta</b>). X here are each data points, and Z is which Gaussian distribution each data point belongs to, i.e., P(Z=1|x1), P(Z=2|x1), P(Z=1|x2)...</div><div><br /></div><div>So first step is just choose random values of <b>theta</b> as initial values. </div><div><br /></div><div>E-step: Calculate</div><div><b>P(Z|xj, theta) </b>for each x and z value. Btw, this step is always simple because we can usually use Bayesian formula here.</div><div><br /></div><div>M-step: Now we know all the <span class="Apple-style-span" style="font-weight: bold; ">P(Z|xj, theta) <span class="Apple-style-span" style="font-weight: normal; ">values. Just maximize the function by corresponding mu, sigma and tau.</span></span></div><div><br /></div><div>The most simple scenario is that, we can directly calculate the <a href="http://en.wikipedia.org/wiki/Partial_derivative">partial derivative</a> of each parameter, and set it to zero to find the value of parameter that minimize the function.</div><div><br /></div><div><div><b>Example2 <a href="http://www.cs.purdue.edu/homes/lsi/ICML_2003.pdf">Flexible Mixture Model for Collaborative Filtering</a></b></div><div>From section3. Two things i want to mention here:</div><div><br /></div><div>1. Z here is a combination of (Zx,Zy). So the E-step actually calculates P(Zx,Zy|xi,yi,ri) for each rating i and each Zx, Zy. </div><div><br /></div><div>2. At M-step, we need to use <a href="http://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multiplier</a> to estimate parameter. </div><div><br /></div><div>That's probably it. I'll keep updating if i find anything more interesting.</div></div>berniefu@gmail.comhttp://www.blogger.com/profile/06278204524366522378noreply@blogger.com0tag:blogger.com,1999:blog-5376107083073034697.post-91323021555878293402009-06-07T12:33:00.000-07:002009-06-07T12:34:10.465-07:00Hash & VC6 Release Mode<span class="Apple-style-span" style="font-family: Tahoma; line-height: 17px; color: rgb(68, 68, 68); font-size: 13px; "><div style="line-height: 17px; ">道理说,对于VC6,一个int是4字节,因此开一个5M的数组需要20MB内存。事实也是如此</div><div style="line-height: 17px; "> </div><div style="line-height: 17px; ">不过对于一个class,虽然理论上的内存占用只有类的成员变量,但是事实说明,对于大量数据,占用的内存是理论值的4倍左右</div><div style="line-height: 17px; "> </div><div style="line-height: 17px; ">英明的符老大提出了Debug & Release的观点。事实说明,采用release mode,能把上面的内存减少大概一半。也就是说,还是理论值的大概两倍</div><div style="line-height: 17px; "> </div><div style="line-height: 17px; ">我们看一个例子:普通的AVL Tree node,做hash用。每个节点是两个int和两个指针(也是4字节),所以5M个节点大概要占用80MB内存,而不是Debug模式下的320MB</div><div style="line-height: 17px; "> </div><div style="line-height: 17px; ">而采用C++自带的STL hash(采用RB tree,红黑树,wiki有详细介绍),内存占用大概是自己编写的数据结构的1.25倍(但是没有研究过是否时间复杂度低)</div><div style="line-height: 17px; "> </div><div style="line-height: 17px; ">一个可能的原因是,频繁new新的小的但是数量多的class member,导致内碎片。还没有证明是否是这样,如果是,就可以直接<strong style="line-height: 17px; font-weight: bold; ">先</strong>分配大块内存,再进行处理</div><div style="line-height: 17px; "> </div><div style="line-height: 17px; ">VC6的release mode设置在 Build -> Set Action Configuration</div><div style="line-height: 17px; "> </div><div style="line-height: 17px; ">* To do: 分析一下大规模数据的hash Vs BST。我老板说,大规模数据下hash根本没法用。听符老大的观点似乎hash在大数据量的时候反而更有优势。做一做</div></span>berniefu@gmail.comhttp://www.blogger.com/profile/06278204524366522378noreply@blogger.com0tag:blogger.com,1999:blog-5376107083073034697.post-6552219487592610632009-06-07T12:31:00.000-07:002009-06-07T12:33:33.051-07:0064-bit integer under VC6<span class="Apple-style-span" style="font-family: Tahoma; line-height: 17px; color: rgb(68, 68, 68); font-size: 13px; "><div style="line-height: 17px; ">see <a href="http://www.cnitblog.com/cockerel/archive/2006/08/16/15356.aspx" style="line-height: 17px; font-weight: inherit; text-decoration: none; color: rgb(0, 102, 167); cursor: pointer; ">http://www.cnitblog.com/cockerel/archive/2006/08/16/15356.aspx</a></div><div style="line-height: 17px; "> </div><div style="line-height: 17px; ">using __int64 and unsigned __int64</div><div style="line-height: 17px; "><br /></div><div style="line-height: 17px; ">generally it works fine (as you're using ordinary int / unsigned int), and the only problem i encountered until now is that, you cannot use cout<< (cin) to output (input) them, which means that you need to handle them seperately...</div></span>berniefu@gmail.comhttp://www.blogger.com/profile/06278204524366522378noreply@blogger.com0tag:blogger.com,1999:blog-5376107083073034697.post-80096178985452921502009-06-07T12:24:00.000-07:002009-06-07T12:28:59.937-07:00Word, backspace doesn't work, but delete worksFor example, if you highlight a text, want to delete it, but backspace key doens't work (so you can only use e.g. delete key), then:<div><br /></div><div>Word 2003:</div><div>Tools -> Options -> check "Typing replaces selected text"</div><div><br /></div><div>Word 2007:</div><div>Start -> Word Options -> check "Typing replaces selected text"</div>berniefu@gmail.comhttp://www.blogger.com/profile/06278204524366522378noreply@blogger.com0