Sunday, July 26, 2009

MapReduce.. PartI

I want to summarize my recent experience using MapReduce Framework.

Actually I will start after if you can successfully run a single-alone version of "wordcount" application. And it is more than just putting everything of single-alone code to the cluster and run it.

Moreover, I'll not introduce how to configure MapReduce on the cluster... Actually this time someone else has set it up for us users.

Let's begin now.

Tip#1
This one is related not to MapReduce, but the difference between Windows and Linux.

Well, as i found out earlier, cygwin is a cool tool, because you can simulate a Linux environment from windows, AND something more: for example, after I setting up Java environment under windows, you can directly use that Java under linux as well, without the need to install again.

Well... really good?

However, as I turned out later then, another programming tool, gcc, yeilds different results when compiling under cygwin and real Linux.

Maybe Java indeed is more a platform-independent programming language.

Tip#2
There are several reasons why programs running fine under single-alone MapReduce framework cannot work on the cluster (fully distributed MapReduce mode).

Here is one:
You should not assume that, you can smoothly pass parameters to Map and Reduce functions, from, e.g., class member variables. For example:

public class WordCount {
int screwyou;
public static class Map ...
{
public void map()
{
...
}
}

public static class Reduce ...
{
public void reduce()
{
...
}
}

public static void main(String[] args) throws Exception {
int screwyou=1;
...
}
}

Then you cannot assume that map/reduce function can access to screwyou variable (because you're screwed). Instead, you need some other mechanism to pass variable. Actually I haven't understood very well how to do that, but the general idea is, use:

public void configure(JobConf job) {}

function inside map and reduce to pass values. You can look at RandomWriter.java to see the example there.


to be continued...