开发MR程序一般需要用到JDK，Eclipse，Hadoop集群，网上已经有不少的博文已经有这方面的记载，但是还是想把整个过程好好的整理和记录下来。

一.基于Windows 7 平台搭建hadoop集群及MR开发环境

需要安装的软件及版本：

OS:win 7

shell支持：cygwin

JDK:1.6.0_38

hadoop:0.20.2

eclipse:Juno Service Release 1

软件安装及环境变量设置：

1）cygwin 安装

可以到官网下载最新版安装 http://cygwin.com/setup.exe

安装过程需要安装openssh，openssl

2）cygwin配置

设置cygwin环境变量

把D:\cygwin\bin;D:\cygwin\usr\sbin;D:\cygwin\usr\i686-pc-cygwin\bin加到path变量中

3）无密码ssh配置

wuliufu@wuliufu-PC ~
$ ssh-host-config

*** Info: Generating /etc/ssh_host_key
*** Info: Generating /etc/ssh_host_rsa_key
*** Info: Generating /etc/ssh_host_dsa_key
*** Info: Generating /etc/ssh_host_ecdsa_key
*** Info: Creating default /etc/ssh_config file
*** Info: Creating default /etc/sshd_config file
*** Info: Privilege separation is set to yes by default since OpenSSH 3.3.
*** Info: However, this requires a non-privileged account called 'sshd'.
*** Info: For more info on privilege separation read /usr/share/doc/openssh/README.privsep.
*** Query: Should privilege separation be used? (yes/no) no
*** Info: Updating /etc/sshd_config file

*** Info: Sshd service is already installed.

*** Info: Host configuration finished. Have fun!

在控制面板里打开服务：

控制面板\所有控制面板项\管理工具\服务

应该能找到cygwin sshd服务，启动服务

注:win7下可能会出现无法启动sshd服务，提示服务启动后又停止什么的，可以按下面的设置进行设置

在cygwin sshd右键点属性->登录->此账户->浏览->高级->选中administrator，确定，然后返回此账户出填写密码

如果administrator没有启用，请控制面板\所有控制面板项\管理工具\本地安全策略->本地策略->安全选项

，右边选中账户:管理员账户状态，启用即可

然后重新启动sshd，如果还是无法启动，尝试重新执行ssh-host-config,执行如下的yes or no

wuliufu@wuliufu-PC ~
$ ssh-host-config

*** Query: Overwrite existing /etc/ssh_config file? (yes/no) yes
*** Info: Creating default /etc/ssh_config file
*** Query: Overwrite existing /etc/sshd_config file? (yes/no) yes
*** Info: Creating default /etc/sshd_config file
*** Info: Privilege separation is set to yes by default since OpenSSH 3.3.
*** Info: However, this requires a non-privileged account called 'sshd'.
*** Info: For more info on privilege separation read /usr/share/doc/openssh/READ                                                                                                                ME.privsep.
*** Query: Should privilege separation be used? (yes/no) yes
*** Info: Note that creating a new user requires that the current account have
*** Info: Administrator privileges.  Should this script attempt to create a
*** Query: new local account 'sshd'? (yes/no) yes
*** Info: Updating /etc/sshd_config file

*** Info: Sshd service is already installed.

*** Info: Host configuration finished. Have fun!

然后再次重新启动sshd，我的到这步就成功启动了，呵呵

配置无密码ssh登录

$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/wuliufu/.ssh/id_rsa):
Created directory '/home/wuliufu/.ssh'.
Your identification has been saved in /home/wuliufu/.ssh/id_rsa.
Your public key has been saved in /home/wuliufu/.ssh/id_rsa.pub.
The key fingerprint is:
1c:c7:f2:e1:11:76:0f:a8:66:44:f3:30:4b:98:08:86 wuliufu@wuliufu-PC
The key's randomart image is:
+--[ RSA 2048]----+
| .o. . +* o.o    |
|E.  . o..O.o o   |
|       .+.*   .  |
|       .+* o     |
|       oS o      |
|                 |
|                 |
|                 |
|                 |
+-----------------+

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is be:be:31:a7:83:28:66:82:f7:25:33:4c:98:79:4d:47.
Are you sure you want to continue connecting (yes/no)? yes

4）JDK和eclipse的安装和环境变量配置(略)

5）hadoop安装

下载：http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz

把压缩包放到D:\cygwin\home\wuliufu目录下

wuliufu@wuliufu-PC ~
$ tar -zxvf hadoop-0.20.2.tar.gz
$ ln -s ~/hadoop-0.20.2 ~/hadoop

6）hadoop配置

先简单设置一些核心属性如下，其他属性请参考开发文档

$ cd ~/hadoop/conf
vi hadoop-env.sh
#设置jdk和hadoop home，添加类似如下变量赋值
export JAVA_HOME="/cygdrive/d/Program Files/Java/jdk1.6.0_38"
export HADOOP_HOME=/home/wuliufu/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin

vi core-site.xml
<configuration>
	<property>     
    <name>fs.default.name</name>     
    <value>hdfs://localhost:9000</value>    
  </property>   
  <property>     
    <name>hadoop.tmp.dir</name>    
    <value>/home/wuliufu/hadoop/hadoop-root</value>    
  </property>
</configuration>

vi hadf-site.xml
<configuration>
 	<property>    
    <name>dfs.namenode.name.dir</name>    
    <value>/home/wuliufu/hadoop/data/dfs/name</value>   
    <final>true</final>   
  </property>    
  <property>   
    <name>dfs.namenode.data.dir</name>   
    <value>/home/wuliufu/hadoop/data/dfs/data</value>    
    <final>true</final>   
  </property>    
  <property>      
    <name>dfs.replication</name>   
    <value>1</value>    
  </property>   
  <property>   
    <name>dfs.permission</name>   
    <value>false</value>  
  </property> 
</configuration>

vi mapred-site.xml
<configuration>
	<property>
		<name>mapred.job.tracker</name>
		<value>localhost:9001</value>
	</property>
</configuration>

7）hadoop格式化及启动

1.格式化namenode

$ hadoop namenode -format
cygwin warning:
  MS-DOS style path detected: D:\cygwin\home\wuliufu\hadoop-0.20.2/build/native
  Preferred POSIX equivalent is: /home/wuliufu/hadoop-0.20.2/build/native
  CYGWIN environment variable option "nodosfilewarning" turns off this warning.
  Consult the user's guide for more details about POSIX paths:
    http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
13/04/23 22:42:47 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = wuliufu-PC/192.168.1.100
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
Re-format filesystem in \home\wuliufu\hadoop\hadoop-root\dfs\name ? (Y or N) y
Format aborted in \home\wuliufu\hadoop\hadoop-root\dfs\name
13/04/23 22:42:55 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at wuliufu-PC/192.168.1.100
************************************************************/

启动

$ cd hadoop/bin
$ ./start-all.sh
starting namenode, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-namenode-wuliufu-PC.out
localhost: starting datanode, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-datanode-wuliufu-PC.out
localhost: starting secondarynamenode, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-secondarynamenode-wuliufu-PC.out
starting jobtracker, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-jobtracker-wuliufu-PC.out
localhost: starting tasktracker, logging to /home/wuliufu/hadoop/logs/hadoop-wuliufu-tasktracker-wuliufu-PC.out

Eclipse 插件编译（试用于eclipse SDK 3.3+）：

在cygwin下执行编译，具体见：

http://wliufu.iteye.com/blog/1851164

另附件上传了我针对当前的eclipse编译好的插件

另对cdh3u4的编译可以参见：http://yzyzero.iteye.com/blog/1845396

Eclipse Hadoop MapReduce环境配置

1.把上个步骤编译好的插件hadoop- 0.20.2-eclipse-plugin.jar 拷贝到eclipse的plugins内，重启eclise

window->Preferences,点左侧的Hadoop Map/Reduce,在右侧配置hadoop安装位置，如：

D:\cygwin\home\wuliufu\hadoop-0.20.2

2.点window->show view->other,搜索map，然后点击Map/Reduce Location,点OK

这样就能看到Map/Reduce Location的视图了

在该视图右上角有一个大象的蓝色图标，点击新建一个location

填写上相关信息，具体参数和上述配置hadoop的参数一致

其中 map/reduce master 的后视图对应于mapred-site.xml 里的mapred.job.tracker属性值

DFS Master对应于core-site.xml的fs.default.name属性值

然后确认返回

点击eclipse右上角的open perspective,切换至map/reduce

这时左侧会如下

如果能看到这里，说明插件能够正常连接上hadoop集群了

来个简单的MR程序吧

在eclipse内点击File->NEW->other,选择map/reduce project,随便取个名wordcount

把hadoo-0.20.2里面的Wordcount.java赋值到demo下(D:\cygwin\home\wuliufu\hadoop-0.20.2\src\examples\org\apache\hadoop\examples\WordCount.java)

回到cygwin 下，我们编辑一个文件word.txt(可以选取一段英文，如附件)，然后把该文件上传到hdfs

wuliufu@wuliufu-PC ~
$ hadoop fs -ls /
Found 2 items
drwxr-xr-x   - wuliufu-pc\wuliufu supergroup          0 2013-04-23 22:44 /home
drwxr-xr-x   - wuliufu-pc\wuliufu supergroup          0 2013-04-24 00:42 /tmp

wuliufu@wuliufu-PC ~
$ hadoop fs -copyFromLocal ./word.txt /tmp/

在eclipse里面的左侧的DFS Location里面的Hadoop(大象图标)右键刷新就可以看到上传的文件了

接下来准备执行以下WordCount 了

改成需要传入两个参数，分别是输入路径和输出目录

右键run configxx。。。

接着右键->Run As->Run on hadoop

控制台会出现类似如下的log

13/04/24 00:49:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
13/04/24 00:49:46 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/04/24 00:49:47 INFO input.FileInputFormat: Total input paths to process : 1
13/04/24 00:49:48 INFO mapred.JobClient: Running job: job_local_0001
13/04/24 00:49:48 INFO input.FileInputFormat: Total input paths to process : 1
13/04/24 00:49:48 INFO mapred.MapTask: io.sort.mb = 100
13/04/24 00:49:49 INFO mapred.MapTask: data buffer = 79691776/99614720
13/04/24 00:49:49 INFO mapred.MapTask: record buffer = 262144/327680
13/04/24 00:49:49 INFO mapred.JobClient:  map 0% reduce 0%
13/04/24 00:49:49 INFO mapred.MapTask: Starting flush of map output
13/04/24 00:49:49 INFO mapred.MapTask: Finished spill 0
13/04/24 00:49:49 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/04/24 00:49:49 INFO mapred.LocalJobRunner: 
13/04/24 00:49:49 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
13/04/24 00:49:49 INFO mapred.LocalJobRunner: 
13/04/24 00:49:49 INFO mapred.Merger: Merging 1 sorted segments
13/04/24 00:49:49 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 870 bytes
13/04/24 00:49:49 INFO mapred.LocalJobRunner: 
13/04/24 00:49:50 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
13/04/24 00:49:50 INFO mapred.LocalJobRunner: 
13/04/24 00:49:50 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
13/04/24 00:49:50 INFO mapred.JobClient:  map 100% reduce 0%
13/04/24 00:49:50 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/tmp/out
13/04/24 00:49:50 INFO mapred.LocalJobRunner: reduce > reduce
13/04/24 00:49:50 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
13/04/24 00:49:51 INFO mapred.JobClient:  map 100% reduce 100%
13/04/24 00:49:51 INFO mapred.JobClient: Job complete: job_local_0001
13/04/24 00:49:51 INFO mapred.JobClient: Counters: 14
13/04/24 00:49:51 INFO mapred.JobClient:   FileSystemCounters
13/04/24 00:49:51 INFO mapred.JobClient:     FILE_BYTES_READ=34718
13/04/24 00:49:51 INFO mapred.JobClient:     HDFS_BYTES_READ=1108
13/04/24 00:49:51 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=70010
13/04/24 00:49:51 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=604
13/04/24 00:49:51 INFO mapred.JobClient:   Map-Reduce Framework
13/04/24 00:49:51 INFO mapred.JobClient:     Reduce input groups=66
13/04/24 00:49:51 INFO mapred.JobClient:     Combine output records=66
13/04/24 00:49:51 INFO mapred.JobClient:     Map input records=1
13/04/24 00:49:51 INFO mapred.JobClient:     Reduce shuffle bytes=0
13/04/24 00:49:51 INFO mapred.JobClient:     Reduce output records=66
13/04/24 00:49:51 INFO mapred.JobClient:     Spilled Records=132
13/04/24 00:49:51 INFO mapred.JobClient:     Map output bytes=903
13/04/24 00:49:51 INFO mapred.JobClient:     Combine input records=87
13/04/24 00:49:51 INFO mapred.JobClient:     Map output records=87
13/04/24 00:49:51 INFO mapred.JobClient:     Reduce input records=66

再在右侧的DFS Location刷新一下

点击part-r-00000,如上图右侧，这就是最终结果了

基本流程结束。

准备睡觉了。。。。。。

二.基于Linux平台搭建hadoop集群及MR开发环境

Hadoop MapReduce开发环境搭建