Hadoop伪分布式运行

hadoop可以在单节点上以所谓的伪分布式模式运行，此时每一个hadoop守护进程都作为一个独立的java进程运行。本文通过自动化脚本配置hadoop伪分布式模式。测试环境为vmware中的centos 6.3, hadoop 1.2.1.其他版本未测试。伪分布式配置脚本包括配置core-site.
hadoop可以在单节点上以所谓的伪分布式模式运行，此时每一个hadoop守护进程都作为一个独立的java进程运行。本文通过自动化脚本配置hadoop伪分布式模式。测试环境为vmware中的centos 6.3, hadoop 1.2.1.其他版本未测试。
伪分布式配置脚本包括配置core-site.xml，hdfs-site.xml及mapred-site.xml，配置ssh免密码登陆。[1]
#!/bin/bash# usage: hadoop伪分布式配置# history:# 20140426 annhe 完成基本功能# check if user is rootif [ $(id -u) != 0 ]; then printf error: you must be root to run this script!\n exit 1fi#同步时钟rm -rf /etc/localtimeln -s /usr/share/zoneinfo/asia/shanghai /etc/localtime#yum install -y ntpntpdate -u pool.ntp.org &>/dev/nullecho -e time: `date` \n#默认为单网卡结构，多网卡的暂不考虑ip=`ifconfig eth0 |grep inet\ addr |awk '{print $2}' |cut -d : -f2`#伪分布式function pseudodistributed (){ cd /etc/hadoop/ #恢复备份 mv core-site.xml.bak core-site.xml mv hdfs-site.xml.bak hdfs-site.xml mv mapred-site.xml.bak mapred-site.xml #备份 mv core-site.xml core-site.xml.bak mv hdfs-site.xml hdfs-site.xml.bak mv mapred-site.xml mapred-site.xml.bak #使用下面的core-site.xml cat > core-site.xml fs.default.namehdfs://$ip:9000eof #使用下面的hdfs-site.xml cat > hdfs-site.xml dfs.replication1 eof #使用下面的mapred-site.xml cat > mapred-site.xml mapred.job.tracker$ip:9001eof}#配置ssh免密码登陆function passphraselessssh (){ #不重复生成私钥 [ ! -f ~/.ssh/id_dsa ] && ssh-keygen -t dsa -p '' -f ~/.ssh/id_dsa cat ~/.ssh/authorized_keys |grep `cat ~/.ssh/id_dsa.pub` &>/dev/null && r=0 || r=1 #没有公钥的时候才添加 [ $r -eq 1 ] && cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys chmod 644 ~/.ssh/authorized_keys}#执行function execute (){ #格式化一个新的分布式文件系统 hadoop namenode -format #启动hadoop守护进程 start-all.sh echo -e \n======================================================================== echo hadoop log dir : $hadoop_log_dir echo namenode - http://$ip:50070/ echo jobtracker - http://$ip:50030/ echo -e \n=========================================================================}pseudodistributed 2>&1 | tee -a pseudo.logpassphraselessssh 2>&1 | tee -a pseudo.logexecute 2>&1 | tee -a pseudo.log
脚本测试结果[root@hadoop hadoop]# ./pseudo.sh14/04/26 23:52:30 info namenode.namenode: startup_msg:/************************************************************startup_msg: starting namenodestartup_msg: host = hadoop/216.34.94.184startup_msg: args = [-format]startup_msg: version = 1.2.1startup_msg: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on mon jul 22 15:27:42 pdt 2013startup_msg: java = 1.7.0_51************************************************************/re-format filesystem in /tmp/hadoop-root/dfs/name ? (y or n) yformat aborted in /tmp/hadoop-root/dfs/name14/04/26 23:52:40 info namenode.namenode: shutdown_msg:/************************************************************shutdown_msg: shutting down namenode at hadoop/216.34.94.184************************************************************/starting namenode, logging to /var/log/hadoop/root/hadoop-root-namenode-hadoop.outlocalhost: starting datanode, logging to /var/log/hadoop/root/hadoop-root-datanode-hadoop.outlocalhost: starting secondarynamenode, logging to /var/log/hadoop/root/hadoop-root-secondarynamenode-hadoop.outstarting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-hadoop.outlocalhost: starting tasktracker, logging to /var/log/hadoop/root/hadoop-root-tasktracker-hadoop.out========================================================================hadoop log dir : /var/log/hadoop/rootnamenode - http://192.168.60.128:50070/jobtracker - http://192.168.60.128:50030/=========================================================================
通过宿主机上的浏览器访问namenode和jobtracker的网络接口
浏览器访问namenode的网络接口
浏览器访问jobtracker网络接口
运行测试程序将输入文件拷贝到分布式文件系统：
$ hadoop fs -put input input
通过网络接口查看hdfs
通过namenode网络接口查看hdfs文件系统
运行示例程序
[root@hadoop hadoop]# hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount input output
通过jobtracker网络接口查看执行状态
wordcount执行状态
执行结果
[root@hadoop hadoop]# hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount input out214/04/27 03:34:56 info input.fileinputformat: total input paths to process : 214/04/27 03:34:56 info util.nativecodeloader: loaded the native-hadoop library14/04/27 03:34:56 warn snappy.loadsnappy: snappy native library not loaded14/04/27 03:34:57 info mapred.jobclient: running job: job_201404270333_000114/04/27 03:34:58 info mapred.jobclient: map 0% reduce 0%14/04/27 03:35:49 info mapred.jobclient: map 100% reduce 0%14/04/27 03:36:16 info mapred.jobclient: map 100% reduce 100%14/04/27 03:36:19 info mapred.jobclient: job complete: job_201404270333_000114/04/27 03:36:19 info mapred.jobclient: counters: 2914/04/27 03:36:19 info mapred.jobclient: job counters14/04/27 03:36:19 info mapred.jobclient: launched reduce tasks=114/04/27 03:36:19 info mapred.jobclient: slots_millis_maps=7289514/04/27 03:36:19 info mapred.jobclient: total time spent by all reduces waiting after reserving slots (ms)=014/04/27 03:36:19 info mapred.jobclient: total time spent by all maps waiting after reserving slots (ms)=014/04/27 03:36:19 info mapred.jobclient: launched map tasks=214/04/27 03:36:19 info mapred.jobclient: data-local map tasks=214/04/27 03:36:19 info mapred.jobclient: slots_millis_reduces=2488014/04/27 03:36:19 info mapred.jobclient: file output format counters14/04/27 03:36:19 info mapred.jobclient: bytes written=2514/04/27 03:36:19 info mapred.jobclient: filesystemcounters14/04/27 03:36:19 info mapred.jobclient: file_bytes_read=5514/04/27 03:36:19 info mapred.jobclient: hdfs_bytes_read=26014/04/27 03:36:19 info mapred.jobclient: file_bytes_written=16404114/04/27 03:36:19 info mapred.jobclient: hdfs_bytes_written=2514/04/27 03:36:19 info mapred.jobclient: file input format counters14/04/27 03:36:19 info mapred.jobclient: bytes read=2514/04/27 03:36:19 info mapred.jobclient: map-reduce framework14/04/27 03:36:19 info mapred.jobclient: map output materialized bytes=6114/04/27 03:36:19 info mapred.jobclient: map input records=214/04/27 03:36:19 info mapred.jobclient: reduce shuffle bytes=6114/04/27 03:36:19 info mapred.jobclient: spilled records=814/04/27 03:36:19 info mapred.jobclient: map output bytes=4114/04/27 03:36:19 info mapred.jobclient: total committed heap usage (bytes)=41444147214/04/27 03:36:19 info mapred.jobclient: cpu time spent (ms)=291014/04/27 03:36:19 info mapred.jobclient: combine input records=414/04/27 03:36:19 info mapred.jobclient: split_raw_bytes=23514/04/27 03:36:19 info mapred.jobclient: reduce input records=414/04/27 03:36:19 info mapred.jobclient: reduce input groups=314/04/27 03:36:19 info mapred.jobclient: combine output records=414/04/27 03:36:19 info mapred.jobclient: physical memory (bytes) snapshot=35343974414/04/27 03:36:19 info mapred.jobclient: reduce output records=314/04/27 03:36:19 info mapred.jobclient: virtual memory (bytes) snapshot=219597209614/04/27 03:36:19 info mapred.jobclient: map output records=4
查看结果
[root@hadoop hadoop]# hadoop fs -cat out2/*hadoop 1hello 2world 1
也可以将分布式文件系统上的文件拷贝到本地查看
[root@hadoop hadoop]# hadoop fs -get out2 out4[root@hadoop hadoop]# cat out4/*cat: out4/_logs: is a directoryhadoop 1hello 2world 1
完成全部操作后，停止守护进程：
[root@hadoop hadoop]# stop-all.shstopping jobtrackerlocalhost: stopping tasktrackerstopping namenodelocalhost: stopping datanodelocalhost: stopping secondarynamenode
遇到的问题宿主机不能访问网络接口因为开启了iptables，所以需要添加相应端口，当然测试环境也可以直接将iptables关闭。
# firewall configuration written by system-config-firewall# manual customization of this file is not recommended.*filter:input accept [0:0]:forward accept [0:0]:output accept [0:0]-a input -m state --state established,related -j accept-a input -p icmp -j accept-a input -i lo -j accept-a input -m state --state new -m tcp -p tcp --dport 22 -j accept-a input -m state --state new -m tcp -p tcp --dport 50070 -j accept-a input -m state --state new -m tcp -p tcp --dport 50030 -j accept-a input -m state --state new -m tcp -p tcp --dport 50075 -j accept-a input -j reject --reject-with icmp-host-prohibited-a forward -j reject --reject-with icmp-host-prohibitedcommit
browse the filesystem跳转地址不对namenode网络接口点击browse the filesystem，跳转到localhost:50075。[2][3]
修改core-site.xml，将hdfs://localhost:9000改成虚拟机ip地址。(上面的脚本已经改写为自动配置为ip)。
根据几次改动的情况，这里也是可以填写域名的，只是要在访问的机器上能解析这个域名。因此公网环境中有dns服务器的应该是可以设置域名的。
执行reduce的时候卡死在/etc/hosts中添加主机名对应的ip地址 [4][5]。（已更新hadoop安装脚本，会自动配置此项）
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4::1 localhost localhost.localdomain localhost6 localhost6.localdomain6127.0.0.1 hadoop #添加这一行
参考文献[1]. hadoop官方文档.?http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html
[2]. stackoverflow.?http://stackoverflow.com/questions/15254492/wrong-redirect-from-hadoop-hdfs-namenode-to-localhost50075
[3]. iteye.?http://yymmiinngg.iteye.com/blog/706909
[4].stackoverflow.?http://stackoverflow.com/questions/10165549/hadoop-wordcount-example-stuck-at-map-100-reduce-0
[5]. 李俊的博客.?http://www.colorlight.cn/archives/32
本文遵从cc版权协定，转载请以链接形式注明出处。
本文链接地址: http://www.annhe.net/article-2682.html

Hadoop伪分布式运行

推荐信息