Nutch1.8+Hadoop1.2+Solr4.3分布式集群配置

nutch 是一个开源java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和web爬虫。当然在百度百科上这种方法在nutch1.2之后，已经不再适合这样描述nutch了，因为在1.2版本之后，nutch专注的只是爬取数据，而全文检索的部分彻底
nutch 是一个开源java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和web爬虫。当然在百度百科上这种方法在nutch1.2之后，已经不再适合这样描述nutch了，因为在1.2版本之后，nutch专注的只是爬取数据，而全文检索的部分彻底的交给lucene和solr，es来做了，当然因为他们都是近亲关系，所以nutch抓取完后的数据，非常easy的就能生成全文索引。
下面散仙，进入正题，nutch目前最新的版本是2.2.1，其中2.x的版本支持gora提供多种存储方式，1.x版本最新的是1.8只支持hdfs存储，散仙在这里用的还是nutch1.8，那么，散仙为什么选择1.x系列呢？这其实和自己的hadoop环境有关系，2.x的nutch用的hadoop2.x的版本，当然如果你不嫌麻烦，你完全可以改改jar的配置，使nutch2.x跑在hadoop1.x的集群上。使用1.x的nutch就可以很轻松的跑在1.x的hadoop里。下面是散仙，本次测试nutch+hadoop+solr集群的配置情况：
序号名称职责描述
1 nutch1.8 主要负责爬取数据，支持分布式
2 hadoop1.2.0 使用mapreduce进行并行爬取，使用hdfs存储数据，nutch的任务提交在hadoop集群上，支持分布式
3 solr4.3.1 主要负责检索，对爬完后的数据进行搜索，查询，海量数据支持分布式
4 ik4.3 主要负责，对网页内容与标题进行分词，便于全文检索
5 centos6.5 linux系统，在上面运行nutch，hadoop等应用
6 tomcat7.0 应用服务器，给solr提供容器运行
7 jdk1.7 提供java运行环境
8 ant1.9 提供nutch等源码编译
9 屌丝软件工程师一名主角
下面开始，正式的启程
1，首先确保你的ant环境配置成功，一切的进行，最好在linux下进行，windows上出问题的几率比较大，下载完的nutch源码，进入nutch的根目录下，执行ant，等待编译完成。编译完后，会有runtime目录，里面有nutch启动的命令，local模式和deploy分布式集群模式
2，配置nutch-site.xml加入如下内容：
http.agent.namemynutchhttp.robots.agentsmynutch,*the agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. you should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. e.g.: blurfldev,blurfl,* plugin.folders./src/pluginpluginsdirectories where nutch plugins are located. each element may be a relative or absolute path. if absolute, it is used as is. if relative, it is searched for on the classpath.
3，在hadoop集群上创建urls文件夹和mydir文件夹
，前者用于存储种子文件地址，后者存放爬取完后的数据。
hadoop fs -mkdir urls --创建文件夹
hadoop fs -put hdfs路径本地路径 --上传种子文件到hdfs上。
hadoop fs -ls /      ---查看路径下内容
4，配置好hadoop集群，以及它的环境变量hadoop_home这个很重要，nutch运行时候，会根据hadoop的环境变量，提交作业。
export hadoop_home=/root/hadoop1.2export path=$hadoop_home/bin:$path ant_home=/root/apache-ant-1.9.2export path user logname mail hostname histsize histcontrolexport java_home=/root/jdk1.7export path=$java_home/bin:$ant_home/bin:$path export classpath=.:$java_home/lib/dt.jar:$java_home/lib/tools.jar
配置完成之后，可以使用which hadoop命令，检测是否配置正确：
[root@master bin]# which hadoop/root/hadoop1.2/bin/hadoop[root@master bin]#
5，配置solr服务，需要将nutch的conf下的schema.xml文件，拷贝到solr的里面，覆盖掉solr原来的schema.xml文件，并加入ik分词，内容如下：
idcontent
6，配置好后，进行nutch的/root/apache-nutch-1.8/runtime/deploy/bin目录下
执行如下命令：
./crawl urls mydir http://192.168.211.36:9001/solr/   2
启动集群抓取任务。
抓取中的mapreduce截图如下：
抓取完，我们就可以去solr中查看抓取的内容了，截图如下：
至此，一个简单的抓取，搜索系统就搞定了，非常轻松，使用都是lucene系列开源的工程。
总结：配置过程中遇到几个比较典型的错误，记录如下：
java.lang.exception: java.lang.runtimeexception: error in configuring object at org.apache.hadoop.mapred.localjobrunner$job.run(localjobrunner.java:354)caused by: java.lang.runtimeexception: error in configuring object at org.apache.hadoop.util.reflectionutils.setjobconf(reflectionutils.java:93) at org.apache.hadoop.util.reflectionutils.setconf(reflectionutils.java:64) at org.apache.hadoop.util.reflectionutils.newinstance(reflectionutils.java:117) at org.apache.hadoop.mapred.maptask.runoldmapper(maptask.java:426) at org.apache.hadoop.mapred.maptask.run(maptask.java:366) at org.apache.hadoop.mapred.localjobrunner$job$maptaskrunnable.run(localjobrunner.java:223) at java.util.concurrent.executors$runnableadapter.call(executors.java:441) at java.util.concurrent.futuretask$sync.innerrun(futuretask.java:303) at java.util.concurrent.futuretask.run(futuretask.java:138) at java.util.concurrent.threadpoolexecutor$worker.runtask(threadpoolexecutor.java:886) at java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:908) at java.lang.thread.run(thread.java:662)caused by: java.lang.reflect.invocationtargetexception at sun.reflect.nativemethodaccessorimpl.invoke0(native method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39) at sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) at java.lang.reflect.method.invoke(method.java:597) at org.apache.hadoop.util.reflectionutils.setjobconf(reflectionutils.java:88) ... 11 morecaused by: java.lang.runtimeexception: error in configuring object at org.apache.hadoop.util.reflectionutils.setjobconf(reflectionutils.java:93) at org.apache.hadoop.util.reflectionutils.setconf(reflectionutils.java:64) at org.apache.hadoop.util.reflectionutils.newinstance(reflectionutils.java:117) at org.apache.hadoop.mapred.maprunner.configure(maprunner.java:34) ... 16 morecaused by: java.lang.reflect.invocationtargetexception at sun.reflect.nativemethodaccessorimpl.invoke0(native method) at sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:39) at sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) at java.lang.reflect.method.invoke(method.java:597) at org.apache.hadoop.util.reflectionutils.setjobconf(reflectionutils.java:88) ... 19 morecaused by: java.lang.runtimeexception: x point org.apache.nutch.net.urlnormalizer not found. at org.apache.nutch.net.urlnormalizers.(urlnormalizers.java:123) at org.apache.nutch.crawl.injector$injectmapper.configure(injector.java:74) ... 24 more2013-09-05 20:40:49,329 info mapred.jobclient (jobclient.java:monitorandprintjob(1393)) - map 0% reduce 0%2013-09-05 20:40:49,332 info mapred.jobclient (jobclient.java:monitorandprintjob(1448)) - job complete: job_local1315110785_00012013-09-05 20:40:49,332 info mapred.jobclient (counters.java:log(585)) - counters: 02013-09-05 20:40:49,333 info mapred.jobclient (jobclient.java:runjob(1356)) - job failed: naexception in thread main java.io.ioexception: job failed! at org.apache.hadoop.mapred.jobclient.runjob(jobclient.java:1357) at org.apache.nutch.crawl.injector.inject(injector.java:281) at org.apache.nutch.crawl.crawl.run(crawl.java:132) at org.apache.hadoop.util.toolrunner.run(toolrunner.java:65) at org.apache.nutch.crawl.crawl.main(crawl.java:55) ========================================================================== 解决方法：在nutch-site.xml里面加入如下配置。 plugin.folders./src/plugindirectories where nutch plugins are located. each element may be a relative or absolute path. if absolute, it is used as is. if relative, it is searched for on the classpath.
在执行抓取的shell命令时，发现
使用 bin/crawl urls mydir http://192.168.211.36:9001/solr/   2 命令有时候会出现，一些hdfs上的目录不能正确访问的问题，所以推荐使用下面的这个命令：
./crawl urls mydir http://192.168.211.36:9001/solr/   2
http://itindex.net/detail/49582-nutch1.8-hadoop1.2-solr4.3

Nutch1.8+Hadoop1.2+Solr4.3分布式集群配置

推荐信息