- 浏览: 2479465 次
- 性别:
- 来自: 北京
-
文章分类
- 全部博客 (174)
- 开发平台 (16)
- 网络爬虫 (13)
- 图书 (1)
- 搜索 (3)
- hadoop (6)
- java (18)
- Nutch相关框架视频教程 (1)
- 大数据 (14)
- 开源项目 (7)
- hbase (1)
- superword (34)
- 分布式 (2)
- 人工智能 (9)
- 搜索引擎 (4)
- word分词 (19)
- rank (5)
- QuestionAnsweringSystem (5)
- HtmlExtractor (4)
- jsearch (1)
- 公开课 (1)
- NLP (1)
- counter (1)
- short-text-search (2)
- high-availability (1)
- 互联网 (3)
- 微服务 (0)
最新评论
-
masuweng:
你好, 根据机器码计算注册码的代码是在哪个包下的哪个类了.
APDPlat中的机器码生成机制 -
masuweng:
我的那个项目跑起来为什么503了
APDPlat中的机器码生成机制 -
masuweng:
APDPlat中的机器码生成机制 -
liutaochn:
可以用,thanks
Cygwin运行nutch报错:Failed to set permissions of path -
qbuer:
The Google Web Search API is no ...
使用Java调用谷歌搜索
一、nutch1.2
二、nutch1.5.1
三、nutch2.0
四、配置SSH
五、安装Hadoop Cluster(伪分布式运行模式)并运行Nutch
六、安装Hadoop Cluster(分布式运行模式)并运行Nutch
七、配置Ganglia监控Hadoop集群和HBase集群
八、Hadoop配置Snappy压缩
九、Hadoop配置Lzo压缩
十、配置zookeeper集群以运行hbase
十一、配置Hbase集群以运行nutch-2.1(Region Servers会因为内存的问题宕机)
十二、配置Accumulo集群以运行nutch-2.1(gora存在BUG)
十三、配置Cassandra 集群以运行nutch-2.1(Cassandra 采用去中心化结构)
十四、配置MySQL 单机服务器以运行nutch-2.1
十五、nutch2.1 使用DataFileAvroStore作为数据源
十六、nutch2.1 使用AvroStore作为数据源
十七、配置SOLR
十八、Nagios监控
十九、配置Splunk
二十、配置Pig
二十一、配置Hive
二十二、配置Hadoop2.x集群
一、nutch1.2
步骤和二大同小异,在步骤 5、配置构建路径 中需要多两个操作:在左部Package Explorer的 nutch1.2文件夹上单击右键 > Build Path > Configure Build Path... > 选中Source选项 > Default output folder:修改nutch1.2/bin为nutch1.2/_bin,在左部Package Explorer的 nutch1.2文件夹下的bin文件夹上单击右键 > Team > 还原
二中黄色背景部分是版本号的差异,红色部分是1.2版本没有的,绿色部分是不一样的地方,如下:
1、Add JARs... > nutch1.2 > lib ,选中所有的.jar文件 > OK
2、crawl-urlfilter.txt
3、将crawl -urlfilter.txt.template改名为crawl -urlfilter.txt
4、修改crawl-urlfilter.txt,将
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# skip everything else
-.
5、cd /home/ysc/workspace/nutch1.2
nutch1.2是一个完整的搜索引擎,nutch1.5.1只是一个爬虫。nutch1.2可以把索引提交给SOLR,也可以直接生成LUCENE索引,nutch1.5.1则只能把索引提交给SOLR:
1、cd /home/ysc
2、wget http://mirrors.tuna.tsinghua.edu.cn/apache/tomcat/tomcat-7/v7.0.29/bin/apache-tomcat-7.0.29.tar.gz
3、tar -xvf apache-tomcat-7.0.29.tar.gz
4、在左部Package Explorer的 nutch1.2文件夹下的build.xml文件上单击右键 > Run As > Ant Build... > 选中war target > Run
5、cd /home/ysc/workspace/nutch1.2/build
6、unzip nutch-1.2.war -d nutch-1.2
7、cp -r nutch-1.2 /home/ysc/apache-tomcat-7.0.29/webapps
8、vi /home/ysc/apache-tomcat-7.0.29/webapps/nutch-1.2/WEB-INF/classes/nutch-site.xml
加入以下配置:
<property>
<name>searcher.dir</name>
<value>/home/ysc/workspace/nutch1.2/data</value>
<description>
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
9、vi /home/ysc/apache-tomcat-7.0.29/conf/server.xml
将
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"/>
改为
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="utf-8"/>
10、cd /home/ysc/apache-tomcat-7.0.29/bin
11、./startup.sh
12、访问:http://localhost:8080/nutch-1.2/
关于nutch1.2更多的BUG修复及资料,请参看我在CSDN发布的资源:http://download.csdn.net/user/yangshangchuan
二、nutch1.5.1
1、下载并解压eclipse(集成开发环境)
下载地址:http://www.eclipse.org/downloads/,下载Eclipse IDE for Java EE Developers
2、安装Subclipse插件(SVN客户端)
插件地址:http://subclipse.tigris.org/update_1.8.x,
3、安装IvyDE插件(下载依赖Jar)
插件地址:http://www.apache.org/dist/ant/ivyde/updatesite/
4、签出代码
File > New > Project > SVN > 从SVN 检出项目
创建新的资源库位置 > URL:https://svn.apache.org/repos/asf/nutch/tags/release-1.5.1/ > 选中URL > Finish
弹出New Project向导,选择Java Project > Next,输入Project name:nutch1.5.1 > Finish
5、配置构建路径
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Build Path > Configure Build Path...
> 选中Source选项 > 选择src > Remove > Add Folder... > 选择src/bin, src/java, src/test 和 src/testresources(对于插件,需要选中src/plugin目录下的每一个插件目录下的src/java , src/test文件夹) > OK
切换到Libraries选项 >
Add Class Folder... > 选中nutch1.5.1/conf > OK
Add JARs... > 需要选中src/plugin目录下的每一个插件目录下的lib目录下的jar文件 > OK
Add Library... > IvyDE Managed Dependencies > Next > Main > Ivy File > Browse > ivy/ivy.xml > Finish
切换到Order and Export选项>
选中conf > Top
6、执行ANT
在左部Package Explorer的 nutch1.5.1文件夹下的build.xml文件上单击右键 > Run As > Ant Build
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Refresh
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Build Path > Configure Build Path... > 选中Libraries选项 > Add Class Folder... > 选中build > OK
7、修改配置文件nutch-site.xml 和regex-urlfilter.txt
将nutch-site.xml.template改名为nutch-site.xml
将regex-urlfilter.txt.template改名为regex-urlfilter.txt
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Refresh
将如下配置项加入文件nutch-site.xml:
<property>
<name>http.agent.name</name>
<value>nutch</value>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
修改regex-urlfilter.txt,将
# accept anything else
+.
替换为:
+^http://([a-z0-9]*\.)*news.163.com/
-.
8、开发调试
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > New > Folder > Folder name: urls
在刚新建的urls目录下新建一个文本文件url,文本内容为:http://news.163.com
打开src/java下的org.apache.nutch.crawl.Crawl.java类,单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: urls -dir data -depth 3 > Run
在需要调试的地方打上断点Debug As > Java Applicaton
9、查看结果
查看segments目录:
打开src/java下的org.apache.nutch.segment.SegmentReader.java类
单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: -dump data/segments/* data/segments/dump
用文本编辑器打开文件data/segments/dump/dump查看segments中存储的信息
查看crawldb目录:
打开src/java下的org.apache.nutch.crawl.CrawlDbReader.java类
单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: data/crawldb -stats
控制台会输出 crawldb统计信息
查看linkdb目录:
打开src/java下的org.apache.nutch.crawl.LinkDbReader.java类
单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: data/linkdb -dump data/linkdb_dump
用文本编辑器打开文件data/linkdb_dump/part-00000查看linkdb中存储的信息
10、全网分步骤抓取
在左部Package Explorer的 nutch1.5.1文件夹下的build.xml文件上单击右键 > Run As > Ant Build
cd /home/ysc/workspace/nutch1.5.1/runtime/local
#准备URL列表
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/url
#注入URL
bin/nutch inject crawl/crawldb dmoz
#生成抓取列表
bin/nutch generate crawl/crawldb crawl/segments
#第一次抓取
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
#抓取网页
bin/nutch fetch $s1
#解析网页
bin/nutch parse $s1
#更新URL状态
bin/nutch updatedb crawl/crawldb $s1
#第二次抓取
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
#第三次抓取
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3
#生成反向链接库
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
11、索引和搜索
cd /home/ysc/
wget http://mirror.bjtu.edu.cn/apache/lucene/solr/3.6.1/apache-solr-3.6.1.tgz
tar -xvf apache-solr-3.6.1.tgz
cd apache-solr-3.6.1 /example
NUTCH_RUNTIME_HOME=/home/ysc/workspace/nutch1.5.1/runtime/local
APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
如果需要把网页内容存储到索引中,则修改 schema.xml文件中的
<field name="content" type="text" stored="false" indexed="true"/>
为
<field name="content" type="text" stored="true" indexed="true"/>
修改${APACHE_SOLR_HOME}/example/solr/conf/solrconfig.xml,将里面的<str name="df">text</str>都替换为<str name="df">content</str>
把${APACHE_SOLR_HOME}/example/solr/conf/schema.xml中的 <schema name="nutch" version="1.5.1">修改为<schema name="nutch" version="1.5">
#启动SOLR服务器
java -jar start.jar
http://127.0.0.1:8983/solr/admin/
http://127.0.0.1:8983/solr/admin/stats.jsp
cd /home/ysc/workspace/nutch1.5.1/runtime/local
#提交索引
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
执行完整crawl:
bin/nutch crawl urls -dir data -depth 2 -topN 100 -solr http://127.0.0.1:8983/solr/
使用以下命令分页查看所有索引的文档:
http://127.0.0.1:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on
标题包含“网易”的文档:
http://127.0.0.1:8983/solr/select/?q=title%3A%E7%BD%91%E6%98%93&version=2.2&start=0&rows=10&indent=on
12、查看索引信息
cd /home/ysc/
wget http://luke.googlecode.com/files/lukeall-3.5.0.jar
java -jar lukeall-3.5.0.jar
Path: /home/ysc/apache-solr-3.6.1/example/solr/data
13、配置SOLR的中文分词
cd /home/ysc/
wget http://mmseg4j.googlecode.com/files/mmseg4j-1.8.5.zip
unzip mmseg4j-1.8.5.zip -d mmseg4j-1.8.5
APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1
mkdir $APACHE_SOLR_HOME/example/solr/lib
mkdir $APACHE_SOLR_HOME/example/solr/dic
cp mmseg4j-1.8.5/mmseg4j-all-1.8.5.jar $APACHE_SOLR_HOME/example/solr/lib
cp mmseg4j-1.8.5/data/*.dic $APACHE_SOLR_HOME/example/solr/dic
将${APACHE_SOLR_HOME}/example/solr/conf/schema.xml文件中的
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
和
<tokenizer class="solr.StandardTokenizerFactory"/>
替换为
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/apache-solr-3.6.1/example/solr/dic"/>
#重新启动SOLR服务器
java -jar start.jar
#重建索引,演示在开发环境中如何操作
打开src/java下的org.apache.nutch.indexer.solr.SolrIndexer.java类
单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: http://127.0.0.1:8983/solr/ data/crawldb -linkdb data/linkdb data/segments/*
使用luke重新打开索引就会发现分词起作用了
三、nutch2.0
nutch2.0和二中的nutch1.5.1的步骤相同,但在8、开发调试之前需要做以下配置:
在左部Package Explorer的 nutch2.0文件夹上单击右键 > New > Folder > Folder name: data并指定数据存储方式,选如下之一:
1、使用mysql作为数据存储
1)、在nutch2.0/conf/nutch-site.xml中加入如下配置:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
</property>
2)、将nutch2.0/conf/gora.properties文件中的
gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
gora.sqlstore.jdbc.user=sa
gora.sqlstore.jdbc.password=
修改为
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch2
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=ROOT
3)、打开nutch2.0/ivy/ivy.xml中的mysql-connector-java依赖
4)、sudo apt-get install mysql-server
2、使用hbase作为数据存储
1)、在nutch2.0/conf/nutch-site.xml中加入如下配置:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
2)、打开nutch2.0/ivy/ivy.xml中的gora-hbase依赖
3)、cd /home/ysc
4)、wget http://mirror.bit.edu.cn/apache/hbase/hbase-0.90.5/hbase-0.90.5.tar.gz
5)、tar -xvf hbase-0.90.5.tar.gz
6)、vi hbase-0.90.5/conf/hbase-site.xml
加入以下配置:
<property>
<name>hbase.rootdir</name>
<value>file:///home/ysc/hbase-0.90.5-database</value>
</property>
7)、hbase-0.90.5/bin/start-hbase.sh
8)、将/home/ysc/hbase-0.90.5/hbase-0.90.5.jar加入开发环境eclipse的build path
四、配置SSH
三台机器 devcluster01, devcluster02, devcluster03,分别在每一台机器上面执行如下操作:
1、sudo vi /etc/hosts
加入以下配置:
192.168.1.1 devcluster01
192.168.1.2 devcluster02
192.168.1.3 devcluster03
2、安装SSH服务:
sudo apt-get install openssh-server
3、(有提示的时候回车键确认)
ssh-keygen -t rsa
该命令会在用户主目录下创建 .ssh 目录,并在其中创建两个文件:id_rsa 私钥文件。是基于 RSA 算法创建。该私钥文件要妥善保管,不要泄漏。id_rsa.pub 公钥文件。和 id_rsa 文件是一对儿,该文件作为公钥文件,可以公开。
4、cp .ssh/id_rsa.pub .ssh/authorized_keys
把 三台机器 devcluster01, devcluster02, devcluster03 的文件/home/ysc/.ssh/authorized_keys的内容复制出来合并成一个文件并替换每一台机器上的/home/ysc/.ssh/authorized_keys文件
在devcluster01上面执行时,以下两条命令的主机为02和03
在devcluster02上面执行时,以下两条命令的主机为01和03
在devcluster03上面执行时,以下两条命令的主机为01和02
5、ssh-copy-id -i .ssh/id_rsa.pub ysc@ devcluster02
6、ssh-copy-id -i .ssh/id_rsa.pub ysc@ devcluster03
以上两条命令实际上是将 .ssh/id_rsa.pub 公钥文件追加到远程主机 server 的 user 主目录下的 .ssh/authorized_keys 文件中。
五、安装Hadoop Cluster(伪分布式运行模式)并运行Nutch
步骤和四大同小异,只需要1台机器 devcluster01,所以黄色背景部分全部设置为devcluster01,不需要第11步
六、安装Hadoop Cluster(分布式运行模式)并运行Nutch
三台机器 devcluster01, devcluster02, devcluster03(vi /etc/hostname)
使用用户ysc登陆 devcluster01:
1、cd /home/ysc
2、wget http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-1.1.1/hadoop-1.1.1-bin.tar.gz
3、tar -xvf hadoop-1.1.1-bin.tar.gz
4、cd hadoop-1.1.1
5、vi conf/masters
替换内容为 :
devcluster01
6、vi conf/slaves
替换内容为 :
devcluster02
devcluster03
7、vi conf/core-site.xml
加入配置:
<property>
<name>fs.default.name</name>
<value>hdfs://devcluster01:9000</value>
<description>
Where to find the Hadoop Filesystem through the network.
Note 9000 is not the default port.
(This is slightly changed from previous versions which didnt have "hdfs")
</description>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
编辑conf/hadoop-policy.xml
8、vi conf/hdfs-site.xml
加入配置:
<property>
<name>dfs.name.dir</name>
<value>/home/ysc/dfs/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/ysc/dfs/filesystem/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.block.size</name>
<value>671088640</value>
<description>The default block size for new files.</description>
</property>
9、vi conf/mapred-site.xml
加入配置:
<property>
<name>mapred.job.tracker</name>
<value>devcluster01:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
Note 9001 is not the default port.
</description>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
<description>If true, then multiple instances of some reduce tasks
may be executed in parallel.</description>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
<description>If true, then multiple instances of some map tasks
may be executed in parallel.</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2000m</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
<description>
the core number of host
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
<description>
define mapred.map tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>4</value>
<description>
define mapred.reduce tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host
</description>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
<description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression.
</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/ysc/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/ysc/mapreduce/local</value>
</property>
10、vi conf/hadoop-env.sh
追加:
export JAVA_HOME=/home/ysc/jdk1.7.0_05
export HADOOP_HEAPSIZE=2000
#替换掉默认的垃圾回收器,因为默认的垃圾回收器在多线程环境下会有更多的wait等待
export HADOOP_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
11、复制HADOOP文件
scp -r /home/ysc/hadoop-1.1.1 ysc@devcluster02:/home/ysc/hadoop-1.1.1
scp -r /home/ysc/hadoop-1.1.1 ysc@devcluster03:/home/ysc/hadoop-1.1.1
12、sudo vi /etc/profile
追加并重启系统:
export PATH=/home/ysc/hadoop-1.1.1/bin:$PATH
13、格式化名称节点并启动集群
hadoop namenode -format
start-all.sh
14、cd /home/ysc/workspace/nutch1.5.1/runtime/deploy
mkdir urls
echo http://news.163.com > urls/url
hadoop dfs -put urls urls
bin/nutch crawl urls -dir data -depth 2 -topN 100
15、访问 http://localhost:50030 可以查看 JobTracker 的运行状态。访问 http://localhost:50060 可以查看 TaskTracker 的运行状态。访问 http://localhost:50070 可以查看 NameNode 以及整个分布式文件系统的状态,浏览分布式文件系统中的文件以及 log 等
16、通过stop-all.sh停止集群
17、如果NameNode和SecondaryNameNode不在同一台机器上,则在SecondaryNameNode的conf/hdfs-site.xml文件中加入配置:
<property>
<name>dfs.http.address</name>
<value>namenode:50070</value>
</property>
七、配置Ganglia监控Hadoop集群和HBase集群
1、服务器端(安装到master devcluster01上)
1)、ssh devcluster01
2)、addgroup ganglia
adduser --ingroup ganglia ganglia
3)、sudo apt-get install ganglia-monitor ganglia-webfront gmetad
//补充:在Ubuntu10.04上,ganglia-webfront这个package名字叫ganglia-webfrontend
//如果install出错,则运行sudo apt-get update,如果update出错,则删除出错路径
4)、vi /etc/ganglia/gmond.conf
先找到setuid = yes,改成setuid =no;
在找到cluster块中的name,改成name =”hadoop-cluster”;
5)、sudo apt-get install rrdtool
6)、vi /etc/ganglia/gmetad.conf
在这个配置文件中增加一些datasource,即其他2个被监控的节点,增加以下内容:
data_source “hadoop-cluster” devcluster01:8649 devcluster02:8649 devcluster03:8649
gridname "Hadoop"
2、数据源端(安装到所有slaves上)
1)、ssh devcluster02
addgroup ganglia
adduser --ingroup ganglia ganglia
sudo apt-get install ganglia-monitor
2)、ssh devcluster03
addgroup ganglia
adduser --ingroup ganglia ganglia
sudo apt-get install ganglia-monitor
3)、ssh devcluster01
scp /etc/ganglia/gmond.conf devcluster02:/etc/ganglia/gmond.conf
scp /etc/ganglia/gmond.conf devcluster03:/etc/ganglia/gmond.conf
3、配置WEB
1)、ssh devcluster01
2)、sudo ln -s /usr/share/ganglia-webfrontend /var/www/ganglia
3)、vi /etc/apache2/apache2.conf
添加:
ServerName devcluster01
4、重启服务
1)、ssh devcluster02
sudo /etc/init.d/ganglia-monitor restart
ssh devcluster03
sudo /etc/init.d/ganglia-monitor restart
2)、ssh devcluster01
sudo /etc/init.d/ganglia-monitor restart
sudo /etc/init.d/gmetad restart
sudo /etc/init.d/apache2 restart
5、访问页面
http:// devcluster01/ganglia
6、集成hadoop
1)、ssh devcluster01
2)、cd /home/ysc/hadoop-1.1.1
3)、vi conf/hadoop-metrics2.properties
# 大于0.20以后的版本用ganglia31 *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.sink.ganglia.period=10
# default for supportsparse is false
*.sink.ganglia.supportsparse=true
*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
#广播IP地址,这是缺省的,统一设该值(只能用组播地址239.2.11.71)
namenode.sink.ganglia.servers=239.2.11.71:8649
datanode.sink.ganglia.servers=239.2.11.71:8649
jobtracker.sink.ganglia.servers=239.2.11.71:8649
tasktracker.sink.ganglia.servers=239.2.11.71:8649
maptask.sink.ganglia.servers=239.2.11.71:8649
reducetask.sink.ganglia.servers=239.2.11.71:8649
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
dfs.period=10
dfs.servers=239.2.11.71:8649
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=239.2.11.71:8649
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=239.2.11.71:8649
4)、scp conf/hadoop-metrics2.properties root@devcluster02:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties
5)、scp conf/hadoop-metrics2.properties root@devcluster03:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties
6)、stop-all.sh
7)、start-all.sh
7、集成hbase
1)、ssh devcluster01
2)、cd /home/ysc/hbase-0.92.2
3)、vi conf/hadoop-metrics.properties(只能用组播地址239.2.11.71)
hbase.extendedperiod = 3600
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
hbase.period=10
hbase.servers=239.2.11.71:8649
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=239.2.11.71:8649
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=239.2.11.71:8649
4)、scp conf/hadoop-metrics.properties root@devcluster02:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties
5)、scp conf/hadoop-metrics.properties root@devcluster03:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties
6)、stop-hbase.sh
7)、start-hbase.sh
八、Hadoop配置Snappy压缩
1、wget http://snappy.googlecode.com/files/snappy-1.0.5.tar.gz
2、tar -xzvf snappy-1.0.5.tar.gz
3、cd snappy-1.0.5
4、./configure
5、make
6、make install
7、scp /usr/local/lib/libsnappy* devcluster01:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
scp /usr/local/lib/libsnappy* devcluster02:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
scp /usr/local/lib/libsnappy* devcluster03:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
8、vi /etc/profile
追加:
export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64
9、修改mapred-site.xml
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
<description>Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
</description>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>If the map outputs are compressed, how should they be
compressed?
</description>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
九、Hadoop配置Lzo压缩
1、wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
2、tar -zxvf lzo-2.06.tar.gz
3、cd lzo-2.06
4、./configure --enable-shared
5、make
6、make install
7、scp /usr/local/lib/liblzo2.* devcluster01:/lib/x86_64-linux-gnu
scp /usr/local/lib/liblzo2.* devcluster02:/lib/x86_64-linux-gnu
scp /usr/local/lib/liblzo2.* devcluster03:/lib/x86_64-linux-gnu
8、wget http://hadoop-gpl-compression.apache-extras.org.codespot.com/files/hadoop-gpl-compression-0.1.0-rc0.tar.gz
9、tar -xzvf hadoop-gpl-compression-0.1.0-rc0.tar.gz
10、cd hadoop-gpl-compression-0.1.0
11、cp lib/native/Linux-amd64-64/* /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
12、cp hadoop-gpl-compression-0.1.0.jar /home/ysc/hadoop-1.1.1/lib/(这里hadoop集群的版本要和compression使用的版本一致)
13、scp -r /home/ysc/hadoop-1.1.1/lib devcluster02:/home/ysc/hadoop-1.1.1/
scp -r /home/ysc/hadoop-1.1.1/lib devcluster03:/home/ysc/hadoop-1.1.1/
14、vi /etc/profile
追加:
export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64
15、修改core-site.xml
<property>
<name>io.compression.codecs</name>
<value>com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value>
<description>A list of the compression codec classes that can be used
for compression/decompression.</description>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1440</value>
<description>Number of minutes between trash checkpoints.
If zero, the trash feature is disabled.
</description>
</property>
16、修改mapred-site.xml
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
<description>Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
</description>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
<description>If the map outputs are compressed, how should they be
compressed?
</description>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
十、配置zookeeper集群以运行hbase
1、ssh devcluster01
2、cd /home/ysc
3、wget http://mirror.bjtu.edu.cn/apache/zookeeper/stable/zookeeper-3.4.5.tar.gz
4、tar -zxvf zookeeper-3.4.5.tar.gz
5、cd zookeeper-3.4.5
6、cp conf/zoo_sample.cfg conf/zoo.cfg
7、vi conf/zoo.cfg
修改:dataDir=/home/ysc/zookeeper
添加:
server.1=devcluster01:2888:3888
server.2=devcluster02:2888:3888
server.3=devcluster03:2888:3888
maxClientCnxns=100
8、scp -r zookeeper-3.4.5 devcluster01:/home/ysc
scp -r zookeeper-3.4.5 devcluster02:/home/ysc
scp -r zookeeper-3.4.5 devcluster03:/home/ysc
9、分别在三台机器上面执行:
ssh devcluster01
mkdir /home/ysc/zookeeper(注:dataDir是zookeeper的数据目录,需要手动创建)
echo 1 > /home/ysc/zookeeper/myid
ssh devcluster02
mkdir /home/ysc/zookeeper
echo 2 > /home/ysc/zookeeper/myid
ssh devcluster03
mkdir /home/ysc/zookeeper
echo 3 > /home/ysc/zookeeper/myid
10、分别在三台机器上面执行:
cd /home/ysc/zookeeper-3.4.5
bin/zkServer.sh start
bin/zkCli.sh -server devcluster01:2181
bin/zkServer.sh status
十一、配置Hbase集群以运行nutch-2.1(Region Servers会因为内存的问题宕机)
1、nutch-2.1使用gora-0.2.1, gora-0.2.1使用hbase-0.90.4,hbase-0.90.4和hadoop-1.1.1不兼容,hbase-0.94.4和gora-0.2.1不兼容,hbase-0.92.2没问题。hbase存在系统时间同步的问题,并且误差要再30s以内。
sudo apt-get install ntp
sudo ntpdate -u 210.72.145.44
2、HBase是数据库,会在同一时间使用很多的文件句柄。大多数linux系统使用的默认值1024是不能满足的。还需要修改 hbase 用户的 nproc,在压力下,如果过低会造成 OutOfMemoryError异常。
vi /etc/security/limits.conf
添加:
ysc soft nproc 32000
ysc hard nproc 32000
ysc soft nofile 32768
ysc hard nofile 32768
vi /etc/pam.d/common-session
添加:
session required pam_limits.so
3、登陆master,下载并解压hbase
ssh devcluster01
cd /home/ysc
wget http://apache.etoak.com/hbase/hbase-0.92.2/hbase-0.92.2.tar.gz
tar -zxvf hbase-0.92.2.tar.gz
cd hbase-0.92.2
4、修改配置文件hbase-env.sh
vi conf/hbase-env.sh
追加:
export JAVA_HOME=/home/ysc/jdk1.7.0_05
export HBASE_MANAGES_ZK=false
export HBASE_HEAPSIZE=10000
#替换掉默认的垃圾回收器,因为默认的垃圾回收器在多线程环境下会有更多的wait等待
export HBASE_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
5、修改配置文件hbase-site.xml
vi conf/hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>hdfs://devcluster01:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>devcluster01,devcluster02,devcluster03</value>
</property>
<property>
<name>hfile.block.cache.size</name>
<value>0.25</value>
<description>
Percentage of maximum heap (-Xmx setting) to allocate to block cache
used by HFile/StoreFile. Default of 0.25 means allocate 25%.
Set to 0 to disable but it's not recommended.
</description>
</property>
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.4</value>
<description>Maximum size of all memstores in a region server before new
updates are blocked and flushes are forced. Defaults to 40% of heap
</description>
</property>
<property>
<name>hbase.regionserver.global.memstore.lowerLimit</name>
<value>0.35</value>
<description>When memstores are being forced to flush to make room in
memory, keep flushing until we hit this mark. Defaults to 35% of heap.
This value equal to hbase.regionserver.global.memstore.upperLimit causes
the minimum possible flushing to occur when updates are blocked due to
memstore limiting.
</description>
</property>
<property>
<name>hbase.hregion.majorcompaction</name>
<value>0</value>
<description>The time (in miliseconds) between 'major' compactions of all
HStoreFiles in a region. Default: 1 day.
Set to 0 to disable automated major compactions.
</description>
</property>
6、修改配置文件regionservers
vi conf/regionservers
devcluster01
devcluster02
devcluster03
7、因为HBase建立在Hadoop之上,Hadoop使用的hadoop*.jar和HBase使用的 必须 一致。所以要将 HBase lib 目录下的hadoop*.jar替换成Hadoop里面的那个,防止版本冲突。
cp /home/ysc/hadoop-1.1.1/hadoop-core-1.1.1.jar /home/ysc/hbase-0.92.2/lib
rm /home/ysc/hbase-0.92.2/lib/hadoop-core-1.0.3.jar
8、复制文件到regionservers
scp -r /home/ysc/hbase-0.92.2 devcluster01:/home/ysc
scp -r /home/ysc/hbase-0.92.2 devcluster02:/home/ysc
scp -r /home/ysc/hbase-0.92.2 devcluster03:/home/ysc
9、启动hadoop并创建目录
hadoop fs -mkdir /hbase
10、管理HBase集群:
启动初始 HBase 集群:
bin/start-hbase.sh
停止HBase 集群:
bin/stop-hbase.sh
启动额外备份主服务器,可以启动到 9 个备份服务器 (总数10 个):
bin/local-master-backup.sh start 1
bin/local-master-backup.sh start 2 3
启动更多 regionservers, 支持到 99 个额外regionservers (总100个):
bin/local-regionservers.sh start 1
bin/local-regionservers.sh start 2 3 4 5
停止备份主服务器:
cat /tmp/hbase-ysc-1-master.pid |xargs kill -9
停止单独 regionserver:
bin/local-regionservers.sh stop 1
使用HBase命令行模式:
bin/hbase shell
11、web界面
http://devcluster01:60010
http://devcluster01:60030
12、如运行nutch2.1则方法一:
cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf
cd /home/ysc/nutch-2.1
ant
cd runtime/deploy
unzip -d apache-nutch-2.1 apache-nutch-2.1.job
rm apache-nutch-2.1.job
cd apache-nutch-2.1
rm lib/hbase-0.90.4.jar
cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar lib
zip -r ../apache-nutch-2.1.job ./*
cd ..
rm -r apache-nutch-2.1
13、如运行nutch2.1则方法二:
cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf
cd /home/ysc/nutch-2.1
cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar lib
ant
cd runtime/deploy
zip -d apache-nutch-2.1.job lib/hbase-0.90.4.jar
启用snappy压缩:
1、vi conf/gora-hbase-mapping.xml
在family上面添加属性:compression="SNAPPY"
2、mkdir /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64
3、cp /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/* /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64
4、vi /home/ysc/hbase-0.92.2/conf/hbase-site.xml
增加:
<property>
<name>hbase.regionserver.codecs</name>
<value>snappy</value>
</property>
十二、配置Accumulo集群以运行nutch-2.1(gora存在BUG)
1、wget http://apache.etoak.com/accumulo/1.4.2/accumulo-1.4.2-dist.tar.gz
2、tar -xzvf accumulo-1.4.2-dist.tar.gz
3、cd accumulo-1.4.2
4、cp conf/examples/3GB/standalone/* conf
5、vi conf/accumulo-env.sh
export HADOOP_HOME=/home/ysc/cluster3
export ZOOKEEPER_HOME=/home/ysc/zookeeper-3.4.5
export JAVA_HOME=/home/jdk1.7.0_01
export ACCUMULO_HOME=/home/ysc/accumulo-1.4.2
6、vi conf/slaves
devcluster01
devcluster02
devcluster03
7、vi conf/masters
devcluster01
8、vi conf/accumulo-site.xml
<property>
<name>instance.zookeeper.host</name>
<value>host6:2181,host8:2181</value>
<description>comma separated list of zookeeper servers</description>
</property>
<property>
<name>logger.dir.walog</name>
<value>walogs</value>
<description>The directory used to store write-ahead logs on the local filesystem. It is possible to specify a comma-separated list of directories.</description>
</property>
<property>
<name>instance.secret</name>
<value>ysc</value>
<description>A secret unique to a given instance that all servers must know in order to communicate with one another.
Change it before initialization. To change it later use ./bin/accumulo org.apache.accumulo.server.util.ChangeSecret [oldpasswd] [newpasswd],
and then update this file.
</description>
</property>
<property>
<name>tserver.memory.maps.max</name>
<value>3G</value>
</property>
<property>
<name>tserver.cache.data.size</name>
<value>50M</value>
</property>
<property>
<name>tserver.cache.index.size</name>
<value>512M</value>
</property>
<property>
<name>trace.password</name>
<!--
change this to the root user's password, and/or change the user below
-->
<value>ysc</value>
</property>
<property>
<name>trace.user</name>
<value>root</value>
</property>
9、bin/accumulo init
10、bin/start-all.sh
11、bin/stop-all.sh
12、web访问:http://devcluster01:50095/
修改nutch2.1:
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore
gora.datastore.accumulo.mock=false
gora.datastore.accumulo.instance=accumulo
gora.datastore.accumulo.zookeepers=host6,host8
gora.datastore.accumulo.user=root
gora.datastore.accumulo.password=ysc
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.accumulo.store.AccumuloStore</value>
</property>
4、vi ivy/ivy.xml
增加:
<dependency org="org.apache.gora" name="gora-accumulo" rev="0.2.1" conf="*->default" />
5、升级accumulo
cp /home/ysc/accumulo-1.4.2/lib/accumulo-core-1.4.2.jar /home/ysc/nutch-2.1/lib
cp /home/ysc/accumulo-1.4.2/lib/accumulo-start-1.4.2.jar /home/ysc/nutch-2.1/lib
cp /home/ysc/accumulo-1.4.2/lib/cloudtrace-1.4.2.jar /home/ysc/nutch-2.1/lib
6、ant
7、cd runtime/deploy
8、删除旧jar
zip -d apache-nutch-2.1.job lib/accumulo-core-1.4.0.jar
zip -d apache-nutch-2.1.job lib/accumulo-start-1.4.0.jar
zip -d apache-nutch-2.1.job lib/cloudtrace-1.4.2.jar
十三、配置Cassandra 集群以运行nutch-2.1(Cassandra 采用去中心化结构)
1、vi /etc/hosts(注意:需要登录到每一台机器上面,将localhost解析到实际地址)
192.168.1.1 localhost
2、wget http://labs.mop.com/apache-mirror/cassandra/1.2.0/apache-cassandra-1.2.0-bin.tar.gz
3、tar -xzvf apache-cassandra-1.2.0-bin.tar.gz
4、cd apache-cassandra-1.2.0
5、vi conf/cassandra-env.sh
增加:
MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="800M"
6、vi conf/log4j-server.properties
修改:
log4j.appender.R.File=/home/ysc/cassandra/system.log
7、vi conf/cassandra.yaml
修改:
cluster_name: 'Cassandra Cluster'
data_file_directories:
- /home/ysc/cassandra/data
commitlog_directory: /home/ysc/cassandra/commitlog
saved_caches_directory: /home/ysc/cassandra/saved_caches
- seeds: "192.168.1.1"
listen_address: 192.168.1.1
rpc_address: 192.168.1.1
thrift_framed_transport_size_in_mb: 1023
thrift_max_message_length_in_mb: 1024
8、vi bin/stop-server
增加:
user=`whoami`
pgrep -u $user -f cassandra | xargs kill -9
9、复制cassandra到其他节点:
cd ..
scp -r apache-cassandra-1.2.0 devcluster02:/home/ysc
scp -r apache-cassandra-1.2.0 devcluster03:/home/ysc
分别在devcluster02和devcluster03上面修改:
vi conf/cassandra.yaml
listen_address: 192.168.1.2
rpc_address: 192.168.1.2
vi conf/cassandra.yaml
listen_address: 192.168.1.3
rpc_address: 192.168.1.3
10、分别在3个节点上面运行
bin/cassandra
bin/cassandra -f 参数 -f 的作用是让 Cassandra 以前端程序方式运行,这样有利于调试和观察日志信息,而在实际生产环境中这个参数是不需要的(即 Cassandra 会以 daemon 方式运行)
11、bin/nodetool -host devcluster01 ring
bin/nodetool -host devcluster01 info
12、bin/stop-server
13、bin/cassandra-cli
修改nutch2.1:
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.cassandrastore.servers=host2:9160,host6:9160,host8:9160
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.cassandra.store.CassandraStore</value>
</property>
4、vi ivy/ivy.xml
增加:
<dependency org="org.apache.gora" name="gora-cassandra" rev="0.2.1" conf="*->default" />
5、升级cassandra
cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-1.2.0.jar /home/ysc/nutch-2.1/lib
cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-thrift-1.2.0.jar /home/ysc/nutch-2.1/lib
cp /home/ysc/apache-cassandra-1.2.0/lib/jline-1.0.jar /home/ysc/nutch-2.1/lib
6、ant
7、cd runtime/deploy
8、删除旧jar
zip -d apache-nutch-2.1.job lib/cassandra-thrift-1.1.2.jar
zip -d apache-nutch-2.1.job lib/jline-0.9.1.jar
十四、配置MySQL 单机服务器以运行nutch-2.1
1、apt-get install mysql-server mysql-client
2、vi /etc/mysql/my.cnf
修改:
bind-address = 221.194.43.2
在[client]下增加:
default-character-set=utf8
在[mysqld]下增加:
default-character-set=utf8
3、mysql –uroot –pysc
SHOW VARIABLES LIKE '%character%';
4、service mysql restart
5、mysql –uroot –pysc
GRANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "ysc";
6、vi conf/gora-sql-mapping.xml
修改字段的长度
<primarykey column="id" length="333"/>
<field name="content" column="content" />
<field name="text" column="text" length="19892"/>
7、启动nutch之后登陆mysql
ALTER TABLE webpage MODIFY COLUMN content MEDIUMBLOB;
ALTER TABLE webpage MODIFY COLUMN text MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN title MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN reprUrl MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN baseUrl MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN typ MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN inlinks MEDIUMBLOB;
ALTER TABLE webpage MODIFY COLUMN outlinks MEDIUMBLOB;
修改nutch2.1:
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://host2:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=ysc
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore </value>
</property>
<property>
<name>encodingdetector.charset.min.confidence</name>
<value>1</value>
<description>A integer between 0-100 indicating minimum confidence value
for charset auto-detection. Any negative value disables auto-detection.
</description>
</property>
4、vi ivy/ivy.xml
增加:
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
十五、nutch2.1 使用DataFileAvroStore作为数据源
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.datafileavrostore.output.path=datafileavrostore
gora.datafileavrostore.input.path=datafileavrostore
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.avro.store.DataFileAvroStore</value>
</property>
<property>
<name>encodingdetector.charset.min.confidence</name>
<value>1</value>
<description>A integer between 0-100 indicating minimum confidence value
for charset auto-detection. Any negative value disables auto-detection.
</description>
</property>
十六、nutch2.1 使用AvroStore作为数据源
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.avrostore.codec.type=BINARY
gora.avrostore.input.path=avrostore
gora.avrostore.output.path=avrostore
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.avro.store.AvroStore</value>
</property>
<property>
<name>encodingdetector.charset.min.confidence</name>
<value>1</value>
<description>A integer between 0-100 indicating minimum confidence value
for charset auto-detection. Any negative value disables auto-detection.
</description>
</property>
十七、配置SOLR
配置tomcat:
1、wget http://www.fayea.com/apache-mirror/tomcat/tomcat-7/v7.0.35/bin/apache-tomcat-7.0.35.tar.gz
2、tar -xzvf apache-tomcat-7.0.35.tar.gz
3、cd apache-tomcat-7.0.35
4、vi conf/server.xml
增加URIEncoding="UTF-8":
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"/>
5、mkdir conf/Catalina
6、mkdir conf/Catalina/localhost
7、vi conf/Catalina/localhost/solr.xml
增加:
<Context path="/solr">
<Environment name="solr/home" type="java.lang.String" value="/home/ysc/solr/configuration/" override="false"/>
</Context>
8、cd ..
下载SOLR:
1、wget http://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/4.1.0/solr-4.1.0.tgz
2、tar -xzvf solr-4.1.0.tgz
复制资源:
1、mkdir /home/ysc/solr
2、cp -r solr-4.1.0/example/solr /home/ysc/solr/configuration
3、unzip solr-4.1.0/example/webapps/solr.war -d /home/ysc/apache-tomcat-7.0.35/webapps/solr
配置nutch:
1、复制schema:
cp /home/ysc/nutch-1.6/conf/schema-solr4.xml /home/ysc/solr/configuration/collection1/conf/schema.xml
2、vi /home/ysc/solr/configuration/collection1/conf/schema.xml
在<fields>下增加:
<field name="_version_" type="long" indexed="true" stored="true"/>
配置中文分词:
1、wget http://mmseg4j.googlecode.com/files/mmseg4j-1.9.1.v20130120-SNAPSHOT.zip
2、unzip mmseg4j-1.9.1.v20130120-SNAPSHOT.zip
3、cp mmseg4j-1.9.1-SNAPSHOT/dist/* /home/ysc/apache-tomcat-7.0.35/webapps/solr/WEB-INF/lib
4、unzip mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT.jar -d mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT
5、mkdir /home/ysc/dic
6、cp mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT/data/* /home/ysc/dic
7、vi /home/ysc/solr/configuration/collection1/conf/schema.xml
将文件中的
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
和
<tokenizer class="solr.StandardTokenizerFactory"/>
替换为
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/dic"/>
配置tomcat本地库:
1、wget http://apache.spd.co.il/apr/apr-1.4.6.tar.gz
2、tar -xzvf apr-1.4.6.tar.gz
3、cd apr-1.4.6
4、./configure
5、make
6、make install
1、wget http://mirror.bjtu.edu.cn/apache/apr/apr-util-1.5.1.tar.gz
2、tar -xzvf apr-util-1.5.1.tar.gz
3、cd apr-util-1.5.1
4、./configure --with-apr=/usr/local/apr
5、make
6、make install
1、wget http://mirror.bjtu.edu.cn/apache//tomcat/tomcat-connectors/native/1.1.24/source/tomcat-native-1.1.24-src.tar.gz
2、tar -zxvf tomcat-native-1.1.24-src.tar.gz
3、cd tomcat-native-1.1.24-src/jni/native
4、./configure --with-apr=/usr/local/apr \
--with-java-home=/home/ysc/jdk1.7.0_01 \
--with-ssl=no \
--prefix=/home/ysc/apache-tomcat-7.0.35
5、make
6、make install
7、vi /etc/profile
增加:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/ysc/apache-tomcat-7.0.35/lib:/usr/local/apr/lib
8、source /etc/profile
启动tomcat:
cd apache-tomcat-7.0.35
bin/catalina.sh start
http://devcluster01:8080/solr/
十八、Nagios监控
服务端:
1、apt-get install apache2 nagios3 nagios-nrpe-plugin
输入密码:nagiosadmin
2、apt-get install nagios3-doc
3、vi /etc/nagios3/conf.d/hostgroups_nagios2.cfg
define hostgroup {
hostgroup_name nagios-servers
alias nagios servers
members devcluster01,devcluster02,devcluster03
}
4、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster01_nagios2.cfg
vi /etc/nagios3/conf.d/devcluster01_nagios2.cfg
替换:
g/localhost/s//devcluster01/g
g/127.0.0.1/s//192.168.1.1/g
5、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster02_nagios2.cfg
vi /etc/nagios3/conf.d/devcluster02_nagios2.cfg
替换:
g/localhost/s//devcluster02/g
g/127.0.0.1/s//192.168.1.2/g
6、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster03_nagios2.cfg
vi /etc/nagios3/conf.d/devcluster03_nagios2.cfg
替换:
g/localhost/s//devcluster03/g
g/127.0.0.1/s//192.168.1.3/g
7、vi /etc/nagios3/conf.d/services_nagios2.cfg
将hostgroup_name改为nagios-servers
增加:
# check that web services are running
define service {
hostgroup_name nagios-servers
service_description HTTP
check_command check_http
use generic-service
notification_interval 0 ; set > 0 if you want to be renotified
}
# check that ssh services are running
define service {
hostgroup_name nagios-servers
service_description SSH
check_command check_ssh
use generic-service
notification_interval 0 ; set > 0 if you want to be renotified
}
8、vi /etc/nagios3/conf.d/extinfo_nagios2.cfg
将hostgroup_name改为nagios-servers
增加:
define hostextinfo{
hostgroup_name nagios-servers
notes nagios-servers
# notes_url http://webserver.localhost.localdomain/hostinfo.pl?host=netware1
icon_image base/debian.png
icon_image_alt Debian GNU/Linux
vrml_image debian.png
statusmap_image base/debian.gd2
}
9、sudo /etc/init.d/nagios3 restart
10、访问http://devcluster01/nagios3/
用户名:nagiosadmin密码:nagiosadmin
监控端:
1、apt-get install nagios-nrpe-server
2、vi /etc/nagios/nrpe.cfg
替换:
g/127.0.0.1/s//192.168.1.1/g
3、sudo /etc/init.d/nagios-nrpe-server restart
十九、配置Splunk
1、wget http://download.splunk.com/releases/5.0.2/splunk/linux/splunk-5.0.2-149561-Linux-x86_64.tgz
2、tar -zxvf splunk-5.0.2-149561-Linux-x86_64.tgz
3、cd splunk
4、bin/splunk start --answer-yes --no-prompt --accept-license
5、访问http://devcluster01:8000
用户名:admin 密码:changeme
6、添加数据 -> 从 UDP 端口 -> UDP 端口 *: 1688 -> 来源类型 从列表 log4j -> 保存
7、配置hadoop
vi /home/ysc/hadoop-1.1.1/conf/log4j.properties
修改:
log4j.rootLogger=${hadoop.root.logger}, EventCounter, SYSLOG
增加:
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.facility=local1
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n
log4j.appender.SYSLOG.SyslogHost=host6:1688
log4j.appender.SYSLOG.threshold=INFO
log4j.appender.SYSLOG.Header=true
log4j.appender.SYSLOG.FacilityPrinting=true
8、配置hbase
vi /home/ysc/hbase-0.92.2/conf/log4j.properties
修改:
log4j.rootLogger=${hbase.root.logger},SYSLOG
增加:
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.facility=local1
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n
log4j.appender.SYSLOG.SyslogHost=host6:1688
log4j.appender.SYSLOG.threshold=INFO
log4j.appender.SYSLOG.Header=true
log4j.appender.SYSLOG.FacilityPrinting=true
9、配置nutch
vi /home/lanke/ysc/nutch-2.1-hbase/conf/log4j.properties
修改:
log4j.rootLogger=INFO,DRFA,SYSLOG
增加:
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.facility=local1
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n
log4j.appender.SYSLOG.SyslogHost=host6:1688
log4j.appender.SYSLOG.threshold=INFO
log4j.appender.SYSLOG.Header=true
log4j.appender.SYSLOG.FacilityPrinting=true
10、启动hadoop和hbase
start-all.sh
start-hbase.sh
二十、配置Pig
1、wget http://labs.mop.com/apache-mirror/pig/pig-0.11.0/pig-0.11.0.tar.gz
2、tar -xzvf pig-0.11.0.tar.gz
3、cd pig-0.11.0
4、vi /etc/profile
增加:
export PIG_HOME=/home/ysc/pig-0.11.0
export PATH=$PIG_HOME/bin:$PATH
5、source /etc/profile
6、cp conf/log4j.properties.template conf/log4j.properties
7、vi conf/log4j.properties
8、pig
二十一、配置Hive
1、wget http://mirrors.cnnic.cn/apache/hive/hive-0.10.0/hive-0.10.0.tar.gz
2、tar -xzvf hive-0.10.0.tar.gz
3、cd hive-0.10.0
4、vi /etc/profile
增加:
export HIVE_HOME=/home/ysc/hive-0.10.0
export PATH=$HIVE_HOME/bin:$PATH
5、source /etc/profile
6、cp conf/hive-log4j.properties.template conf/hive-log4j.properties
7、vi conf/hive-log4j.properties
替换:
log4j.appender.EventCounter=org.apache.hadoop.metrics.jvm.EventCounter
为:
log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter
二十二、配置Hadoop2.x集群
1、wget http://labs.mop.com/apache-mirror/hadoop/common/hadoop-2.0.2-alpha/hadoop-2.0.2-alpha.tar.gz
2、tar -xzvf hadoop-2.0.2-alpha.tar.gz
3、cd hadoop-2.0.2-alpha
4、vi etc/hadoop/hadoop-env.sh
追加:
export JAVA_HOME=/home/ysc/jdk1.7.0_05
export HADOOP_HEAPSIZE=2000
5、vi etc/hadoop/core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://devcluster01:9000</value>
<description>
Where to find the Hadoop Filesystem through the network.
Note 9000 is not the default port.
(This is slightly changed from previous versions which didnt have "hdfs")
</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description>The size of buffer for use in sequence files.
The size of this buffer should probably be a multiple of hardware
page size (4096 on Intel x86), and it determines how much data is
buffered during read and write operations.</description>
</property>
6、vi etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.reduce.input.buffer.percent</name>
<value>1</value>
<description>The percentage of memory- relative to the maximum heap size- to
retain map outputs during the reduce. When the shuffle is concluded, any
remaining map outputs in memory must consume less than this threshold before
the reduce can begin.
</description>
</property>
<property>
<name>mapred.job.shuffle.input.buffer.percent</name>
<value>1</value>
<description>The percentage of memory to be allocated from the maximum heap
size to storing map outputs during the shuffle.
</description>
</property>
<property>
<name>mapred.inmem.merge.threshold</name>
<value>0</value>
<description>The threshold, in terms of the number of files
for the in-memory merge process. When we accumulate threshold number of files
we initiate the in-memory merge and spill to disk. A value of 0 or less than
0 indicates we want to DON'T have any threshold and instead depend only on
the ramfs's memory consumption to trigger the merge.
</description>
</property>
<property>
<name>io.sort.factor</name>
<value>100</value>
<description>The number of streams to merge at once while sorting
files. This determines the number of open file handles.</description>
</property>
<property>
<name>io.sort.mb</name>
<value>240</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>If the map outputs are compressed, how should they be
compressed?
</description>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2000m</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
<description>Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>5</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>15</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>5</value>
<description>
define mapred.map tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>15</value>
<description>
define mapred.reduce tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host
</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/ysc/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/ysc/mapreduce/local</value>
</property>
<property>
<name>mapreduce.job.counters.max</name>
<value>12000</value>
<description>Limit on the number of counters allowed per job.
</description>
</property>
7、vi etc/hadoop/yarn-site.xml
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>devcluster01:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>devcluster01:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>devcluster01:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>devcluster01:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>devcluster01:8088</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name> <value>/home/ysc/h2/data/1/yarn/local,/home/ysc/h2/data/2/yarn/local,/home/ysc/h2/data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name> <value>/home/ysc/h2/data/1/yarn/logs,/home/ysc/h2/data/2/yarn/logs,/home/ysc/h2/data/3/yarn/logs</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/home/ysc/h2/var/log/hadoop-yarn/apps</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>devcluster01:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>devcluster01:19888</value>
</property>
8、vi etc/hadoop/hdfs-site.xml
<property>
<name>dfs.permissions.superusergroup</name>
<value>root</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/ysc/dfs/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/ysc/dfs/filesystem/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.block.size</name>
<value>6710886400</value>
<description>The default block size for new files.</description>
</property>
9、启动hadoop
bin/hdfs namenode -format
sbin/start-dfs.sh
sbin/start-yarn.sh
10、访问管理页面
http://devcluster01:8088
http://devcluster01:50070
参考我的这篇博文:http://yangshangchuan.iteye.com/blog/1839784
nutch1和nutch2最大的不同在于存储层,nutch1使用文件系统,主要是HDFS,nutch2使用多种数据库,主要是HBASE。
二、nutch1.5.1
三、nutch2.0
四、配置SSH
五、安装Hadoop Cluster(伪分布式运行模式)并运行Nutch
六、安装Hadoop Cluster(分布式运行模式)并运行Nutch
七、配置Ganglia监控Hadoop集群和HBase集群
八、Hadoop配置Snappy压缩
九、Hadoop配置Lzo压缩
十、配置zookeeper集群以运行hbase
十一、配置Hbase集群以运行nutch-2.1(Region Servers会因为内存的问题宕机)
十二、配置Accumulo集群以运行nutch-2.1(gora存在BUG)
十三、配置Cassandra 集群以运行nutch-2.1(Cassandra 采用去中心化结构)
十四、配置MySQL 单机服务器以运行nutch-2.1
十五、nutch2.1 使用DataFileAvroStore作为数据源
十六、nutch2.1 使用AvroStore作为数据源
十七、配置SOLR
十八、Nagios监控
十九、配置Splunk
二十、配置Pig
二十一、配置Hive
二十二、配置Hadoop2.x集群
一、nutch1.2
步骤和二大同小异,在步骤 5、配置构建路径 中需要多两个操作:在左部Package Explorer的 nutch1.2文件夹上单击右键 > Build Path > Configure Build Path... > 选中Source选项 > Default output folder:修改nutch1.2/bin为nutch1.2/_bin,在左部Package Explorer的 nutch1.2文件夹下的bin文件夹上单击右键 > Team > 还原
二中黄色背景部分是版本号的差异,红色部分是1.2版本没有的,绿色部分是不一样的地方,如下:
1、Add JARs... > nutch1.2 > lib ,选中所有的.jar文件 > OK
2、crawl-urlfilter.txt
3、将crawl -urlfilter.txt.template改名为crawl -urlfilter.txt
4、修改crawl-urlfilter.txt,将
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# skip everything else
-.
5、cd /home/ysc/workspace/nutch1.2
nutch1.2是一个完整的搜索引擎,nutch1.5.1只是一个爬虫。nutch1.2可以把索引提交给SOLR,也可以直接生成LUCENE索引,nutch1.5.1则只能把索引提交给SOLR:
1、cd /home/ysc
2、wget http://mirrors.tuna.tsinghua.edu.cn/apache/tomcat/tomcat-7/v7.0.29/bin/apache-tomcat-7.0.29.tar.gz
3、tar -xvf apache-tomcat-7.0.29.tar.gz
4、在左部Package Explorer的 nutch1.2文件夹下的build.xml文件上单击右键 > Run As > Ant Build... > 选中war target > Run
5、cd /home/ysc/workspace/nutch1.2/build
6、unzip nutch-1.2.war -d nutch-1.2
7、cp -r nutch-1.2 /home/ysc/apache-tomcat-7.0.29/webapps
8、vi /home/ysc/apache-tomcat-7.0.29/webapps/nutch-1.2/WEB-INF/classes/nutch-site.xml
加入以下配置:
<property>
<name>searcher.dir</name>
<value>/home/ysc/workspace/nutch1.2/data</value>
<description>
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
9、vi /home/ysc/apache-tomcat-7.0.29/conf/server.xml
将
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"/>
改为
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="utf-8"/>
10、cd /home/ysc/apache-tomcat-7.0.29/bin
11、./startup.sh
12、访问:http://localhost:8080/nutch-1.2/
关于nutch1.2更多的BUG修复及资料,请参看我在CSDN发布的资源:http://download.csdn.net/user/yangshangchuan
二、nutch1.5.1
1、下载并解压eclipse(集成开发环境)
下载地址:http://www.eclipse.org/downloads/,下载Eclipse IDE for Java EE Developers
2、安装Subclipse插件(SVN客户端)
插件地址:http://subclipse.tigris.org/update_1.8.x,
3、安装IvyDE插件(下载依赖Jar)
插件地址:http://www.apache.org/dist/ant/ivyde/updatesite/
4、签出代码
File > New > Project > SVN > 从SVN 检出项目
创建新的资源库位置 > URL:https://svn.apache.org/repos/asf/nutch/tags/release-1.5.1/ > 选中URL > Finish
弹出New Project向导,选择Java Project > Next,输入Project name:nutch1.5.1 > Finish
5、配置构建路径
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Build Path > Configure Build Path...
> 选中Source选项 > 选择src > Remove > Add Folder... > 选择src/bin, src/java, src/test 和 src/testresources(对于插件,需要选中src/plugin目录下的每一个插件目录下的src/java , src/test文件夹) > OK
切换到Libraries选项 >
Add Class Folder... > 选中nutch1.5.1/conf > OK
Add JARs... > 需要选中src/plugin目录下的每一个插件目录下的lib目录下的jar文件 > OK
Add Library... > IvyDE Managed Dependencies > Next > Main > Ivy File > Browse > ivy/ivy.xml > Finish
切换到Order and Export选项>
选中conf > Top
6、执行ANT
在左部Package Explorer的 nutch1.5.1文件夹下的build.xml文件上单击右键 > Run As > Ant Build
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Refresh
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Build Path > Configure Build Path... > 选中Libraries选项 > Add Class Folder... > 选中build > OK
7、修改配置文件nutch-site.xml 和regex-urlfilter.txt
将nutch-site.xml.template改名为nutch-site.xml
将regex-urlfilter.txt.template改名为regex-urlfilter.txt
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Refresh
将如下配置项加入文件nutch-site.xml:
<property>
<name>http.agent.name</name>
<value>nutch</value>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
修改regex-urlfilter.txt,将
# accept anything else
+.
替换为:
+^http://([a-z0-9]*\.)*news.163.com/
-.
8、开发调试
在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > New > Folder > Folder name: urls
在刚新建的urls目录下新建一个文本文件url,文本内容为:http://news.163.com
打开src/java下的org.apache.nutch.crawl.Crawl.java类,单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: urls -dir data -depth 3 > Run
在需要调试的地方打上断点Debug As > Java Applicaton
9、查看结果
查看segments目录:
打开src/java下的org.apache.nutch.segment.SegmentReader.java类
单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: -dump data/segments/* data/segments/dump
用文本编辑器打开文件data/segments/dump/dump查看segments中存储的信息
查看crawldb目录:
打开src/java下的org.apache.nutch.crawl.CrawlDbReader.java类
单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: data/crawldb -stats
控制台会输出 crawldb统计信息
查看linkdb目录:
打开src/java下的org.apache.nutch.crawl.LinkDbReader.java类
单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: data/linkdb -dump data/linkdb_dump
用文本编辑器打开文件data/linkdb_dump/part-00000查看linkdb中存储的信息
10、全网分步骤抓取
在左部Package Explorer的 nutch1.5.1文件夹下的build.xml文件上单击右键 > Run As > Ant Build
cd /home/ysc/workspace/nutch1.5.1/runtime/local
#准备URL列表
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/url
#注入URL
bin/nutch inject crawl/crawldb dmoz
#生成抓取列表
bin/nutch generate crawl/crawldb crawl/segments
#第一次抓取
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
#抓取网页
bin/nutch fetch $s1
#解析网页
bin/nutch parse $s1
#更新URL状态
bin/nutch updatedb crawl/crawldb $s1
#第二次抓取
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
#第三次抓取
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3
#生成反向链接库
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
11、索引和搜索
cd /home/ysc/
wget http://mirror.bjtu.edu.cn/apache/lucene/solr/3.6.1/apache-solr-3.6.1.tgz
tar -xvf apache-solr-3.6.1.tgz
cd apache-solr-3.6.1 /example
NUTCH_RUNTIME_HOME=/home/ysc/workspace/nutch1.5.1/runtime/local
APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
如果需要把网页内容存储到索引中,则修改 schema.xml文件中的
<field name="content" type="text" stored="false" indexed="true"/>
为
<field name="content" type="text" stored="true" indexed="true"/>
修改${APACHE_SOLR_HOME}/example/solr/conf/solrconfig.xml,将里面的<str name="df">text</str>都替换为<str name="df">content</str>
把${APACHE_SOLR_HOME}/example/solr/conf/schema.xml中的 <schema name="nutch" version="1.5.1">修改为<schema name="nutch" version="1.5">
#启动SOLR服务器
java -jar start.jar
http://127.0.0.1:8983/solr/admin/
http://127.0.0.1:8983/solr/admin/stats.jsp
cd /home/ysc/workspace/nutch1.5.1/runtime/local
#提交索引
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
执行完整crawl:
bin/nutch crawl urls -dir data -depth 2 -topN 100 -solr http://127.0.0.1:8983/solr/
使用以下命令分页查看所有索引的文档:
http://127.0.0.1:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on
标题包含“网易”的文档:
http://127.0.0.1:8983/solr/select/?q=title%3A%E7%BD%91%E6%98%93&version=2.2&start=0&rows=10&indent=on
12、查看索引信息
cd /home/ysc/
wget http://luke.googlecode.com/files/lukeall-3.5.0.jar
java -jar lukeall-3.5.0.jar
Path: /home/ysc/apache-solr-3.6.1/example/solr/data
13、配置SOLR的中文分词
cd /home/ysc/
wget http://mmseg4j.googlecode.com/files/mmseg4j-1.8.5.zip
unzip mmseg4j-1.8.5.zip -d mmseg4j-1.8.5
APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1
mkdir $APACHE_SOLR_HOME/example/solr/lib
mkdir $APACHE_SOLR_HOME/example/solr/dic
cp mmseg4j-1.8.5/mmseg4j-all-1.8.5.jar $APACHE_SOLR_HOME/example/solr/lib
cp mmseg4j-1.8.5/data/*.dic $APACHE_SOLR_HOME/example/solr/dic
将${APACHE_SOLR_HOME}/example/solr/conf/schema.xml文件中的
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
和
<tokenizer class="solr.StandardTokenizerFactory"/>
替换为
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/apache-solr-3.6.1/example/solr/dic"/>
#重新启动SOLR服务器
java -jar start.jar
#重建索引,演示在开发环境中如何操作
打开src/java下的org.apache.nutch.indexer.solr.SolrIndexer.java类
单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: http://127.0.0.1:8983/solr/ data/crawldb -linkdb data/linkdb data/segments/*
使用luke重新打开索引就会发现分词起作用了
三、nutch2.0
nutch2.0和二中的nutch1.5.1的步骤相同,但在8、开发调试之前需要做以下配置:
在左部Package Explorer的 nutch2.0文件夹上单击右键 > New > Folder > Folder name: data并指定数据存储方式,选如下之一:
1、使用mysql作为数据存储
1)、在nutch2.0/conf/nutch-site.xml中加入如下配置:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
</property>
2)、将nutch2.0/conf/gora.properties文件中的
gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
gora.sqlstore.jdbc.user=sa
gora.sqlstore.jdbc.password=
修改为
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch2
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=ROOT
3)、打开nutch2.0/ivy/ivy.xml中的mysql-connector-java依赖
4)、sudo apt-get install mysql-server
2、使用hbase作为数据存储
1)、在nutch2.0/conf/nutch-site.xml中加入如下配置:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
2)、打开nutch2.0/ivy/ivy.xml中的gora-hbase依赖
3)、cd /home/ysc
4)、wget http://mirror.bit.edu.cn/apache/hbase/hbase-0.90.5/hbase-0.90.5.tar.gz
5)、tar -xvf hbase-0.90.5.tar.gz
6)、vi hbase-0.90.5/conf/hbase-site.xml
加入以下配置:
<property>
<name>hbase.rootdir</name>
<value>file:///home/ysc/hbase-0.90.5-database</value>
</property>
7)、hbase-0.90.5/bin/start-hbase.sh
8)、将/home/ysc/hbase-0.90.5/hbase-0.90.5.jar加入开发环境eclipse的build path
四、配置SSH
三台机器 devcluster01, devcluster02, devcluster03,分别在每一台机器上面执行如下操作:
1、sudo vi /etc/hosts
加入以下配置:
192.168.1.1 devcluster01
192.168.1.2 devcluster02
192.168.1.3 devcluster03
2、安装SSH服务:
sudo apt-get install openssh-server
3、(有提示的时候回车键确认)
ssh-keygen -t rsa
该命令会在用户主目录下创建 .ssh 目录,并在其中创建两个文件:id_rsa 私钥文件。是基于 RSA 算法创建。该私钥文件要妥善保管,不要泄漏。id_rsa.pub 公钥文件。和 id_rsa 文件是一对儿,该文件作为公钥文件,可以公开。
4、cp .ssh/id_rsa.pub .ssh/authorized_keys
把 三台机器 devcluster01, devcluster02, devcluster03 的文件/home/ysc/.ssh/authorized_keys的内容复制出来合并成一个文件并替换每一台机器上的/home/ysc/.ssh/authorized_keys文件
在devcluster01上面执行时,以下两条命令的主机为02和03
在devcluster02上面执行时,以下两条命令的主机为01和03
在devcluster03上面执行时,以下两条命令的主机为01和02
5、ssh-copy-id -i .ssh/id_rsa.pub ysc@ devcluster02
6、ssh-copy-id -i .ssh/id_rsa.pub ysc@ devcluster03
以上两条命令实际上是将 .ssh/id_rsa.pub 公钥文件追加到远程主机 server 的 user 主目录下的 .ssh/authorized_keys 文件中。
五、安装Hadoop Cluster(伪分布式运行模式)并运行Nutch
步骤和四大同小异,只需要1台机器 devcluster01,所以黄色背景部分全部设置为devcluster01,不需要第11步
六、安装Hadoop Cluster(分布式运行模式)并运行Nutch
三台机器 devcluster01, devcluster02, devcluster03(vi /etc/hostname)
使用用户ysc登陆 devcluster01:
1、cd /home/ysc
2、wget http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-1.1.1/hadoop-1.1.1-bin.tar.gz
3、tar -xvf hadoop-1.1.1-bin.tar.gz
4、cd hadoop-1.1.1
5、vi conf/masters
替换内容为 :
devcluster01
6、vi conf/slaves
替换内容为 :
devcluster02
devcluster03
7、vi conf/core-site.xml
加入配置:
<property>
<name>fs.default.name</name>
<value>hdfs://devcluster01:9000</value>
<description>
Where to find the Hadoop Filesystem through the network.
Note 9000 is not the default port.
(This is slightly changed from previous versions which didnt have "hdfs")
</description>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
编辑conf/hadoop-policy.xml
8、vi conf/hdfs-site.xml
加入配置:
<property>
<name>dfs.name.dir</name>
<value>/home/ysc/dfs/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/ysc/dfs/filesystem/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.block.size</name>
<value>671088640</value>
<description>The default block size for new files.</description>
</property>
9、vi conf/mapred-site.xml
加入配置:
<property>
<name>mapred.job.tracker</name>
<value>devcluster01:9001</value>
<description>
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and
reduce task.
Note 9001 is not the default port.
</description>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
<description>If true, then multiple instances of some reduce tasks
may be executed in parallel.</description>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
<description>If true, then multiple instances of some map tasks
may be executed in parallel.</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2000m</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
<description>
the core number of host
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
<description>
define mapred.map tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>4</value>
<description>
define mapred.reduce tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host
</description>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
<description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression.
</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/ysc/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/ysc/mapreduce/local</value>
</property>
10、vi conf/hadoop-env.sh
追加:
export JAVA_HOME=/home/ysc/jdk1.7.0_05
export HADOOP_HEAPSIZE=2000
#替换掉默认的垃圾回收器,因为默认的垃圾回收器在多线程环境下会有更多的wait等待
export HADOOP_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
11、复制HADOOP文件
scp -r /home/ysc/hadoop-1.1.1 ysc@devcluster02:/home/ysc/hadoop-1.1.1
scp -r /home/ysc/hadoop-1.1.1 ysc@devcluster03:/home/ysc/hadoop-1.1.1
12、sudo vi /etc/profile
追加并重启系统:
export PATH=/home/ysc/hadoop-1.1.1/bin:$PATH
13、格式化名称节点并启动集群
hadoop namenode -format
start-all.sh
14、cd /home/ysc/workspace/nutch1.5.1/runtime/deploy
mkdir urls
echo http://news.163.com > urls/url
hadoop dfs -put urls urls
bin/nutch crawl urls -dir data -depth 2 -topN 100
15、访问 http://localhost:50030 可以查看 JobTracker 的运行状态。访问 http://localhost:50060 可以查看 TaskTracker 的运行状态。访问 http://localhost:50070 可以查看 NameNode 以及整个分布式文件系统的状态,浏览分布式文件系统中的文件以及 log 等
16、通过stop-all.sh停止集群
17、如果NameNode和SecondaryNameNode不在同一台机器上,则在SecondaryNameNode的conf/hdfs-site.xml文件中加入配置:
<property>
<name>dfs.http.address</name>
<value>namenode:50070</value>
</property>
七、配置Ganglia监控Hadoop集群和HBase集群
1、服务器端(安装到master devcluster01上)
1)、ssh devcluster01
2)、addgroup ganglia
adduser --ingroup ganglia ganglia
3)、sudo apt-get install ganglia-monitor ganglia-webfront gmetad
//补充:在Ubuntu10.04上,ganglia-webfront这个package名字叫ganglia-webfrontend
//如果install出错,则运行sudo apt-get update,如果update出错,则删除出错路径
4)、vi /etc/ganglia/gmond.conf
先找到setuid = yes,改成setuid =no;
在找到cluster块中的name,改成name =”hadoop-cluster”;
5)、sudo apt-get install rrdtool
6)、vi /etc/ganglia/gmetad.conf
在这个配置文件中增加一些datasource,即其他2个被监控的节点,增加以下内容:
data_source “hadoop-cluster” devcluster01:8649 devcluster02:8649 devcluster03:8649
gridname "Hadoop"
2、数据源端(安装到所有slaves上)
1)、ssh devcluster02
addgroup ganglia
adduser --ingroup ganglia ganglia
sudo apt-get install ganglia-monitor
2)、ssh devcluster03
addgroup ganglia
adduser --ingroup ganglia ganglia
sudo apt-get install ganglia-monitor
3)、ssh devcluster01
scp /etc/ganglia/gmond.conf devcluster02:/etc/ganglia/gmond.conf
scp /etc/ganglia/gmond.conf devcluster03:/etc/ganglia/gmond.conf
3、配置WEB
1)、ssh devcluster01
2)、sudo ln -s /usr/share/ganglia-webfrontend /var/www/ganglia
3)、vi /etc/apache2/apache2.conf
添加:
ServerName devcluster01
4、重启服务
1)、ssh devcluster02
sudo /etc/init.d/ganglia-monitor restart
ssh devcluster03
sudo /etc/init.d/ganglia-monitor restart
2)、ssh devcluster01
sudo /etc/init.d/ganglia-monitor restart
sudo /etc/init.d/gmetad restart
sudo /etc/init.d/apache2 restart
5、访问页面
http:// devcluster01/ganglia
6、集成hadoop
1)、ssh devcluster01
2)、cd /home/ysc/hadoop-1.1.1
3)、vi conf/hadoop-metrics2.properties
# 大于0.20以后的版本用ganglia31 *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.sink.ganglia.period=10
# default for supportsparse is false
*.sink.ganglia.supportsparse=true
*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
#广播IP地址,这是缺省的,统一设该值(只能用组播地址239.2.11.71)
namenode.sink.ganglia.servers=239.2.11.71:8649
datanode.sink.ganglia.servers=239.2.11.71:8649
jobtracker.sink.ganglia.servers=239.2.11.71:8649
tasktracker.sink.ganglia.servers=239.2.11.71:8649
maptask.sink.ganglia.servers=239.2.11.71:8649
reducetask.sink.ganglia.servers=239.2.11.71:8649
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
dfs.period=10
dfs.servers=239.2.11.71:8649
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=239.2.11.71:8649
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=239.2.11.71:8649
4)、scp conf/hadoop-metrics2.properties root@devcluster02:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties
5)、scp conf/hadoop-metrics2.properties root@devcluster03:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties
6)、stop-all.sh
7)、start-all.sh
7、集成hbase
1)、ssh devcluster01
2)、cd /home/ysc/hbase-0.92.2
3)、vi conf/hadoop-metrics.properties(只能用组播地址239.2.11.71)
hbase.extendedperiod = 3600
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
hbase.period=10
hbase.servers=239.2.11.71:8649
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=239.2.11.71:8649
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=239.2.11.71:8649
4)、scp conf/hadoop-metrics.properties root@devcluster02:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties
5)、scp conf/hadoop-metrics.properties root@devcluster03:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties
6)、stop-hbase.sh
7)、start-hbase.sh
八、Hadoop配置Snappy压缩
1、wget http://snappy.googlecode.com/files/snappy-1.0.5.tar.gz
2、tar -xzvf snappy-1.0.5.tar.gz
3、cd snappy-1.0.5
4、./configure
5、make
6、make install
7、scp /usr/local/lib/libsnappy* devcluster01:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
scp /usr/local/lib/libsnappy* devcluster02:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
scp /usr/local/lib/libsnappy* devcluster03:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
8、vi /etc/profile
追加:
export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64
9、修改mapred-site.xml
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
<description>Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
</description>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>If the map outputs are compressed, how should they be
compressed?
</description>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
九、Hadoop配置Lzo压缩
1、wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
2、tar -zxvf lzo-2.06.tar.gz
3、cd lzo-2.06
4、./configure --enable-shared
5、make
6、make install
7、scp /usr/local/lib/liblzo2.* devcluster01:/lib/x86_64-linux-gnu
scp /usr/local/lib/liblzo2.* devcluster02:/lib/x86_64-linux-gnu
scp /usr/local/lib/liblzo2.* devcluster03:/lib/x86_64-linux-gnu
8、wget http://hadoop-gpl-compression.apache-extras.org.codespot.com/files/hadoop-gpl-compression-0.1.0-rc0.tar.gz
9、tar -xzvf hadoop-gpl-compression-0.1.0-rc0.tar.gz
10、cd hadoop-gpl-compression-0.1.0
11、cp lib/native/Linux-amd64-64/* /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
12、cp hadoop-gpl-compression-0.1.0.jar /home/ysc/hadoop-1.1.1/lib/(这里hadoop集群的版本要和compression使用的版本一致)
13、scp -r /home/ysc/hadoop-1.1.1/lib devcluster02:/home/ysc/hadoop-1.1.1/
scp -r /home/ysc/hadoop-1.1.1/lib devcluster03:/home/ysc/hadoop-1.1.1/
14、vi /etc/profile
追加:
export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64
15、修改core-site.xml
<property>
<name>io.compression.codecs</name>
<value>com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value>
<description>A list of the compression codec classes that can be used
for compression/decompression.</description>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1440</value>
<description>Number of minutes between trash checkpoints.
If zero, the trash feature is disabled.
</description>
</property>
16、修改mapred-site.xml
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
<description>Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
</description>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
<description>If the map outputs are compressed, how should they be
compressed?
</description>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
十、配置zookeeper集群以运行hbase
1、ssh devcluster01
2、cd /home/ysc
3、wget http://mirror.bjtu.edu.cn/apache/zookeeper/stable/zookeeper-3.4.5.tar.gz
4、tar -zxvf zookeeper-3.4.5.tar.gz
5、cd zookeeper-3.4.5
6、cp conf/zoo_sample.cfg conf/zoo.cfg
7、vi conf/zoo.cfg
修改:dataDir=/home/ysc/zookeeper
添加:
server.1=devcluster01:2888:3888
server.2=devcluster02:2888:3888
server.3=devcluster03:2888:3888
maxClientCnxns=100
8、scp -r zookeeper-3.4.5 devcluster01:/home/ysc
scp -r zookeeper-3.4.5 devcluster02:/home/ysc
scp -r zookeeper-3.4.5 devcluster03:/home/ysc
9、分别在三台机器上面执行:
ssh devcluster01
mkdir /home/ysc/zookeeper(注:dataDir是zookeeper的数据目录,需要手动创建)
echo 1 > /home/ysc/zookeeper/myid
ssh devcluster02
mkdir /home/ysc/zookeeper
echo 2 > /home/ysc/zookeeper/myid
ssh devcluster03
mkdir /home/ysc/zookeeper
echo 3 > /home/ysc/zookeeper/myid
10、分别在三台机器上面执行:
cd /home/ysc/zookeeper-3.4.5
bin/zkServer.sh start
bin/zkCli.sh -server devcluster01:2181
bin/zkServer.sh status
十一、配置Hbase集群以运行nutch-2.1(Region Servers会因为内存的问题宕机)
1、nutch-2.1使用gora-0.2.1, gora-0.2.1使用hbase-0.90.4,hbase-0.90.4和hadoop-1.1.1不兼容,hbase-0.94.4和gora-0.2.1不兼容,hbase-0.92.2没问题。hbase存在系统时间同步的问题,并且误差要再30s以内。
sudo apt-get install ntp
sudo ntpdate -u 210.72.145.44
2、HBase是数据库,会在同一时间使用很多的文件句柄。大多数linux系统使用的默认值1024是不能满足的。还需要修改 hbase 用户的 nproc,在压力下,如果过低会造成 OutOfMemoryError异常。
vi /etc/security/limits.conf
添加:
ysc soft nproc 32000
ysc hard nproc 32000
ysc soft nofile 32768
ysc hard nofile 32768
vi /etc/pam.d/common-session
添加:
session required pam_limits.so
3、登陆master,下载并解压hbase
ssh devcluster01
cd /home/ysc
wget http://apache.etoak.com/hbase/hbase-0.92.2/hbase-0.92.2.tar.gz
tar -zxvf hbase-0.92.2.tar.gz
cd hbase-0.92.2
4、修改配置文件hbase-env.sh
vi conf/hbase-env.sh
追加:
export JAVA_HOME=/home/ysc/jdk1.7.0_05
export HBASE_MANAGES_ZK=false
export HBASE_HEAPSIZE=10000
#替换掉默认的垃圾回收器,因为默认的垃圾回收器在多线程环境下会有更多的wait等待
export HBASE_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
5、修改配置文件hbase-site.xml
vi conf/hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>hdfs://devcluster01:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>devcluster01,devcluster02,devcluster03</value>
</property>
<property>
<name>hfile.block.cache.size</name>
<value>0.25</value>
<description>
Percentage of maximum heap (-Xmx setting) to allocate to block cache
used by HFile/StoreFile. Default of 0.25 means allocate 25%.
Set to 0 to disable but it's not recommended.
</description>
</property>
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.4</value>
<description>Maximum size of all memstores in a region server before new
updates are blocked and flushes are forced. Defaults to 40% of heap
</description>
</property>
<property>
<name>hbase.regionserver.global.memstore.lowerLimit</name>
<value>0.35</value>
<description>When memstores are being forced to flush to make room in
memory, keep flushing until we hit this mark. Defaults to 35% of heap.
This value equal to hbase.regionserver.global.memstore.upperLimit causes
the minimum possible flushing to occur when updates are blocked due to
memstore limiting.
</description>
</property>
<property>
<name>hbase.hregion.majorcompaction</name>
<value>0</value>
<description>The time (in miliseconds) between 'major' compactions of all
HStoreFiles in a region. Default: 1 day.
Set to 0 to disable automated major compactions.
</description>
</property>
6、修改配置文件regionservers
vi conf/regionservers
devcluster01
devcluster02
devcluster03
7、因为HBase建立在Hadoop之上,Hadoop使用的hadoop*.jar和HBase使用的 必须 一致。所以要将 HBase lib 目录下的hadoop*.jar替换成Hadoop里面的那个,防止版本冲突。
cp /home/ysc/hadoop-1.1.1/hadoop-core-1.1.1.jar /home/ysc/hbase-0.92.2/lib
rm /home/ysc/hbase-0.92.2/lib/hadoop-core-1.0.3.jar
8、复制文件到regionservers
scp -r /home/ysc/hbase-0.92.2 devcluster01:/home/ysc
scp -r /home/ysc/hbase-0.92.2 devcluster02:/home/ysc
scp -r /home/ysc/hbase-0.92.2 devcluster03:/home/ysc
9、启动hadoop并创建目录
hadoop fs -mkdir /hbase
10、管理HBase集群:
启动初始 HBase 集群:
bin/start-hbase.sh
停止HBase 集群:
bin/stop-hbase.sh
启动额外备份主服务器,可以启动到 9 个备份服务器 (总数10 个):
bin/local-master-backup.sh start 1
bin/local-master-backup.sh start 2 3
启动更多 regionservers, 支持到 99 个额外regionservers (总100个):
bin/local-regionservers.sh start 1
bin/local-regionservers.sh start 2 3 4 5
停止备份主服务器:
cat /tmp/hbase-ysc-1-master.pid |xargs kill -9
停止单独 regionserver:
bin/local-regionservers.sh stop 1
使用HBase命令行模式:
bin/hbase shell
11、web界面
http://devcluster01:60010
http://devcluster01:60030
12、如运行nutch2.1则方法一:
cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf
cd /home/ysc/nutch-2.1
ant
cd runtime/deploy
unzip -d apache-nutch-2.1 apache-nutch-2.1.job
rm apache-nutch-2.1.job
cd apache-nutch-2.1
rm lib/hbase-0.90.4.jar
cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar lib
zip -r ../apache-nutch-2.1.job ./*
cd ..
rm -r apache-nutch-2.1
13、如运行nutch2.1则方法二:
cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf
cd /home/ysc/nutch-2.1
cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar lib
ant
cd runtime/deploy
zip -d apache-nutch-2.1.job lib/hbase-0.90.4.jar
启用snappy压缩:
1、vi conf/gora-hbase-mapping.xml
在family上面添加属性:compression="SNAPPY"
2、mkdir /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64
3、cp /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/* /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64
4、vi /home/ysc/hbase-0.92.2/conf/hbase-site.xml
增加:
<property>
<name>hbase.regionserver.codecs</name>
<value>snappy</value>
</property>
十二、配置Accumulo集群以运行nutch-2.1(gora存在BUG)
1、wget http://apache.etoak.com/accumulo/1.4.2/accumulo-1.4.2-dist.tar.gz
2、tar -xzvf accumulo-1.4.2-dist.tar.gz
3、cd accumulo-1.4.2
4、cp conf/examples/3GB/standalone/* conf
5、vi conf/accumulo-env.sh
export HADOOP_HOME=/home/ysc/cluster3
export ZOOKEEPER_HOME=/home/ysc/zookeeper-3.4.5
export JAVA_HOME=/home/jdk1.7.0_01
export ACCUMULO_HOME=/home/ysc/accumulo-1.4.2
6、vi conf/slaves
devcluster01
devcluster02
devcluster03
7、vi conf/masters
devcluster01
8、vi conf/accumulo-site.xml
<property>
<name>instance.zookeeper.host</name>
<value>host6:2181,host8:2181</value>
<description>comma separated list of zookeeper servers</description>
</property>
<property>
<name>logger.dir.walog</name>
<value>walogs</value>
<description>The directory used to store write-ahead logs on the local filesystem. It is possible to specify a comma-separated list of directories.</description>
</property>
<property>
<name>instance.secret</name>
<value>ysc</value>
<description>A secret unique to a given instance that all servers must know in order to communicate with one another.
Change it before initialization. To change it later use ./bin/accumulo org.apache.accumulo.server.util.ChangeSecret [oldpasswd] [newpasswd],
and then update this file.
</description>
</property>
<property>
<name>tserver.memory.maps.max</name>
<value>3G</value>
</property>
<property>
<name>tserver.cache.data.size</name>
<value>50M</value>
</property>
<property>
<name>tserver.cache.index.size</name>
<value>512M</value>
</property>
<property>
<name>trace.password</name>
<!--
change this to the root user's password, and/or change the user below
-->
<value>ysc</value>
</property>
<property>
<name>trace.user</name>
<value>root</value>
</property>
9、bin/accumulo init
10、bin/start-all.sh
11、bin/stop-all.sh
12、web访问:http://devcluster01:50095/
修改nutch2.1:
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore
gora.datastore.accumulo.mock=false
gora.datastore.accumulo.instance=accumulo
gora.datastore.accumulo.zookeepers=host6,host8
gora.datastore.accumulo.user=root
gora.datastore.accumulo.password=ysc
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.accumulo.store.AccumuloStore</value>
</property>
4、vi ivy/ivy.xml
增加:
<dependency org="org.apache.gora" name="gora-accumulo" rev="0.2.1" conf="*->default" />
5、升级accumulo
cp /home/ysc/accumulo-1.4.2/lib/accumulo-core-1.4.2.jar /home/ysc/nutch-2.1/lib
cp /home/ysc/accumulo-1.4.2/lib/accumulo-start-1.4.2.jar /home/ysc/nutch-2.1/lib
cp /home/ysc/accumulo-1.4.2/lib/cloudtrace-1.4.2.jar /home/ysc/nutch-2.1/lib
6、ant
7、cd runtime/deploy
8、删除旧jar
zip -d apache-nutch-2.1.job lib/accumulo-core-1.4.0.jar
zip -d apache-nutch-2.1.job lib/accumulo-start-1.4.0.jar
zip -d apache-nutch-2.1.job lib/cloudtrace-1.4.2.jar
十三、配置Cassandra 集群以运行nutch-2.1(Cassandra 采用去中心化结构)
1、vi /etc/hosts(注意:需要登录到每一台机器上面,将localhost解析到实际地址)
192.168.1.1 localhost
2、wget http://labs.mop.com/apache-mirror/cassandra/1.2.0/apache-cassandra-1.2.0-bin.tar.gz
3、tar -xzvf apache-cassandra-1.2.0-bin.tar.gz
4、cd apache-cassandra-1.2.0
5、vi conf/cassandra-env.sh
增加:
MAX_HEAP_SIZE="4G"
HEAP_NEWSIZE="800M"
6、vi conf/log4j-server.properties
修改:
log4j.appender.R.File=/home/ysc/cassandra/system.log
7、vi conf/cassandra.yaml
修改:
cluster_name: 'Cassandra Cluster'
data_file_directories:
- /home/ysc/cassandra/data
commitlog_directory: /home/ysc/cassandra/commitlog
saved_caches_directory: /home/ysc/cassandra/saved_caches
- seeds: "192.168.1.1"
listen_address: 192.168.1.1
rpc_address: 192.168.1.1
thrift_framed_transport_size_in_mb: 1023
thrift_max_message_length_in_mb: 1024
8、vi bin/stop-server
增加:
user=`whoami`
pgrep -u $user -f cassandra | xargs kill -9
9、复制cassandra到其他节点:
cd ..
scp -r apache-cassandra-1.2.0 devcluster02:/home/ysc
scp -r apache-cassandra-1.2.0 devcluster03:/home/ysc
分别在devcluster02和devcluster03上面修改:
vi conf/cassandra.yaml
listen_address: 192.168.1.2
rpc_address: 192.168.1.2
vi conf/cassandra.yaml
listen_address: 192.168.1.3
rpc_address: 192.168.1.3
10、分别在3个节点上面运行
bin/cassandra
bin/cassandra -f 参数 -f 的作用是让 Cassandra 以前端程序方式运行,这样有利于调试和观察日志信息,而在实际生产环境中这个参数是不需要的(即 Cassandra 会以 daemon 方式运行)
11、bin/nodetool -host devcluster01 ring
bin/nodetool -host devcluster01 info
12、bin/stop-server
13、bin/cassandra-cli
修改nutch2.1:
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.cassandrastore.servers=host2:9160,host6:9160,host8:9160
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.cassandra.store.CassandraStore</value>
</property>
4、vi ivy/ivy.xml
增加:
<dependency org="org.apache.gora" name="gora-cassandra" rev="0.2.1" conf="*->default" />
5、升级cassandra
cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-1.2.0.jar /home/ysc/nutch-2.1/lib
cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-thrift-1.2.0.jar /home/ysc/nutch-2.1/lib
cp /home/ysc/apache-cassandra-1.2.0/lib/jline-1.0.jar /home/ysc/nutch-2.1/lib
6、ant
7、cd runtime/deploy
8、删除旧jar
zip -d apache-nutch-2.1.job lib/cassandra-thrift-1.1.2.jar
zip -d apache-nutch-2.1.job lib/jline-0.9.1.jar
十四、配置MySQL 单机服务器以运行nutch-2.1
1、apt-get install mysql-server mysql-client
2、vi /etc/mysql/my.cnf
修改:
bind-address = 221.194.43.2
在[client]下增加:
default-character-set=utf8
在[mysqld]下增加:
default-character-set=utf8
3、mysql –uroot –pysc
SHOW VARIABLES LIKE '%character%';
4、service mysql restart
5、mysql –uroot –pysc
GRANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "ysc";
6、vi conf/gora-sql-mapping.xml
修改字段的长度
<primarykey column="id" length="333"/>
<field name="content" column="content" />
<field name="text" column="text" length="19892"/>
7、启动nutch之后登陆mysql
ALTER TABLE webpage MODIFY COLUMN content MEDIUMBLOB;
ALTER TABLE webpage MODIFY COLUMN text MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN title MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN reprUrl MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN baseUrl MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN typ MEDIUMTEXT;
ALTER TABLE webpage MODIFY COLUMN inlinks MEDIUMBLOB;
ALTER TABLE webpage MODIFY COLUMN outlinks MEDIUMBLOB;
修改nutch2.1:
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://host2:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=ysc
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore </value>
</property>
<property>
<name>encodingdetector.charset.min.confidence</name>
<value>1</value>
<description>A integer between 0-100 indicating minimum confidence value
for charset auto-detection. Any negative value disables auto-detection.
</description>
</property>
4、vi ivy/ivy.xml
增加:
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
十五、nutch2.1 使用DataFileAvroStore作为数据源
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.datafileavrostore.output.path=datafileavrostore
gora.datafileavrostore.input.path=datafileavrostore
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.avro.store.DataFileAvroStore</value>
</property>
<property>
<name>encodingdetector.charset.min.confidence</name>
<value>1</value>
<description>A integer between 0-100 indicating minimum confidence value
for charset auto-detection. Any negative value disables auto-detection.
</description>
</property>
十六、nutch2.1 使用AvroStore作为数据源
1、cd /home/ysc/nutch-2.1
2、vi conf/gora.properties
增加:
gora.avrostore.codec.type=BINARY
gora.avrostore.input.path=avrostore
gora.avrostore.output.path=avrostore
3、vi conf/nutch-site.xml
增加:
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.avro.store.AvroStore</value>
</property>
<property>
<name>encodingdetector.charset.min.confidence</name>
<value>1</value>
<description>A integer between 0-100 indicating minimum confidence value
for charset auto-detection. Any negative value disables auto-detection.
</description>
</property>
十七、配置SOLR
配置tomcat:
1、wget http://www.fayea.com/apache-mirror/tomcat/tomcat-7/v7.0.35/bin/apache-tomcat-7.0.35.tar.gz
2、tar -xzvf apache-tomcat-7.0.35.tar.gz
3、cd apache-tomcat-7.0.35
4、vi conf/server.xml
增加URIEncoding="UTF-8":
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"/>
5、mkdir conf/Catalina
6、mkdir conf/Catalina/localhost
7、vi conf/Catalina/localhost/solr.xml
增加:
<Context path="/solr">
<Environment name="solr/home" type="java.lang.String" value="/home/ysc/solr/configuration/" override="false"/>
</Context>
8、cd ..
下载SOLR:
1、wget http://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/4.1.0/solr-4.1.0.tgz
2、tar -xzvf solr-4.1.0.tgz
复制资源:
1、mkdir /home/ysc/solr
2、cp -r solr-4.1.0/example/solr /home/ysc/solr/configuration
3、unzip solr-4.1.0/example/webapps/solr.war -d /home/ysc/apache-tomcat-7.0.35/webapps/solr
配置nutch:
1、复制schema:
cp /home/ysc/nutch-1.6/conf/schema-solr4.xml /home/ysc/solr/configuration/collection1/conf/schema.xml
2、vi /home/ysc/solr/configuration/collection1/conf/schema.xml
在<fields>下增加:
<field name="_version_" type="long" indexed="true" stored="true"/>
配置中文分词:
1、wget http://mmseg4j.googlecode.com/files/mmseg4j-1.9.1.v20130120-SNAPSHOT.zip
2、unzip mmseg4j-1.9.1.v20130120-SNAPSHOT.zip
3、cp mmseg4j-1.9.1-SNAPSHOT/dist/* /home/ysc/apache-tomcat-7.0.35/webapps/solr/WEB-INF/lib
4、unzip mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT.jar -d mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT
5、mkdir /home/ysc/dic
6、cp mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT/data/* /home/ysc/dic
7、vi /home/ysc/solr/configuration/collection1/conf/schema.xml
将文件中的
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
和
<tokenizer class="solr.StandardTokenizerFactory"/>
替换为
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/dic"/>
配置tomcat本地库:
1、wget http://apache.spd.co.il/apr/apr-1.4.6.tar.gz
2、tar -xzvf apr-1.4.6.tar.gz
3、cd apr-1.4.6
4、./configure
5、make
6、make install
1、wget http://mirror.bjtu.edu.cn/apache/apr/apr-util-1.5.1.tar.gz
2、tar -xzvf apr-util-1.5.1.tar.gz
3、cd apr-util-1.5.1
4、./configure --with-apr=/usr/local/apr
5、make
6、make install
1、wget http://mirror.bjtu.edu.cn/apache//tomcat/tomcat-connectors/native/1.1.24/source/tomcat-native-1.1.24-src.tar.gz
2、tar -zxvf tomcat-native-1.1.24-src.tar.gz
3、cd tomcat-native-1.1.24-src/jni/native
4、./configure --with-apr=/usr/local/apr \
--with-java-home=/home/ysc/jdk1.7.0_01 \
--with-ssl=no \
--prefix=/home/ysc/apache-tomcat-7.0.35
5、make
6、make install
7、vi /etc/profile
增加:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/ysc/apache-tomcat-7.0.35/lib:/usr/local/apr/lib
8、source /etc/profile
启动tomcat:
cd apache-tomcat-7.0.35
bin/catalina.sh start
http://devcluster01:8080/solr/
十八、Nagios监控
服务端:
1、apt-get install apache2 nagios3 nagios-nrpe-plugin
输入密码:nagiosadmin
2、apt-get install nagios3-doc
3、vi /etc/nagios3/conf.d/hostgroups_nagios2.cfg
define hostgroup {
hostgroup_name nagios-servers
alias nagios servers
members devcluster01,devcluster02,devcluster03
}
4、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster01_nagios2.cfg
vi /etc/nagios3/conf.d/devcluster01_nagios2.cfg
替换:
g/localhost/s//devcluster01/g
g/127.0.0.1/s//192.168.1.1/g
5、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster02_nagios2.cfg
vi /etc/nagios3/conf.d/devcluster02_nagios2.cfg
替换:
g/localhost/s//devcluster02/g
g/127.0.0.1/s//192.168.1.2/g
6、cp /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster03_nagios2.cfg
vi /etc/nagios3/conf.d/devcluster03_nagios2.cfg
替换:
g/localhost/s//devcluster03/g
g/127.0.0.1/s//192.168.1.3/g
7、vi /etc/nagios3/conf.d/services_nagios2.cfg
将hostgroup_name改为nagios-servers
增加:
# check that web services are running
define service {
hostgroup_name nagios-servers
service_description HTTP
check_command check_http
use generic-service
notification_interval 0 ; set > 0 if you want to be renotified
}
# check that ssh services are running
define service {
hostgroup_name nagios-servers
service_description SSH
check_command check_ssh
use generic-service
notification_interval 0 ; set > 0 if you want to be renotified
}
8、vi /etc/nagios3/conf.d/extinfo_nagios2.cfg
将hostgroup_name改为nagios-servers
增加:
define hostextinfo{
hostgroup_name nagios-servers
notes nagios-servers
# notes_url http://webserver.localhost.localdomain/hostinfo.pl?host=netware1
icon_image base/debian.png
icon_image_alt Debian GNU/Linux
vrml_image debian.png
statusmap_image base/debian.gd2
}
9、sudo /etc/init.d/nagios3 restart
10、访问http://devcluster01/nagios3/
用户名:nagiosadmin密码:nagiosadmin
监控端:
1、apt-get install nagios-nrpe-server
2、vi /etc/nagios/nrpe.cfg
替换:
g/127.0.0.1/s//192.168.1.1/g
3、sudo /etc/init.d/nagios-nrpe-server restart
十九、配置Splunk
1、wget http://download.splunk.com/releases/5.0.2/splunk/linux/splunk-5.0.2-149561-Linux-x86_64.tgz
2、tar -zxvf splunk-5.0.2-149561-Linux-x86_64.tgz
3、cd splunk
4、bin/splunk start --answer-yes --no-prompt --accept-license
5、访问http://devcluster01:8000
用户名:admin 密码:changeme
6、添加数据 -> 从 UDP 端口 -> UDP 端口 *: 1688 -> 来源类型 从列表 log4j -> 保存
7、配置hadoop
vi /home/ysc/hadoop-1.1.1/conf/log4j.properties
修改:
log4j.rootLogger=${hadoop.root.logger}, EventCounter, SYSLOG
增加:
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.facility=local1
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n
log4j.appender.SYSLOG.SyslogHost=host6:1688
log4j.appender.SYSLOG.threshold=INFO
log4j.appender.SYSLOG.Header=true
log4j.appender.SYSLOG.FacilityPrinting=true
8、配置hbase
vi /home/ysc/hbase-0.92.2/conf/log4j.properties
修改:
log4j.rootLogger=${hbase.root.logger},SYSLOG
增加:
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.facility=local1
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n
log4j.appender.SYSLOG.SyslogHost=host6:1688
log4j.appender.SYSLOG.threshold=INFO
log4j.appender.SYSLOG.Header=true
log4j.appender.SYSLOG.FacilityPrinting=true
9、配置nutch
vi /home/lanke/ysc/nutch-2.1-hbase/conf/log4j.properties
修改:
log4j.rootLogger=INFO,DRFA,SYSLOG
增加:
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.facility=local1
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n
log4j.appender.SYSLOG.SyslogHost=host6:1688
log4j.appender.SYSLOG.threshold=INFO
log4j.appender.SYSLOG.Header=true
log4j.appender.SYSLOG.FacilityPrinting=true
10、启动hadoop和hbase
start-all.sh
start-hbase.sh
二十、配置Pig
1、wget http://labs.mop.com/apache-mirror/pig/pig-0.11.0/pig-0.11.0.tar.gz
2、tar -xzvf pig-0.11.0.tar.gz
3、cd pig-0.11.0
4、vi /etc/profile
增加:
export PIG_HOME=/home/ysc/pig-0.11.0
export PATH=$PIG_HOME/bin:$PATH
5、source /etc/profile
6、cp conf/log4j.properties.template conf/log4j.properties
7、vi conf/log4j.properties
8、pig
二十一、配置Hive
1、wget http://mirrors.cnnic.cn/apache/hive/hive-0.10.0/hive-0.10.0.tar.gz
2、tar -xzvf hive-0.10.0.tar.gz
3、cd hive-0.10.0
4、vi /etc/profile
增加:
export HIVE_HOME=/home/ysc/hive-0.10.0
export PATH=$HIVE_HOME/bin:$PATH
5、source /etc/profile
6、cp conf/hive-log4j.properties.template conf/hive-log4j.properties
7、vi conf/hive-log4j.properties
替换:
log4j.appender.EventCounter=org.apache.hadoop.metrics.jvm.EventCounter
为:
log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter
二十二、配置Hadoop2.x集群
1、wget http://labs.mop.com/apache-mirror/hadoop/common/hadoop-2.0.2-alpha/hadoop-2.0.2-alpha.tar.gz
2、tar -xzvf hadoop-2.0.2-alpha.tar.gz
3、cd hadoop-2.0.2-alpha
4、vi etc/hadoop/hadoop-env.sh
追加:
export JAVA_HOME=/home/ysc/jdk1.7.0_05
export HADOOP_HEAPSIZE=2000
5、vi etc/hadoop/core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://devcluster01:9000</value>
<description>
Where to find the Hadoop Filesystem through the network.
Note 9000 is not the default port.
(This is slightly changed from previous versions which didnt have "hdfs")
</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description>The size of buffer for use in sequence files.
The size of this buffer should probably be a multiple of hardware
page size (4096 on Intel x86), and it determines how much data is
buffered during read and write operations.</description>
</property>
6、vi etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.reduce.input.buffer.percent</name>
<value>1</value>
<description>The percentage of memory- relative to the maximum heap size- to
retain map outputs during the reduce. When the shuffle is concluded, any
remaining map outputs in memory must consume less than this threshold before
the reduce can begin.
</description>
</property>
<property>
<name>mapred.job.shuffle.input.buffer.percent</name>
<value>1</value>
<description>The percentage of memory to be allocated from the maximum heap
size to storing map outputs during the shuffle.
</description>
</property>
<property>
<name>mapred.inmem.merge.threshold</name>
<value>0</value>
<description>The threshold, in terms of the number of files
for the in-memory merge process. When we accumulate threshold number of files
we initiate the in-memory merge and spill to disk. A value of 0 or less than
0 indicates we want to DON'T have any threshold and instead depend only on
the ramfs's memory consumption to trigger the merge.
</description>
</property>
<property>
<name>io.sort.factor</name>
<value>100</value>
<description>The number of streams to merge at once while sorting
files. This determines the number of open file handles.</description>
</property>
<property>
<name>io.sort.mb</name>
<value>240</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>If the map outputs are compressed, how should they be
compressed?
</description>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2000m</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
<description>Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>5</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>15</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>5</value>
<description>
define mapred.map tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>15</value>
<description>
define mapred.reduce tasks to be number of slave hosts.the best number is the number of slave hosts plus the core numbers of per host
</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/ysc/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/ysc/mapreduce/local</value>
</property>
<property>
<name>mapreduce.job.counters.max</name>
<value>12000</value>
<description>Limit on the number of counters allowed per job.
</description>
</property>
7、vi etc/hadoop/yarn-site.xml
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>devcluster01:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>devcluster01:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>devcluster01:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>devcluster01:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>devcluster01:8088</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name> <value>/home/ysc/h2/data/1/yarn/local,/home/ysc/h2/data/2/yarn/local,/home/ysc/h2/data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name> <value>/home/ysc/h2/data/1/yarn/logs,/home/ysc/h2/data/2/yarn/logs,/home/ysc/h2/data/3/yarn/logs</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/home/ysc/h2/var/log/hadoop-yarn/apps</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>devcluster01:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>devcluster01:19888</value>
</property>
8、vi etc/hadoop/hdfs-site.xml
<property>
<name>dfs.permissions.superusergroup</name>
<value>root</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/ysc/dfs/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/ysc/dfs/filesystem/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.block.size</name>
<value>6710886400</value>
<description>The default block size for new files.</description>
</property>
9、启动hadoop
bin/hdfs namenode -format
sbin/start-dfs.sh
sbin/start-yarn.sh
10、访问管理页面
http://devcluster01:8088
http://devcluster01:50070
评论
4 楼
yangshangchuan
2014-02-25
NikolaiBalance 写道
你好,我按照你的步骤,二、nutch1.5.1 ,做到 8、开发调试 报错,我的环境是在win 7下eclipse
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Administrator\mapred\staging\Administrator1042821926\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
想请教一下,问题原因是什么?
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Administrator\mapred\staging\Administrator1042821926\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
想请教一下,问题原因是什么?
参考我的这篇博文:http://yangshangchuan.iteye.com/blog/1839784
3 楼
NikolaiBalance
2014-02-25
你好,我按照你的步骤,二、nutch1.5.1 ,做到 8、开发调试 报错,我的环境是在win 7下eclipse
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Administrator\mapred\staging\Administrator1042821926\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
想请教一下,问题原因是什么?
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Administrator\mapred\staging\Administrator1042821926\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
想请教一下,问题原因是什么?
2 楼
yangshangchuan
2014-02-19
liujze 写道
您的文章写的真详细,我又个问题,nutch1和nutch2有哪些不同?
nutch1和nutch2最大的不同在于存储层,nutch1使用文件系统,主要是HDFS,nutch2使用多种数据库,主要是HBASE。
1 楼
liujze
2014-02-17
您的文章写的真详细,我又个问题,nutch1和nutch2有哪些不同?
发表评论
-
模拟浏览器的神器 - HtmlUnit
2014-03-26 10:55 26792随着Web的发展,RIA越来越多,JavaScript和C ... -
运行nutch提示:0 records selected for fetching, exiting
2014-03-18 20:17 5204运行Nutch的时候提示Generator: 0 reco ... -
Apache Nutch v1.8发布,Java实现的网络爬虫
2014-03-18 10:14 4720Apache Nutch v1.8已经发布 ... -
Nutch抓取需要登录的网站
2014-03-16 20:01 6558Tomcat自身带的后台管理程序是需要用户登录的,这样的网 ... -
网络爬虫面临的挑战 之 链接构造
2014-03-16 01:39 5399爬虫与反爬虫就好像是安全领域的破解与反破解一样,相互矛盾, ... -
配置Nutch模拟浏览器以绕过反爬虫限制
2014-03-14 02:48 7439当我们配置Nutch抓取 http://yangshang ... -
运行nutch报错:unzipBestEffort returned null
2014-03-12 18:41 3883报错信息:fetch of http://szs.mof. ... -
Nutch的发展历程
2013-09-29 18:18 4698Nutch的创始人是Doug Cutting,他同时也是L ... -
配置Cygwin支持无密码SSH登陆
2013-04-01 00:56 96281、安装SSH 默认的Cygwi ... -
Cygwin运行nutch报错:Failed to set permissions of path
2013-03-31 23:37 9099错误信息: Exception in thread &q ... -
nutch2.1+mysql报错及解决
2013-03-31 23:35 5140错误信息:java.io.IOException: ja ... -
对Nutch2.1抽象存储层的一些看法
2013-03-22 20:31 16470Nutch2.1通过gora对存储层进 ...
相关推荐
矢量边界,行政区域边界,精确到乡镇街道,可直接导入arcgis使用
毕业设计
毕业设计
经验贝叶斯EB的简单例子
智慧园区,作为现代城市发展的新形态,旨在通过高度集成的信息化系统,实现园区的智能化管理与服务。该方案提出,利用智能手环、定制APP、园区管理系统及物联网技术,将园区的各类设施与设备紧密相连,形成一个高效、便捷、安全的智能网络。从智慧社区到智慧酒店,从智慧景区到智慧康养,再到智慧生态,五大应用板块覆盖了园区的每一个角落,为居民、游客及工作人员提供了全方位、个性化的服务体验。例如,智能手环不仅能实现定位、支付、求助等功能,还能监测用户健康状况,让科技真正服务于生活。而智慧景区的建设,更是通过大数据分析、智能票务、电子围栏等先进技术,提升了游客的游玩体验,确保了景区的安全有序。 尤为值得一提的是,方案中的智慧康养服务,展现了科技对人文关怀的深刻体现。通过智慧手环与传感器,自动感知老人身体状态,及时通知家属或医疗机构,有效解决了“空巢老人”的照护难题。同时,智慧生态管理系统的应用,实现了对大气、水、植被等环境要素的实时监测与智能调控,为园区的绿色发展提供了有力保障。此外,方案还提出了建立全域旅游营销平台,整合区域旅游资源,推动旅游业与其他产业的深度融合,为区域经济的转型升级注入了新的活力。 总而言之,这份智慧园区建设方案以其前瞻性的理念、创新性的技术和人性化的服务设计,为我们展示了一个充满智慧与活力的未来园区图景。它不仅提升了园区的运营效率和服务质量,更让科技真正融入了人们的生活,带来了前所未有的便捷与舒适。对于正在规划或实施智慧园区建设的决策者而言,这份方案无疑提供了一份宝贵的参考与启示,激发了他们对于未来智慧生活的无限遐想与憧憬。
数学建模相关主题资源2
内容概要:本文围绕SQL在求职和实际工作中的应用展开,详细解析了SQL的重要性及其在不同行业中不可替代的地位。文章首先强调了SQL作为“一切数据工作的起点”,是数据分析、数据挖掘等领域必不可少的技能,并介绍了SQL与其他编程语言在就业市场的对比情况。随后重点探讨了SQL在面试过程中可能出现的挑战与应对策略,具体涉及到询问澄清问题、正确选择JOIN语句类型、恰当使用GROUP BY及相关过滤条件的区别、理解和运用窗口函数等方面,并给出了详细的实例和技巧提示。另外提醒面试者要注意重复值和空值等问题,倡导与面试官及时沟通。文中引用IEEE Spectrum编程语言排行榜证明了SQL不仅广泛应用于各行各业,在就业市场上也最受欢迎。 适用人群:从事或打算转入数据科学领域(包括但不限于数据分析师、数据科学家、数据工程师等职业方向),并对掌握和深入理解SQL有一定需求的专业人士,尤其是正准备涉及SQL相关技术面试的求职者。 使用场景及目标:帮助用户明确在面对复杂的SQL查询题目时能够更加灵活应对,提高解题效率的同时确保准确性;同时让用户意识到SQL不仅仅是简单的数据库查询工具,而是贯穿整个数据处理流程的基础能力之一,进而激发他们进一步探索的热情。 其他说明:SQL在性能方面优于Excel尤其适用于大规模数据操作;各知名企业仍将其视为标准数据操作手段。此外还提供了对初学者友好的建议,针对留学生普遍面临的难题如零散的学习资料、昂贵且效果不佳的付费教程以及难以跟上的纯英教学视频给出了改进的方向。
COMSOL仿真揭示石墨烯临界耦合光吸收特性:费米能级调控下的光学性能探究,COMSOL仿真揭示石墨烯临界耦合光吸收特性:费米能级调控下的光学性能探究,COMSOL 准 BIC控制石墨烯临界耦合光吸收。 COMSOL 光学仿真,石墨烯,光吸收,费米能级可调下图是仿真文件截图,所见即所得。 ,COMSOL; 准BIC; 石墨烯; 临界耦合光吸收; 光学仿真; 费米能级可调。,COMSOL仿真:石墨烯光吸收的BIC控制与费米能级调节
Labview与Proteus串口仿真下的温度采集与报警系统:Keil单片机程序及全套视频源码解析,Labview与Proteus串口仿真温度采集及上位机报警系统实战教程:设定阈值的Keil程序源码分享,labview 和proteus 联合串口仿真 温度采集 上位机报警 设定阈值单片机keil程序 整套视频仿真源码 ,关键词:LabVIEW;Proteus;串口仿真;温度采集;上位机报警;阈值设定;Keil程序;视频仿真源码。,LabVIEW与Proteus联合串口仿真:温度采集与报警系统,Keil程序与阈值设定全套视频源码
整车性能目标书:涵盖燃油车、混动车及纯电动车型的十六个性能模块目标定义模板与集成开发指南,整车性能目标书:涵盖燃油车、混动车及纯电动车型的十六个性能模块目标定义模板与集成开发指南,整车性能目标书,汽车性能目标书,十六个性能模块目标定义模板,包含燃油车、混动车型及纯电动车型。 对于整车性能的集成开发具有较高的参考价值 ,整车性能目标书;汽车性能目标书;性能模块目标定义模板;燃油车;混动车型;纯电动车型;集成开发;参考价值,《汽车性能模块化目标书:燃油车、混动车及纯电动车的集成开发参考》
熵值法stata代码(含stata代码+样本数据) 面板熵值法是一种在多指标综合评价中常用的数学方法,主要用于对不同的评价对象进行量化分析,以确定各个指标在综合评价中的权重。该方法结合了熵值理论和面板数据分析,能够有效地处理包含多个指标的复杂数据。
“电子电路”仿真资源(Multisim、Proteus、PCB等)
在 GEE(Google Earth Engine)中,XEE 包是一个用于处理和分析地理空间数据的工具。以下是对 GEE 中 XEE 包的具体介绍: 主要特性 地理数据处理:提供强大的函数和工具,用于处理遥感影像和其他地理空间数据。 高效计算:利用云计算能力,支持大规模数据集的快速处理。 可视化:内置可视化工具,方便用户查看和分析数据。 集成性:可以与其他 GEE API 和工具无缝集成,支持多种数据源。 适用场景 环境监测:用于监测森林砍伐、城市扩展、水体变化等环境问题。 农业分析:分析作物生长、土地利用变化等农业相关数据。 气候研究:研究气候变化对生态系统和人类活动的影响。
内容概要:本文介绍了C++编程中常见指针错误及其解决方案,并涵盖了模板元编程的基础知识和发展趋势,强调了高效流操作的最新进展——std::spanstream。文章通过一系列典型错误解释了指针的安全使用原则,强调指针初始化、内存管理和引用安全的重要性。随后介绍了模板元编程的核心特性,展示了编译期计算、类型萃取等高级编程技巧的应用场景。最后,阐述了C++23中引入的新特性std::spanstream的优势,对比传统流处理方法展现了更高的效率和灵活性。此外,还给出了针对求职者的C++技术栈学习建议,涵盖了语言基础、数据结构与算法及计算机科学基础领域内的多项学习资源与实战练习。 适合人群:正在学习C++编程的学生、从事C++开发的技术人员以及其他想要深入了解C++语言高级特性的开发者。 使用场景及目标:帮助读者掌握C++中的指针规则,预防潜在陷阱;介绍模板元编程的相关技术和优化方法;使读者理解新引入的标准库组件,提高程序性能;引导C++学习者按照有效的路径规划自己的技术栈发展路线。 阅读建议:对于指针部分的内容,应当结合实际代码样例反复实践,以便加深理解和记忆;在研究模板元编程时,要从简单的例子出发逐步建立复杂模型的理解能力,培养解决抽象问题的能力;而对于C++23带来的变化,则可以通过阅读官方文档并尝试最新标准特性来加深印象;针对求职准备,应结合个人兴趣和技术发展方向制定合理的学习计划,并注重积累高质量的实际项目经验。
JNA、JNI, Java两种不同调用DLL、SO动态库方式读写FM1208 CPU卡示例源码,包括初始化CPU卡、创建文件、修改文件密钥、读写文件数据等操作。支持Windows系统、支持龙芯Mips、LoongArch、海思麒麟鲲鹏飞腾Arm、海光兆芯x86_Amd64等架构平台的国产统信、麒麟等Linux系统编译运行,内有jna-4.5.0.jar包,vx13822155058 qq954486673
内容概要:本文全面介绍了Linux系统的各个方面,涵盖入门知识、基础操作、进阶技巧以及高级管理技术。首先概述了Linux的特点及其广泛的应用领域,并讲解了Linux环境的搭建方法(如使用虚拟机安装CentOS),随后深入剖析了一系列常用命令和快捷键,涉及文件系统管理、用户和权限设置、进程和磁盘管理等内容。此外,还讨论了服务管理的相关指令(如nohup、systemctl)以及日志记录和轮替的最佳实践。这不仅为初学者提供了一个完整的知识框架,也为中级和高级用户提供深入理解和优化系统的方法。 适合人群:适用于有意深入了解Linux系统的学生和专业技术人员,特别是需要掌握服务器运维技能的人群。 使用场景及目标:本文适合初次接触Linux的操作员了解基本概念;也适合作为培训教材,指导学生逐步掌握各项技能。对于有一定经验的技术人员而言,则可以帮助他们巩固基础知识,并探索更多的系统维护和优化可能性。 阅读建议:建议按照文章结构循序渐进地学习相关内容,尤其是结合实际练习操作来加深记忆和理解。遇到复杂的问题时可以通过查阅官方文档或在线资源获得更多帮助。
内容概要:本文档详细介绍了企业在规范运维部门绩效管理过程中所建立的一套绩效考核制度。首先阐述了绩效考核制度设立的目的为确保绩效目标得以衡量与追踪,并确保员工与公司共同成长与发展。其次规定范围覆盖公司所有在职员工,并详细列明了从总经理到一线员工在内的不同角色的职责范围。再则描述了完整的绩效工作流程,即从年初开始制定绩效管理活动计划,经过与每个员工制定具体的绩效目标,在绩效考核周期之内对员工的工作进展和问题解决状况进行持续的监督跟进,并且在每周期结束前完成员工绩效的评估和反馈工作,同时利用绩效评估结果对员工作出保留或异动的相关决定,最后进行绩效管理活动总结以为来年提供参考。此外还强调了整个过程中必要的相关文档保存,如员工绩效评估表。 适合人群:企业管理层,HR专业人士及对现代企业内部运营管理感兴趣的读者。 使用场景及目标:①管理层需要理解如何规范和有效实施企业内部绩效管理,以提高公司运营效率和员工满意度;②HR人士可以通过参考此文档来优化自己公司的绩效管理体系;③对企业和组织管理有兴趣的研究员亦可借鉴。 阅读建议:读者应重点关注各个层级管理者和员工在整个流程中的角色和责任,以期更好地理解
基于MATLAB Simulink的LCL三相并网逆变器仿真模型:采用交流电流内环PR控制与SVPWM-PWM波控制研究,基于MATLAB Simulink的LCL三相并网逆变器仿真模型研究:采用比例谐振控制与交流SVPWM控制策略及参考文献解析,LCL_Three_Phase_inverter:基于MATLAB Simulink的LCL三相并网逆变器仿真模型,交流电流内环才用PR(比例谐振)控制,PWM波采用SVPWM控制,附带对应的参考文献。 仿真条件:MATLAB Simulink R2015b,前如需转成低版本格式请提前告知,谢谢。 ,LCL三相并网逆变器; LCL_Three_Phase_inverter; MATLAB Simulink; PR控制; SVPWM控制; 仿真模型; 参考文献; 仿真条件; R2015b版本,基于PR控制与SVPWM的LCL三相并网逆变器Simulink仿真模型研究
内点法求解标准节点系统最优潮流计算的稳定程序,注释清晰,通用性强,内点法用于标准节点系统的最优潮流计算:稳定、通用且注释清晰的matlab程序,内点法最优潮流程序matlab 采用内点法对14标准节点系统进行最优潮流计算,程序运行稳定,注释清楚,通用性强 ,内点法; 最优潮流程序; MATLAB; 14标准节点系统; 稳定运行; 清晰注释; 通用性强。,Matlab内点法最优潮流程序:稳定高效,通用性强,适用于14节点系统
17suiea3.apk?v=1741006890849