版本:hadoop2.2.0,mahout0.9。
使用mahout的org.apache.mahout.cf.taste.hadoop.item.RecommenderJob进行测试。
首先说明下,如果使用官网提供的下载hadoop2.2.0以及mahout0.9进行调用mahout的相关算法会报错。一般报错如下:
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614) at org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob.run(PreparePreferenceMatrixJob.java:73) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
这个是因为目前mahout只支持hadoop1 的缘故。在这里可以找到解决方法:https://issues.apache.org/jira/browse/MAHOUT-1329。主要就是修改pom文件,修改mahout的依赖。
大家可以下载修改后的源码包(http://download.csdn.net/detail/fansy1990/7165957)自己编译mahout,或者直接下载已经编译好的jar包(http://download.csdn.net/detail/fansy1990/7166017、http://download.csdn.net/detail/fansy1990/7166055)。
接着,按照这篇文章建立eclipse的环境:http://blog.csdn.net/fansy1990/article/details/22896249。环境配置好了之后,需要添加mahout的jar包,下载前面提供的jar包,然后导入到java工程中。
编写下面的java代码:
package fz.hadoop2.util; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.yarn.conf.YarnConfiguration; public class Hadoop2Util { private static Configuration conf=null; private static final String YARN_RESOURCE="node31:8032"; private static final String DEFAULT_FS="hdfs://node31:9000"; public static Configuration getConf(){ if(conf==null){ conf = new YarnConfiguration(); conf.set("fs.defaultFS", DEFAULT_FS); conf.set("mapreduce.framework.name", "yarn"); conf.set("yarn.resourcemanager.address", YARN_RESOURCE); } return conf; } }
package fz.mahout.recommendations; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.util.ToolRunner; import org.apache.mahout.cf.taste.hadoop.item.RecommenderJob; import org.junit.After; import org.junit.Before; import org.junit.Test; import fz.hadoop2.util.Hadoop2Util; /** * 测试mahout org.apache.mahout.cf.taste.hadoop.item.RecommenderJob * environment: * mahout0.9 * hadoop2.2 * @author fansy * */ public class RecommenderJobTest{ //RecommenderJob rec = null; Configuration conf =null; @Before public void setUp(){ // rec= new RecommenderJob(); conf= Hadoop2Util.getConf(); System.out.println("Begin to test..."); } @Test public void testMain() throws Exception{ String[] args ={ "-i","hdfs://node31:9000/input/user.csv", "-o","hdfs://node31:9000/output/rec001", "-n","3","-b","false","-s","SIMILARITY_EUCLIDEAN_DISTANCE", "--maxPrefsPerUser","7","--minPrefsPerUser","2", "--maxPrefsInItemSimilarity","7", "--outputPathForSimilarityMatrix","hdfs://node31:9000/output/matrix/rec001", "--tempDir","hdfs://node31:9000/output/temp/rec001"}; ToolRunner.run(conf, new RecommenderJob(), args); } @After public void cleanUp(){ } }
在前面下载好了mahout的jar包后,需要把这些jar包放入hadoop2的lib目录(share/hadoop/mapreduce/lib,注意不一定一定要这个路径,其他hadoop lib也可以)。然后运行RecommenderJobTest即可。
输入文件如下:
1,101,5.0 1,102,3.0 1,103,2.5 2,101,2.0 2,102,2.5 2,103,5.0 2,104,2.0 3,101,2.5 3,104,4.0 3,105,4.5 3,107,5.0 4,101,5.0 4,103,3.0 4,104,4.5 4,106,4.0 5,101,4.0 5,102,3.0 5,103,2.0 5,104,4.0 5,105,3.5 5,106,4.0
输出文件为:
最后一个MR日志:
2014-04-09 13:03:09,301 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor 2014-04-09 13:03:09,301 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.map.child.java.opts is deprecated. Instead, use mapreduce.map.java.opts 2014-04-09 13:03:09,302 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb 2014-04-09 13:03:09,302 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout 2014-04-09 13:03:09,317 INFO [main] client.RMProxy (RMProxy.java:createRMProxy(56)) - Connecting to ResourceManager at node31/192.168.0.31:8032 2014-04-09 13:03:09,460 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 1 2014-04-09 13:03:09,515 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(394)) - number of splits:1 2014-04-09 13:03:09,531 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-04-09 13:03:09,547 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(477)) - Submitting tokens for job: job_1396479318893_0015 2014-04-09 13:03:09,602 INFO [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(174)) - Submitted application application_1396479318893_0015 to ResourceManager at node31/192.168.0.31:8032 2014-04-09 13:03:09,604 INFO [main] mapreduce.Job (Job.java:submit(1272)) - The url to track the job: http://node31:8088/proxy/application_1396479318893_0015/ 2014-04-09 13:03:09,604 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1317)) - Running job: job_1396479318893_0015 2014-04-09 13:03:24,170 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1338)) - Job job_1396479318893_0015 running in uber mode : false 2014-04-09 13:03:24,170 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) - map 0% reduce 0% 2014-04-09 13:03:32,299 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) - map 100% reduce 0% 2014-04-09 13:03:41,373 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) - map 100% reduce 100% 2014-04-09 13:03:42,404 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1356)) - Job job_1396479318893_0015 completed successfully 2014-04-09 13:03:42,485 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1363)) - Counters: 43 File System Counters FILE: Number of bytes read=306 FILE: Number of bytes written=163713 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=890 HDFS: Number of bytes written=192 HDFS: Number of read operations=10 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=5798 Total time spent by all reduces in occupied slots (ms)=6179 Map-Reduce Framework Map input records=7 Map output records=21 Map output bytes=927 Map output materialized bytes=298 Input split bytes=131 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=298 Reduce input records=21 Reduce output records=5 Spilled Records=42 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=112 CPU time spent (ms)=1560 Physical memory (bytes) snapshot=346509312 Virtual memory (bytes) snapshot=1685782528 Total committed heap usage (bytes)=152834048 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=572 File Output Format Counters Bytes Written=192
说明:由于只测试了一个协同过滤算法的程序,其他的算法并没有测试,如果其他算法在此版本上有问题,也是可能有的。
相关推荐
Hadoop2.2+Zookeeper3.4.5+HBase0.96集群环境搭建 Hadoop2.2+Zookeeper3.4.5+HBase0.96集群环境搭建是大数据处理和存储的重要组件,本文档将指导用户从零开始搭建一个完整的Hadoop2.2+Zookeeper3.4.5+HBase0.96集群...
在本文中,我们将深入探讨Hadoop 2.7.3与Mahout 0.9集成过程中可能遇到的问题,以及如何解决这些技术挑战。Hadoop是一个开源的分布式计算框架,而Mahout是基于Hadoop的数据挖掘库,专注于机器学习算法。这两者的结合...
总的来说,"hadoop 2.4.1+mahout0.9环境搭建"是一个涉及多方面技术的工程,需要对Hadoop、Mahout、Java开发、分布式系统和机器学习有深入理解。通过这个过程,开发者可以更好地掌握大数据处理和分析的实践技能。
Hadoop 2.6.0+Hbase1.12+mahout0.9 集群搭建,自己写的,可以根据实际情况搭建伪分布式或者完全分布式。
VMware10+CentOS6.5+Hadoop2.2+Zookeeper3.4.6+HBase0.96安装过程详解.pdf
VMware10+CentOS6.5+Hadoop2.2+Zookeeper3.4.6+HBase0.96安装过程详解 用于解决分布式集群服务器
综合以上信息,用户在进行hadoop2.2+hbase0.96+hive0.12的集成安装时,应该详细检查各个组件的版本兼容性,确保系统权限设置正确,按照实践指南执行相关配置步骤,并正确设置和使用MySQL数据库作为元数据存储。...
### hadoop2.2+hbase0.96+hive0.12安装整合详细高可靠文档及经验总结 #### 一、Hadoop2.2的安装 **问题导读:** 1. Hadoop的安装需要安装哪些软件? 2. Hadoop与HBase整合需要注意哪些问题? 3. Hive与HBase的...
hadoop2.2集群搭建遇到的各种问题。
亲自搭建集群,由于代码文件比较大,需要的联系我。
mahout0.9 的jar包,支持hadoop2,此为第二部分jar包。具体调用方式参考lz相关博客
1. **分布式计算框架支持**:Mahout 0.9利用Hadoop的分布式计算能力,可以处理大规模数据集。这使得它能够高效地运行在云计算平台上,如Amazon EMR或自建的Hadoop集群。 2. **机器学习算法库**:Mahout包含了多种...
Hadoop 2.2 是一个重要的版本,它在Hadoop生态系统中引入了多项改进和优化,使得大数据处理变得更加高效和可靠。在这个版本中,Hadoop增强了其分布式存储系统HDFS(Hadoop Distributed File System)以及分布式计算...
Mahout 0.9版本是该库的一个重要里程碑,尤其因为它是第一个全面支持Hadoop 2的版本。在Hadoop 2中,引入了诸如YARN(Yet Another Resource Negotiator)这样的重大改进,增强了资源管理和任务调度的效率,使得...
Apache Mahout是一个基于Apache Hadoop的数据挖掘库,专注于大规模机器学习算法的实现。这个压缩包包含的是Mahout项目不同版本的核心库,分别是mahout-core-0.9.jar、mahout-core-0.8.jar和mahout-core-0.1.jar。...
mahout0.9的源码,支持hadoop2,需要自行使用mvn编译。mvn编译使用命令: mvn clean install -Dhadoop2 -Dhadoop.2.version=2.2.0 -DskipTests
mahout0.9仅支持hadoop1.x,编译好的这个包支持hadoop2.2.0.由于上传文件50M的限制,采用分卷压缩的形式,包括三个包:mahout-mahout-distribution-0.9.zip,distribution-0.9.z01,mahout-distribution-0.9.z02,...
hadoop2.2 64位 (下) centos6.4 64位编译 这是下半部分