//http://distributed-agility.blogspot.com/2010/01/hadoop-0201-example-inverted-line-index.html
//https://portal.futuregrid.org/manual/hadoop-wordcount
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
* LineIndexer Creates an inverted index over all the words in a document corpus, mapping each observed word to a list
* of filename@offset locations where it occurs.
*/
public class LineIndexer extends Configured implements Tool {
// where to put the data in hdfs when we're done
private static final String OUTPUT_PATH = "output";
// where to read the data from.
private static final String INPUT_PATH = "input";
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new LineIndexer(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf, "Line Indexer 1");
job.setJarByClass(LineIndexer.class);
job.setMapperClass(LineIndexMapper.class);
job.setReducerClass(LineIndexReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
return job.waitForCompletion(true) ? 0 : 1;
}
}
After updating, make sure to run generate a new jar, remove anything under the directory "output" (since the program does not clean that up), and execute the new version.
training@training-vm:~/git/exercises/shakespeare$ ant jar
Buildfile: build.xml
compile:
[javac] Compiling 4 source files to /home/training/git/exercises/shakespeare/bin
jar:
[jar] Building jar: /home/training/git/exercises/shakespeare/indexer.jar
BUILD SUCCESSFUL
Total time: 1 second
I have added 2 ASCII books in the input directory: the works from Leonardo Da Vinci and the first volume of the book "The outline of science".
training@training-vm:~/git/exercises/shakespeare$ hadoop fs -ls input
Found 3 items
-rw-r--r-- 1 training supergroup 5342761 2009-12-30 11:57 /user/training/input/all-shakespeare
-rw-r--r-- 1 training supergroup 1427769 2010-01-04 17:42 /user/training/input/leornardo-davinci-all.txt
-rw-r--r-- 1 training supergroup 674762 2010-01-04 17:42 /user/training/input/the-outline-of-science-vol1.txt
The execution and output of running this example is shown as follows.
training@training-vm:~/git/exercises/shakespeare$ hadoop jar indexer.jar index.LineIndexer
10/01/04 21:11:55 INFO input.FileInputFormat: Total input paths to process : 3
10/01/04 21:11:56 INFO mapred.JobClient: Running job: job_200912301017_0017
10/01/04 21:11:57 INFO mapred.JobClient: map 0% reduce 0%
分享到:
相关推荐
你还需要学习如何使用Hadoop的API,如`FileSystem`类来与HDFS交互,以及`Job`类来提交和管理作业。 总的来说,虽然Hadoop在Windows上的配置和使用相比Linux更具挑战性,但通过编译`hadoop.dll`和`winutils.exe`,...
Run As -> Run on Hadoop 选择之前配置好的 MapReduce 运行环境,点击“Finish”运行。控制台会输出相关的运行信息。 6.4 查看运行结果 在输出目录/mapreduce/wordcount/output/1 中,可以看见 WordCount 程序的...
标题 "hadoop.dll-and-winutils.exe-for-hadoop2.7.7-on-windows_X64-master" 暗示了这是一个针对64位Windows系统优化的Hadoop 2.7.7版本的特定组件集合,主要包含`hadoop.dll`和`winutils.exe`两个关键文件。...
- **提交Job**:Giraph向Hadoop提交Job后,Zookeeper选举一个MapTask作为Master。 - **初始化**:Master分配图,启动Workers,每个Worker负责一部分图数据。 - **执行Supersteps**:每个Worker在其分配的图上执行...
6. **运行和调试**:写好代码后,右键点击项目,选择"Hadoop" > "Run on Cluster"或"Debug on Cluster",Eclipse会自动将你的程序提交到Hadoop集群上运行。你可以在"Console"视图中查看运行日志,也可以在...
同时,下载Hadoop-2.5.2.tar.gz和hadooponwindows-master.zip压缩包。 步骤二:解压安装Hadoop-2.5.2 解压Hadoop-2.5.2.tar.gz压缩包,并将其解压到指定目录下,例如F:\OpenSources\hadoop\Hadoop-2.5.2。然后,...
A number of organizations are focusing on big data processing, particularly with Hadoop. This course will help you understand how Hadoop, as an ecosystem, helps us store, process, and analyze data. ...
Big Data Employment Data Analysis Based on Hadoop Technology LIANG Tian-you, QIU Min (School of Information Engineering, Nanning University, Nanning 530200,China) Abstract: Big data is a new ...
18/05/25 19:51:32 INFO mapreduce.JobSubmissionFiles: Permissions on staging directory /tmp/hadoop-yarn/staging/hadoop/.staging are incorrect: rwxrwxrwx. Fixing permissions to correct value rwx------ ...
8. **提交和运行Job**:你可以使用Hadoop的MapReduce API编写Java程序,然后通过`hadoop jar`命令在本地运行测试。 在Windows上运行Hadoop虽然不如在Linux环境中顺畅,但通过`winutils.exe`和`hadoop.dll`的配合,...
This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus...
This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus...
- **使用Hadoop Job Tasklet**:通过`<hadoop:job-tasklet>`元素可以轻松运行Hadoop Job。 - **使用工具Runner**:对于需要调用特定Hadoop工具的情况,可以使用工具Runner代替shell命令进行调用。 ```xml <hadoop:...
- 使用 `Run As -> Run on Hadoop` 来运行程序。 - 注意:每次运行都会在 `.metadata/.plugins/org.apache.hadoop.eclipse` 目录下生成临时 jar 包。 #### 三、常见错误及其处理 - **安全模式问题**:当尝试删除...
import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org....