`
C_J
  • 浏览: 128463 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Something stuff of Apress-Pro Hadoop(be going on...)

阅读更多

电子版在http://caibinbupt.iteye.com/blog/418846下载


Getting started with hadoop core

    Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer fit on
a single cost-effective computer.

 

    A simple but expensive solution has been to buy specialty machines that have a lot of memory and many CPUs. This solution scales as far as what is supported
by the fastest machines available, and usually the only limiting factor is your budget.


   An alternative solution is to build a high-availability cluster.

 

 

MapReduce Model:

• Map: An initial ingestion and transformation step, in which individual input records can be processed in parallel.
• Reduce: An aggregation or summarization step, in which all associated records must be processed together by a single entity.

 

 

 

 

MapReduce Application is a specialized web crawler which received as input large sets of URLs.Job had serverl steps:

   1,Ingest Urls.

   2,Normalize the urls.

   3,eliminate duplicate urls.

   4,filter all urls.

   5,fetch the urls.

   6,fingerprint the content items.

   7,update the recently sets.

   8,prepare the work list for next application.

 

The Hadoop-based application was running faster and well.

 

 


 Introducing Hadoop

    this is a top-level project in apache,provoding and supporting development of open source software that supplies a framework for developments of highly scalable distributed computing applications.

    The two fundamental pieces of hadoop are the mapreduce framework and hadoop distributed file system(HDFS).

     The mapreduce framework required a shared file system such as HDFS,S3,NFS,GFS..but the HDFS is the best suitable.

 

Introducing MapReduce

 

    required as following:

    1,The locations in the distributed file system of input.

    2,the locations in the distributed file system for output.

    3,the input format.

    4,the output format.

    5,the class contains the map function.

    6,optionally,the class contains the reduce function.

    7,the jar fils containing the above class.

 

if a job does not need a reduce function,the framework will partition  the input,and schedule and execute maps tasks across the cluster.if requested, it will sort the results of the map task and execute the map reduce with the map output.the final output will be moved the output directory and the state of job report user.

 

Managing the mapreduce:

   there are two process to manage jobs:

    TaskTracker manages the execution of individual map and reduce task on a compute node in the cluster.

    JobTracker accepts job submission provides job monitoring and control,and manager the distribution of tasks to the tasktracker nodes.

Note: one nice feature is that you can add tasktracker to the cluster when a job is running and have the job spread to the new node.

 

 Introducing HDFS

 

 

 HDFS is designed for use for mapreduce jobs that  read input in large churks of input and write large churk of output.this is referred as replication in hadoop.

 

Installing Hadoop

    the prerequisites:

    1,fedora 8

    2,jdk1.6

    3,hadoop 0.19 or later

 Go to the Hadoop download site at http://www.apache.org/dyn/closer.cgi/hadoop/core/. find  the gz file,download the file,tar the file,then export HADOOP_HOME=[yourdirectory],export PATH=${HADOOP_HOME}/bin:${PATH}.

    last,check all..

Running examples and tests

      domonstrate all examples...:)

 

 Chapter 2 the basices of mapreduce job

the chapter

 

 

 

 

 the user is responsiable for handing the job setup,specifying the inputs locations,specifying .

 

there is a simple example:

package com.apress.hadoopbook.examples.ch2;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.lib.IdentityMapper;
import org.apache.hadoop.mapred.lib.IdentityReducer;
import org.apache.log4j.Logger;
/** A very simple MapReduce example that reads textual input where
* each record is a single line, and sorts all of the input lines into
* a single output file.
*
* The records are parsed into Key and Value using the first TAB
* character as a separator. If there is no TAB character the entire
* line is the Key. *
*
* @author Jason Venner
*
*/
public class MapReduceIntro {
protected static Logger logger = Logger.getLogger(MapReduceIntro.class);
/**
* Configure and run the MapReduceIntro job.
*
* @param args
* Not used.
*/
public static void main(final String[] args) {
try {
/** Construct the job conf object that will be used to submit this job
* to the Hadoop framework. ensure that the jar or directory that
* contains MapReduceIntroConfig.class is made available to all of the
* Tasktracker nodes that will run maps or reduces for this job.
*/
final JobConf conf = new JobConf(MapReduceIntro.class);

/**
* Take care of some housekeeping to ensure that this simple example
* job will run
*/
MapReduceIntroConfig.
exampleHouseKeeping(conf,
MapReduceIntroConfig.getInputDirectory(),
MapReduceIntroConfig.getOutputDirectory());
/**
* This section is the actual job configuration portion /**
* Configure the inputDirectory and the type of input. In this case
* we are stating that the input is text, and each record is a
* single line, and the first TAB is the separator between the key
* and the value of the record.
*/
conf.setInputFormat(KeyValueTextInputFormat.class);
FileInputFormat.setInputPaths(conf,
MapReduceIntroConfig.getInputDirectory());
/** Inform the framework that the mapper class will be the
* {@link IdentityMapper}. This class simply passes the
* input Key Value pairs directly to its output, which in
* our case will be the shuffle.
*/
conf.setMapperClass(IdentityMapper.class);
/** Configure the output of the job to go to the output
* directory. Inform the framework that the Output Key
* and Value classes will be {@link Text} and the output
* file format will {@link TextOutputFormat}. The
* TextOutput format class joins produces a record of
* output for each Key,Value pair, with the following
* format. Formatter.format( "%s\t%s%n", key.toString(),
* value.toString() );.
*
* In addition indicate to the framework that there will be
* 1 reduce. This results in all input keys being placed
* into the same, single, partition, and the final output
* being a single sorted file.
*/
FileOutputFormat.setOutputPath(conf,
MapReduceIntroConfig.getOutputDirectory());
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setNumReduceTasks(1);

/** Inform the framework that the reducer class will be the {@link
* IdentityReducer}. This class simply writes an output record key,
* value record for each value in the key, valueset it receives as
* input. The value ordering is arbitrary.
*/
conf.setReducerClass(IdentityReducer.class);
logger .info("Launching the job.");
/** Send the job configuration to the framework and request that the
* job be run.
*/
final RunningJob job = JobClient.runJob(conf);
logger.info("The job has completed.");
if (!job.isSuccessful()) {
logger.error("The job failed.");
System.exit(1);
}
logger.info("The job completed successfully.");
System.exit(0);
} catch (final IOException e) {
logger.error("The job has failed due to an IO Exception", e);
e.printStackTrace();
}
}
}

 

IdentityMapper:

the framework will make one call to your map function for echo record for your input.

IdentityReducer:

the framework will calls the reduce function one time for each unique key.

 

If you require the output of your job to be sorted, the reducer function must pass the key
objects to the output.collect() method unchanged. The reduce phase is, however, free to
output any number of records, including zero records, with the same key and different values.
This particular constraint is also why the map tasks may be multithreaded, while the reduce
tasks are explicitly only single-threaded.

 

Special the input formats:

 

    KeyValueTextInputFormat,TextInputFormat,NLineInputFormat,MultiFileInputFormat,SequenceFileInputFormat

 

keyvaluetextinputformat and sequenceinputformat are the most commonly used input formats.

 

Setting the out format:

 

Configuring the reduce phase:

        Five pieces:

    The number of reduce tasks;

    The class supplying the reduce method;

    The input/output key and value types for reduce task;

    The output file type for reduce task output;

 

 

Creating a custom mapper and reducer

    As you're seen,your first hadoop job produced sorted output,but the sorting was not suitable.Let's work out what is required to sort,using custom mapper.

 

creating a custom mapper:

you must change your configuration and provide a custom class .this is done by two calls on the jobconf.class:

    conf.setOutputKeyClass(xxx.class):informs the type;

    conf.setMapperClass(TransformKeysToLongMapper.class)

 

as blow: you must informs:

 

/** Transform the input Text, Text key value
* pairs into LongWritable, Text key/value pairs.
*/
public class TransformKeysToLongMapperMapper
extends MapReduceBase implements Mapper<Text, Text, LongWritable, Text>

 

Creating a custom reducer:

    after your work with the custom mapper in the preceding sections,creating a custom reducer will seem familiar.

 

so add the following single line:

conf.setReducerClass(MergeValuesToCSV.class);

 

public class MergeValuesToCSVReducer<K, V>
extends MapReduceBase implements Reducer<K, V, K, Text> {

...

}

 

Why do the mapper and reducer extend MapReduceBase?

 

The class provides basic implementations of two additinal methods the required of a mapper or reducer by the framework..

 

/** Default implementation that does nothing. */
public void close() throws IOException {
}
/** Default implementation that does nothing. */
public void configure(JobConf job) {
}

 

the configure is the way to access to the jobconf..

the close is the way to close resource or other things.

 

The makeup of cluster

   In the context of Hadoop, a node/machine running the TaskTracker or DataNode server is considered a slave node. It is common to have nodes that run both the TaskTracker and
DataNode servers. The Hadoop server processes on the slave nodes are controlled by their respective masters, the JobTracker and NameNode servers.

 

 

  • 大小: 95.6 KB
  • 大小: 51.7 KB
分享到:
评论

相关推荐

    Apress - Pro Hadoop.pdf

    ### Apress - Pro Hadoop: 构建可扩展的分布式云应用 #### 书籍概述 《Pro Hadoop》是一本由Jason Venner编写的专著,由Apress出版社出版。本书旨在帮助读者深入理解Hadoop及其相关的MapReduce技术,通过实践指导...

    Apress - Pro MERN Stack, 2nd.2019.epub

    Apress - Pro MERN Stack, 2nd.2019.epub Apress - Pro MERN Stack, 2nd.2019.epub

    Apress - Pro React 16.2019.epub

    Apress - Pro React 16.2019 Apress - Pro React 16.2019

    Apress - Pro Hadoop

    Hadoop 项目主页:http://hadoop.apache.org  一个分布式系统基础架构,由Apache基金会开发。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力高速运算和存储。 起源:Google的集群...

    Apress-matlab-machine-learning.zip

    Apress-matlab-machine-learning.zip

    Apress - Integrating Serverless Architecture.2019.epub

    Apress - Integrating Serverless Architecture.2019.epub

    Apress - DevOps in Python.2019.pdf

    由于提供的文件内容主要是关于“Apress - DevOps in Python.2019.pdf”书籍的信息,而没有具体的章节内容,因此我们无法直接从这些信息中提取具体的DevOps和Python知识点。但我们可以从文件的标题和描述中推断出一些...

    Apress - Pro JavaScript Techniques -2007.pdf

    ### 《Pro JavaScript Techniques》知识点概述 #### 一、书籍基本信息 - **书名**:Pro JavaScript Techniques - **作者**:John Resig - **出版年份**:2007年 - **出版社**:Apress - **ISBN**: - ISBN-13 (pbk...

    Apress.Pro.JPA.2.2nd.Edition.Oct.2013

    《Apress.Pro.JPA.2.2nd.Edition.Oct.2013》是一部关于Java持久化API(Java Persistence API,简称JPA)的专著,由Apress出版社于2013年10月出版。这本书是JPA 2.2版本的详细指南,旨在帮助开发者深入理解和有效利用...

    Apress - Pro Android Games, 2009.zip

    《Pro Android Games》是2009年Apress出版社出版的一本专著,专注于教导读者如何在Android平台上开发游戏。这本书对于想要深入了解Android游戏开发的程序员来说,是一份宝贵的资源。书中涵盖了从基础到高级的各种...

    Apress.Blockchain.Basics.A.Non-Technical.Introduction.in.25.Steps

    Apress.Blockchain.Basics.A.Non-Technical.Introduction.in.25.Steps.1484226038Apress.Blockchain.Basics.A.Non-Technical.Introduction.in.25.Steps.1484226038Apress.Blockchain.Basics.A.Non-Technical....

    Apress.Pro.Spark.Streaming.The.Zen.of.Real-Time.Analytics.Using.Apache.Spark

    《Apress.Pro.Spark.Streaming.The.Zen.of.Real-Time.Analytics.Using.Apache.Spark》这本书专注于探讨如何利用Apache Spark进行实时数据分析,是Spark流处理技术的深入解析。Apache Spark作为一个快速、通用且可...

    Apress.Pro.Ajax.and.Java.Frameworks.Jul.2006.HAPPY.NEW.YEAR.rar

    《Apress.Pro.Ajax.and.Java.Frameworks.Jul.2006.HAPPY.NEW.YEAR》这本书专注于探讨如何在Java环境中利用Ajax技术构建高效、交互性强的Web应用。Ajax(Asynchronous JavaScript and XML)是一种用于创建动态网页的...

    Apress.Pro.Visual.C.plus.plus.CLI.and.the.dot.NET.2.0.Platform.Dec.2005.rar

    《Apress.Pro.Visual.C++.CLI.and.the.dot.NET.2.0.Platform.Dec.2005》这本书主要聚焦于Visual C++编程与.NET 2.0平台的交互,特别是Common Language Infrastructure (CLI) 和C++/CLI(Managed C++)的应用。...

Global site tag (gtag.js) - Google Analytics