Memory Usage
By default, Hadoop allocates 1000 MB (1GB) of memory to each daemon (NameNode, JobTracker, DataNode, SecondaryNameNode, TaskTracker) it runs. This is controlled by the HADOOP_HEAPSIZE settings in hadoop-env.sh. In addition, the task tracker launches separate child JVMs to run map and reduce tasks in, so we need to factor these into the total memory footprint of a worker machine.
The max number of map tasks that can run on a tasktracker at one time is controlled by the mapred.tasktracker.map.tasks.maximum property, which is default to 2 tasks. There's also a corresponding property for reduce tasks, mapred.tasktracker.reduce.maxmum which also default to 2 tasks. The tasktracker is said to have 2 map slots and 2 reduce slots.
MapReduce 1 has been criticised for it's static slot based memory model for a long time. Rather than using a true resource management system, a MapReduce cluster is divided into a fixed number of map and reduce slots based on a static configuration – so slots are wasted anytime the cluster workload does not fit the static configuration. Furthermore, the slot-based model makes it hard for non-MapReduce applications to be scheduled appropriately. This problem has been resolved in YARN as well as some third-party libraries such as Facebook Corona.
Get back to MapReduce 1, the memory given to each JVM running a task can be changed by setting the mapred.child.java.opts property. The default setting is -Xmx200m, which gives each task 200MB of memory. Make sure you don't configure this setting to be final so that you can provide extra JVM options here to enable verbose GC logging to debug GC. The default configuration therefore uses 2800MB memory for a worker machine. Below table illustrate the typical memory usage of Hadoop Classic MapReduce.
JVM | Default memory used(MB) |
Memory used for 8 processors 400MB/child |
DataNode | 1000 | 1000 |
TaskTracker | 1000 | 1000 |
TaskTracker child map task | 2x200 | 7x400 |
TaskTracker child reduce task | 2x200 | 7x400 |
Total | 2800 | 7600 |
Why there is 7 map tasks and 7 reduce tasks while we have a 8 processors to process. The reason is that normally MapReduce jobs are I/O bounded, it make sense to have more tasks than processors to get better CPU utilization. The amount of oversubscription depends on the CPU utilization of jobs you run, but a good role of thumb is to have a factor of between one and two more tasks than processors, both map and reduce tasks counted.
In above table, we had 8 processors, so we set both mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to 7 (not 8, because the datanode and the tasktracker each take one slot). Also we increase the memory available to each child task to 400MB, the total memory usage would be 7600MB.
Whether this Java memory allocation will fit into 8GB of physical memory depends on the other processes that are running on the machine If you are running Streaming or Pips program, this allocation will probably be inappropriate since it doesn't allow enough memory for user's Streaming or Pips processes to run. This is the thing we have to avoid because it will lead to processes swapped out, which will lead to severe performance degradation.
Task memory limits
On a shared cluster, it shouldn't be possible for one user's errant MapReduce program (memory leak for example) to bring down nodes in the cluster. Hadoop provide several ways to solve this problem.
- The naive way, locking down mapred.child.java.opts by marking it as final to prevent user specify too much memory for their tasks. This is inappropriate most of time because there're legitimate reasons to allow some jobs to use more memory, so this is not always an acceptable solution, or user may use extra JVM opts to profile their MapReduce program. Furthermore, even locking down mapred.child.java.opts doesn't solve the problem because tasks can spawn new processes that are not constrained in memory usage. Streaming and Pipes job do exactly that for example.
- Via the Linux ulimit command, which can be done at the operating-system level (in the limits.conf, typically found in /etc/security) or by setting mapred.child.ulimit in the hadoop configuration. The value is specified in kilobytes, and should be comfortably larger than the memory of the JVM set by mapred.child.java.opts; otherwise the child JVM might now start.
- Utilize Hadoop's task memory monitoring feature. The idea is that an administrator sets a range of allowed virtual memory limits for tasks on the cluster, and users specify the maximum memory requirements for their jobs in the job configuration. If user doesn't specify memory requirements for the job, then the defaults are used (mapred.job.map.memory.mb and mapred.job.reduce.memory.mb, both default to -1).
The task memory monitoring approach has a couple of advantages over the ulimit one.
- It enforces the memory usage of the whole task process tree, including spawned processes.
- It enables memory-aware scheduling, where tasks are scheduled on tasktrackers that have enough free memory to run them.
To enable task memory monitoring, you need to set all six of the properties below. The default values are all -1, which means the feature is disabled.
- mapred.cluster.map.memory.mb The amount of virtual memory that defines a map slot. Map tasks that requires more than this amount of memory will use more than 1 map slot.
- mapred.cluster.reduce.memory.mb The amount of virtual memory that defines a reduce slot. Reduce tasks that requires more than this amount of memory will use more than 1 reduce slot.
- mapred.job.map.memory.mb The amount of virtual memory that a map task requires to run. If a map task exceeds this limit, it may be terminated and marked as failed.
- mapred.job.reduce.memory.mb The amount of virtual memory that a reduce task requires to run. If a reduce task exceeds this limit, it may be terminated and marked as failed.
- mapred.cluster.max.map.memory.mb The max limit that user can set mapred.job.map.memory.mb to.
- mapred.cluster.max.reduce.memory.mb The max limit that user can set mapred.job reduce.memory.mb to.
相关推荐
实验项目“MapReduce 编程”旨在让学生深入理解并熟练运用MapReduce编程模型,这是大数据处理领域中的核心技术之一。实验内容涵盖了从启动全分布模式的Hadoop集群到编写、运行和分析MapReduce应用程序的全过程。 ...
(1)熟悉Hadoop开发包 (2)编写MepReduce程序 (3)调试和运行MepReduce程序 (4)完成上课老师演示的内容 二、实验环境 Windows 10 VMware Workstation Pro虚拟机 Hadoop环境 Jdk1.8 二、实验内容 1.单词计数实验...
1. 基于C45决策树算法的Mapper实现:在Mapper中,主要实现了对输入数据的处理和预处理工作,包括对输入数据的tokenize、attribute extraction和data filtering等。同时,Mapper还需要实现对决策树算法的初始化工作,...
MapReduce之数据清洗ETL详解 MapReduce是一种基于Hadoop的分布式计算框架,广泛应用于大数据处理领域。数据清洗(Data Cleaning)是数据处理过程中非常重要的一步,旨在清洁和转换原始数据,使其更加可靠和有用。...
MapReduce是一种分布式计算模型,由Google开发,用于处理和生成大量数据。这个模型主要由两个主要阶段组成:Map(映射)和Reduce(规约)。MapReduce的核心思想是将复杂的大规模数据处理任务分解成一系列可并行执行...
1. **Map阶段**: 在Map阶段,原始数据被分割成多个小块(split),然后分配到不同的工作节点(mapper)上进行处理。在这个项目中,族谱数据可能包含每个人的姓名、他们的关系(如爷爷、父母、孩子)以及可能的其他...
一个自己写的Hadoop MapReduce实例源码,网上看到不少网友在学习MapReduce...希望能够帮助到学习MapReduce的朋友,另外稍微意思下,收取辛苦分1分啊,如果没有积分的就到本人博客下地址,看到后再发到朋友你的邮箱。
1. **Hadoop MapReduce 概述**: Hadoop MapReduce 是一种分布式计算模型,由两部分组成:Map(映射)和 Reduce(归约)。Map 阶段并行处理输入数据,Reduce 阶段整合 Map 阶段的结果。这种设计使得 Hadoop 能够...
### MapReduce基础知识详解 #### 一、MapReduce概述 **MapReduce** 是一种编程模型,最初由Google提出并在Hadoop中实现,用于处理大规模数据集的分布式计算问题。该模型的核心思想是将复杂的大型计算任务分解成较...
1. 输入数据被Master节点分割成多个Splits。 2. 用户编写的Map函数在各个Map Worker上并行运行,生成中间键值对。 3. 中间键值对在Map Worker本地按键分区和排序。 4. Reduce Worker根据Master提供的信息读取对应键...
1. **文件处理**: MapReduce的基本操作是对大量数据文件进行处理。在Map阶段,数据被分割成多个块(通常由HDFS存储),并分配给各个工作节点进行处理。每个节点运行Map函数,对输入的数据块进行转换,生成中间键值...
1. MapReduce框架:Hadoop MapReduce框架是一个分布式计算框架,用于处理大规模数据集的计算任务。在这个代码中,使用MapReduce框架来实现Apriori算法的并行计算。 2. Apriori算法:Apriori算法是一种经典的关联规则...
1. MapReduce设计模式:涉及MapReduce编程模型的多种使用场景和应用,旨在为开发者提供各种数据处理问题的解决方案。 2. MapReduce和Hadoop:介绍了MapReduce的历史及其与Hadoop的关系。Hadoop是一个开源框架,支持...
标题“mapreduce1”表明这是一个关于MapReduce主题的资源集合,可能包含了一些基础概念、原理介绍以及实践应用等内容。 描述中的“相关资源 两篇文档和两本pdf电子书”提示我们,这里有四份资料,可能涵盖了...
MapReduce是一种分布式计算模型,由Google在2004年提出,主要用于处理和生成大规模数据集。这个模型将复杂的计算任务分解成两个主要阶段:Map(映射)和Reduce(化简),使得在大规模分布式环境下处理大数据变得可能...
Hadoop MapReduce 编程实战 Hadoop MapReduce 是大数据处理的核心组件之一,它提供了一个编程模型和软件框架,用于大规模数据处理。下面是 Hadoop MapReduce 编程实战的知识点总结: MapReduce 编程基础 ...
1. **掌握基本的MapReduce编程方法**:理解MapReduce的基本原理和编程流程,学会如何使用Java编写MapReduce程序。 2. **实现统计HDFS系统中多个文本文件中的单词出现频率**:通过实际操作,体验MapReduce在处理大...
【大数据Hadoop MapReduce词频统计】 大数据处理是现代信息技术领域的一个重要概念,它涉及到海量数据的存储、管理和分析。Hadoop是Apache软件基金会开发的一个开源框架,专门用于处理和存储大规模数据集。Hadoop的...
### MapReduce:大规模数据处理的简化利器 #### 引言:MapReduce的诞生与使命 在MapReduce问世之前,Google的工程师们,包括其发明者Jeffrey Dean和Sanjay Ghemawat,面临着一个共同的挑战:如何高效地处理海量...