`

Hadoop 3.x主要变化(相对于Hadoop 2.x)

阅读更多

 

       今天有人问Hadoop 3.x的主要变动在哪里,这里在官网(http://hadoop.apache.org/docs/r3.0.0/index.html)查了下,总结简单翻译如下:

  • 1.要求JDK>=1.7
  • 2.HDFS支持纠删码

       与副本相比纠删码是一种更节省空间的数据持久化存储方法。标准编码(比如Reed-Solomon(10,4))会有

1.4 倍的空间开销;然而HDFS副本则会有3倍的空间开销。因为纠删码额外开销主要是在重建和执行远程读,它传统用于存储冷数据,即不经常访问的数据。当部署这个新特性时用户应该考虑纠删码的网络和CPU 开销。更多关于HDFS的纠删码可以参见http://hadoop.apache.org/docs/r3.0.0-beta1/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html.
  • 3.YARN Timeline Service版本更新到v.2

        本版本引入了Yarn时间抽服务v.2,主要用于解决2大挑战:改善时间轴服务的可伸缩性和可靠性,通过引入流和聚合增强可用性。

YARN Timeline Service v.2 alpha 1可以让用户和开发者测试以及反馈,以便使得它可以替换现在的Timeline Service v.1.x。请在测试环境中使用。更多关于YARN Timeline Service v.2的知识请参见http://hadoop.apache.org/docs/r3.0.0-beta1/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html

  • 4.重写相关shell脚本,比如所有脚本都以hadoop-env.sh为基础脚本等等

        Hadoop的Shell脚本被重写解决了之前很多长期存在的bug,并且引入了一些新的特性。绝大部分都保持兼容性,不过仍有些变化可能使得现有的安装不能正常运行。不兼容的改变可以参见HADOOP-9902。更多内容请参见Unix Shell Guide文档。即使你是资深用户,也建议看下这个文档,因为其描述了许多新的功能,特别是与可扩展性有关的功能。

  • 5.合并客户端jar,比如使用maven的shaded插件将 hadoop-client-api和hadoop-client-runtime合并为一个jar

        在 Hadoop 2.x 版本,hadoop-client Maven artifact将 Hadoop 所有的依赖都加到 Hadoop 应用程序的环境变量中,这样会可能会导致应用程序依赖的类和 Hadoop 依赖的类有冲突。这个问题在 HADOOP-11804 得到了解决。

  • 6.支持投机性的容器和分布式调度,比如在没有资源可分配时仍可执行一个Applications

        Opportunistic Container引入新 Opportunistic 类型的 Container 后,这种 Container 可以利用节点上已分配但未真正使用的资源。原有 Container 类型定义为 Guaranteed 类型。相对于 Guaranteed 类型Container, Opportunistic 类型的Container优先级更低。

  • 7.MapReduce本地优化

        MapReduce添加了Map输出collector的本地实现。对于shuffle密集型的作业来说,这将会有30%以上的性能提升。更多内容请参见 MAPREDUCE-2841

  • 8.支持2个以上namenode

        初的HDFS NameNode high-availability实现仅仅提供了一个active NameNode和一个Standby NameNode;并且通过将编辑日志复制到三个JournalNodes上,这种架构能够容忍系统中的任何一个节点的失败。然而,一些部署需要更高的容错度。我们可以通过这个新特性来实现,其允许用户运行多个Standby NameNode。比如通过配置三个NameNode和五个JournalNodes,这个系统可以容忍2个节点的故障,而不是仅仅一个节点。HDFS high-availability文档已经对这些信息进行了更新,我们可以阅读这篇文档了解如何配置多于2个NameNodes。

  • 9.默认的端口和服务有改变

        在此之前,多个Hadoop服务的默认端口都属于Linux的临时端口范围(32768-61000)。这就意味着我们的服务在启动的时候可能因为和其他应用程序产生端口冲突而无法启动。现在这些可能会产生冲突的端口已经不再属于临时端口的范围,这些端口的改变会影响NameNode, Secondary NameNode, DataNode以及KMS。与此同时,官方文档也进行了相应的改变,具体可以参见 HDFS-9427以及HADOOP-12811。

       
namenode端口 namenode 8020 9820
namenode端口 namenode htttp web 50070 9870
namenode端口 namenode https web 50470 9871
secondnamenode端口 secondnamenode https web 50091 9869
secondnamenode端口 secondnamenode http web 50090 9868
datanode端口 datanode ipc 50020 9867
datanode端口 datanode 50010 9866
datanode端口 datanode http web 50075 9864
datanode端口 datanode https web 50475 9865

 

  • 10.支持微软Azure存储系统和阿里云存储系统
  • 11.新增内部节点的平衡器

        一个DataNode可以管理多个磁盘,正常写入操作,各磁盘会被均匀填满。然而,当添加或替换磁盘时可能导致此DataNode内部的磁盘存储的数据严重内斜。这种情况现有的HDFS balancer是无法处理的。这种情况是由新intra-DataNode平衡功能来处理,通过hdfs diskbalancer CLI来调用。更多请参考HDFS Commands Guide

  • 12.重做了后台程序和任务的堆内存管理

        Hadoop守护进程和MapReduce任务的堆内存管理发生了一系列变化。

HADOOP-10950:介绍了配置守护集成heap大小的新方法。主机内存大小可以自动调整,HADOOP_HEAPSIZE 已弃用。
MAPREDUCE-5785:map和reduce task堆大小的配置方法,所需的堆大小不再需要通过任务配置和Java选项实现。已经指定的现有配置不受此更改影响。
  • 13.针对S3文件系统支持DynamoDB存储

        HADOOP-13345 里面为 Amazon S3 存储系统的 S3A 客户端引入了一个新的可选特性,也就是可以使用 DynamoDB 表作为文件和目录元数据的快速一致的存储。

  • 14.HDFS支持基于路由器的联盟

        HDFS Router-Based Federation 添加了一个 RPC路由层,提供了多个 HDFS 命名空间的联合视图。与现有 ViewFs 和 HDFS Federation 功能类似,不同之处在于挂载表(mount table)由服务器端(server-side)的路由层维护,而不是客户端。这简化了现有 HDFS客户端 对 federated cluster 的访问。 详细请参见:HDFS-10467

  • 15.提供REST API来修改容量调度

        OrgQueue 扩展了 capacity scheduler ,通过 REST API 提供了以编程的方式来改变队列的配置,This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.。详细请参见:YARN-5734

  • 16.YARN的资源除了传统的CPU和内存外,还可以支持用户自定义的资源类型,比如GPU

        YARN 资源模型(YARN resource model)已被推广为支持用户自定义的可数资源类型(support user-defined countable resource types),不仅仅支持 CPU 和内存。比如集群管理员可以定义诸如 GPUs、软件许可证(software licenses)或本地附加存储器(locally-attached storage)之类的资源。YARN 任务可以根据这些资源的可用性进行调度。详细请参见: YARN-3926。

Apache Hadoop 3.0.0

Apache Hadoop 3.0.0 incorporates a number of significant enhancements over the previous major release line (hadoop-2.x).

This release is generally available (GA), meaning that it represents a point of API stability and quality that we consider production-ready.

Overview

Users are encouraged to read the full set of release notes. This page provides an overview of the major changes.

Minimum required Java version increased from Java 7 to Java 8

All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.

Support for erasure coding in HDFS

Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.

Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.

More details are available in the HDFS Erasure Coding documentation.

YARN Timeline Service v.2

We are introducing an early preview (alpha 2) of a major revision of YARN Timeline Service: v.2. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.

YARN Timeline Service v.2 alpha 2 is provided so that users and developers can test it and provide feedback and suggestions for making it a ready replacement for Timeline Service v.1.x. It should be used only in a test capacity.

More details are available in the YARN Timeline Service v.2 documentation.

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.

Incompatible changes are documented in the release notes, with related discussion on HADOOP-9902.

More details are available in the Unix Shell Guide documentation. Power users will also be pleased by the Unix Shell API documentation, which describes much of the new functionality, particularly related to extensibility.

Shaded client jars

The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.

HADOOP-11804 adds new hadoop-client-api and hadoop-client-runtime artifacts that shade Hadoop’s dependencies into a single jar. This avoids leaking Hadoop’s dependencies onto the application’s classpath.

Support for Opportunistic Containers and Distributed Scheduling.

A notion of ExecutionType has been introduced, whereby Applications can now request for containers with an execution type of Opportunistic. Containers of this type can be dispatched for execution at an NM even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.

Opportunistic containers are by default allocated by the central RM, but support has also been added to allow opportunistic containers to be allocated by a distributed scheduler which is implemented as an AMRMProtocol interceptor.

Please see documentation for more details.

MapReduce task-level native optimization

MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.

See the release notes for MAPREDUCE-2841 for more detail.

Support for more than 2 NameNodes.

The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.

However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.

The HDFS high-availability documentation has been updated with instructions on how to configure more than two NameNodes.

Default ports of multiple services have been changed.

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.

These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our documentation has been updated appropriately, but see the release notes for HDFS-9427 and HADOOP-12811 for a list of port changes.

Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors

Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as alternative Hadoop-compatible filesystems.

Intra-datanode balancer

A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.

This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI. See the disk balancer section in the HDFS Commands Guidefor more information.

Reworked daemon and task heap management

A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.

HADOOP-10950 introduces new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZEvariable has been deprecated. See the full release notes of HADOOP-10950 for more detail.

MAPREDUCE-5785 simplifies the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change. See the full release notes of MAPREDUCE-5785 for more details.

S3Guard: Consistency and Metadata Caching for the S3A filesystem client

HADOOP-13345 adds an optional feature to the S3A client of Amazon S3 storage: the ability to use a DynamoDB table as a fast and consistent store of file and directory metadata.

See S3Guard for more details.

HDFS Router-Based Federation

HDFS Router-Based Federation adds a RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federationfunctionality, except the mount table is managed on the server-side by the routing layer rather than on the client. This simplifies access to a federated cluster for existing HDFS clients.

See HDFS-10467 and the HDFS Router-based Federation documentation for more details.

API-based configuration of Capacity Scheduler queue configuration

The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.

See YARN-5734 and the Capacity Scheduler documentation for more information.

 

YARN Resource Types

The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.

See YARN-3926 and the YARN resource model documentation for more information.

分享到:
评论

相关推荐

    Hadoop3.x系统文档

    文档中对于Hadoop 3.x常用配置与Hadoop 2.x的对比,提供了一系列常见配置项的解释,例如: - 如何指定Hadoop临时路径,这关系到Hadoop执行过程中临时文件的存放。 - dfs.blocksize可以使用的单位,文件块大小是HDFS...

    hbase-server-0.98.8-hadoop1.zip

    3. HBase的版本历史:0.98.8相对于后续版本的变化,以及与不同Hadoop版本的兼容性问题。 4. Elasticsearch的基本概念:倒排索引、搜索性能、RESTful API以及它的多样化应用场景。 5. 插件系统在Elasticsearch中的...

    使用命令行编译打包运行自己的MapReduce程序 Hadoop2.6.0

    在Hadoop 2.x版本中,相较于早期版本,其架构和依赖库有了明显的变化。在早期版本如1.x中,所有的依赖都集中在`hadoop-core*.jar`这一个文件里。但在Hadoop 2.x中,这种集中式的依赖管理方式被分成了多个单独的JAR...

    hadoop-3.0.0 CSDN 下载

    Hadoop 3.0.0是这个框架的一个重大更新,相较于之前的2.x版本,它引入了许多关键改进和新特性,旨在提高性能、可扩展性、可靠性和安全性。 在Hadoop 3.0.0中,最重要的变化之一是对YARN(Yet Another Resource ...

    hadoop-eclipse-plugin-2.7.4.jar

    在Hadoop 2.7.4版本中,插件还支持YARN(Yet Another Resource Negotiator),这是Hadoop 2.x引入的资源管理系统,负责集群资源的分配和调度。这意味着你可以利用Eclipse工具直接提交和管理基于YARN的任务,提高了...

    hadoop2.7.7src包

    3. **YARN**:YARN是Hadoop 2.x引入的重要变化,它将资源管理和任务调度从MapReduce中分离出来,使得Hadoop可以支持更多种类的计算框架。YARN的源代码展示了如何管理集群资源,如何启动、监控和终止应用程序容器。 ...

    winutils-lu.rar

    在Hadoop 2.x和3.x版本之间,可能会有一些API和功能的变化,因此不同版本的`hadoop.dll`和`winutils.exe`可能需要与相应的Hadoop发行版相匹配,以确保兼容性和正确运行。对于开发者或管理员来说,获取适用于他们具体...

    云计算第三版精品课程配套PPT课件含习题(30页)第5章 Hadoop 2.0 主流开源云架构(四).pptx

    2. **Hadoop 2.0 简述**:简要介绍Hadoop 2.0的起源、主要改进以及相对于Hadoop 1.x的主要变化,包括YARN(Yet Another Resource Negotiator)的引入。 3. **Hadoop 2.0 部署**:讲解如何在不同的硬件和网络环境中...

    Hadoop权威指南_中文版

    此外,对于Hadoop性能的优化以及在运行过程中可能出现的故障进行排查和解决,也是Hadoop管理员需要具备的重要技能。 8. 云计算与大数据的结合: Hadoop与云计算技术相结合,为用户提供灵活、高效的计算和存储资源。...

    sqoop.1.99.5

    1. **兼容性增强**:版本号中的"1.99.5"表明这是一个接近于2.0版本的迭代,因此它可能增强了对不同版本Hadoop的兼容性,包括这里的"hadoop200",意味着它支持Hadoop 2.x系列的版本。 2. **启动方式的变化**:描述中...

    CDH5与CDH6对比.pdf

    在软件开发中,版本号通常由x.y.z组成,z表示小版本,主要修复错误,不改变API;y代表次要版本,增加新功能和API;x代表主版本,可能包含全新的功能,甚至修改API。因此,从CDH5到CDH6的升级,意味着大量的新特性、...

    vertica安装文档,9.1.1版本

    4. **与Hadoop、Kafka和Spark集成**:理解如何将这些大数据处理框架与Vertica相结合,实现高效的数据处理流程。 5. **FlexTables的使用**:了解FlexTables的概念,及其如何提高数据处理效率。 6. **SQL命令**:熟悉...

    王涛-开源NoSQL数据库构建Spark一体化大数据平台

    NoSQL数据库(非关系型数据库)相对于传统的关系型数据库而言,更加适合处理分布式的大规模数据集。它们通常具备高并发读写、灵活的数据模型、易于水平扩展等特性。在构建大数据平台时,NoSQL数据库的优势尤为明显,...

Global site tag (gtag.js) - Google Analytics