`

hdfs data flow-part writing

UP 
阅读更多

The client creates the file by calling create() on DistributedFileSystem (step 1 in
Figure 3-3). DistributedFileSystem makes an RPC call to the namenode to create a new
file in the filesystem’s namespace, with no blocks associated with it (step 2). The name-
node performs various checks to make sure the file doesn’t already exist, and that the
client has the right permissions to create the file. If these checks pass, the namenode
makes a record of the new file; otherwise, file creation fails and the client is thrown an
IOException. The DistributedFileSystem returns a FSDataOutputStream for the client to
start writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutput
Stream, which handles communication with the datanodes and namenode.


As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes
to an internal queue, called the data queue. The data queue is consumed by the Data
Streamer, whose responsibility it is to ask the namenode to allocate new blocks by
picking a list of suitable datanodes to store the replicas. The list of datanodes forms a
pipeline—we’ll assume the replication level is 3, so there are three nodes in the pipeline.
The DataStreamer streams the packets to the first datanode in the pipeline, which stores
the packet and forwards it to the second datanode in the pipeline. Similarly, the second
datanode stores the packet and forwards it to the third (and last) datanode in the pipe-
line (step 4).

 

DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack
queue only when it has been acknowledged by all the datanodes in the pipeline (step 5).


If a datanode fails while data is being written to it, then the following actions are taken,
which are transparent to the client writing the data. First the pipeline is closed, and any
packets in the ack queue are added to the front of the data queue so that datanodes
that are downstream from the failed node will not miss any packets. The current block
on the good datanodes is given a new identity, which is communicated to the name-
node, so that the partial block on the failed datanode will be deleted if the failed data-
node recovers later on. The failed datanode is removed from the pipeline and the
remainder of the block’s data is written to the two good datanodes in the pipeline. The
namenode notices that the block is under-replicated, and it arranges for a further replica
to be created on another node. Subsequent blocks are then treated as normal.

 

It’s possible, but unlikely, that multiple datanodes fail while a block is being written.
As long as dfs.replication.min replicas (default one) are written the write will succeed,
and the block will be asynchronously replicated across the cluster until its target rep-
lication factor is reached (dfs.replication, which defaults to three).

 

 

When the client has finished writing data it calls close() on the stream (step 6). This
action flushes all the remaining packets to the datanode pipeline and waits for ac-
knowledgments before contacting the namenode to signal that the file is complete (step
7). The namenode already knows which blocks the file is made up of (via Data
Streamer asking for block allocations), so it only has to wait for blocks to be minimally
replicated before returning successfully.

 

 

 

 

分享到:
评论

相关推荐

    hdfs-over-ftp安装包及说明

    【标题】"hdfs-over-ftp安装包及说明"涉及的核心技术是将FTP(File Transfer Protocol)服务与HDFS(Hadoop Distributed File System)相结合,允许用户通过FTP协议访问和操作HDFS上的数据。这个标题暗示了我们将在...

    Hadoop 3.x(HDFS)----【HDFS 的 API 操作】---- 代码

    Hadoop 3.x(HDFS)----【HDFS 的 API 操作】---- 代码 Hadoop 3.x(HDFS)----【HDFS 的 API 操作】---- 代码 Hadoop 3.x(HDFS)----【HDFS 的 API 操作】---- 代码 Hadoop 3.x(HDFS)----【HDFS 的 API 操作】--...

    hdfs-over-ftp-hadoop-0.20.0.rar_ftp_ftpoverhdfs_hdfs文件传入ftp_java

    标题 "hdfs-over-ftp-hadoop-0.20.0.rar" 提示我们关注的是一个关于将HDFS(Hadoop Distributed File System)与FTP(File Transfer Protocol)整合的项目,特别适用于版本0.20.0的Hadoop。这个项目可能提供了在...

    hadoop-hdfs-client-2.9.1-API文档-中文版.zip

    赠送jar包:hadoop-hdfs-client-2.9.1.jar 赠送原API文档:hadoop-hdfs-client-2.9.1-javadoc.jar 赠送源代码:hadoop-hdfs-client-2.9.1-sources.jar 包含翻译后的API文档:hadoop-hdfs-client-2.9.1-javadoc-...

    hadoop 2.7.1 hdfs-over-ftp

    在本文档中,我们首先了解了如何通过Hadoop 2.7.1实现HDFS与FTP的结合使用,称为hdfs-over-ftp。为了实现这一功能,我们需要经过几个步骤来配置和启动一个支持Hadoop文件系统的FTP服务器。 首先,文档提到了安装和...

    hadoop-hdfs-client-2.9.1-API文档-中英对照版.zip

    赠送jar包:hadoop-hdfs-client-2.9.1.jar; 赠送原API文档:hadoop-hdfs-client-2.9.1-javadoc.jar; 赠送源代码:hadoop-hdfs-client-2.9.1-sources.jar; 赠送Maven依赖信息文件:hadoop-hdfs-client-2.9.1.pom;...

    hadoop-hdfs-2.9.1-API文档-中文版.zip

    赠送jar包:hadoop-hdfs-2.9.1.jar 赠送原API文档:hadoop-hdfs-2.9.1-javadoc.jar 赠送源代码:hadoop-hdfs-2.9.1-sources.jar 包含翻译后的API文档:hadoop-hdfs-2.9.1-javadoc-API文档-中文(简体)版.zip 对应...

    hadoop-hdfs-2.7.3-API文档-中英对照版.zip

    赠送jar包:hadoop-hdfs-2.7.3.jar; 赠送原API文档:hadoop-hdfs-2.7.3-javadoc.jar; 赠送源代码:hadoop-hdfs-2.7.3-sources.jar; 赠送Maven依赖信息文件:hadoop-hdfs-2.7.3.pom; 包含翻译后的API文档:hadoop...

    大数据 hdfs-over-ftp jar包

    大数据 hdfs-over-ftp jar包。 基于maven工程打包的可执行jar包,支持hadoop版本cdh5.12.1,以及kerberos认证,配置kerberos信息的核心配置文件core.properties,下载集群相关认证信息配置即可,其他配置文件信息是...

    hadoop-hdfs-2.7.3-API文档-中文版.zip

    赠送jar包:hadoop-hdfs-2.7.3.jar; 赠送原API文档:hadoop-hdfs-2.7.3-javadoc.jar; 赠送源代码:hadoop-hdfs-2.7.3-sources.jar; 赠送Maven依赖信息文件:hadoop-hdfs-2.7.3.pom; 包含翻译后的API文档:hadoop...

    hadoop-hdfs-2.6.5-API文档-中文版.zip

    赠送jar包:hadoop-hdfs-2.6.5.jar; 赠送原API文档:hadoop-hdfs-2.6.5-javadoc.jar; 赠送源代码:hadoop-hdfs-2.6.5-sources.jar; 赠送Maven依赖信息文件:hadoop-hdfs-2.6.5.pom; 包含翻译后的API文档:hadoop...

    hadoop-hdfs-test-0.21.0.jar

    hadoop-hdfs-test-0.21.0.jar

    hadoop-hdfs-2.5.1-API文档-中文版.zip

    赠送jar包:hadoop-hdfs-2.5.1.jar; 赠送原API文档:hadoop-hdfs-2.5.1-javadoc.jar; 赠送源代码:hadoop-hdfs-2.5.1-sources.jar; 赠送Maven依赖信息文件:hadoop-hdfs-2.5.1.pom; 包含翻译后的API文档:hadoop...

    hadoop-hdfs-2.6.5-API文档-中英对照版.zip

    赠送jar包:hadoop-hdfs-2.6.5.jar; 赠送原API文档:hadoop-hdfs-2.6.5-javadoc.jar; 赠送源代码:hadoop-hdfs-2.6.5-sources.jar; 赠送Maven依赖信息文件:hadoop-hdfs-2.6.5.pom; 包含翻译后的API文档:hadoop...

    hadoop-hdfs-2.9.1-API文档-中英对照版.zip

    赠送jar包:hadoop-hdfs-2.9.1.jar; 赠送原API文档:hadoop-hdfs-2.9.1-javadoc.jar; 赠送源代码:hadoop-hdfs-2.9.1-sources.jar; 赠送Maven依赖信息文件:hadoop-hdfs-2.9.1.pom; 包含翻译后的API文档:hadoop...

    hadoop-hdfs-2.5.1-API文档-中英对照版.zip

    赠送jar包:hadoop-hdfs-2.5.1.jar; 赠送原API文档:hadoop-hdfs-2.5.1-javadoc.jar; 赠送源代码:hadoop-hdfs-2.5.1-sources.jar; 赠送Maven依赖信息文件:hadoop-hdfs-2.5.1.pom; 包含翻译后的API文档:hadoop...

    javaftp源码-hdfs-over-ftp:在HDFS上工作的FTP服务器

    ftp源码hdfs-over-ftp 工作在 HDFS 之上的 FTP 服务器源代码是在 MIT 许可下提供的 FTP 服务器可通过 hdfs-over-ftp.properties 和 users.properties 进行配置。 它允许通过 SSL 使用安全连接并支持所有 HDFS 权限。...

    hdfs-java-api

    HDFS Java API 详解 HDFS(Hadoop Distributed File System)是 Hadoop 项目中的一部分,是一个分布式文件系统。HDFS Java API 是一组 Java 类库,提供了一组接口来操作 HDFS。下面我们将对 HDFS Java API 进行详细...

    6、HDFS的HttpFS-代理服务

    HttpFS作为一个独立服务,它不直接集成在Hadoop分布式文件系统中,而是作为中间层处理文件系统的操作请求,转发给HDFS集群进行处理。其主要功能是安全地代理HTTP客户端对HDFS的读写请求。 **一、HttpFS介绍** 1. *...

    hdfs-over-ftp

    基于HDFS的FTP项目,可上传下载文件,开源

Global site tag (gtag.js) - Google Analytics