`

Is HDFS an append only file system? Then, how do people modify the files stored

 
阅读更多

HDFS is append only, yes. The short answer to your question is that, to modify any portion of a file that is already written, one must rewrite the entire file and replace the old file.

"Even for a single byte?" Yes, even for a single byte.
"Really?!" Yep, really. 

"Isn't that horribly inefficient?"
Yep, but it usually don't matter because large data processing applications are typically built around the idea that things don't change, piecemeal like this. Let's take a few examples.

Let's say you were to build a typical fact table in Apache Hive or Cloudera Impalato store transactional data from point of sale or an ecommerce website. Users can make purchases, in which case we simply add new records to the table. This works just fine in an append-only system. The question is, what happens if someone cancels an order or wants to update the quantity of an item purchased? You might be tempted to update the existing record. In fact, what you should probably do to preserve the series of actions is to append an adjustment or delta record that indicates a modification occurred to a previous transaction. To do this, we'd use a schema something like this (I'm going to ignore some details):

CREATE TABLE order_item_transactions (
  transaction_id bigint, // unique for each record
  order_id bigint, // non-unique
  version int,
  product_id bigint,
  quantity int
)

When we update an order item, we use a new transaction_id, but the same order_id. We bump the version (or use an epoch timestamp as the version) to indicate that the latter record takes precedence over the former. This is extremely common in data warehousing. We may also choose to build a derived table (effectively a materialized view) that is the latest version of all orders where order_id is unique. Something equivalent to:

CREATE TABLE latest_order_items AS SELECT order_id, first(version), first(transaction_id), ... FROM order_item_transactions GROUP BY order_id ORDER BY version DESC

HBase, which does need to modify records, uses this technique to modify and delete records. During a "compaction," it removes old versions of records.

A side benefit of this, and the reason why it's so popular in data warehousing, is because, even though you frequently need the most recent version of a record, you also want to see a full log of changes made for auditing purposes. Since Hadoop typically deals with batch data processing and long term storage applications, being append-only isn't as much as a limitation as you'd expect.

 

The MapR Distribution for Apache Hadoop supports random reads/writes, thus eliminating this problem.

Eric Sammer is correct in that many use cases involve adding data and not necessarily changing it, but there are many benefits to supporting that ability. 

First, it's a pre-requisite for supporting standard interfaces such as NFS. The NFS protocol doesn't have a concept of opening or closing a file. Therefore, the only way the NFS protocol can be supported is by having an underlying storage system that can support random writes. Furthermore, the vast majority of tools that exist today were not designed to work with an append-only system (the last such systems were CD-ROM and FTP, both 20 years old now), so they commonly write at random offsets (and even when they don't, the requests could get re-ordered on the host or on the network).

Second, having random write support enables innovation and capabilities that would otherwise not be possible. For example, MapR addressed the major HBase limitations (see MapR M7) by taking advantage of the underlying capabilities including random write support. Apache HBase was designed to work around the limitations of HDFS, and that comes at a high cost (eg, MapR M7 eliminates compactions which impact all stock HBase users).

 

 

 

 

 

 

 

 

References

http://www.quora.com/Is-HDFS-an-append-only-file-system-Then-how-do-people-modify-the-files-stored-on-HDFS

分享到:
评论

相关推荐

    分布式文件系统hdfs,HDFS的优势是什么?

    分布式文件系统HDFS(Hadoop Distributed File System)是Hadoop生态系统中的一部分,旨在运行于大规模数据集的分布式环境中,具有高度容错性和高度可用性。它的设计目标是能够管理超大规模的数据集,支持高吞吐量...

    hdfs_design, hadoop file system design

    Hadoop Distributed File System (HDFS) 是一种专为运行在低成本硬件上的分布式文件系统而设计的架构。它与现有的分布式文件系统有许多相似之处,但也存在一些显著差异。HDFS 具有高度的容错性,并且针对大型数据集...

    向hdfs上传Excel文件.doc

    ### 向HDFS上传Excel文件 #### 背景 在大数据处理场景中,经常需要将Excel文件上传到Hadoop分布式文件系统(HDFS)中进行进一步的数据处理或分析。然而,由于HDFS本身并不直接支持Excel文件格式,通常的做法是先将...

    The Google File System中文翻译

    《谷歌文件系统(The Google File System)》是Google公司于2003年发布的一篇技术论文,它详细阐述了一种专为大规模分布式计算环境设计的文件系统。这篇论文成为了后来许多分布式文件系统的设计基础,对云计算和大...

    apache hadoop HDFS append design

    ### Apache Hadoop HDFS Append Design #### 设计挑战与解决方案 **1.1 设计挑战** 随着`hflush`功能的引入,Hadoop HDFS(HDFS)面临着一个全新的挑战:如何使未关闭文件的最后一个块对读者可见。这一需求带来了...

    解决hive报hdfs exceeded directory item limit错误

    这个错误是因为HDFS(Hadoop Distributed File System)的某个目录下的子目录数量超过了默认的最大限制,即1048576个。这个问题通常出现在Hive任务异常中断或失败后,由于Hive会在指定的`hive.exec.scratchdir`配置...

    HDFS命令指南相关学习

    在大数据处理领域,Hadoop HDFS(Hadoop Distributed File System)是核心组件之一,它提供了分布式存储的能力。本文将深入探讨HDFS的一些基本命令,帮助用户更好地管理和操作HDFS中的数据。 1. `appendToFile`命令...

    HDFS文件系统基本文件命令、编程读写HDFS

    HDFS(Hadoop Distributed File System)是一种分布式文件系统,用于存储和管理大规模数据。它是 Hadoop 云计算平台的核心组件之一,提供了高效、可靠、可扩展的数据存储和管理解决方案。 HDFS 的优点包括: 1. 高...

    HDFS基本命令.docx

    HDFS(Hadoop Distributed File System)是一种分布式文件系统,提供了对大规模数据的存储和管理能力。在HDFS中,基本命令是最基础也是最常用的命令,掌握这些命令是使用HDFS的基础。本节我们将详细介绍HDFS中的基本...

    Hadoop介绍,HDFS和MapReduce工作原理

    Hadoop介绍,HDFS和MapReduce工作原理

    HDFS scalability:the limits to growth

    Hadoop Distributed File System (HDFS)作为Hadoop项目中的一个核心组件,是一种开放源代码系统,它被广泛应用于处理大规模数据集的场景中。HDFS的设计理念来源于Google的GFS(Google File System)论文,由Doug ...

    搭建hadoop后hdfs基本操作 ecplisec操作

    HDFS(Hadoop Distributed File System)是Hadoop的核心组件,提供了一个高容错、可扩展的分布式文件系统。本文将深入探讨在搭建好Hadoop环境后,如何使用Eclipse进行HDFS的基本操作。 首先,我们要理解HDFS的基本...

    Hadoop Distributed File System for the Grid

    Hadoop分布式文件系统(HDFS)是Hadoop项目的关键组件,它被设计用来在商用硬件上存储数据,以提供高数据访问带宽。HDFS的特点包括可靠的自动数据复制和故障检测能力,以及支持快速和自动的系统恢复。HDFS与Hadoop...

    利用javaAPI访问HDFS的文件

    out.writeBytes("Append data to the file.\n"); out.close(); } ``` ##### 6. 从HDFS文件读取数据 ```java private static void readFromHdfs() throws IOException { FileSystem fs = FileSystem.get(conf); ...

    hdfs-java-api

    HDFS(Hadoop Distributed File System)是 Hadoop 项目中的一部分,是一个分布式文件系统。HDFS Java API 是一组 Java 类库,提供了一组接口来操作 HDFS。下面我们将对 HDFS Java API 进行详细的介绍。 HDFS Java ...

    hdfs-over-ftp安装包及说明

    【标题】"hdfs-over-ftp安装包及说明"涉及的核心技术是将FTP(File Transfer Protocol)服务与HDFS(Hadoop Distributed File System)相结合,允许用户通过FTP协议访问和操作HDFS上的数据。这个标题暗示了我们将在...

    实验手册_HDFS.docx

    HDFS(Hadoop Distributed File System)是 Hadoop 平台的核心组成之一。它是一个分布式文件系统,能够存储大量的数据,并提供高可靠性和高性能的数据访问。 二、HDFS 的访问方式 HDFS 的访问方式有多种,包括: ...

Global site tag (gtag.js) - Google Analytics