HDFS is append only, yes. The short answer to your question is that, to modify any portion of a file that is already written, one must rewrite the entire file and replace the old file.
"Even for a single byte?" Yes, even for a single byte.
"Really?!" Yep, really.
"Isn't that horribly inefficient?"
Yep, but it usually don't matter because large data processing applications are typically built around the idea that things don't change, piecemeal like this. Let's take a few examples.
Let's say you were to build a typical fact table in Apache Hive or Cloudera Impalato store transactional data from point of sale or an ecommerce website. Users can make purchases, in which case we simply add new records to the table. This works just fine in an append-only system. The question is, what happens if someone cancels an order or wants to update the quantity of an item purchased? You might be tempted to update the existing record. In fact, what you should probably do to preserve the series of actions is to append an adjustment or delta record that indicates a modification occurred to a previous transaction. To do this, we'd use a schema something like this (I'm going to ignore some details):
CREATE TABLE order_item_transactions (
transaction_id bigint, // unique for each record
order_id bigint, // non-unique
version int,
product_id bigint,
quantity int
)
When we update an order item, we use a new transaction_id, but the same order_id. We bump the version (or use an epoch timestamp as the version) to indicate that the latter record takes precedence over the former. This is extremely common in data warehousing. We may also choose to build a derived table (effectively a materialized view) that is the latest version of all orders where order_id is unique. Something equivalent to:
CREATE TABLE latest_order_items AS SELECT order_id, first(version), first(transaction_id), ... FROM order_item_transactions GROUP BY order_id ORDER BY version DESC
HBase, which does need to modify records, uses this technique to modify and delete records. During a "compaction," it removes old versions of records.
A side benefit of this, and the reason why it's so popular in data warehousing, is because, even though you frequently need the most recent version of a record, you also want to see a full log of changes made for auditing purposes. Since Hadoop typically deals with batch data processing and long term storage applications, being append-only isn't as much as a limitation as you'd expect.
The MapR Distribution for Apache Hadoop supports random reads/writes, thus eliminating this problem.
Eric Sammer is correct in that many use cases involve adding data and not necessarily changing it, but there are many benefits to supporting that ability.
First, it's a pre-requisite for supporting standard interfaces such as NFS. The NFS protocol doesn't have a concept of opening or closing a file. Therefore, the only way the NFS protocol can be supported is by having an underlying storage system that can support random writes. Furthermore, the vast majority of tools that exist today were not designed to work with an append-only system (the last such systems were CD-ROM and FTP, both 20 years old now), so they commonly write at random offsets (and even when they don't, the requests could get re-ordered on the host or on the network).
Second, having random write support enables innovation and capabilities that would otherwise not be possible. For example, MapR addressed the major HBase limitations (see MapR M7) by taking advantage of the underlying capabilities including random write support. Apache HBase was designed to work around the limitations of HDFS, and that comes at a high cost (eg, MapR M7 eliminates compactions which impact all stock HBase users).
References
http://www.quora.com/Is-HDFS-an-append-only-file-system-Then-how-do-people-modify-the-files-stored-on-HDFS
相关推荐
分布式文件系统HDFS(Hadoop Distributed File System)是Hadoop生态系统中的一部分,旨在运行于大规模数据集的分布式环境中,具有高度容错性和高度可用性。它的设计目标是能够管理超大规模的数据集,支持高吞吐量...
### 向HDFS上传Excel文件 #### 背景 在大数据处理场景中,经常需要将Excel文件上传到Hadoop分布式文件系统(HDFS)中进行进一步的数据处理或分析。然而,由于HDFS本身并不直接支持Excel文件格式,通常的做法是先将...
Hadoop Distributed File System (HDFS) 是一种专为运行在低成本硬件上的分布式文件系统而设计的架构。它与现有的分布式文件系统有许多相似之处,但也存在一些显著差异。HDFS 具有高度的容错性,并且针对大型数据集...
《谷歌文件系统(The Google File System)》是Google公司于2003年发布的一篇技术论文,它详细阐述了一种专为大规模分布式计算环境设计的文件系统。这篇论文成为了后来许多分布式文件系统的设计基础,对云计算和大...
### Apache Hadoop HDFS Append Design #### 设计挑战与解决方案 **1.1 设计挑战** 随着`hflush`功能的引入,Hadoop HDFS(HDFS)面临着一个全新的挑战:如何使未关闭文件的最后一个块对读者可见。这一需求带来了...
Hadoop 1.0是该框架的最初版本,它主要由两个核心组件构成:HDFS(Hadoop Distributed File System,分布式文件系统)和MapReduce。HDFS负责可靠地存储大数据,而MapReduce则用于处理这些数据。在这个版本中,...
hdfs 文件的上传,hdfs fs -put /文件名
这个错误是因为HDFS(Hadoop Distributed File System)的某个目录下的子目录数量超过了默认的最大限制,即1048576个。这个问题通常出现在Hive任务异常中断或失败后,由于Hive会在指定的`hive.exec.scratchdir`配置...
在大数据处理领域,Hadoop HDFS(Hadoop Distributed File System)是核心组件之一,它提供了分布式存储的能力。本文将深入探讨HDFS的一些基本命令,帮助用户更好地管理和操作HDFS中的数据。 1. `appendToFile`命令...
HDFS(Hadoop Distributed File System)是一种分布式文件系统,提供了对大规模数据的存储和管理能力。在HDFS中,基本命令是最基础也是最常用的命令,掌握这些命令是使用HDFS的基础。本节我们将详细介绍HDFS中的基本...
HDFS(Hadoop Distributed File System)是一种分布式文件系统,用于存储和管理大规模数据。它是 Hadoop 云计算平台的核心组件之一,提供了高效、可靠、可扩展的数据存储和管理解决方案。 HDFS 的优点包括: 1. 高...
Hadoop介绍,HDFS和MapReduce工作原理
Hadoop Distributed File System (HDFS)作为Hadoop项目中的一个核心组件,是一种开放源代码系统,它被广泛应用于处理大规模数据集的场景中。HDFS的设计理念来源于Google的GFS(Google File System)论文,由Doug ...
HDFS(Hadoop Distributed File System)是 Hadoop 生态系统中的一个核心组件,负责存储和管理大规模数据。作为一个分布式文件系统,HDFS 提供了高可靠性、可扩展性和高性能的存储解决方案。本文将对 HDFS 的基本...
HDFS(Hadoop Distributed File System)是Hadoop的核心组件,提供了一个高容错、可扩展的分布式文件系统。本文将深入探讨在搭建好Hadoop环境后,如何使用Eclipse进行HDFS的基本操作。 首先,我们要理解HDFS的基本...
Hadoop分布式文件系统(HDFS)是Hadoop项目的关键组件,它被设计用来在商用硬件上存储数据,以提供高数据访问带宽。HDFS的特点包括可靠的自动数据复制和故障检测能力,以及支持快速和自动的系统恢复。HDFS与Hadoop...
out.writeBytes("Append data to the file.\n"); out.close(); } ``` ##### 6. 从HDFS文件读取数据 ```java private static void readFromHdfs() throws IOException { FileSystem fs = FileSystem.get(conf); ...
HDFS(Hadoop Distributed File System)是 Hadoop 项目中的一部分,是一个分布式文件系统。HDFS Java API 是一组 Java 类库,提供了一组接口来操作 HDFS。下面我们将对 HDFS Java API 进行详细的介绍。 HDFS Java ...