- 浏览: 611160 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
月光杯:
问题解决了吗?
Exceptions in HDFS -
iostreamin:
神,好厉害,这是我找到的唯一可以ac的Java代码,厉害。
[leetcode] word ladder II -
standalone:
One answer I agree with:引用Whene ...
How many string objects are created? -
DiaoCow:
不错!,一开始对这些确实容易犯迷糊
erlang中的冒号 分号 和 句号 -
standalone:
Exception in thread "main& ...
one java interview question
In the HDFS design document, it introduces deletes and undeletes in HDFS.
File Deletes and Undeletes
When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash . A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash , the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.
After a file is deleted, its corresponding entry is removed from namenode's namespace and corresponding blocks are also marked to be obsolete. Then when namenode receives a BlockReport from datanode who owns the block, a block list diff is done by generating 3 block list.
//DataNode.java /** * Main loop for the DataNode. Runs until shutdown, * forever calling remote NameNode functions. */ public void offerService() throws Exception { ... ... // send block report if (startTime - lastBlockReport > blockReportInterval) { // // Send latest blockinfo report if timer has expired. // Get back a list of local block(s) that are obsolete // and can be safely GC'ed. // long brStartTime = now(); Block[] bReport = data.getBlockReport(); DatanodeCommand cmd = namenode.blockReport(dnRegistration, BlockListAsLongs.convertToArrayLongs(bReport)); long brTime = now() - brStartTime; myMetrics.blockReports.inc(brTime); LOG.info("BlockReport of " + bReport.length + " blocks got processed in " + brTime + " msecs"); // // If we have sent the first block report, then wait a random // time before we start the periodic block reports. // if (resetBlockReportTime) { lastBlockReport = startTime - R.nextInt((int)(blockReportInterval)); resetBlockReportTime = false; } else { /* say the last block report was at 8:20:14. The current report * should have started around 9:20:14 (default 1 hour interval). * If current time is : * 1) normal like 9:20:18, next report should be at 10:20:14 * 2) unexpected like 11:35:43, next report should be at 12:20:14 */ lastBlockReport += (now() - lastBlockReport) / blockReportInterval * blockReportInterval; } processCommand(cmd); } ... ... }
DataNode invokes the method blockReport and through RPC at namenode side the same name method of NameNode is invoked, it handles the block report and sends back commands data nodes should do.
//NameNode.java public DatanodeCommand blockReport(DatanodeRegistration nodeReg, long[] blocks) throws IOException { verifyRequest(nodeReg); BlockListAsLongs blist = new BlockListAsLongs(blocks); stateChangeLog.debug("*BLOCK* NameNode.blockReport: " +"from "+nodeReg.getName()+" "+blist.getNumberOfBlocks() +" blocks"); namesystem.processReport(nodeReg, blist); if (getFSImage().isUpgradeFinalized()) return DatanodeCommand.FINALIZE; return null; }
Detail file name removing and other trivial things are delegated to FSNamesystem.
//FSNamesystem.java /** * The given node is reporting all its blocks. Use this info to * update the (machine-->blocklist) and (block-->machinelist) tables. */ public synchronized void processReport(DatanodeID nodeID, BlockListAsLongs newReport ) throws IOException { long startTime = now(); if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("BLOCK* NameSystem.processReport: " + "from " + nodeID.getName()+" " + newReport.getNumberOfBlocks()+" blocks"); } DatanodeDescriptor node = getDatanode(nodeID); if (node == null) { throw new IOException("ProcessReport from unregisterted node: " + nodeID.getName()); } // Check if this datanode should actually be shutdown instead. if (shouldNodeShutdown(node)) { setDatanodeDead(node); throw new DisallowedDatanodeException(node); } // // Modify the (block-->datanode) map, according to the difference // between the old and new block report. // Collection<Block> toAdd = new LinkedList<Block>(); Collection<Block> toRemove = new LinkedList<Block>(); Collection<Block> toInvalidate = new LinkedList<Block>(); node.reportDiff(blocksMap, newReport, toAdd, toRemove, toInvalidate); for (Block b : toRemove) { removeStoredBlock(b, node); } for (Block b : toAdd) { addStoredBlock(b, node, null); } for (Block b : toInvalidate) { NameNode.stateChangeLog.info("BLOCK* NameSystem.processReport: block " + b + " on " + node.getName() + " size " + b.getNumBytes() + " does not belong to any file."); addToInvalidates(b, node); } NameNode.getNameNodeMetrics().blockReport.inc((int) (now() - startTime)); }
Back to DataNode side, let us see how the returned cmd is processed:
// DataNode.java switch(cmd.getAction()) { ... ... case DatanodeProtocol.DNA_INVALIDATE: // // Some local block(s) are obsolete and can be // safely garbage-collected. // Block toDelete[] = bcmd.getBlocks(); try { if (blockScanner != null) { blockScanner.deleteBlocks(toDelete); } data.invalidate(toDelete); } catch(IOException e) { checkDiskError(); throw e; } myMetrics.blocksRemoved.inc(toDelete.length); break; ... ... }
发表评论
-
hadoop-2.2.0 build failure due to missing dependancy
2014-01-06 13:18 747The bug and fix is at https://i ... -
HDFS中租约管理源代码分析
2013-07-05 18:05 0HDFS中Client写文件的时候要获得一个租约,用来保证Cl ... -
Question on HBase source code
2013-05-22 15:05 1100I'm reading source code of hbas ... -
Using the libjars option with Hadoop
2013-05-20 15:03 965As I have said in my last post, ... -
What's Xen?
2012-12-23 17:19 1122Xen的介绍。 -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 23:23 10102现在版本的hadoop各种serv ... -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 22:59 2现在版本的hadoop各种server、client RPC端 ... -
Hadoop RPC 一问
2012-11-14 14:43 121看代码时候发现好像有个地方做得多余,不知道改一下会不会有好处, ... -
Hadoop Version Graph
2012-11-14 11:47 922可以到这里看全文: http://cloudblog.8km ... -
Hadoop 2.0 代码分析---MapReduce
2012-10-25 18:27 7089本文参考hadoop的版本: hadoop-2.0.1-alp ... -
how to study hadoop?
2012-04-27 15:34 1525From StackOverflow http://stack ... -
首相发怒记之hadoop篇
2012-03-23 12:14 794我在youtube上看到的,某位能翻*墙的看一下吧,挺好笑的。 ... -
Cloud Security?
2011-09-02 14:23 842看了一些文章,主要是保证用户怎么保证存储在公有云的数据的完整性 ... -
一个HDFS Error
2011-06-11 21:53 1529ERROR: hdfs.DFSClient: Excep ... -
hadoop cluster at ebay
2011-06-11 21:39 1153Friday, December 17, 2010Hadoop ... -
[转]hadoop at ebay
2011-06-11 21:09 1192http://www.ebaytechblog.com/201 ... -
【读书笔记】Data warehousing and analytics infrastructure at facebook
2011-03-18 22:03 1947这好像是sigmod2010上的paper。 读了之后做了以 ... -
cassandra example
2011-01-19 16:39 1752http://www.rackspacecloud.com/b ... -
想了解Thrift,留个记号
2011-01-19 16:35 144Thrift: Scalable Cross-Langu ... -
impact of total region numbers?
2011-01-19 16:31 925这几天tune了hbase的几个参数,有些有意思的结果。具体看 ...
相关推荐
本文将深入探讨Hadoop的Web接口功能,包括文件的上传、下载、更新、删除以及追加操作,帮助用户更方便地通过Web界面管理Hadoop分布式文件系统(HDFS)。 一、Hadoop Web接口概述 Hadoop的Web接口,也称为Hadoop ...
Hadoop介绍,HDFS和MapReduce工作原理
HDFS是Hadoop分布式计算的存储基础。HDFS具有高容错性,可以部署在通用硬件设备上,适合数据密集型应用,并且提供对数据读写的高吞 吐量。HDFS能 够提供对数据的可扩展访问,通过简单地往集群里添加节点就可以解决...
分布式文件系统HDFS(Hadoop Distributed File System)是Hadoop生态系统中的一部分,旨在运行于大规模数据集的分布式环境中,具有高度容错性和高度可用性。它的设计目标是能够管理超大规模的数据集,支持高吞吐量...
**HDFS管理工具HDFS Explorer** HDFS Explorer是一款专为Windows平台设计的HDFS文件管理系统,它使得用户能够像操作本地文件系统一样便捷地管理和浏览Hadoop分布式文件系统(HDFS)。尽管官方已经停止更新此软件,...
Hadoop分布式文件系统(HDFS)是Apache Hadoop项目的核心组件之一,它为大数据处理提供了可靠的、可扩展的分布式存储解决方案。在这个“HDFS实例基本操作”中,我们将深入探讨如何在已经安装好的HDFS环境中执行基本...
HDFS 文件系统基本文件命令、编程读写 HDFS HDFS(Hadoop Distributed File System)是一种分布式文件系统,用于存储和管理大规模数据。它是 Hadoop 云计算平台的核心组件之一,提供了高效、可靠、可扩展的数据存储...
hdfs源码分析整理 在分布式文件系统中,HDFS(Hadoop Distributed File System)扮演着核心角色,而HDFS的源码分析则是深入了解HDFS架构和实现机理的关键。本文将对HDFS源码进行详细的分析和整理,涵盖了HDFS的目录...
### 向HDFS上传Excel文件 #### 背景 在大数据处理场景中,经常需要将Excel文件上传到Hadoop分布式文件系统(HDFS)中进行进一步的数据处理或分析。然而,由于HDFS本身并不直接支持Excel文件格式,通常的做法是先将...
【HDFS 透明加密KMS】是Hadoop分布式文件系统(HDFS)提供的一种安全特性,用于保护存储在HDFS中的数据,确保数据在传输和存储时的安全性。HDFS透明加密通过端到端的方式实现了数据的加密和解密,无需修改用户的应用...
hdfs文件的查看 hdfs fs -cat /文件名
例如,`hdfs fsck / -files -blocks -locations` 对根目录进行文件系统完整性检查。 这些命令使得运维人员和开发者可以高效地管理存储在HDFS上的大量数据。无论是进行日志分析、数据备份、数据迁移,还是集群维护,...
### 大数据实验二-HDFS编程实践 #### 实验内容概览 本次实验的主要目标是通过对HDFS(Hadoop Distributed File System)的操作实践,加深学生对HDFS在Hadoop架构中的作用及其基本操作的理解。实验内容包括两大部分...
HDFS基本命令 HDFS(Hadoop Distributed File System)是一种分布式文件系统,提供了对大规模数据的存储和管理能力。在HDFS中,基本命令是最基础也是最常用的命令,掌握这些命令是使用HDFS的基础。本节我们将详细...
HDFS Java API 详解 HDFS(Hadoop Distributed File System)是 Hadoop 项目中的一部分,是一个分布式文件系统。HDFS Java API 是一组 Java 类库,提供了一组接口来操作 HDFS。下面我们将对 HDFS Java API 进行详细...
在大数据技术领域,Hadoop 分布式文件系统(HDFS)是核心组件之一,它为大规模数据存储提供了可扩展和高容错性的解决方案。本实验报告主要关注HDFS的常用操作命令,这些命令是管理员和数据分析师日常工作中不可或缺...
【标题】"hdfs-over-ftp安装包及说明"涉及的核心技术是将FTP(File Transfer Protocol)服务与HDFS(Hadoop Distributed File System)相结合,允许用户通过FTP协议访问和操作HDFS上的数据。这个标题暗示了我们将在...
6. 查看文件详细信息:使用命令 `hdfs fsck 文件名 -files -blocks -locations` 查看一个文件的详细信息。 7. 检测块丢失:使用命令 `hdfs fsck -list-corruptfileblocks` 检测块丢失的情况。 8. datanode 加载失败...
【WebHDFS简介】 WebHDFS是Hadoop分布式文件系统(HDFS)的一部分,它提供了一种RESTful接口,使得用户可以通过HTTP协议访问HDFS,而无需安装完整的Hadoop或Java环境。这种设计使得跨平台的客户端能够方便地与HDFS...
HDFS 可靠性策略 HDFS(Hadoop Distributed File System)作为一种分布式文件系统,其高可靠性主要是由多种策略及机制共同作用实现的。下面我们来分析这些策略和机制,对分布式文件系统的高可靠性进行详细的解释。 ...