- 浏览: 616633 次
- 性别:
- 来自: 上海
Exceptions in HDFS -
[leetcode] word ladder II -
One answer I agree with:引用Whene ...
How many string objects are created? -
erlang中的冒号 分号 和 句号 -
Exception in thread "main& ...
one java interview question
In the HDFS design document, it introduces deletes and undeletes in HDFS.
File Deletes and Undeletes
When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash . A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash , the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.
After a file is deleted, its corresponding entry is removed from namenode's namespace and corresponding blocks are also marked to be obsolete. Then when namenode receives a BlockReport from datanode who owns the block, a block list diff is done by generating 3 block list.
//DataNode.java /** * Main loop for the DataNode. Runs until shutdown, * forever calling remote NameNode functions. */ public void offerService() throws Exception { ... ... // send block report if (startTime - lastBlockReport > blockReportInterval) { // // Send latest blockinfo report if timer has expired. // Get back a list of local block(s) that are obsolete // and can be safely GC'ed. // long brStartTime = now(); Block[] bReport = data.getBlockReport(); DatanodeCommand cmd = namenode.blockReport(dnRegistration, BlockListAsLongs.convertToArrayLongs(bReport)); long brTime = now() - brStartTime; myMetrics.blockReports.inc(brTime); LOG.info("BlockReport of " + bReport.length + " blocks got processed in " + brTime + " msecs"); // // If we have sent the first block report, then wait a random // time before we start the periodic block reports. // if (resetBlockReportTime) { lastBlockReport = startTime - R.nextInt((int)(blockReportInterval)); resetBlockReportTime = false; } else { /* say the last block report was at 8:20:14. The current report * should have started around 9:20:14 (default 1 hour interval). * If current time is : * 1) normal like 9:20:18, next report should be at 10:20:14 * 2) unexpected like 11:35:43, next report should be at 12:20:14 */ lastBlockReport += (now() - lastBlockReport) / blockReportInterval * blockReportInterval; } processCommand(cmd); } ... ... }
DataNode invokes the method blockReport and through RPC at namenode side the same name method of NameNode is invoked, it handles the block report and sends back commands data nodes should do.
//NameNode.java public DatanodeCommand blockReport(DatanodeRegistration nodeReg, long[] blocks) throws IOException { verifyRequest(nodeReg); BlockListAsLongs blist = new BlockListAsLongs(blocks); stateChangeLog.debug("*BLOCK* NameNode.blockReport: " +"from "+nodeReg.getName()+" "+blist.getNumberOfBlocks() +" blocks"); namesystem.processReport(nodeReg, blist); if (getFSImage().isUpgradeFinalized()) return DatanodeCommand.FINALIZE; return null; }
Detail file name removing and other trivial things are delegated to FSNamesystem.
//FSNamesystem.java /** * The given node is reporting all its blocks. Use this info to * update the (machine-->blocklist) and (block-->machinelist) tables. */ public synchronized void processReport(DatanodeID nodeID, BlockListAsLongs newReport ) throws IOException { long startTime = now(); if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("BLOCK* NameSystem.processReport: " + "from " + nodeID.getName()+" " + newReport.getNumberOfBlocks()+" blocks"); } DatanodeDescriptor node = getDatanode(nodeID); if (node == null) { throw new IOException("ProcessReport from unregisterted node: " + nodeID.getName()); } // Check if this datanode should actually be shutdown instead. if (shouldNodeShutdown(node)) { setDatanodeDead(node); throw new DisallowedDatanodeException(node); } // // Modify the (block-->datanode) map, according to the difference // between the old and new block report. // Collection<Block> toAdd = new LinkedList<Block>(); Collection<Block> toRemove = new LinkedList<Block>(); Collection<Block> toInvalidate = new LinkedList<Block>(); node.reportDiff(blocksMap, newReport, toAdd, toRemove, toInvalidate); for (Block b : toRemove) { removeStoredBlock(b, node); } for (Block b : toAdd) { addStoredBlock(b, node, null); } for (Block b : toInvalidate) { NameNode.stateChangeLog.info("BLOCK* NameSystem.processReport: block " + b + " on " + node.getName() + " size " + b.getNumBytes() + " does not belong to any file."); addToInvalidates(b, node); } NameNode.getNameNodeMetrics().blockReport.inc((int) (now() - startTime)); }
Back to DataNode side, let us see how the returned cmd is processed:
// DataNode.java switch(cmd.getAction()) { ... ... case DatanodeProtocol.DNA_INVALIDATE: // // Some local block(s) are obsolete and can be // safely garbage-collected. // Block toDelete[] = bcmd.getBlocks(); try { if (blockScanner != null) { blockScanner.deleteBlocks(toDelete); } data.invalidate(toDelete); } catch(IOException e) { checkDiskError(); throw e; } myMetrics.blocksRemoved.inc(toDelete.length); break; ... ... }
hadoop-2.2.0 build failure due to missing dependancy
2014-01-06 13:18 757The bug and fix is at https://i ... -
2013-07-05 18:05 0HDFS中Client写文件的时候要获得一个租约,用来保证Cl ... -
Question on HBase source code
2013-05-22 15:05 1128I'm reading source code of hbas ... -
Using the libjars option with Hadoop
2013-05-20 15:03 973As I have said in my last post, ... -
What's Xen?
2012-12-23 17:19 1134Xen的介绍。 -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 23:23 10128现在版本的hadoop各种serv ... -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 22:59 2现在版本的hadoop各种server、client RPC端 ... -
Hadoop RPC 一问
2012-11-14 14:43 121看代码时候发现好像有个地方做得多余,不知道改一下会不会有好处, ... -
Hadoop Version Graph
2012-11-14 11:47 933可以到这里看全文: http://cloudblog.8km ... -
Hadoop 2.0 代码分析---MapReduce
2012-10-25 18:27 7101本文参考hadoop的版本: hadoop-2.0.1-alp ... -
how to study hadoop?
2012-04-27 15:34 1537From StackOverflow http://stack ... -
2012-03-23 12:14 794我在youtube上看到的,某位能翻*墙的看一下吧,挺好笑的。 ... -
Cloud Security?
2011-09-02 14:23 859看了一些文章,主要是保证用户怎么保证存储在公有云的数据的完整性 ... -
一个HDFS Error
2011-06-11 21:53 1540ERROR: hdfs.DFSClient: Excep ... -
hadoop cluster at ebay
2011-06-11 21:39 1166Friday, December 17, 2010Hadoop ... -
[转]hadoop at ebay
2011-06-11 21:09 1202http://www.ebaytechblog.com/201 ... -
【读书笔记】Data warehousing and analytics infrastructure at facebook
2011-03-18 22:03 1957这好像是sigmod2010上的paper。 读了之后做了以 ... -
cassandra example
2011-01-19 16:39 1786http://www.rackspacecloud.com/b ... -
2011-01-19 16:35 144Thrift: Scalable Cross-Langu ... -
impact of total region numbers?
2011-01-19 16:31 939这几天tune了hbase的几个参数,有些有意思的结果。具体看 ...
本文将深入探讨Hadoop的Web接口功能,包括文件的上传、下载、更新、删除以及追加操作,帮助用户更方便地通过Web界面管理Hadoop分布式文件系统(HDFS)。 一、Hadoop Web接口概述 Hadoop的Web接口,也称为Hadoop ...
### 向HDFS上传Excel文件 #### 背景 在大数据处理场景中,经常需要将Excel文件上传到Hadoop分布式文件系统(HDFS)中进行进一步的数据处理或分析。然而,由于HDFS本身并不直接支持Excel文件格式,通常的做法是先将...
HDFS是Hadoop分布式计算的存储基础。HDFS具有高容错性,可以部署在通用硬件设备上,适合数据密集型应用,并且提供对数据读写的高吞 吐量。HDFS能 够提供对数据的可扩展访问,通过简单地往集群里添加节点就可以解决...
分布式文件系统HDFS(Hadoop Distributed File System)是Hadoop生态系统中的一部分,旨在运行于大规模数据集的分布式环境中,具有高度容错性和高度可用性。它的设计目标是能够管理超大规模的数据集,支持高吞吐量...
**HDFS管理工具HDFS Explorer** HDFS Explorer是一款专为Windows平台设计的HDFS文件管理系统,它使得用户能够像操作本地文件系统一样便捷地管理和浏览Hadoop分布式文件系统(HDFS)。尽管官方已经停止更新此软件,...
实验项目名为“实战 HDFS”,旨在深入理解和熟练运用Hadoop分布式文件系统(HDFS)。HDFS是Apache Hadoop的核心组件,它为大数据处理提供高容错性、高吞吐量的存储解决方案。实验目的是通过一系列操作,让学生全面...
Hadoop分布式文件系统(HDFS)是Apache Hadoop项目的核心组件之一,它为大数据处理提供了可靠的、可扩展的分布式存储解决方案。在这个“HDFS实例基本操作”中,我们将深入探讨如何在已经安装好的HDFS环境中执行基本...
HDFS 文件系统基本文件命令、编程读写 HDFS HDFS(Hadoop Distributed File System)是一种分布式文件系统,用于存储和管理大规模数据。它是 Hadoop 云计算平台的核心组件之一,提供了高效、可靠、可扩展的数据存储...
hdfs源码分析整理 在分布式文件系统中,HDFS(Hadoop Distributed File System)扮演着核心角色,而HDFS的源码分析则是深入了解HDFS架构和实现机理的关键。本文将对HDFS源码进行详细的分析和整理,涵盖了HDFS的目录...
### 大数据实验二-HDFS编程实践 #### 实验内容概览 本次实验的主要目标是通过对HDFS(Hadoop Distributed File System)的操作实践,加深学生对HDFS在Hadoop架构中的作用及其基本操作的理解。实验内容包括两大部分...
【HDFS 透明加密KMS】是Hadoop分布式文件系统(HDFS)提供的一种安全特性,用于保护存储在HDFS中的数据,确保数据在传输和存储时的安全性。HDFS透明加密通过端到端的方式实现了数据的加密和解密,无需修改用户的应用...
hdfs文件的查看 hdfs fs -cat /文件名
例如,`hdfs fsck / -files -blocks -locations` 对根目录进行文件系统完整性检查。 这些命令使得运维人员和开发者可以高效地管理存储在HDFS上的大量数据。无论是进行日志分析、数据备份、数据迁移,还是集群维护,...
标题中的“基于spring-boot和hdfs的网盘.zip”表明这是一个使用Spring Boot框架构建的网盘应用,它集成了Hadoop分布式文件系统(HDFS)。这个应用可能允许用户存储、检索和管理他们的文件在分布式环境中的存储。让...
【标题】"hdfs-over-ftp安装包及说明"涉及的核心技术是将FTP(File Transfer Protocol)服务与HDFS(Hadoop Distributed File System)相结合,允许用户通过FTP协议访问和操作HDFS上的数据。这个标题暗示了我们将在...
HDFS 安装手册 HDFS(Hadoop Distributed File System)是一种分布式文件系统,由Apache Hadoop项目开发和维护。HDFS 是一个高可扩展、高可靠的文件系统,能够存储大量数据,满足大数据处理的需求。 1. 文档目的 ...
HDFS基本命令 HDFS(Hadoop Distributed File System)是一种分布式文件系统,提供了对大规模数据的存储和管理能力。在HDFS中,基本命令是最基础也是最常用的命令,掌握这些命令是使用HDFS的基础。本节我们将详细...
HDFS Java API 详解 HDFS(Hadoop Distributed File System)是 Hadoop 项目中的一部分,是一个分布式文件系统。HDFS Java API 是一组 Java 类库,提供了一组接口来操作 HDFS。下面我们将对 HDFS Java API 进行详细...
在大数据技术领域,Hadoop 分布式文件系统(HDFS)是核心组件之一,它为大规模数据存储提供了可扩展和高容错性的解决方案。本实验报告主要关注HDFS的常用操作命令,这些命令是管理员和数据分析师日常工作中不可或缺...