`
standalone
  • 浏览: 614178 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

How Does HDFS Deletes Files?

阅读更多

In the HDFS design document, it introduces deletes and undeletes in HDFS.

 

File Deletes and Undeletes

When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash . A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash , the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.

 

After a file is deleted, its corresponding entry is removed from namenode's namespace and corresponding blocks are also marked to be obsolete. Then when namenode receives a  BlockReport from datanode who owns the block, a block list diff is done by generating 3 block list.

 

//DataNode.java
  /**
   * Main loop for the DataNode.  Runs until shutdown,
   * forever calling remote NameNode functions.
   */
  public void offerService() throws Exception {
        ... ...

        // send block report
        if (startTime - lastBlockReport > blockReportInterval) {
          //
          // Send latest blockinfo report if timer has expired.
          // Get back a list of local block(s) that are obsolete
          // and can be safely GC'ed.
          //
          long brStartTime = now();
          Block[] bReport = data.getBlockReport();
          DatanodeCommand cmd = namenode.blockReport(dnRegistration,
                  BlockListAsLongs.convertToArrayLongs(bReport));
          long brTime = now() - brStartTime;
          myMetrics.blockReports.inc(brTime);
          LOG.info("BlockReport of " + bReport.length +
              " blocks got processed in " + brTime + " msecs");
          //
          // If we have sent the first block report, then wait a random
          // time before we start the periodic block reports.
          //
          if (resetBlockReportTime) {
            lastBlockReport = startTime - R.nextInt((int)(blockReportInterval));
            resetBlockReportTime = false;
          } else {
            /* say the last block report was at 8:20:14. The current report 
             * should have started around 9:20:14 (default 1 hour interval). 
             * If current time is :
             *   1) normal like 9:20:18, next report should be at 10:20:14
             *   2) unexpected like 11:35:43, next report should be at 12:20:14
             */
            lastBlockReport += (now() - lastBlockReport) / 
                               blockReportInterval * blockReportInterval;
          }
          processCommand(cmd);
        }
         
        ... ...
 }

 

DataNode invokes the method blockReport and through RPC at namenode side the same name method of NameNode is invoked, it handles the block report and sends back commands data nodes should do.

 

//NameNode.java  
  public DatanodeCommand blockReport(DatanodeRegistration nodeReg,
                                     long[] blocks) throws IOException {
    verifyRequest(nodeReg);
    BlockListAsLongs blist = new BlockListAsLongs(blocks);
    stateChangeLog.debug("*BLOCK* NameNode.blockReport: "
           +"from "+nodeReg.getName()+" "+blist.getNumberOfBlocks() +" blocks");

    namesystem.processReport(nodeReg, blist);
    if (getFSImage().isUpgradeFinalized())
      return DatanodeCommand.FINALIZE;
    return null;
  }

    Detail file name removing and other trivial things are delegated to FSNamesystem.

//FSNamesystem.java
 /**
   * The given node is reporting all its blocks.  Use this info to 
   * update the (machine-->blocklist) and (block-->machinelist) tables.
   */
  public synchronized void processReport(DatanodeID nodeID, 
                                         BlockListAsLongs newReport
                                        ) throws IOException {
    long startTime = now();
    if (NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("BLOCK* NameSystem.processReport: "
                             + "from " + nodeID.getName()+" " + 
                             newReport.getNumberOfBlocks()+" blocks");
    }
    DatanodeDescriptor node = getDatanode(nodeID);
    if (node == null) {
      throw new IOException("ProcessReport from unregisterted node: "
                            + nodeID.getName());
    }

    // Check if this datanode should actually be shutdown instead.
    if (shouldNodeShutdown(node)) {
      setDatanodeDead(node);
      throw new DisallowedDatanodeException(node);
    }
    
    //
    // Modify the (block-->datanode) map, according to the difference
    // between the old and new block report.
    //
    Collection<Block> toAdd = new LinkedList<Block>();
    Collection<Block> toRemove = new LinkedList<Block>();
    Collection<Block> toInvalidate = new LinkedList<Block>();
    node.reportDiff(blocksMap, newReport, toAdd, toRemove, toInvalidate);
        
    for (Block b : toRemove) {
      removeStoredBlock(b, node);
    }
    for (Block b : toAdd) {
      addStoredBlock(b, node, null);
    }
    for (Block b : toInvalidate) {
      NameNode.stateChangeLog.info("BLOCK* NameSystem.processReport: block " 
          + b + " on " + node.getName() + " size " + b.getNumBytes()
          + " does not belong to any file.");
      addToInvalidates(b, node);
    }
    NameNode.getNameNodeMetrics().blockReport.inc((int) (now() - startTime));
  }

 

   Back to DataNode side, let us see how the returned cmd is processed:

 

// DataNode.java
switch(cmd.getAction()) {
    ... ...
    case DatanodeProtocol.DNA_INVALIDATE:
      //
      // Some local block(s) are obsolete and can be 
      // safely garbage-collected.
      //
      Block toDelete[] = bcmd.getBlocks();
      try {
        if (blockScanner != null) {
          blockScanner.deleteBlocks(toDelete);
        }
        data.invalidate(toDelete);
      } catch(IOException e) {
        checkDiskError();
        throw e;
      }
      myMetrics.blocksRemoved.inc(toDelete.length);
      break;
      ... ...
}
分享到:
评论

相关推荐

    hadoop的web上传、下载、更新、删除和文件追加

    本文将深入探讨Hadoop的Web接口功能,包括文件的上传、下载、更新、删除以及追加操作,帮助用户更方便地通过Web界面管理Hadoop分布式文件系统(HDFS)。 一、Hadoop Web接口概述 Hadoop的Web接口,也称为Hadoop ...

    Hadoop介绍,HDFS和MapReduce工作原理

    Hadoop介绍,HDFS和MapReduce工作原理

    HDFS Comics HDFS 漫画

    HDFS是Hadoop分布式计算的存储基础。HDFS具有高容错性,可以部署在通用硬件设备上,适合数据密集型应用,并且提供对数据读写的高吞 吐量。HDFS能 够提供对数据的可扩展访问,通过简单地往集群里添加节点就可以解决...

    向hdfs上传Excel文件.doc

    ### 向HDFS上传Excel文件 #### 背景 在大数据处理场景中,经常需要将Excel文件上传到Hadoop分布式文件系统(HDFS)中进行进一步的数据处理或分析。然而,由于HDFS本身并不直接支持Excel文件格式,通常的做法是先将...

    分布式文件系统hdfs,HDFS的优势是什么?

    分布式文件系统HDFS(Hadoop Distributed File System)是Hadoop生态系统中的一部分,旨在运行于大规模数据集的分布式环境中,具有高度容错性和高度可用性。它的设计目标是能够管理超大规模的数据集,支持高吞吐量...

    实验项目 实战 HDFS 实验报告

    实验项目名为“实战 HDFS”,旨在深入理解和熟练运用Hadoop分布式文件系统(HDFS)。HDFS是Apache Hadoop的核心组件,它为大数据处理提供高容错性、高吞吐量的存储解决方案。实验目的是通过一系列操作,让学生全面...

    HDFS管理工具HDFS Explorer下载地址、使用方法.docx

    **HDFS管理工具HDFS Explorer** HDFS Explorer是一款专为Windows平台设计的HDFS文件管理系统,它使得用户能够像操作本地文件系统一样便捷地管理和浏览Hadoop分布式文件系统(HDFS)。尽管官方已经停止更新此软件,...

    HDFS实例基本操作

    Hadoop分布式文件系统(HDFS)是Apache Hadoop项目的核心组件之一,它为大数据处理提供了可靠的、可扩展的分布式存储解决方案。在这个“HDFS实例基本操作”中,我们将深入探讨如何在已经安装好的HDFS环境中执行基本...

    HDFS文件系统基本文件命令、编程读写HDFS

    HDFS 文件系统基本文件命令、编程读写 HDFS HDFS(Hadoop Distributed File System)是一种分布式文件系统,用于存储和管理大规模数据。它是 Hadoop 云计算平台的核心组件之一,提供了高效、可靠、可扩展的数据存储...

    hdfs源码分析整理

    hdfs源码分析整理 在分布式文件系统中,HDFS(Hadoop Distributed File System)扮演着核心角色,而HDFS的源码分析则是深入了解HDFS架构和实现机理的关键。本文将对HDFS源码进行详细的分析和整理,涵盖了HDFS的目录...

    大数据实验二-HDFS编程实践

    ### 大数据实验二-HDFS编程实践 #### 实验内容概览 本次实验的主要目标是通过对HDFS(Hadoop Distributed File System)的操作实践,加深学生对HDFS在Hadoop架构中的作用及其基本操作的理解。实验内容包括两大部分...

    14、HDFS 透明加密KMS

    【HDFS 透明加密KMS】是Hadoop分布式文件系统(HDFS)提供的一种安全特性,用于保护存储在HDFS中的数据,确保数据在传输和存储时的安全性。HDFS透明加密通过端到端的方式实现了数据的加密和解密,无需修改用户的应用...

    HDFS文件的查看

    hdfs文件的查看 hdfs fs -cat /文件名

    HDFS shell

    例如,`hdfs fsck / -files -blocks -locations` 对根目录进行文件系统完整性检查。 这些命令使得运维人员和开发者可以高效地管理存储在HDFS上的大量数据。无论是进行日志分析、数据备份、数据迁移,还是集群维护,...

    基于spring-boot和hdfs的网盘.zip

    标题中的“基于spring-boot和hdfs的网盘.zip”表明这是一个使用Spring Boot框架构建的网盘应用,它集成了Hadoop分布式文件系统(HDFS)。这个应用可能允许用户存储、检索和管理他们的文件在分布式环境中的存储。让...

    hdfs-over-ftp安装包及说明

    【标题】"hdfs-over-ftp安装包及说明"涉及的核心技术是将FTP(File Transfer Protocol)服务与HDFS(Hadoop Distributed File System)相结合,允许用户通过FTP协议访问和操作HDFS上的数据。这个标题暗示了我们将在...

    HDFS基本命令.docx

    HDFS基本命令 HDFS(Hadoop Distributed File System)是一种分布式文件系统,提供了对大规模数据的存储和管理能力。在HDFS中,基本命令是最基础也是最常用的命令,掌握这些命令是使用HDFS的基础。本节我们将详细...

    hdfs-java-api

    HDFS Java API 详解 HDFS(Hadoop Distributed File System)是 Hadoop 项目中的一部分,是一个分布式文件系统。HDFS Java API 是一组 Java 类库,提供了一组接口来操作 HDFS。下面我们将对 HDFS Java API 进行详细...

    大数据技术基础实验报告-HDFS常用操作命令.doc

    在大数据技术领域,Hadoop 分布式文件系统(HDFS)是核心组件之一,它为大规模数据存储提供了可扩展和高容错性的解决方案。本实验报告主要关注HDFS的常用操作命令,这些命令是管理员和数据分析师日常工作中不可或缺...

    Hdfs基本操作1

    6. 查看文件详细信息:使用命令 `hdfs fsck 文件名 -files -blocks -locations` 查看一个文件的详细信息。 7. 检测块丢失:使用命令 `hdfs fsck -list-corruptfileblocks` 检测块丢失的情况。 8. datanode 加载失败...

Global site tag (gtag.js) - Google Analytics