Hadoop 的文件系统

nlslzf

浏览: 1052533 次
性别:
来自: 北京

最近访客更多访客>>

wangyy

u012363178

cwfmaker

windows9834

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hadoop生态圈(hadoop/hbase/pig/hive/zookeeper)

Hadoop ASP Apache Socket 数据结构

Hadoop 的文件系统
http://blogger.org.cn/blog/more.asp?name=bg1011&id=30853
Hadoop 的文件系统，最重要是 FileSystem 类，以及它的两个子类 LocalFileSystem 和 DistributedFileSystem。这里先分析 FileSystem。
抽象类 FileSystem，提高了一系列对文件/目录操作的接口，还有一些辅助方法。分别说明一下:
1. open，create，delete，rename等，非abstract，部分返回 FSDataOutputStream，作为流进行处理。
2. openRaw，createRaw，renameRaw，deleteRaw等，abstract，部分返回 FSInputStream，可以随机访问。
3. lock，release，copyFromLocalFile，moveFromLocalFile，copyToLocalFile 等abstract method，提供便利作用，从方法命名可以看出作用。
特别说明，Hadoop的文件系统，每个文件都有一个checksum，一个crc文件。因此FileSystem里面的部分代码对此进行了特别的处理，比如 rename。
LocalFileSystem 和 DistributedFileSystem，理应对用户透明，这里不多做分析，和 FSDataInputStream，FSInputStream 结合一起说明一下。
查看两个子类的 getFileCacheHints 方法，可以看到 LocalFileSystem 是使用'localhost'来命名，这里暂且估计两个FileSystem都是通过网络进行数据通讯，一个是Internet，一个是Intranet。
LocalFileSystem 里面有两个内部类 LocalFSFileInputStream和LocalFSFileOutputStream，查看代码可以看到它是使用 FileChannel进行操作的。另外 lock和release 两个方法使用了TreeMap来保存文件和对应的锁。
DistributedFileSystem 代码量少于 LocalFileSystem，但是更加复杂，它里面使用了 DFSClient 来进行分布式文件系统的操作:
    public DistributedFileSystem(InetSocketAddress namenode, Configuration conf) throws IOException
    {
      super(conf);
      this.dfs = new DFSClient(namenode, conf);
      this.name = namenode.getHostName() + ":" + namenode.getPort();
    }
DFSClient 类接收一个InetSocketAddress 和Configuration 作为输入，对网络传输细节进行了封装。DistributedFileSystem中绝大多数方法都是调用DFSClient进行处理，它只是一个 Warpper。下面着重分析DFSClient。
DFSClient中，主要使用RPC来进行网络的通讯，而不是直接在内部使用Socket。如果要详细了解传输细节，可以查看 org.apache.hadoop.ipc 这个包里面的3个Class。
DFSClient 中的路径，基本上都是UTF8类型，而非String，在DistributedFileSystem中，通过getPath和getDFSPath来转换，这样做可以保证路径格式的标准和数据传输的一致性。
DFSClient 中的大多数方法，也是直接委托ClientProtocol类型的namenode来执行，这里主要分析其它方法。
LeaseChecker 内部类。一个守护线程，定期对namenode进行renewLease操作，注释说明:
Client programs can cause stateful changes in the NameNode that affect other clients. A client may obtain a file and neither abandon nor complete it. A client might hold a series of locks that prevent other clients from proceeding. Clearly, it would be bad if a client held a bunch of locks that it never gave up. This can happen easily if the client dies unexpectedly. So, the NameNode will revoke the locks and live file-creates for clients that it thinks have died. A client tells the NameNode that it is still alive by periodically calling renewLease(). If a certain amount of time passes since the last call to renewLease(), the NameNode assumes the client has died.
作用是对client进行心跳监测，若client挂掉了，执行解锁操作。
DFSInputStream 和 DFSOutputStream，比LocalFileSystem里面的更为复杂，也是通过 ClientProtocol 进行操作，里面使用到了 org.apache.hadoop.dfs 包中的数据结构，如DataNode，Block等，这里不对这些细节进行分析。

对FileSystem的分析(1)到此结束，个人感觉它的封装还是做的不错的，从Nutch项目分离出来后，比原先更为清晰。

分享到：

用 Hadoop 进行分布式并行编程 | 中文帮助文档

2010-03-05 21:02
浏览 1135
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论