`
varsoft
  • 浏览: 2510055 次
  • 性别: Icon_minigender_1
  • 来自: 上海
文章分类
社区版块
存档分类
最新评论

[转] KFS,一个克隆GFS的文件系统

阅读更多

KFS(KOSMOS DISTRIBUTED FILE SYSTEM),一个类似GFS、Hadoop中HDFS 的一个开源的分布式文件系统。

PS: google的三大基石 gfs,bigtable,map-reduce 相对应的开源产品 gfs:kfs(据传google创史人的同窗所创),hdfs(hadoop的子项目) bigtable:hbase(hadoop的子项目),Hypertable(从hbase项目组分离出去的,用c++实现) map-reduce:hadoop(apache的项目,java实现,目前创史人在yahoo全力打造,已有2000个以上的节点并行计算的规模)

Google两个共同创始人的两个大学同窗(印度人)Anand Rajaraman和Venky Harinarayan,创立的一个新的搜索引擎Kosmix最近捐献了一个克隆GFS的文件系统KFS项目HadoopHypertable这两个项目也开始支持KFS来做底层的存储。KFS是用C++写的,但是其client支持C++,Java和Python。那么KFS到底有什么特性呢?

  1. 支持存储扩充(添加新的chunckserver,系统自动感知)
  2. 有效性(复制机制保证文件有效性)
  3. 负载平衡(系统周期地检查chunkservers的磁盘利用,并重新平衡chunkservers的磁盘利用,HDFS现在还没有支持)
  4. 数据完整性(当要读取数据时检查数据的完整性,如果检验出错使用另外的备份覆盖当前的数据)
  5. 支持FUSE(HDFS也有工具支持FUSE)
  6. 使用契约(保证Client缓存的数据和文件系统中的文件保持一致性)

HDFS未支持的高级特性:

  1. 支持同一文件多次写入和Append,不像HDFS支持一次写入多次读取和不支持Append(最近要增加Append,但是遇到许多问题)。
  2. 文件及时有效,当应用程序创建一个文件时,文件名在系统马上有效。不像HDFS文件只当输入流关闭时才在系统中有效,因此,如果应用程序在关闭前出现异常导致没有关闭输入流,数据将会丢失。

官方网站: http://kosmosfs.sourceforge.net/

来自startup的垂直搜索引擎http://www.kosmix.com/的开源项目,又一个开源的类似google mapreduce 的分布式文件系统,可以应用在诸如图片存储、搜索引擎、网格计算、数据挖掘这样需要处理大数据量的网络应用中。与hadoop集成得也比较好,这样可以充分利用了hadoop一些现成的功能,基于C++。

Introduction

Applications that process large volumes of data (such as, search engines, grid computing applications, data mining applications, etc.) require a backend infrastructure for storing data. Such infrastructure is required to support applications whose workload could be characterized as:

  • Primarily write-once/read-many workloads
  • Few millions of large files, where each file is on the order of a few tens of MB to a few tens of GB in size
  • Mostly sequential access

We have developed the Kosmos Distributed File System (KFS), a high performance distributed file system to meet this infrastructure need.

The system consists of 3 components:

  1. Meta-data server : a single meta-data server that provides a global namespace
  2. Block server: Files are split into blocks or chunks and stored on block servers. Blocks are also known as chunk servers. Chunkserver store the chunks as files in the underlying file system (such as, XFS on Linux)
  3. Client library: that provides the file system API to allow applications to interface with KFS. To integrate applications to use KFS, applications will need to be modified and relinked with the KFS client library.

KFS is implemented in C++. It is built using standard system components such as, TCP sockets, aio (for disk I/O), STL, and boost libraries. It has been tested on 64-bit x86 architectures running Linux FC5.

While KFS can be accessed natively from C++ applications, support is also provided for Java applications. JNI glue code is included in the release to allow Java applications to access the KFS client library APIs.

Features
  • Incremental scalability: New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.
  • Availability: Replication is used to provide availability due to chunk server failures. Typically, files are replicated 3-way.
  • Per file degree of replication: The degree of replication is configurable on a per file basis, with a max. limit of 64.
  • Re-replication: Whenever the degree of replication for a file drops below the configured amount (such as, due to an extended chunkserver outage), the metaserver forces the block to be re-replicated on the remaining chunk servers. Re-replication is done in the background without overwhelming the system.
  • Re-balancing: Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.
  • Data integrity: To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.
  • File writes: The system follows the standard model. When an application creates a file, the filename becomes part of the filesystem namespace. For performance, writes are cached at the KFS client library. Periodically, the cache is flushed and data is pushed out to the chunkservers. Also, applications can force data to be flushed to the chunkservers. In either case, once data is flushed to the server, it is available for reading.
  • Leases: KFS client library uses caching to improve performance. Leases are used to support cache consistency.
  • Chunk versioning: Versioning is used to detect stale chunks.
  • Client side fail-over: The client library is resilient to chunksever failures. During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
  • Language support: KFS client library can be accessed from C++, Java, and Python.
  • FUSE support on Linux: By mounting KFS via FUSE, this support allows existing linux utilities (such as, ls) to interface with KFS.
  • Tools: A shell binary is included in the set of tools. This allows users to navigate the filesystem tree using utilities such as, cp, ls, mkdir, rmdir, rm, mv. Tools to also monitor the chunk/meta-servers are provided.
  • Deploy scripts: To simplify launching KFS servers, a set of scripts to (1) install KFS binaries on a set of nodes, (2) start/stop KFS servers on a set of nodes are also provided.
  • Job placement support: The KFS client library exports an API to determine the location of a byte range of a file. Job placement systems built on top of KFS can leverage this API to schedule jobs appropriately.
  • Local read optimization: When applications are run on the same nodes as chunkservers, the KFS client library contains an optimization for reading data locally. That is, if the chunk is stored on the same node as the one on which the application is executing, data is read from the local node.
KFS with Hadoop

KFS has been integrated with Hadoop using Hadoop’s filesystem interfaces. This allows existing Hadoop applications to use KFS seamlessly. The integration code has been submitted as a patch to Hadoop-JIRA-1963 (this will enable distribution of the integration code with Hadoop). In addition, the code as well as instructions will also be available for download from the KFS project page shortly. As part of the integration, there is job placement support for Hadoop. That is, the Hadoop Map/Reduce job placement system can schedule jobs on the nodes where the chunks are stored.

参考资料:

  • distribute file system

http://lucene.apache.org/hadoop/

http://www.danga.com/mogilefs/

http://www.lustre.org/

http://oss.sgi.com/projects/xfs/

http://www.megite.com/discover/filesystem

http://swik.net/distributed+cluster

  • cluster&high availability

http://www.gluster.org/index.php

http://www.linux-ha.org/

http://openssi.org

http://kerrighed.org/

http://openmosix.sourceforge.net/

http://www.linux.com/article.pl?sid=06/09/12/1459204

http://labs.google.com/papers/mapreduce.html

文章来源:

分享到:
评论

相关推荐

    kfs文件系统和GOOGLE的GFS差不多

    KFS(Kosmos File System)是一个分布式文件系统,它与Google的GFS(Google File System)在设计理念上有着诸多相似之处。这两者都是为了处理大规模数据存储和处理而设计的,尤其适用于互联网服务和大数据应用。在...

    GFS文件系统预研报告、GFS文件系统体系结构、安装流程、命令行操作

    在实验环境下,GFS文件系统统一安装到了redhat linux7。2下,(最好不要使用redhat 7.1,因为GFS安装成功后,可能会使系统启动失败)因为GFS5。0要求linux的内核必须是2.4.16以上。所以在安装GFS文件系统之前,需要...

    Google文件系统GFS

    GFS集群包含一个master节点和多个chunkserver,以及众多客户端。master负责元数据管理,包括命名空间、访问控制信息、文件到块的映射和块的位置信息,同时也进行块租约管理、孤儿块回收和块迁移。chunkserver存储...

    GFS分布式文件系统

    - **设计初衷**:GFS(Google File System)是由Google设计并实现的一种分布式文件系统,旨在为大规模数据密集型应用提供一种可伸缩的解决方案。它通过运行在成本低廉的通用硬件上,不仅实现了灾难冗余能力,还能为...

    GFS2文件系统介绍

    GFS2文件系统,作为Red Hat全球文件系统(Global File System)的第二代版本,是专为Red Hat Enterprise Linux 5设计的一款集群文件系统。它为运行多个节点的集群环境提供高性能、高稳定性的共享存储解决方案。在...

    kfs 分布式系统的文件管理分析

    综上所述,KFS分布式文件系统的文件管理是一个综合了元数据管理、数据分布、冗余复制、故障恢复、并发控制等多个方面的复杂系统。通过这些机制,KFS能够提供高效、可靠且可扩展的文件存储解决方案,适用于大数据处理...

    GFS分布式文件系统实验包

    这个实验包是针对GFS的一个学习资源,旨在帮助用户理解和掌握分布式文件系统的原理和操作。在Linux环境中,GFS是一种关键的技术,因为它能够提供高可用性、可扩展性和容错性,对于处理大数据量的计算任务至关重要。 ...

    Linux环境下使用GFS文件系统

    加载GFS模块后,需要创建一个GFS文件系统,使用GFS工具创建GFS文件系统,创建过程如下: `[root@test /sbin]# ./mkfs_gfs -j 5 /dev/sda8 –p nolock` 将分区 `/dev/sda8` 格式化为 GFS 文件系统,在本分区内保存...

    GFS(google 文件服务)

    ### GFS(Google 文件系统)...GFS的成功实施不仅解决了Google内部的数据存储需求,也为业界提供了一个高效、可靠的分布式文件系统解决方案。通过持续的技术创新和完善,GFS已经成为大规模数据处理领域的重要组成部分。

    分布式文件系统-GFS1

    分布式文件系统-GFS1是一种专为大规模数据处理设计的文件系统,由Google开发,用于支撑其内部的大数据处理任务,如MapReduce和Bigtable。GFS的核心设计理念是基于大规模集群环境,采用简单的设计原则,以处理硬件...

    opensource网络文件系统kfs

    它的设计灵感来源于Google的GFS(Google File System),但KFS是一个独立的实现,具有自己的特性和优势。 KFS的主要目标是为大规模数据集提供可扩展性、高可用性和容错性。它被设计成能够跨越多台服务器进行数据...

    Google三大论文之分布式文件系统GFS中文完整版

    ### 分布式文件系统GFS的关键知识点 ...通过对组件失效、大文件管理、追加写入操作以及应用程序与文件系统API的协同设计等方面的深入研究,GFS成功地构建了一个高性能、可扩展、可靠且易于使用的分布式文件系统。

    GFS(Google File System)架构

    GFS,全称为Google File System,是Google公司开发的一款分布式文件系统,旨在为大规模的数据处理提供高可用性...它的设计理念和实践对后续的分布式系统设计产生了深远影响,成为了分布式文件系统领域的一个经典案例。

    RHEL5下安装GFS集群文件系统

    ### RHEL5下安装GFS集群文件系统的详细指南 #### 一、概述 RHEL5下安装GFS集群文件系统是一...总之,在完成上述步骤后,应继续安装其余的软件包并进行相应的配置工作,最终搭建起一个稳定可靠的GFS集群文件系统环境。

    GFS分布式文件系统.docx

    GFS分布式文件系统

    分布式系统学习——GFS谷歌文件系统Paper翻译1

    谷歌文件系统(Google File System, GFS)是一个专为大规模分布式数据处理设计的可扩展的分布式文件系统。它基于普通的、价格适中的硬件设备,旨在在容错性、性能、扩展性、可靠性和可用性方面提供出色的服务。GFS是...

    GFS配置文件

    GFS,全称为Google File System,是一个分布式文件系统,由Google设计用于支持大规模的数据处理。在本文中,我们将深入探讨如何在Red Hat Enterprise Linux (RHEL) 5.1上配置GFS,这个版本的内核已经内置了对GFS的...

    GFS:谷歌文件系统的实现

    谷歌文件系统(Google File System,简称GFS)是谷歌设计的一个分布式文件系统,用于处理大规模的数据处理任务。它为海量数据的存储和访问提供了高可用性、高吞吐量和可扩展性的解决方案。GFS的核心目标是支持大规模...

Global site tag (gtag.js) - Google Analytics