A post about KFS vs. HDFS
October 02, 2007
Advantages of Kosmix's KFS vs. HDFS
I was excited to learn last week that my friends at Kosmix have decided to open source a project long in the works: the Kosmix Distributed File System, or KFS (see the offical blog post). A number of people have commented on this release including Ethan Stock of zVents, who plans to use KFS along with their HyperTable clone of BigTable, and Rich Skrenta, who gives an excellent list of features of KFS.
Now, as a dumb product manager, my biggest questions were about KFS vs. HDFS, which is the distributed file system built by the Hadoop project. Powerset already makes extensive use of the Hadoop stack, including HDFS. So, I asked Sriram Rao, the lead engineer of KFS if he could explain to me what the different is between HDFS and KFS. Here are some of his answers, which I think give more insight into why Kosmix chose to build KFS.
-
So why did Kosmix build KFS instead of using HDFS? Apparently, KFS/HDFS were done in parallel. The implementation was done from 2006-2007 and now Kosmix feels it's in a releasable state. One of the reasons to stick with KFS over HDFS is that HDFS is written in Java and Kosmix's back-end is written in C++ and they were worried about the speed of the JNI interface.
-
File writing - HDFS writes to a file once and read many times. But, when writing to a file, you have to write from the start to the end and that is it. Conversely, in KFS you can write to a file as many times as you want and write anywhere in the file (i.e., seek and write) and append to an existing file. I've heard that Yahoo is working to fix this problem in HDFS, but it still isn't implemented.
-
Data integrity - Currently, with HDFS, after you write to a file, the data becomes “visible” to other apps only when the application closes the file. So, if the process were to crash before closing, the data written is lost. With KFS, the data becomes visible when it gets pushed out to the chunkservers. For performance, clients cache data; when the cache is full or when the applicatiohn choses, data gets flushed out.
-
Data rebalancing - KFS has rudimentary support for automatic rebalancing. When you add new nodes/there is a change in space utilization amongst nodes, the system may migrate chunks from over-utilized nodes to under-utilized nodes. HDFS doesn’t have such support now.
Hopefully I transcribed these accurately! Definitely check out the KFS project, as the more people contributing, the better. Powerset will be evaluating KFS in the coming weeks to see if it has any features that can propel us ahead of using HDFS.
KFS
分享到:
相关推荐
本篇文章针对分布式文件系统KFS(Kosmix File System)的元数据模型进行了改进研究,提出了利用内存缓冲策略和批量插入方法优化元数据管理效率的改进模型,并通过实现在开源KFS系统中的应用与算法复杂度分析,验证了...
2011年4月,沃尔玛收购了Kosmix,一家专注于社交网络内容组织的初创公司,以此为基础推出了多个新项目。Kosmix的核心技术是“社交基因组”(Social Genome),该技术能够智能地组织社交网络中的信息,为用户提供与其...
8. **利用社会化媒体和移动应用**:Kosmix的案例表明,利用社交媒体数据进行个性化推荐,可以提升用户体验和销售额。 9. **谨慎的产品和服务扩展**:企业应根据自身资源合理扩展产品线,如OCADO从食品配送到电子...
此外,沃尔玛还积极进行技术创新,通过收购社交媒体技术提供商Kosmix,以及成立沃尔玛实验室,关注互联网的新趋势和技术,以更好地将实体店与线上业务融合。沃尔玛的这些战略举措,展示了其对电商市场的深入理解和...
通过收购社交媒体平台Kosmix,沃尔玛试图利用其技术将在线零售与Facebook和Twitter等社交媒体结合,实现社交搜索功能,提升用户体验。 此外,Kosmix的创始人具有丰富的电子商务经验,他们的加入为沃尔玛带来了更...
本文旨在填补这一空白,详细描述了在Kosmix(一家硅谷初创公司)和Walmart Labs(沃尔玛的研究和发展实验室)中,如何构建、更新和维护大型知识库,并利用该知识库支持包括查询理解、Deep Web搜索、情境广告、社交...