- 浏览: 563040 次
- 性别:
- 来自: 济南
文章分类
- 全部博客 (270)
- Ask chenwq (10)
- JSF (2)
- ExtJS (5)
- Life (19)
- jQuery (5)
- ASP (7)
- JavaScript (5)
- SQL Server (1)
- MySQL (4)
- En (1)
- development tools (14)
- Data mining related (35)
- Hadoop (33)
- Oracle (13)
- To Do (2)
- SSO (2)
- work/study diary (10)
- SOA (6)
- Ubuntu (7)
- J2SE (18)
- NetWorks (1)
- Struts2 (2)
- algorithm (9)
- funny (1)
- BMP (1)
- Paper Reading (2)
- MapReduce (23)
- Weka (3)
- web design (1)
- Data visualisation&R (1)
- Mahout (7)
- Social Recommendation (1)
- statistical methods (1)
- Git&GitHub (1)
- Python (1)
- Linux (1)
最新评论
-
brandNewUser:
楼主你好,问个问题,为什么我写的如下的:JobConf pha ...
Hadoop ChainMap -
Molisa:
Molisa 写道mapred.min.split.size指 ...
Hadoop MapReduce Job性能调优——修改Map和Reduce个数 -
Molisa:
mapred.min.split.size指的是block数, ...
Hadoop MapReduce Job性能调优——修改Map和Reduce个数 -
heyongcs:
请问导入之后,那些错误怎么解决?
Eclipse导入Mahout -
a420144030:
看了你的文章深受启发,想请教你几个问题我的数据都放到hbase ...
Mahout clustering Canopy+K-means 源码分析
原文:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
译文:
HDFS处理大量小文件时的问题
小文件指的是那些size比HDFS 的block size(默认64M)小的多的文件。如果在HDFS中存储小文件,那么在HDFS中肯定会含有许许多多这样的小文件(不然就不会用hadoop了)。
而HDFS的问题在于无法很有效的处理大量小文件。
任何一个文件,目录和block,在HDFS中都会被表示为一个object存储在namenode的内存中,没一个object占用150 bytes的内存空间。所以,如果有10million个文件,
没一个文件对应一个block,那么就将要消耗namenode 3G的内存来保存这些block的信息。如果规模再大一些,那么将会超出现阶段计算机硬件所能满足的极限。
不仅如此,HDFS并不是为了有效的处理大量小文件而存在的。它主要是为了流式的访问大文件而设计的。对小文件的读取通常会造成大量从
datanode到datanode的seeks和hopping来retrieve文件,而这样是非常的低效的一种访问方式。
大量小文件在mapreduce中的问题
Map tasks通常是每次处理一个block的input(默认使用FileInputFormat)。如果文件非常的小,并且拥有大量的这种小文件,那么每一个map task都仅仅处理了非常小的input数据,
并且会产生大量的map tasks,每一个map task都会消耗一定量的bookkeeping的资源。比较一个1GB的文件,默认block size为64M,和1Gb的文件,没一个文件100KB,
那么后者没一个小文件使用一个map task,那么job的时间将会十倍甚至百倍慢于前者。
hadoop中有一些特性可以用来减轻这种问题:可以在一个JVM中允许task reuse,以支持在一个JVM中运行多个map task,以此来减少一些JVM的启动消耗
(通过设置mapred.job.reuse.jvm.num.tasks属性,默认为1,-1为无限制)。另一种方法为使用MultiFileInputSplit,它可以使得一个map中能够处理多个split。
为什么会产生大量的小文件?
至少有两种情况下会产生大量的小文件
1. 这些小文件都是一个大的逻辑文件的pieces。由于HDFS仅仅在不久前才刚刚支持对文件的append,因此以前用来向unbounde files(例如log文件)添加内容的方式都是通过将这些数据用许多chunks的方式写入HDFS中。
2. 文件本身就是很小。例如许许多多的小图片文件。每一个图片都是一个独立的文件。并且没有一种很有效的方法来将这些文件合并为一个大的文件
这两种情况需要有不同的解决方 式。对于第一种情况,文件是由许许多多的records组成的,那么可以通过件邪行的调用HDFS的sync()方法(和append方法结合使用)来解 决。或者,可以通过些一个程序来专门合并这些小文件(see Nathan Marz’s post about a tool called the Consolidator which does exactly this).
对于第二种情况,就需要某种形式的容器来通过某种方式来group这些file。hadoop提供了一些选择:
* HAR files
Hadoop Archives (HAR files)是在0.18.0版本中引入的,它的出现就是为了缓解大量小文件消耗namenode内存的问题。HAR文件是通过在HDFS上构建一个层次化的文件系统来工作。一个HAR文件是通过hadoop的archive命令来创建,而这个命令实 际上也是运行了一个MapReduce任务来将小文件打包成HAR。对于client端来说,使用HAR文件没有任何影响。所有的原始文件都 visible && accessible(using har://URL)。但在HDFS端它内部的文件数减少了。
通 过HAR来读取一个文件并不会比直接从HDFS中读取文件高效,而且实际上可能还会稍微低效一点,因为对每一个HAR文件的访问都需要完成两层index 文件的读取和文件本身数据的读取(见上图)。并且尽管HAR文件可以被用来作为MapReduce job的input,但是并没有特殊的方法来使maps将HAR文件中打包的文件当作一个HDFS文件处理。 可以考虑通过创建一种input format,利用HAR文件的优势来提高MapReduce的效率,但是目前还没有人作这种input format。 需要注意的是:MultiFileInputSplit,即使在HADOOP-4565的改进(choose files in a split that are node local),但始终还是需要seek per small file。
* Sequence Files
通 常对于“the small files problem”的回应会是:使用SequenceFile。这种方法是说,使用filename作为key,并且file contents作为value。实践中这种方式非常管用。回到10000个100KB的文件,可以写一个程序来将这些小文件写入到一个单独的 SequenceFile中去,然后就可以在一个streaming fashion(directly or using mapreduce)中来使用这个sequenceFile。不仅如此,SequenceFiles也是splittable的,所以mapreduce 可以break them into chunks,并且分别的被独立的处理。和HAR不同的是,这种方式还支持压缩。block的压缩在许多情况下都是最好的选择,因为它将多个 records压缩到一起,而不是一个record一个压缩。
将已有的许多小文件转换成一个SequenceFiles可能会比较慢。但 是,完全有可能通过并行的方式来创建一个一系列的SequenceFiles。(Stuart Sierra has written a very useful post about converting a tar file into a SequenceFile — tools like this are very useful).更进一步,如果有可能最好设计自己的数据pipeline来将数据直接写入一个SequenceFile。
发表评论
-
Parallel K-Means Clustering Based on MapReduce
2012-08-04 20:28 1403K-means is a pleasingly paral ... -
Pagerank在Hadoop上的实现原理
2012-07-19 16:04 1460转自:pagerank 在 hadoop 上的实现原理 ... -
Including external jars in a Hadoop job
2012-06-25 20:24 1219办法1: 把所有的第三方jar和自己的class打成一个大 ... -
[转]BSP模型与实例分析(一)
2012-06-15 22:26 0一、BSP模型概念 BSP(Bulk Synchr ... -
Hadoop中两表JOIN的处理方法
2012-05-29 10:35 9631. 概述 在传统数据库(如:MYSQL)中,JOIN ... -
Hadoop DistributedCache
2012-05-27 23:45 1126Hadoop的DistributedCache,可以把 ... -
MapReduce,组合式,迭代式,链式
2012-05-27 23:27 23871.迭代式mapreduce 一些复杂的任务难以用一 ... -
Hadoop ChainMap
2012-05-27 23:09 1986单一MapReduce对一些非常简单的问题提供了很好的支持。 ... -
广度优先BFS的MapReduce实现
2012-05-25 21:47 4312社交网络中的图模型经常需要构造一棵树型结构:从一个特定的节点出 ... -
HADOOP程序日志
2012-05-23 19:53 1015*.log日志文件和*.out日志文件 进入Hadoo ... -
TFIDF based on MapReduce
2012-05-23 11:58 951Job1: Map: input: ( ... -
个人Hadoop 错误列表
2012-05-23 11:31 1490错误1:Too many fetch-failure ... -
Hadoop Map&Reduce个数优化设置以及JVM重用
2012-05-22 11:29 2430Hadoop与JVM重用对应的参数是map ... -
有空读下
2012-05-20 23:59 0MapReduce: JT默认task scheduli ... -
Hadoop MapReduce Job性能调优——修改Map和Reduce个数
2012-05-20 23:46 26754map task的数量即mapred ... -
Hadoop用于和Map Reduce作业交互的命令
2012-05-20 16:02 1225用法:hadoop job [GENERIC_OPTION ... -
Eclipse:Run on Hadoop 没有反应
2012-05-20 11:46 1277原因: hadoop-0.20.2下自带的eclise ... -
Hadoop0.20+ custom MultipleOutputFormat
2012-05-20 11:46 1540Hadoop0.20.2中无法使用MultipleOutput ... -
Custom KeyValueTextInputFormat
2012-05-19 16:23 1715在看老版的API时,发现旧的KeyValueTextInpu ... -
Hadoop SequenceFile Writer And Reader
2012-05-19 15:22 2067package cn.edu.xmu.dm.mpdemo ...
相关推荐
- If a file does not extend beyond any of the original limitations (filesizes of 4 gig or 65535 files) then no Zip64 format information is included in the archive. - property isZip64 - tells you when ...
doesn't extract the files and the files thus can't be copied/moved. This is caused by a quirk in WinZip; Apparently WinZip doesn't like IDataObject.GetData to be called before IDropTarget.Drop is ...
ChromeCacheView is a small utility that reads the cache folder of Google Chrome Web browser, and displays the list of all files currently stored in the cache. For each cache file, the following ...
* Solved problem with scrolls when in multimonitor systems when second monitor is placed 'above' the primary monitor * Solved problem with TabIndex property in the TsTabControl component when ...
- Corrected a problem where the loopback sound test could run out of memory if run for several days. Release 5.3 build 1013 WIN32 release 31 December 2007 - Improved the reporting of ...
Inodes in the system are very small and all blocks are packed to minimise data overhead. Block sizes greater than 4K are supported up to a maximum of 32K. Squashfs is intended for general read-only ...
the list of all attached files that it finds. You can easily select one or more attachments and save all of them into the desired folder, as well as you can delete unwanted large attachments that take...
files/directories from the command line, the -ef option takes the exlude files/directories from the specified exclude file, one file/directory per line. If an exclude file/directory is absolute (i.e. ...
[+] Added new option "Associate (*.ispro) files with InstallSimple" to the preferences dialog [+] Added support for preview the installation wizard window [+] Added if the required .NET Framework is ...
support protected-mode compilation and replace the files of the same name in Turbo C++ Second Edition. Turbo C++ Second Edition should continue to be used in instances where real-mode compilation is ...
support protected-mode compilation and replace the files of the same name in Turbo C++ Second Edition. Turbo C++ Second Edition should continue to be used in instances where real-mode compilation is ...
Fixed a problem where the Load ASL operator allowed the source operand to be an operation region of any type. It is now restricted to regions of type SystemMemory, as per the ACPI specification. BZ ...
), now SakEmail deletes the invalid chars.- Applied a patch from Matjaz Bravc, that resolve the problem of localized dates, letting you choose (in design time) if you want localized dates (NOT ...
- The Setup named the Delphi 2009 files "Delphi 2005" due to a typo. (Delphi 2005 units are not included anymore) - fix probable range check error in WPRTEDEFS 13.10.2011 - WPTools 6.20 + completely...
The diagram included at the top represents the address partitioning for the 32-bit version of Windows 2000. Typically, the process address space is evenly divided into two 2-GB regions. Each process...
Another problem with the file reading engine caused columns starting with a negative number to be discarded. Fixed. + Entering a user model with no parameters was allowed when it shouldn't have ...
Errors are arised when there is any logic problem with the logic of the program. Try catch in jsp In try block we write those code which can throw exception while code execution and when the ...
After the files are copied, all the installed options from your current TC.EXE will be transferred to the new one. This is especially useful if you have modified the colors or editor keys. 3. ...
After the files are copied, all the installed options from your current TC.EXE will be transferred to the new one. This is especially useful if you have modified the colors or editor keys. 3. ...