今天将临时表里面的数据按照天分区插入到线上的表中去,出现了Hive创建的文件数大于100000个的情况,我的SQL如下:
hive> insert overwrite table test partition(dt) > select * from table_tmp;
table_tmp表里面一共有570多G的数据,一共可以分成76个分区,SQL运行的时候创建了2163个Mapper,0个Reducers。程序运行到一般左右的时候出现了以下的异常:
[Fatal Error] total number of created files now is 100385, which exceeds 100000. Killing the job.
并最终导致了SQL的运行失败。这个错误的原因是因为Hive对创建文件的总数有限制(hive.exec.max.created.files
),默认是100000个,而这个SQL在运行的时候每个Map都会创建76个文件,对应了每个分区,所以这个SQL总共会创建2163 * 76 = 164388个文件,运行中肯定会出现上述的异常。为了能够成功地运行上述的SQL,最简单的方法就是加大hive.exec.max.created.files
参数的设置。但是这有个问题,这会导致在hadoop中产生大量的小文件,因为table_tmp表的数据就570多G,那么平均每个文件的大小=570多G / 164388 = 3.550624133148405MB,可想而知,十万多个这么小的小文件对Hadoop来说是多么不好。那么有没有好的办法呢?有!
我们可以将dt相同的数据放到同一个Reduce处理,这样最多也就产生76个文件,将dt相同的数据放到同一个Reduce可以使用DISTRIBUTE BY dt
实现,所以修改之后的SQL如下:
hive> insert overwrite table test partition(dt) > select * from table_tmp > DISTRIBUTE BY dt;
修改完之后的SQL运行良好,并没有出现上面的异常信息,但是这里也有个问题,因为这76个分区的数据分布很不均匀,有些Reduce的数据有30多G,而有些Reduce只有几K,直接导致了这个SQL运行的速度很慢!
能不能将570G的数据均匀的分配给Reduce呢?可以!我们可以使用DISTRIBUTE BY rand()
将数据随机分配给Reduce,这样可以使得每个Reduce处理的数据大体一致。我设定每个Reduce处理5G的数据,对于570G的数据总共会起110左右的Reduces,修改的SQL如下:
hive> set hive.exec.reducers.bytes.per.reducer=5120000000; hive> insert overwrite table test partition(dt) > select * from iteblog_tmp > DISTRIBUTE BY rand();
这个SQL运行的时间很不错,而且生产的文件数量为Reduce的个数*分区的个数,不到1W个文件。
相关推荐
The boot.ini option /3GB was created for those cases where systems actually support greater than 2 GB of physical memory and an application can make use of it This capability allows memory intensive ...
which is created. The archive must not already exist. File names may specify a path, which is stored. If there are no file names on the command line, then PAQ6 prompts for them, reading until the ...
标题 "this exceeds GitHub's file size limit of 100.00 MB" 提醒我们,当单个文件超过 100MB 时,GitHub 将无法接受这样的提交。描述中提供了几种应对策略,这里我们将详细讨论这些方法。 首先,Git Large File ...
This book is a collection of in-depth tutorials, that will guide you through some fun and practical projects. Along the way, you’ll pick up lots of useful development tips. It contains: How to ...
File size exceeds the maximum allowed limit(解决方案).md
An isolation level determines the degree to which data is isolated for use by one process and guarded against interference from other processes. Prior to SQL Server 7.0, REPEATABLE READ and ...
The INTEGER data type converts numbers.... If the number to be returned exceeds the capacity of a signed integer for the system, Oracle Database returns an "overflow on conversion" error.
The digital section is internally operated at +2.5 V, irrespective of the value of VDD, by an on board regulator which steps down VDD to +2.5 V, when VDD exceeds +2.5 V. The AD9833 has a power-down ...
The purpose of the toolkit is to facilitate creating experimental Bayes nets that analyze sequences of events. The toolkit provides code to help with the following: (a) creating Bayes nets. There are ...
separated by commas, for the plan achieving the maximum number of fish expected to be caught (you should print the entire plan on one line even if it exceeds 80 characters). This is followed by a ...
Recognizing arbitrary multi-character text in unconstrained natural ...operating thresholds, the performance of the proposed system is comparable to, and in some cases exceeds, that of human operators.
tree cover: >= min canopy cover & area (light green): Tile error: Number of pixels requested from Image.reproject exceeds the maximum allowed (2^31). 问题解析 这里的问题就是原始的数据文件全球一张图的...
(3)Software quality assurance is now an of software engineering. (4)Assessment of software quality still relies on . (5)We are not yet capable of quantifying . (6)At each stage ...
You can significantly minimize the number of header files you need to include in your own header files by using forward declarations. For example, if your header file uses the File class in ways that ...
The syntax of the file is extremely simple. Whitespace and lines ; beginning with a semicolon are silently ignored (as you probably guessed). ; Section headers (e.g. [Foo]) are also silently ignored,...
already exceeds the number of processors in PCs, and this trend is expected to continue. According to forecasts, the size of embedded software will also increase at a large rate. Another kind of Moore...
based digital predistorter (DPD) outperforms rectified linear unit (ReLU) activation by up to 2 dB even when the number of layers of the network is increased. When the number of coefficients exceeds ...
Grading hundreds of thousands of Graduate Entrance Exams is a hard work. It is even harder to design a process to make the results as fair as possible. One way is to assign each exam problem to 3 ...
- "The smallest sector in the pie chart represents tourism, which only takes up 5% of the total market share." 2. 饼状图的常用套句: - "The pie chart provides an insight into the distribution of ...