When a job is in sorting or merging phase, Hadoop leverage RawComparator for the map output key to compare keys. Built-in Writable classes such as IntWritable have byte-level implementation that are fast because they don't require the byte form of the object to be unmarshalled to Object form for the comparision. When writing your own Writable, it may be tempting to implement the WritableComparable interface for it's easy to implemente this interface without knowing the layout of the custom Writable layout in memory. Unfortunately, it requres Object unmarshalling from byte form which lead to inefficiency of comparisions.
In this blog post, I'll show you how to implement your custom RawComparator to avoid the inefficiencies. But by comparision, I'll implement the WritableComparable interface first, then implement RawComparator with the same custom object.
Suppose you have a custom Writable called Person, in order to make it comparable, you implement the WritableComparable like this:
import org.apache.hadoop.io.WritableComparable; import java.io.*; public class Person implements WritableComparable<Person> { private String firstName; private String lastName; public Person() { } public String getFirstName() { return firstName; } public void setFirstName(String firstName) { this.firstName = firstName; } public String getLastName() { return lastName; } public void setLastName(String lastName) { this.lastName = lastName; } @Override public void readFields(DataInput in) throws IOException { this.lastName = in.readUTF(); this.firstName = in.readUTF(); } @Override public void write(DataOutput out) throws IOException { out.writeUTF(lastName); out.writeUTF(firstName); } @Override public int compareTo(Person other) { int cmp = this.lastName.compareTo(other.lastName); if (cmp != 0) { return cmp; } return this.firstName.compareTo(other.firstName); } public void set(String lastName, String firstName) { this.lastName = lastName; this.firstName = firstName; } }
The trouble with this Comparator is that MapReduce store your intermediary map output data in byte form, and every time it needs to sort your data, it has to unmarshall it into Writable form to perform the comparison, this unmarshalling is expensive because it recreates your objects for comparison purposes.
To write a byte-level Comparator for the Person class, we have to implement the RawComparator interface. Let's revisit the Person class and look at how to do this. In the Person class, we store the two fields, firstname and last name, as string, and used the DataOutput's writableUTF method to write them out.
@Override public void write(DataOutput out) throws IOException { out.writeUTF(lastName); out.writeUTF(firstName); }
If you're going to read the javadoc of writeUTF(String str, DataOut out), you will see below statement:
* First, two bytes are written to out as if by the <code>writeShort</code>
* method giving the number of bytes to follow. This value is the number of
* bytes actually written out, not the length of the string. Following the
* length, each character of the string is output, in sequence, using the
* modified UTF-8 encoding for the character. If no exception is thrown, the
* counter <code>written</code> is incremented by the total number of
* bytes written to the output stream. This will be at least two
* plus the length of <code>str</code>, and at most two plus
* thrice the length of <code>str</code>.
This simply means that the writeUTF method writes two bytes containing the length of the string, followed by the byte form of the string.
Assume that you want to perform a lexicographical comparison that includes both the last and the first name, you can not do this with the entire byte array because the string lengths are also encoded in the array. Instead, the comparator needs to be smart enough to skip over the string lengths, as below code shown:
import org.apache.hadoop.io.WritableComparator; public class PersonBinaryComparator extends WritableComparator { protected PersonBinaryComparator() { super(Person.class, true); } @Override public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { // Compare last name int lastNameResult = compare(b1, s1, b2, s2); // If last name is identical, return the result of comparison if (lastNameResult != 0) { return lastNameResult; } // Read the size of of the last name from the byte array int b1l1 = readUnsignedShort(b1, s1); int b2l1 = readUnsignedShort(b2, s2); // Return the comparison result on the first name return compare(b1, s1 + b1l1 + 2, b2, s2 + b2l1 + 2); } // Compare string in byte form public static int compare(byte[] b1, int s1, byte[] b2, int s2) { // Read the size of the UTF-8 string in byte array int b1l1 = readUnsignedShort(b1, s1); int b2l1 = readUnsignedShort(b2, s2); // Perform lexicographical comparison of the UTF-8 binary data // with the WritableComparator.compareBytes(...) method return compareBytes(b1, s1 + 2, b1l1, b2, s2 + 2, b2l1); } // Read two bytes public static int readUnsignedShort(byte[] b, int offset) { int ch1 = b[offset]; int ch2 = b[offset + 1]; return (ch1 << 8) + (ch2); } }
Final note: Using the writableUTF is limited because it can only support string that contain less than 65525 (two bytes) characters. If you need to work with a larger string, you should look at using Hadoop's Text class, which can support much larget strings. The implementation of Text's comparator is similar to what we completed in this blog post.
相关推荐
《批量归一化:通过减少内部协变量漂移加速深度网络训练》是深度学习领域的一篇重要论文,由Sergey Ioffe和Christian Szegedy于2015年提出。这篇论文的核心思想是引入批量归一化(Batch Normalization,BN)技术,...
在本文中,Sergey Ioffe和Christian Szegedy两位作者提出了Batch Normalization(批量归一化)技术,主要解决深度神经网络训练过程中,随着前层参数变化导致各层输入分布改变的问题。这种现象被称为内部协变量偏移...
Accelerating MATLAB Performance aims to correct this perception by describing multiple ways to greatly improve MATLAB program speed. Packed with thousands of helpful tips, it leaves no stone unturned,...
本书《MATLAB加速方法大全 Accelerating MATLAB Performance》是一本专注于提高MATLAB程序性能的实用参考书籍。它通过深入探讨MATLAB代码的性能分析、资源使用情况以及多种加速技巧,旨在改变人们对于MATLAB运行缓慢...
Accelerating MATLAB with CUDA.pdf Accelerating MATLAB with CUDA.pdf Accelerating MATLAB with CUDA.pdf Accelerating MATLAB with CUDA.pdf
Accelerating MATLAB Performance: 1001 Tips to Speed Up MATLAB Programs by Yair M. Altman English | 2014 | ISBN: 1482211300 | 785 pages | PDF | 145 MB Features Demonstrates how to improve MATLAB® ...
本篇文章《通过动力学共轭变换加快周期轨道展开的收敛速度》由高昂、谢建波和兰岳恒三位作者共同撰写,发表在《中国科技论文在线》。文章重点介绍了动力学共轭变换在加速周期轨道展开(cycle expansions)收敛速度...
《Catalyst: Accelerating Perl Web Application Development》是一本专为Perl开发者设计的书籍,旨在帮助读者快速掌握Perl MVC框架Catalyst的使用,从而高效地构建Web应用程序。MVC(Model-View-Controller)是一种...
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks 对图卷积网络进行加速的批量稀疏矩阵乘法 作者的ppt的pdf版本
### 加速MATLAB使用CUDA技术 #### 概述 MATLAB是一种强大的工具,广泛应用于原型设计与数据分析领域。为了进一步挖掘高性能计算潜力,MATLAB可以通过扩展MEX文件的方式利用最新的NVIDIA图形处理器(GPU)的强大...
这个概念由Sergey Ioffe和Christian Szegedy在2015年的论文《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》中首次提出。 ### 内部协变量漂移问题 在深度神经...
本文件《accelerating-sparsity-ampere-architecture.pdf》深入探讨了Sparsity的原理、性能提升的实现以及面临的主要挑战。 在Sparsity原理的介绍中,首先回顾了Sparsity的概念。生物学启发了这一研究领域,比如...
Accelerating DFA Construction by Parallelizing Subset Construction
Matlab 加速 技巧,一本非常实用的书,帮助你合理运用Matlab的函数库,加速代码运行
Dark Energy and Accelerating Universe .pdf
在当前的移动网络发展中,"Accelerating Service Creation and Deployment" 是一个关键的议题,它意味着通过创新技术提高服务创建和部署的速度与效率。随着科技的进步,未来的移动网络将变得可编程,这为服务创新...