Accelerating Comparison by Providing RawComparator

sunwinner

浏览: 203961 次
性别:
来自: 上海

最近访客更多访客>>

luojianbing

yanghuangsanguo

jahentao

baichoufei90sina

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hadoop

When a job is in sorting or merging phase, Hadoop leverage RawComparator for the map output key to compare keys. Built-in Writable classes such as IntWritable have byte-level implementation that are fast because they don't require the byte form of the object to be unmarshalled to Object form for the comparision. When writing your own Writable, it may be tempting to implement the WritableComparable interface for it's easy to implemente this interface without knowing the layout of the custom Writable layout in memory. Unfortunately, it requres Object unmarshalling from byte form which lead to inefficiency of comparisions.

In this blog post, I'll show you how to implement your custom RawComparator to avoid the inefficiencies. But by comparision, I'll implement the WritableComparable interface first, then implement RawComparator with the same custom object.

Suppose you have a custom Writable called Person, in order to make it comparable, you implement the WritableComparable like this:

import org.apache.hadoop.io.WritableComparable;

import java.io.*;

public class Person implements WritableComparable<Person> {

    private String firstName;
    private String lastName;

    public Person() {
    }

    public String getFirstName() {
        return firstName;
    }

    public void setFirstName(String firstName) {
        this.firstName = firstName;
    }

    public String getLastName() {
        return lastName;
    }

    public void setLastName(String lastName) {
        this.lastName = lastName;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.lastName = in.readUTF();
        this.firstName = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(lastName);
        out.writeUTF(firstName);
    }

    @Override
    public int compareTo(Person other) {
        int cmp = this.lastName.compareTo(other.lastName);
        if (cmp != 0) {
            return cmp;
        }
        return this.firstName.compareTo(other.firstName);
    }

    public void set(String lastName, String firstName) {
        this.lastName = lastName;
        this.firstName = firstName;
    }
}

The trouble with this Comparator is that MapReduce store your intermediary map output data in byte form, and every time it needs to sort your data, it has to unmarshall it into Writable form to perform the comparison, this unmarshalling is expensive because it recreates your objects for comparison purposes.

To write a byte-level Comparator for the Person class, we have to implement the RawComparator interface. Let's revisit the Person class and look at how to do this. In the Person class, we store the two fields, firstname and last name, as string, and used the DataOutput's writableUTF method to write them out.

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(lastName);
        out.writeUTF(firstName);
    }

If you're going to read the javadoc of writeUTF(String str, DataOut out), you will see below statement:

* First, two bytes are written to out as if by the <code>writeShort</code>

* method giving the number of bytes to follow. This value is the number of

* bytes actually written out, not the length of the string. Following the

* length, each character of the string is output, in sequence, using the

* modified UTF-8 encoding for the character. If no exception is thrown, the

* counter <code>written</code> is incremented by the total number of

* bytes written to the output stream. This will be at least two

* plus the length of <code>str</code>, and at most two plus

* thrice the length of <code>str</code>.

This simply means that the writeUTF method writes two bytes containing the length of the string, followed by the byte form of the string.

Assume that you want to perform a lexicographical comparison that includes both the last and the first name, you can not do this with the entire byte array because the string lengths are also encoded in the array. Instead, the comparator needs to be smart enough to skip over the string lengths, as below code shown:

import org.apache.hadoop.io.WritableComparator;

public class PersonBinaryComparator extends WritableComparator {
    protected PersonBinaryComparator() {
        super(Person.class, true);
    }

    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
                       int l2) {
        
        // Compare last name
        int lastNameResult = compare(b1, s1, b2, s2);

        // If last name is identical, return the result of comparison
        if (lastNameResult != 0) {
            return lastNameResult;
        }

        // Read the size of of the last name from the byte array
        int b1l1 = readUnsignedShort(b1, s1);
        int b2l1 = readUnsignedShort(b2, s2);

        // Return the comparison result on the first name
        return compare(b1, s1 + b1l1 + 2, b2, s2 + b2l1 + 2);
    }

    // Compare string in byte form
    public static int compare(byte[] b1, int s1, byte[] b2, int s2) {
        // Read the size of the UTF-8 string in byte array
        int b1l1 = readUnsignedShort(b1, s1);
        int b2l1 = readUnsignedShort(b2, s2);

        // Perform lexicographical comparison of the UTF-8 binary data
        // with the WritableComparator.compareBytes(...) method
        return compareBytes(b1, s1 + 2, b1l1, b2, s2 + 2, b2l1);
    }

    // Read two bytes
    public static int readUnsignedShort(byte[] b, int offset) {
        int ch1 = b[offset];
        int ch2 = b[offset + 1];
        return (ch1 << 8) + (ch2);
    }
}

Final note: Using the writableUTF is limited because it can only support string that contain less than 65525 (two bytes) characters. If you need to work with a larger string, you should look at using Hadoop's Text class, which can support much larget strings. The implementation of Text's comparator is similar to what we completed in this blog post.

分享到：

Getting Started with Apache Crunch | MapReduce Algorithm - Secondary Sort

2013-07-27 21:25
浏览 1251
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

修炼成Javascript中级程序员必知必会: 修炼成Javascript中级程序员必知必会_资源分享

MATLAB深度学习工具箱应用于果树病虫害识别的技术解析与实战: 内容概要：本文详细介绍了如何使用MATLAB的深度学习工具箱，在果树病虫害识别任务中从数据准备、模型设计、训练优化到最后的模型评估与应用全流程的具体实施步骤和技术要点。涵盖了MATLAB深度学习工具箱的基本概念及其提供的多种功能组件，如卷积神经网络(CNN)的应用实例。此外，文中还具体讲述了数据集的收集与预处理方法、不同类型的深度学习模型搭建、训练过程中的超参数设定及其优化手段，并提供了病虫害识别的实际案例。最后展望了深度学习技术在未来农业领域的潜在影响力和发展前景。适合人群：对深度学习及农业应用感兴趣的科研人员、高校师生和相关从业者。使用场景及目标：①希望掌握MATLAB环境下构建深度学习模型的方法和技术细节；②从事果树病虫害管理研究或实践，寻找高效的自动化解决方案。阅读建议：在阅读本文之前，建议读者熟悉基本的MATLAB编程环境及初步了解机器学习的相关概念。针对文中涉及的理论和技术难点，可以通过官方文档或其他教程进行补充学习。同时，建议动手实践每一个关键点的内容，在实践中加深理解和掌握技能。

nodejs010-nodejs-block-stream-0.0.7-1.el6.centos.alt.noarch.rpm: nodejs010-nodejs-block-stream-0.0.7-1.el6.centos.alt.noarch.rpm

机械模型与技术交底书的融合：创新点详解与解析,机械模型加技术交底书，有创新点 ,机械模型; 技术交底书; 创新点,创新机械模型与技术交底书详解: 机械模型与技术交底书的融合：创新点详解与解析,机械模型加技术交底书，有创新点 ,机械模型; 技术交底书; 创新点,创新机械模型与技术交底书详解

景区寄存管理系统免费JAVA毕业设计 2024成品源码+论文+数据库+启动教程.zip: 免费JAVA毕业设计 2024成品源码+论文+数据库+启动教程启动教程：https://www.bilibili.com/video/BV1SzbFe7EGZ 项目讲解视频：https://www.bilibili.com/video/BV1Tb421n72S 二次开发教程：https://www.bilibili.com/video/BV18i421i7Dx

饮食分享平台免费JAVA毕业设计 2024成品源码+论文+数据库+启动教程.zip: 免费JAVA毕业设计 2024成品源码+论文+数据库+启动教程启动教程：https://www.bilibili.com/video/BV1SzbFe7EGZ 项目讲解视频：https://www.bilibili.com/video/BV1Tb421n72S 二次开发教程：https://www.bilibili.com/video/BV18i421i7Dx

nodejs010-nodejs-cmd-shim-1.1.0-4.1.el6.centos.alt.noarch.rpm: nodejs010-nodejs-cmd-shim-1.1.0-4.1.el6.centos.alt.noarch.rpm

西门子四轴卧加后处理系统：828D至840D兼容，四轴联动高效加工解决方案，支持图档处理及试看程序 ,西门子四轴卧加后处理，支持828D~840D系统，支持四轴联动，可制制，看清楚联系，可提供图档处理: 西门子四轴卧加后处理系统：828D至840D兼容，四轴联动高效加工解决方案，支持图档处理及试看程序。,西门子四轴卧加后处理，支持828D~840D系统，支持四轴联动，可制制，看清楚联系，可提供图档处理试看程序 ,核心关键词：西门子四轴卧加后处理; 828D~840D系统支持; 四轴联动; 制程; 联系; 图档处理试看程序。,西门子四轴卧加后处理程序，支持多种系统与四轴联动

基于黏菌优化算法（SMA）的改进与复现-融合EO算法更新策略的ESMA项目报告,黏菌优化算法（SMA）复现（融合EO算法改进更新策略）-ESMA 复现内容包括:改进算法实现、23个基准测: 基于黏菌优化算法（SMA）的改进与复现——融合EO算法更新策略的ESMA项目报告,黏菌优化算法（SMA）复现（融合EO算法改进更新策略）——ESMA。复现内容包括:改进算法实现、23个基准测试函数、多次实验运行并计算均值标准差等统计量、与SMA对比等。程序基本上每一步都有注释，非常易懂，代码质量极高，便于新手学习和理解。 ,SMA复现;EO算法改进;算法实现;基准测试函数;实验运行;统计量;SMA对比;程序注释;代码质量;学习理解。,标题：ESMA算法复现：黏菌优化与EO算法融合改进的实证研究

基于MATLAB的Stewart平台并联机器人仿真技术研究与实现：Simscape环境下的虚拟模拟分析与应用,MATLAB并联机器人Stewart平台仿真simscape ,MATLAB; 并联机器: 基于MATLAB的Stewart平台并联机器人仿真技术研究与实现：Simscape环境下的虚拟模拟分析与应用,MATLAB并联机器人Stewart平台仿真simscape ,MATLAB; 并联机器人; Stewart平台; 仿真; Simscape; 关键技术。,MATLAB中Stewart平台并联机器人Simscape仿真

Grad-CAM可视化医学3D影像: Grad-CAM可视化医学3D影像

探索comsol泰勒锥：电流体动力学的微观世界之旅,comsol泰勒锥、电流体动力学 ,comsol泰勒锥; 电流体动力学; 锥形结构; 电场影响,COMSOL泰勒锥与电流体动力学研究: 探索comsol泰勒锥：电流体动力学的微观世界之旅,comsol泰勒锥、电流体动力学 ,comsol泰勒锥; 电流体动力学; 锥形结构; 电场影响,COMSOL泰勒锥与电流体动力学研究

健美操评分系统免费JAVA毕业设计 2024成品源码+论文+数据库+启动教程.zip: 免费JAVA毕业设计 2024成品源码+论文+数据库+启动教程启动教程：https://www.bilibili.com/video/BV1SzbFe7EGZ 项目讲解视频：https://www.bilibili.com/video/BV1Tb421n72S 二次开发教程：https://www.bilibili.com/video/BV18i421i7Dx

PFC6.03D模型动态压缩模拟与SHPB霍普金森压杆系统理论及实验数据处理技术解析,PFC6.03D模型，动态压缩模拟，还包括: SHPB霍普金森压杆系统理论知识介绍，二波法和三波法处理实验数据，提: PFC6.03D模型动态压缩模拟与SHPB霍普金森压杆系统理论及实验数据处理技术解析,PFC6.03D模型，动态压缩模拟，还包括: SHPB霍普金森压杆系统理论知识介绍，二波法和三波法处理实验数据，提出三波波形，计算动态压缩强度等 ,PFC模型; 动态压缩模拟; SHPB霍普金森压杆系统; 理论介绍; 二波法处理; 三波法处理; 三波波形; 动态压缩强度。,"PFC模型下的动态压缩模拟及SHPB理论实践研究"

ProASCI 开发板原理图: ProASCI 开发板原理图，适用于A3P3000

网上音乐商城免费JAVA毕业设计 2024成品源码+论文+录屏+启动教程.zip: 免费JAVA毕业设计 2024成品源码+论文+录屏+启动教程启动教程：https://www.bilibili.com/video/BV1SzbFe7EGZ 项目讲解视频：https://www.bilibili.com/video/BV1Tb421n72S 二次开发教程：https://www.bilibili.com/video/BV18i421i7Dx

pykde4-devel-4.10.5-6.el7.x64-86.rpm.tar.gz: 1、文件内容：pykde4-devel-4.10.5-6.el7.rpm以及相关依赖 2、文件形式：tar.gz压缩包 3、安装指令： #Step1、解压 tar -zxvf /mnt/data/output/pykde4-devel-4.10.5-6.el7.tar.gz #Step2、进入解压后的目录，执行安装 sudo rpm -ivh *.rpm 4、安装指导：私信博主，全程指导安装

基于Comsol模拟的三层顶板随机裂隙浆液扩散模型：考虑重力影响的瞬态扩散规律分析,Comsol模拟，考虑三层顶板包含随机裂隙的浆液扩散模型，考虑浆液重力的影响，模型采用的DFN插件建立随机裂隙，采用: 基于Comsol模拟的三层顶板随机裂隙浆液扩散模型：考虑重力影响的瞬态扩散规律分析,Comsol模拟，考虑三层顶板包含随机裂隙的浆液扩散模型，考虑浆液重力的影响，模型采用的DFN插件建立随机裂隙，采用达西定律模块中的储水模型为控制方程，分析不同注浆压力条件下的浆液扩散规律，建立瞬态模型 ,Comsol模拟; 随机裂隙浆液扩散模型; 浆液重力影响; DFN插件; 达西定律模块储水模型; 注浆压力条件; 浆液扩散规律; 瞬态模型,Comsol浆液扩散模型：随机裂隙下考虑重力的瞬态扩散分析

go-fastdfs-golang资源: A simple fast, easy use distributed file system written by golang(similar fastdfs).go-fastdfs

手机编程-1738391552157.jpg: 手机编程-1738391552157.jpg

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论