论坛首页 Web前端技术论坛

Hadoop的SemiJoin

浏览 2389 次
精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者 正文
   发表时间:2014-04-25  
散仙,在前两篇博客里,写了关于Hadoop的Map侧join
和Reduce的join,今天我们就来在看另外一种比较中立的Join。

SemiJoin,一般称为半链接,其原理是在Map侧过滤掉了一些不需要join的数据,从而大大减少了reduce的shffule时间,因为我们知道,如果仅仅使用Reduce侧连接,那么如果一份数据中,存在大量的无效数据,而这些数据,在join中,并不需要,但是因为没有做过预处理,所以这些数据,直到真正的执行reduce函数时,才被定义为无效数据,而这时候,前面已经执行过shuffle和merge和sort,所以这部分无效的数据,就浪费了大量的网络IO和磁盘IO,所以在整体来讲,这是一种降低性能的表现,如果存在的无效数据越多,那么这种趋势,就越明显。

之所以会出现半连接,这其实也是reduce侧连接的一个变种,只不过我们在Map侧,过滤掉了一些无效的数据,所以减少了reduce过程的shuffle时间,所以能获取一个性能的提升。

具体的原理也是利用DistributedCache将小表的的分发到各个节点上,在Map过程的setup函数里,读取缓存里面的文件,只将小表的链接键存储在hashset里,在map函数执行时,对每一条数据,进行判断,如果这条数据的链接键为空或者在hashset里面不存在,那么则认为这条数据,是无效的数据,所以这条数据,并不会被partition分区后写入磁盘,参与reduce阶段的shuffle和sort,所以在一定程序上,提升了join性能。需要注意的是如果
小表的key依然非常巨大,可能会导致我们的程序出现OOM的情况,那么这时候我们就需要考虑其他的链接方式了。

测试数据如下:
模拟小表数据:
1,三劫散仙,13575468248
2,凤舞九天,18965235874
3,忙忙碌碌,15986854789
4,少林寺方丈,15698745862


模拟大表数据:
3,A,99,2013-03-05
1,B,89,2013-02-05
2,C,69,2013-03-09
3,D,56,2013-06-07
5,E,100,2013-09-09
6,H,200,2014-01-10

代码如下:

<pre name="code" class="java">package com.semijoin;

import java.io.BufferedReader;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/***
*
* Hadoop1.2的版本
*
* hadoop的半链接
*
* SemiJoin实现
*
* @author qindongliang
*
*    大数据交流群:376932160
*  搜索技术交流群:324714439
*
*
*
* **/
public class Semjoin {



/**
*
*
* 自定义一个输出实体
*
* **/
private static class CombineEntity implements WritableComparable<CombineEntity>{


private Text joinKey;//连接key
private Text flag;//文件来源标志
private Text secondPart;//除了键外的其他部分的数据


public CombineEntity() {
// TODO Auto-generated constructor stub
this.joinKey=new Text();
this.flag=new Text();
this.secondPart=new Text();
}

public Text getJoinKey() {
return joinKey;
}

public void setJoinKey(Text joinKey) {
this.joinKey = joinKey;
}

public Text getFlag() {
return flag;
}

public void setFlag(Text flag) {
this.flag = flag;
}

public Text getSecondPart() {
return secondPart;
}

public void setSecondPart(Text secondPart) {
this.secondPart = secondPart;
}

@Override
public void readFields(DataInput in) throws IOException {
this.joinKey.readFields(in);
this.flag.readFields(in);
this.secondPart.readFields(in);

}

@Override
public void write(DataOutput out) throws IOException {
this.joinKey.write(out);
this.flag.write(out);
this.secondPart.write(out);

}

@Override
public int compareTo(CombineEntity o) {
// TODO Auto-generated method stub
return this.joinKey.compareTo(o.joinKey);
}



}




private static class JMapper extends Mapper<LongWritable, Text, Text, CombineEntity>{

private CombineEntity combine=new CombineEntity();
private Text flag=new Text();
private  Text joinKey=new Text();
private Text secondPart=new Text();
/**
* 存储小表的key
*
*
* */
private HashSet<String> joinKeySet=new HashSet<String>();


@Override
protected void setup(Context context)throws IOException, InterruptedException {

//读取文件流
BufferedReader br=null;
String temp;
// 获取DistributedCached里面 的共享文件
Path path[]=DistributedCache.getLocalCacheFiles(context.getConfiguration());

for(Path p:path){

if(p.getName().endsWith("a.txt")){
br=new BufferedReader(new FileReader(p.toString()));
//List<String> list=Files.readAllLines(Paths.get(p.getName()), Charset.forName("UTF-8"));

while((temp=br.readLine())!=null){
String ss[]=temp.split(",");
//map.put(ss[0], ss[1]+"\t"+ss[2]);//放入hash表中
joinKeySet.add(ss[0]);//加入小表的key
}
}
}


}



@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {


   //获得文件输入路径
            String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();

            if(pathName.endsWith("a.txt")){
           
            String  valueItems[]=value.toString().split(",");
           
           
            /**
            * 在这里过滤必须要的连接字符
            *
            * */
            if(joinKeySet.contains(valueItems[0])){
            //设置标志位
                flag.set("0");  
                //设置链接键
                joinKey.set(valueItems[0]);         
                //设置第二部分
                secondPart.set(valueItems[1]+"\t"+valueItems[2]);
               
                //封装实体
                combine.setFlag(flag);//标志位
                combine.setJoinKey(joinKey);//链接键
                combine.setSecondPart(secondPart);//其他部分
               
                //写出
                context.write(combine.getJoinKey(), combine);
            }else{
            System.out.println("a.txt里");
            System.out.println("在小表中无此记录,执行过滤掉!");
            for(String v:valueItems){
            System.out.print(v+"   ");
            }
           
            return ;
           
            }
           
           
           
            }else if(pathName.endsWith("b.txt")){
            String  valueItems[]=value.toString().split(",");
           
            /**
            *
            * 判断是否在集合中
            *
            * */
            if(joinKeySet.contains(valueItems[0])){
           

                //设置标志位
                flag.set("1");  
               
                //设置链接键
                joinKey.set(valueItems[0]);
         
                //设置第二部分注意不同的文件的列数不一样
                secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]);
               
                //封装实体
                combine.setFlag(flag);//标志位
                combine.setJoinKey(joinKey);//链接键
                combine.setSecondPart(secondPart);//其他部分
               
                //写出
                context.write(combine.getJoinKey(), combine);
           
           
            }else{           
            //执行过滤 ......
            System.out.println("b.txt里");
            System.out.println("在小表中无此记录,执行过滤掉!");
            for(String v:valueItems){
            System.out.print(v+"   ");
            }
           
            return ;
           
           
            }
           
           
           
           
            }
           
           





}

}


private static class JReduce extends Reducer<Text, CombineEntity, Text, Text>{


//存储一个分组中左表信息
private List<Text> leftTable=new ArrayList<Text>();
//存储一个分组中右表信息
private List<Text> rightTable=new ArrayList<Text>();

private Text secondPart=null;

private Text output=new Text();



//一个分组调用一次
@Override
protected void reduce(Text key, Iterable<CombineEntity> values,Context context)
throws IOException, InterruptedException {
leftTable.clear();//清空分组数据
rightTable.clear();//清空分组数据


/**
  * 将不同文件的数据,分别放在不同的集合
  * 中,注意数据量过大时,会出现
  * OOM的异常
  *
  * **/

for(CombineEntity ce:values){

this.secondPart=new Text(ce.getSecondPart().toString());


//左表

if(ce.getFlag().toString().trim().equals("0")){
leftTable.add(secondPart);

}else if(ce.getFlag().toString().trim().equals("1")){

rightTable.add(secondPart);

}




}

//=====================
for(Text left:leftTable){

for(Text right:rightTable){

output.set(left+"\t"+right);//连接左右数据
context.write(key, output);//输出
}

}




}

}








public static void main(String[] args)throws Exception {




//Job job=new Job(conf,"myjoin");
JobConf conf=new JobConf(Semjoin.class);
   conf.set("mapred.job.tracker","192.168.75.130:9001");
    conf.setJar("tt.jar");
 
 
//小表共享
String bpath="hdfs://192.168.75.130:9000/root/dist/a.txt";
//添加到共享cache里
DistributedCache.addCacheFile(new URI(bpath), conf);

  Job job=new Job(conf, "aaaaa");
job.setJarByClass(Semjoin.class);
System.out.println("模式:  "+conf.get("mapred.job.tracker"));;


//设置Map和Reduce自定义类
job.setMapperClass(JMapper.class);
job.setReducerClass(JReduce.class);

//设置Map端输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CombineEntity.class);

//设置Reduce端的输出
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);


job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);


FileSystem fs=FileSystem.get(conf);

Path op=new Path("hdfs://192.168.75.130:9000/root/outputjoindbnew4");

if(fs.exists(op)){
fs.delete(op, true);
System.out.println("存在此输出路径,已删除!!!");
}


  FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.75.130:9000/root/inputjoindb"));
  FileOutputFormat.setOutputPath(job, op);
  
  System.exit(job.waitForCompletion(true)?0:1);
 
 












}




}
</pre>
运行日志如下:
<pre name="code" class="java">模式:  192.168.75.130:9001
存在此输出路径,已删除!!!
WARN - JobClient.copyAndConfigureFiles(746) | Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
INFO - FileInputFormat.listStatus(237) | Total input paths to process : 2
WARN - NativeCodeLoader.<clinit>(52) | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN - LoadSnappy.<clinit>(46) | Snappy native library not loaded
INFO - JobClient.monitorAndPrintJob(1380) | Running job: job_201404260312_0002
INFO - JobClient.monitorAndPrintJob(1393) |  map 0% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) |  map 50% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) |  map 100% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) |  map 100% reduce 33%
INFO - JobClient.monitorAndPrintJob(1393) |  map 100% reduce 100%
INFO - JobClient.monitorAndPrintJob(1448) | Job complete: job_201404260312_0002
INFO - Counters.log(585) | Counters: 29
INFO - Counters.log(587) |   Job Counters
INFO - Counters.log(589) |     Launched reduce tasks=1
INFO - Counters.log(589) |     SLOTS_MILLIS_MAPS=12445
INFO - Counters.log(589) |     Total time spent by all reduces waiting after reserving slots (ms)=0
INFO - Counters.log(589) |     Total time spent by all maps waiting after reserving slots (ms)=0
INFO - Counters.log(589) |     Launched map tasks=2
INFO - Counters.log(589) |     Data-local map tasks=2
INFO - Counters.log(589) |     SLOTS_MILLIS_REDUCES=9801
INFO - Counters.log(587) |   File Output Format Counters
INFO - Counters.log(589) |     Bytes Written=172
INFO - Counters.log(587) |   FileSystemCounters
INFO - Counters.log(589) |     FILE_BYTES_READ=237
INFO - Counters.log(589) |     HDFS_BYTES_READ=455
INFO - Counters.log(589) |     FILE_BYTES_WRITTEN=169503
INFO - Counters.log(589) |     HDFS_BYTES_WRITTEN=172
INFO - Counters.log(587) |   File Input Format Counters
INFO - Counters.log(589) |     Bytes Read=227
INFO - Counters.log(587) |   Map-Reduce Framework
INFO - Counters.log(589) |     Map output materialized bytes=243
INFO - Counters.log(589) |     Map input records=10
INFO - Counters.log(589) |     Reduce shuffle bytes=243
INFO - Counters.log(589) |     Spilled Records=16
INFO - Counters.log(589) |     Map output bytes=215
INFO - Counters.log(589) |     Total committed heap usage (bytes)=336338944
INFO - Counters.log(589) |     CPU time spent (ms)=1770
INFO - Counters.log(589) |     Combine input records=0
INFO - Counters.log(589) |     SPLIT_RAW_BYTES=228
INFO - Counters.log(589) |     Reduce input records=8
INFO - Counters.log(589) |     Reduce input groups=4
INFO - Counters.log(589) |     Combine output records=0
INFO - Counters.log(589) |     Physical memory (bytes) snapshot=442564608
INFO - Counters.log(589) |     Reduce output records=4
INFO - Counters.log(589) |     Virtual memory (bytes) snapshot=2184306688
INFO - Counters.log(589) |     Map output records=8
</pre>
在map侧过滤的数据,在50030中查看的截图如下:



运行结果如下所示:
<pre name="code" class="java">1 三劫散仙 13575468248 B 89 2013-02-05
2 凤舞九天 18965235874 C 69 2013-03-09
3 忙忙碌碌 15986854789 A 99 2013-03-05
3 忙忙碌碌 15986854789 D 56 2013-06-07
</pre>

至此,这个半链接就完成了,结果正确,在hadoop的几种join方式里,只有在Map侧的链接比较高效,但也需要根据具体的实际情况,进行选择。

  • 大小: 119.9 KB
论坛首页 Web前端技术版

跳转论坛:
Global site tag (gtag.js) - Google Analytics