ghost1392

浏览: 2914 次
性别:
来自: 上海

最近访客更多访客>>

sy1989

Anrui2017

Creating-Z

归宿的微博小窝1984

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Hadoop编程起步

博客分类：

Java
大数据

java hadoop

一、与Eclipse集成

1、Hadoop开发包下载

A、官网：http://hadoop.apache.org/#Download+Hadoop

2016-03-25 更新版本为2.6.0，配置方法一样。

B、我的共享目录：\\${共享盘}\share\yzc\Hadoop-dev

C、Eclipse插件下载（hadoop-eclipse-plugin-x.x.x.jar）

2、Eclipse集成Hadoop开发配置

A、解压安装与插件：

Hadoop、Eclipse的zip包，统一归整好目录，将hadoop-eclipse-plugin-x.x.x.jar插件Jar包复制到Eclipse插件目录D:\eclipse\plugins\下。如下图所示：

注意：解压Hadoop后，将D:\hadoop-2.5.1\bin\hadoop.dll文件复制到C:\Windows\System32\目录下。

B、配置Hadoop环境变量：

桌面 »» 右击【我的电脑】»» 选择【属性】 »» 左边导航【高级系统设置】 »» 【高级】选项卡的【环境变量（N）...】 »» 设置【系统变量（S）】。添加或编辑以下变量：

HADOOP_HOME » D:\hadoop-2.5.1 ----Hadoop的安装解压目录

Path » %HADOOP_HOME%\bin;...... ----编辑Path配置，在最前面增加Hadoop bin的配置，以“;”分隔作为该项结束

系统的环境变量配置说明：

(A)、没有权限配置的，可联系IT部的人(xxx)使用管理员权限，或通过idesk平台设置；

(B)、配置完环境变量后，需要重启机器才能生效。

C、启动Eclipse，配置Hadoop HDFS：

Eclipse菜单栏，【Window】 »» 【Preferences】 »» 【Hadoop Map/Reduce】 »» 【Browse...】 »» 浏览到Hadoop的解压安装目录。如下图所示：

配置Hadoop HDFS：打开【Map/Reduce Locations】视图，在空白处右击菜单，选择【New Hadoop location...】打开配置窗口，配置Name Node的Host和相关的端口号，点击【Finish】保存。如下图所示：

前提是了解Hadoop的集群环境，找到Name Node的配置文件。下面以我们开发环境为例，在Hadoop安装配置目录/etc/hadoop/conf/下有以下配置文件：

core-site.xml 8020 nameNode.fs.defaultFs.RPC namenode RPC交互端口，HDFS文件操作

hdfs-site.xml 50020 dfs.dataNode.ipc.address datanode的IPC服务器地址和端口

mapred-site.xml

配置中的端口均采用如图默认配置值即可，点击【Finish】，切换到【Map/Reduce】窗口模式，就会在左边的【Project Explorer】区域查看到DFS Locations菜单下有刚在配置的Name Node的HDFS文件目录结构，以及右键菜单提供相应的文件操作。如下图所示：

右键菜单提供文件操作：

二、Map / Reduce程序开发

1、构建Maven项目，加入依赖

pom.xml依赖Jar Collapse source

2、编写Map / Reduce的Job程序

Map/Reduce Job代码示例 Collapse source

package com.xxx.analyse.job;
 
import java.io.IOException;

import java.sql.Connection;

import java.sql.SQLException;

import java.util.Iterator;

import java.util.Map;

import java.util.concurrent.ConcurrentHashMap;
 
import org.apache.commons.lang3.StringUtils;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
import com.xxx.analyse.jdbc.DBConnPool;

import com.xxx.analyse.nginx.SysNodeCache;

import com.xxx.analyse.nginx.TrackNodeBatchInsertDB;

import com.xxx.analyse.parser.NginxLogParser;

import com.xxx.analyse.po.NginxLog;

import com.xxx.analyse.po.TrackNode;

import com.xxx.analyse.util.Converter;

import com.xxx.analyse.util.DateUtil;

import com.xxx.analyse.util.HdfsFileOuter;

import com.xxx.analyse.util.HdfsFileUtil;

import com.xxx.analyse.util.RedisUtil;
 
/**  

 * @Title: 用户轨迹流量统计PV、UV

 * @Description: Mapper & Reducer

 * @Team: 技术1部Java开发小组

 * @Author Andy-ZhichengYuan

 * @Date 2015年11月25日

 * @Version V1.0   */

public class TrackNodeJob {

    /** 存放Redis缓存的Key：t_bi_nodev_dlog表数据，TrackNode对象  */

    public static final String RedisKey = "bi_nodev_dlog";

    /** 当前时间的配置Key */

    private static final String DayLogDate = "TrackNodeJobDate";

    private static final String RedisNodeID = "NodeID_";

    /** 用户轨迹流量统计的Mapper  */

    public static class TrackNodeMapper extends Mapper<Object, Text, Text, IntWritable> {

        private Text PreNodePvKey = new Text();

        private Text PreNodeUvKey = new Text();

        private Text NodePvKey = new Text();

        private Text NodeUvKey = new Text();

        private final static IntWritable PreNodePv = new IntWritable(1);

        private final static IntWritable PreNodeUv = new IntWritable(1);

        private final static IntWritable NodePv = new IntWritable(1);

        private final static IntWritable NodeUv = new IntWritable(1);

        @Override

        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            String line = value.toString();

            NginxLog log = NginxLogParser.parseNginxLogAll(line);

            if (log != null && log.isValid()) {

                //iAutoID,iPreNode,iPrePV,iPreUV,iNode,iPV,iUV,iAvgSec,iCreateTime

                Long iPreNode = SysNodeCache.getNodeIdByURL(log.getRequest().getsReferUrl());

                Long iNode = SysNodeCache.getNodeIdByURL(log.getRequest().getsPageUrl());

                //TODO 处理前后节点相同的情况

                //当前节点的平均停留秒钟数：此处亦缓存了，在最后节点的流量统计数据入库时用作节点缓存的获取

                Long staySecond = log.getRequest().getStaySeconds();

                if(staySecond != null && staySecond.longValue() > 0) {

                    Map<String, String> map = RedisUtil.getMapRedisCacheInfo(RedisKey);

                    if(map == null) {

                        map = new ConcurrentHashMap<String, String>();

                        map.put(""+iNode, ""+staySecond);

                        RedisUtil.setMapRedisCacheInfo(RedisKey, map, RedisUtil.DAY);

                    } else {

                        if( map.containsKey(""+iNode) ) {

                            staySecond += Converter.parseLong( map.get(""+iNode) );

                            map.put(""+iNode, ""+staySecond);

                        } else {

                            map.put(""+iNode, ""+staySecond);

                        }

                        RedisUtil.setMapRedisCacheInfo(RedisKey, map, RedisUtil.DAY);

                    }

                }

                //PV值

                PreNodePvKey.set( iPreNode +"|"+ iNode );

                context.write(PreNodePvKey, PreNodePv);

                NodePvKey.set( ""+ iNode );

                context.write(NodePvKey, NodePv);

                //UV值

                PreNodeUvKey.set( iPreNode +"|"+ iNode +":"+ log.getsGuid() );

                context.write(PreNodeUvKey, PreNodeUv);

                NodeUvKey.set( iNode +":"+ log.getsGuid() );

                context.write(NodeUvKey, NodeUv);

            }//非法日志

            line = null;

            log = null;

        }

    }

////////////////////////////////////////////////////////////////////////////////////

    /** 组合器Combiner：在Map输出于Reduce之前做合并计算，其实质是Reduce的处理逻辑  */

    public static class TrackNodeCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        //实现reduce函数

        public void reduce(Text key, Iterable<IntWritable> values, Context context)

                throws IOException, InterruptedException {

            int sum = 0;

            for (IntWritable val : values) {

                sum += val.get();

            }

            String mapKey = key.toString();

            if( StringUtils.contains(mapKey, ":") ) {//UV，GUID处理

                mapKey = (StringUtils.substringBeforeLast(mapKey, ":") + ":");

                result.set( sum );

                context.write(new Text(mapKey), result);

            } else {

                result.set( sum );

                context.write(key, result);

            }

        }

    }

///////////////////////////////////////////////////////////////////////////////////

    /** 用户轨迹流量统计的Reduce类  */

    public static class TrackNodeReducer extends Reducer<Text, IntWritable, TrackNode, Text> {

        private Long iDate = null;

        // 实现reduce函数

        public void reduce(Text key, Iterable<IntWritable> values, Context context)

                throws IOException, InterruptedException {

            long sum = 0;

            String mapKey = key.toString();

            int iType = this.getKeyType(mapKey);//业务类型

            String iNode = this.getNodeID(iType, mapKey);

            if(iType == 1 || iType == 3) {//UV只求Values的个数

                Iterator<IntWritable> itr = values.iterator();

                while( itr.hasNext() ) {

                    itr.next();

                    sum ++;

                }

            } else {//PV

                for (IntWritable val : values) {

                    sum += val.get();

                }

            }

            if(iDate == null || iDate.longValue() == 0) {

                Configuration conf = context.getConfiguration();

                String yesterday = DateUtil.getAgoBackDate(-1);

                Long iDateDef = DateUtil.getDateSec( yesterday );//秒钟数

                iDate = conf.getLong(DayLogDate, iDateDef);

            }

            //缓存当天所有节点的ID，做为入库时获取缓存的Key：在Map时处理当前节点的平均停留秒钟数已做了缓存
//            if( RedisUtil.exists(RedisKey) ) {
//                Set<String> keySet = RedisUtil.getSetRedisCacheInfo(RedisKey);
//                keySet.add( iNode );
//                RedisUtil.setSetRedisCacheInfo(RedisKey, keySet, RedisUtil.HOUR);
//            } else {
//                Set<String> keySet = new HashSet<String>();
//                keySet.add(iNode);
//                RedisUtil.setSetRedisCacheInfo(RedisKey, keySet, RedisUtil.HOUR);
//            }

            TrackNode tNode = RedisUtil.getJson2ObjectCacheInfo(RedisNodeID+iNode, TrackNode.class);

            if(tNode != null) {

                //iAutoID,iPreNode,iPrePV,iPreUV,iNode,iPV,iUV,iAvgSec,iCreateTime
//                tNode.setiNode( Converter.parseLong(iNode) );
//                tNode.setiCreateTime( iDate );//时间戳

                switch (iType) {

                case 1://iPreNode UV

                    if(tNode.getPreNode() == 0) {

                        tNode.setPreNode(Converter.parseLong( this.getPreNodeID(mapKey) ));

                    }

                    Long iPreUv = tNode.getPreUV();

                    tNode.setPreUV(iPreUv + sum);

                    break;

                case 2://iPreNode PV

                    if(tNode.getPreNode() == 0) {

                        tNode.setPreNode(Converter.parseLong( this.getPreNodeID(mapKey) ));

                    }

                    Long iPrePv = tNode.getPrePV();

                    tNode.setPrePV(iPrePv + sum);

                    break;

                case 3://iNode UV

                    Long iUv = tNode.getUv();

                    tNode.setUv(iUv + sum);

                    break;

                default://0 iNode PV

                    Long iPv = tNode.getPv();

                    tNode.setPv(iPv + sum);

                    break;

                }

                RedisUtil.setObject2JsonCacheInfo(RedisNodeID+iNode, tNode, RedisUtil.DAY);

            } else {//不存在

                tNode = new TrackNode();

                //iAutoID,iPreNode,iPrePV,iPreUV,iNode,iPV,iUV,iAvgSec,iCreateTime

                tNode.setNode( Converter.parseLong(iNode) );

                tNode.setCreateTime( iDate );//时间戳

                switch (iType) {

                case 1://iPreNode UV

                    tNode.setPreNode(Converter.parseLong( this.getPreNodeID(mapKey) ));

                    tNode.setPreUV( sum );

                    break;

                case 2://iPreNode PV

                    tNode.setPreNode(Converter.parseLong( this.getPreNodeID(mapKey) ));

                    tNode.setPrePV( sum );

                    break;

                case 3://iNode UV

                    tNode.setUv( sum );

                    break;

                default://0 iNode PV

                    tNode.setPv( sum );

                    break;

                }

                RedisUtil.setObject2JsonCacheInfo(RedisNodeID+iNode, tNode, RedisUtil.DAY);

            }

        }

        /** 根据Key判断业务类型:

         *<p>1 = UV Key：iPreNode +"|"+ iNode +":"+ GUID </p>

         *<p>2 = PV Key：iPreNode +"|"+ iNode  </p>

         *<p>3 = UV Key：iNode +":"+ GUID  </p>

         * 0 = PV Key：iNode (default)   */

        public int getKeyType(String mapKey) {

            int iType = 0;//业务类型

            if(StringUtils.contains(mapKey, "|") && StringUtils.contains(mapKey, ":")) {//iPreNode UV

                iType = 1;

            } else {

                if(StringUtils.contains(mapKey, "|")) {//iPreNode PV

                    iType = 2;

                } else {

                    if(StringUtils.contains(mapKey, ":")) {//iNode UV

                        iType = 3;

                    } else {//iNode PV

                        iType = 0;

                    }

                }

            }

            return iType;

        }

        /** 根据业务类型来获取当前节点的ID  */

        public String getNodeID(int iType, String mapKey) {

            switch (iType) {

            case 1://iPreNode UV

                return StringUtils.substringBetween(mapKey, "|", ":");

            case 2://iPreNode PV

                return StringUtils.substringAfterLast(mapKey, "|");

            case 3://iNode UV

                return StringUtils.substringBefore(mapKey, ":");

            default://0 iNode PV

                return mapKey;

            }

        }

        /** 仅当存在上游节点时，才根据MapKey来获取上游节点的ID；否则返回null  */

        public String getPreNodeID(String mapKey) {

            if( StringUtils.contains(mapKey, "|") ) {

                return StringUtils.substringBefore(mapKey, "|");

            }

            return null;

        }

    }

//////////////////////////////////////////////////////////////////////////////////////////////

    /** 运行TrackNodeJob用户轨迹流量统计的MapReduce作业  */

    public static int runTrackNode2File(String input, String output, Long iDate) {

        int exit = 0;

        try {

            Configuration conf = new Configuration();

            conf.setLong(DayLogDate, iDate);//设置Job的常量，日志时间

            Job job = Job.getInstance(conf, "t_bi_nodev_dlog");

            job.setJarByClass(TrackNodeJob.class);

            job.setMapperClass(TrackNodeMapper.class);

            job.setCombinerClass(TrackNodeCombiner.class);

            job.setReducerClass(TrackNodeReducer.class);

            job.setOutputKeyClass(Text.class);

            job.setOutputValueClass(IntWritable.class);

//            FileInputFormat.addInputPath(job, new Path(input));

            FileInputFormat.setInputPaths(job, new Path(input));

            //输出：写入内存

            HdfsFileUtil.delete(conf, output, true);

            FileOutputFormat.setOutputPath(job, new Path(output));

            exit = job.waitForCompletion(true) ? 0 : 1;

        } catch (IOException e) {

            e.printStackTrace();

        } catch (ClassNotFoundException e) {

            e.printStackTrace();

        } catch (InterruptedException e) {

            e.printStackTrace();

        }

        return exit;

    }

    /** 将TrackNodeJob用户轨迹流量统计的PV、UV结果保存到DB数据库中    */

    public static void storeTrackNode2DB(String output) {

        Connection conn = null;

        HdfsFileOuter outer = null;

        try {
//            Set<String> keySet = RedisUtil.getSetRedisCacheInfo(RedisKey);
//            if(keySet != null && !keySet.isEmpty()) {

            Map<String, String> map = RedisUtil.getMapRedisCacheInfo(RedisKey);

            if(map != null && !map.isEmpty()) {

                conn = DBConnPool.getInstance().getConn();

                TrackNodeBatchInsertDB inserter = new TrackNodeBatchInsertDB(conn, 10);

                outer = new HdfsFileOuter(output, true, 100);

                outer.open();

                String iNode = null;

                Long iAvgSec = null;

                TrackNode node = null;

                Iterator<String> itr = map.keySet().iterator();

                while (itr.hasNext()) {

                    iNode = itr.next();

                    iAvgSec = Converter.parseLong( map.get(iNode) );

                    node = RedisUtil.getJson2ObjectCacheInfo(RedisNodeID+iNode, TrackNode.class);

                    if(node == null) {

                        System.out.println("未缓存的节点："+ iNode);

                        continue;

                    }

                    node.setAvgSec( iAvgSec/node.getPv() );//平均停留秒钟数

                    outer.write( node.toString() );

                    outer.flush(false);

                    inserter.addBatch(node);

                    inserter.commit(false);

                    RedisUtil.delRedisCacheInfo(RedisNodeID+iNode);//删除Redis

                }

                inserter.commit(true);

            }

        }catch (Exception e) {

            e.printStackTrace();

        } finally {

            if(outer != null) {

                outer.flush(true);

                outer.close();

            }

            if(conn != null) {

                try {

                    conn.close();

                } catch (SQLException e) {

                    e.printStackTrace();

                }

            }

        }

    }

    /** 内存数据清理 */

    public static void clear() {

        RedisUtil.delRedisCacheInfo(RedisKey);//删除Redis

    }

}

另外在这里分享一下我的UV计算去重的方法：

首先要理解MR很重要，这里先不说内部原理，你目前最关心的应该是处理过程的问题。

MR是一行一行数据处理的：

M，如果你读取的是HDFS上的文件，M每次处理就会获取一行数据，在map()方法里面处理，至于如何处理看你喜好，基本都是字符串处理。

R，也是一行一行处理的，M的每一行来自于文件上的读取数据，R的每一行来自于M的输出。

所以在M的map()处理完后，想丢给R的就要通过context.write(key, value)给R处理。

注意：write时的key、value，所有相同的key会被组合成一个数据给R处理。

假设M输出write时，有：（Key的值是解析nginx access log后业务加工的一个访问节点ID:GUID）

Key Value

100:3aefa00eaa07def13a15a5be3a79a39e 1

100:ef5a4c4a838f4c46c0c3a83642ddd297 1

100:837e6a6a77a3db3304ffab75835f4177 1

100:6a36a270c57af55f768d14d6bcc34dc7 1

101:3aefa00eaa07def13a15a5be3a79a39e 1

101:837e6a6a77a3db3304ffab75835f4177 1

101:8cca950662f1d052fd1e05c886d841db 1

101:9628e18b945f4c3c8d2355583fb2495b 1

R会处理两次，第1次为Combiner输入是：

Key Vaules 迭代求和

100:3aefa00eaa07def13a15a5be3a79a39e 1 1

100:ef5a4c4a838f4c46c0c3a83642ddd297 1 1

100:837e6a6a77a3db3304ffab75835f4177 1 1

100:6a36a270c57af55f768d14d6bcc34dc7 1 1

101:3aefa00eaa07def13a15a5be3a79a39e 1 1

101:837e6a6a77a3db3304ffab75835f4177 1,1,1 3

101:8cca950662f1d052fd1e05c886d841db 1,1,1 3

101:9628e18b945f4c3c8d2355583fb2495b 1,1 2

而做完迭代求和后，我在这一步context.write(key, value)操作时，巧妙的将含有GUID的Key做了截段聚合处理，保证输出到下1次Reduce时的输入为以下格式的内容：

Key Vaules

100: 1,1,1,1

101: 1,3,3,2

第2次Reduce时处理就简单了，处理一下访问节点的转换，迭代计数Values即可得到UV。

UV的输出结果是：

Key UV(迭代计数)

100 4

101 4

而PV的输出结果是：

Key PV(取值累加)

100 4

101 9

总结：在R里面，巧妙利用2次R运算的Key聚合，通过迭代values获取value里面的值，将值求和作累加得到的是PV；而只做values的迭代计数，相当于得到value的个数即为UV。

下面给大家看一下MR的运算图解就清楚了：

查看图片附件

0
顶

0
踩

分享到：

Hadoop家族系列文章分享

2017-05-29 20:34
浏览 1539
评论(1)
分类:编程语言
查看更多

1 楼 ghost1392 2017-05-29

再分享一下，我们在大数据当中遇到UV去重计算的历程：
1、用Hadoop的MR任务计算，通过本地Set或Map缓存记录GUID来判断GUID的重复性；（UV比实际偏高）
2、用Hadoop的MR任务计算，通过redis缓存记录GUID来判断GUID的重复性；（应该可行）
3、用Hadoop的MR任务计算，通过对MR计算原理和处理过程的理解，取迭代计数=UV；（没毛病）
4、先将解析后的nginx access log日志，输出为结构化的数据文本，再通过sqoop2导入到hive表中，然后通过impala用SQL去重函数查询UV。（一次解析，灵活分析统计）

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hadoop编程起步

一、与Eclipse集成

1、Hadoop开发包下载

2、Eclipse集成Hadoop开发配置

A、解压安装与插件：

B、配置Hadoop环境变量：

C、启动Eclipse，配置Hadoop HDFS：

二、Map / Reduce程序开发

1、构建Maven项目，加入依赖

2、编写Map / Reduce的Job程序

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Hadoop编程起步

一、与Eclipse集成

1、Hadoop开发包下载

2、Eclipse集成Hadoop开发配置

A、解压安装与插件：

B、配置Hadoop环境变量：

C、启动Eclipse，配置Hadoop HDFS：

二、Map / Reduce程序开发

1、构建Maven项目，加入依赖

2、编写Map / Reduce的Job程序

评论

发表评论

相关推荐

Hadoop家族系列文章分享

我的世界

最近访客更多访客>>