问题导读
1.hive实现统计的查询语句是什么?
2.生产环境中为什么建议使用外部表?
3.hadoop mapreduce创建类DataWritable的作用是什么?
4.为什么创建类DataWritable?
5.如何实现统计手机流量?
6.对比hive与mapreduce统计手机流量的区别?
1.使用Hive进行手机流量统计
很多公司在使用hive对数据进行处理。
hive是hadoop家族成员,是一种解析like sql语句的框架。它封装了常用MapReduce任务,让你像执行sql一样操作存储在HDFS的表。
hive的表分为两种,内表和外表。
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。
在删除表的时候,内部表的元数据和数据会被一起删除, 而外部表只删除元数据,不删除数据。这样外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据。
Hive的内外表,还有一个Partition的分区的知识点,用于避免全表扫描,快速检索。后期的文章会提到。
接下来开始正式开始《Hive统计手机流量》
原始数据:
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 瑙.?缃.. 15 2 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 淇℃.瀹.. 20 20 156 2936 200
操作步骤:
- #配置好Hive之后,使用hive命令启动hive框架。hive启动属于懒加载模式,会比较慢
- hive;
- #使用show databases命令查看当前数据库信息
- hive> show databases;
- OK
- default
- hive
- Time taken: 3.389 seconds
- #使用 use hive命令,使用指定的数据库 hive数据库是我之前创建的
- use hive;
- #创建表,这里是创建内表。内表加载hdfs上的数据,会将被加载文件中的内容剪切走。
- #外表没有这个问题,所以在实际的生产环境中,建议使用外表。
- create table ll(reportTime string,msisdn string,apmac string,acmac string,host string,siteType string,upPackNum bigint,downPackNum bigint,upPayLoad bigint,downPayLoad bigint,httpStatus string)row format delimited fields terminated by '\t';
- #加载数据,这里是从hdfs加载数据,也可用linux下加载数据 需要local关键字
- load data inpath'/HTTP_20130313143750.dat' into table ll;
- #数据加载完毕之后,hdfs的
- #执行hive 的like sql语句,对数据进行统计
- select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;
执行结果如下:
- hive> select msisdn,sum(uppacknum),sum(downpacknum),sum(uppayload),sum(downpayload) from ll group by msisdn;
- Total MapReduce jobs = 1
- Launching Job 1 out of 1
- Number of reduce tasks not specified. Estimated from input data size: 1
- In order to change the average load for a reducer (in bytes):
- set hive.exec.reducers.bytes.per.reducer=<number>
- In order to limit the maximum number of reducers:
- set hive.exec.reducers.max=<number>
- In order to set a constant number of reducers:
- set mapred.reduce.tasks=<number>
- Starting Job = job_201307160252_0006, Tracking URL = http://hadoop0:50030/jobdetails.jsp?jobid=job_201307160252_0006
- Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=hadoop0:9001 -kill job_201307160252_0006
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
- 2013-07-17 19:51:42,599 Stage-1 map = 0%, reduce = 0%
- 2013-07-17 19:52:40,474 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:41,690 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:42,693 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:43,698 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:44,702 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:45,707 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:46,712 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:47,715 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:48,721 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:49,758 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:50,763 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 48.5 sec
- 2013-07-17 19:52:51,772 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
- 2013-07-17 19:52:52,775 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
- 2013-07-17 19:52:53,779 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.0 sec
- MapReduce Total cumulative CPU time: 50 seconds 0 msec
- Ended Job = job_201307160252_0006
- MapReduce Jobs Launched:
- Job 0: Map: 1 Reduce: 1 Cumulative CPU: 50.0 sec HDFS Read: 2787075 HDFS Write: 16518 SUCCESS
- Total MapReduce CPU Time Spent: 50 seconds 0 msec
- OK
- 13402169727 171 108 11286 130230
- 13415807477 2067 1683 169668 1994181
- 13416127574 1501 1094 161963 802756
- 13416171820 113 99 10630 32120
- 13417106524 160 128 18688 13088
- 13418002498 240 256 22136 86896
- 13418090588 456 351 98934 67470
- 13418117364 264 152 29436 49966
- 13418173218 37680 48348 2261286 73159722
- 13418666750 22432 26482 1395648 39735552
- 13420637670 20 20 1480 1480
- ......
- Time taken: 75.24 seconds
2.Hadoop MapReduce手机流量统计
自定义一个writable
- package cn.maoxiangyi.hadoop.wordcount;
- import java.io.DataInput;
- import java.io.DataOutput;
- import java.io.IOException;
- import org.apache.hadoop.io.Writable;
- public class DataWritable implements Writable {
- private int upPackNum;
- private int downPackNum;
- private int upPayLoad;
- private int downPayLoad;
- public DataWritable() {
- super();
- }
- public DataWritable(int upPackNum, int downPackNum, int upPayLoad,
- int downPayLoad) {
- super();
- this.upPackNum = upPackNum;
- this.downPackNum = downPackNum;
- this.upPayLoad = upPayLoad;
- this.downPayLoad = downPayLoad;
- }
- @Override
- public void write(DataOutput out) throws IOException {
- out.writeInt(upPackNum);
- out.writeInt(downPackNum);
- out.writeInt(upPayLoad);
- out.writeInt(downPayLoad);
- }
- @Override
- public void readFields(DataInput in) throws IOException {
- upPackNum = in.readInt();
- downPackNum = in.readInt();
- upPayLoad = in.readInt();
- downPayLoad =in.readInt();
- }
- public int getUpPackNum() {
- return upPackNum;
- }
- public void setUpPackNum(int upPackNum) {
- this.upPackNum = upPackNum;
- }
- public int getDownPackNum() {
- return downPackNum;
- }
- public void setDownPackNum(int downPackNum) {
- this.downPackNum = downPackNum;
- }
- public int getUpPayLoad() {
- return upPayLoad;
- }
- public void setUpPayLoad(int upPayLoad) {
- this.upPayLoad = upPayLoad;
- }
- public int getDownPayLoad() {
- return downPayLoad;
- }
- public void setDownPayLoad(int downPayLoad) {
- this.downPayLoad = downPayLoad;
- }
- @Override
- public String toString() {
- return " " + upPackNum + " "
- + downPackNum + " " + upPayLoad + " "
- + downPayLoad;
- }
- }
MapReduc函数
- package cn.maoxiangyi.hadoop.wordcount;
- import java.io.IOException;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- public class DataTotalMapReduce {
- public static void main(String[] args) throws Exception {
- Configuration configuration = new Configuration();
- Job job = new Job(configuration);
- job.setJarByClass(DataTotalMapReduce.class);
- job.setMapperClass(DataTotalMapper.class);
- job.setReducerClass(DataTotalReducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(DataWritable.class);
- job.setCombinerClass(DataTotalReducer.class);
- Path inputDir = new Path("hdfs://hadoop0:9000/HTTP_20130313143750.dat");
- FileInputFormat.addInputPath(job, inputDir);
- Path outputDir = new Path("hdfs://hadoop0:9000/dataTotal");
- FileOutputFormat.setOutputPath(job, outputDir);
- job.waitForCompletion(true);
- }
- }
- /**
- *
- *
- *
- * 1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82
- * i02.c.aliimg.com 24 27 2481 24681 200 1363157995052 13826544101
- * 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200 1363157991076 13926435656
- * 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
- *
- *
- */
- class DataTotalMapper extends Mapper<LongWritable, Text, Text, DataWritable> {
- @Override
- protected void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- String lineStr = value.toString();
- String[] strArr = lineStr.split("\t");
- String phpone = strArr[1];
- String upPackNum = strArr[6];
- String downPackNum = strArr[7];
- String upPayLoad = strArr[8];
- String downPayLoad = strArr[9];
- context.write(
- new Text(phpone),
- new DataWritable(Integer.parseInt(upPackNum), Integer
- .parseInt(downPackNum), Integer.parseInt(upPayLoad),
- Integer.parseInt(downPayLoad)));
- }
- }
- class DataTotalReducer extends Reducer<Text, DataWritable, Text, DataWritable> {
- @Override
- protected void reduce(Text k2, Iterable<DataWritable> v2, Context context)
- throws IOException, InterruptedException {
- int upPackNumSum = 0;
- int downPackNumSum = 0;
- int upPayLoadSum = 0;
- int downPayLoadSum = 0;
- for (DataWritable dataWritable : v2) {
- upPackNumSum += dataWritable.getUpPackNum();
- downPackNumSum += dataWritable.getDownPackNum();
- upPayLoadSum += dataWritable.getUpPayLoad();
- downPayLoadSum += dataWritable.getDownPayLoad();
- }
- context.write(k2, new DataWritable(upPackNumSum, downPackNumSum, upPayLoadSum, downPayLoadSum));
- }
- }
结果节选
- 13402169727 171 108 11286 130230
- 13415807477 2067 1683 169668 1994181
- 13416127574 1501 1094 161963 802756
- 13416171820 113 99 10630 32120
- 13417106524 160 128 18688 13088
- 13418002498 240 256 22136 86896
- 13418090588 456 351 98934 67470
- 13418117364 264 152 29436 49966
- 13418173218 37680 48348 2261286 73159722
- 13418666750 22432 26482 1395648 39735552
- 13420637670 20 20 1480 1480
- 13422149173 40 32 4000 3704
- 13422311151 465 535 33050 661790
- 13424077835 84 72 15612 9948
- 13424084200 765 690 60930 765675
- 13428887537 43892 44830 2925330 65047620
- 13430219372 454 352 33792 192876
- 13430234524 27852 39056 1767220 52076614
- 13430237899 1293 1165 166346 808613
- 13430258776 4681 4783 350511 6609423
- 13430266620 10544 9377 11600817 5728002
- 13432023893 40 0 2400 0
http://www.aboutyun.com/forum.php?highlight=hive&mod=viewthread&tid=7455
相关推荐
基于hadoop的Hive数据仓库JavaAPI简单调用的实例,关于Hive的简介在此不赘述。hive提供了三种用户接口:CLI,JDBC/ODBC和 WebUI CLI,即Shell命令行 JDBC/ODBC 是 Hive 的Java,与使用传统数据库JDBC的方式类似 Web...
7. **高级特性**:如使用新版本的MapReduce API(如YARN和Flink等),以及与Hive、Pig等工具的集成。 8. **调试与监控**:介绍如何使用Hadoop自带的工具监控作业性能,定位并解决问题。 通过本书的学习,读者不仅...
此外,Hadoop 还支持使用其他编程语言,如 Python 和 Scala,通过 Pig 或 Hive 等高级接口编写 MapReduce 作业,简化开发过程。然而,对于更复杂的逻辑,Java 仍然是首选,因为它提供了更大的灵活性和性能。 在项目...
在大数据处理领域,Hadoop和Hive是两个非常重要的组件。Hadoop是一个开源框架,主要用于分布式存储和计算大规模数据集,而Hive则是一个基于Hadoop的数据仓库工具,提供了SQL-like查询语言(HQL)来方便地管理和分析...
标题中的“hadoop实现网站流量数据分析(MapReduce+hive)程序+说明.rar”指的是一个使用Hadoop框架,结合MapReduce和Hive技术进行网站流量数据分析的项目。这个项目可能包含了程序代码、配置文件以及详细的使用说明...
在使用Hadoop和Hive过程中,可能会遇到一些问题,例如Mapreduce任务结束了,但是Reduce任务停止了,或者HDFS抛出错误等情况。 1. Mapreduce任务结束了,但是Reduce任务停止了 这种情况可能是由于Mapreduce任务的...
然后,本文详细介绍了Hadoop和Hive的工作原理,包括Hadoop的分布式架构、MapReduce算法和Hive的数据仓库工具。接着,本文介绍了基于Hadoop和Hive的数据查询优化设计与实现,涵盖了系统设计、数据查询优化和动态分区...
毕业设计,采用Hadoop+Hive构建数据仓库,使用django+echarts构建前端web网站对业务指标进行可视化呈现 1. Hadoop+Hive构建数据仓库 2. django+echarts网站开发 3. 数据清洗,数据模型构建 毕业设计,采用Hadoop+...
【标题】:“hadoop,hive,hbase学习资料”是一份综合性的学习资源,涵盖了大数据处理领域中的三个核心组件——Hadoop、Hive和Hbase。这些工具在大数据处理和分析中发挥着至关重要的作用。 【描述】:描述指出这份...
此外,书中可能还会涵盖使用Hadoop生态中的其他工具,如HDFS(Hadoop Distributed File System)和Hive、Pig等数据查询与分析工具,以实现更高效的数据处理流程。 标签“hadoop”表明了这本书内容与Hadoop生态系统...
分布式文件管理系统 Hadoop MapReduce Hive
`基于hadoop的hive以及sqoop的安装和配置.wps`应该包含了这些详细步骤,同时可能还会涉及到自定义Hive的配置参数,如Hive的执行引擎(Tez或MapReduce)的选择。 然后是Sqoop的配置。Sqoop允许用户从关系数据库导入...
该文档保护了目前比较流行的大数据平台的原理过程梳理。Hadoop,Hive,Hbase,Spark,MapReduce,Storm
2. **Hive查询执行**:当用户提交HQL查询时,Hive会将其转换为MapReduce任务在Hadoop集群上执行。查询结果可以直接输出到控制台,也可以保存回HDFS或者外部系统。 3. **数据加载与导出**:Hive支持从本地文件系统或...
这个压缩包文件包含的是Hadoop 1.1.2版本的操作示例,以及与之相关的HBase、Hive和MapReduce的jar包。这些工具是大数据处理生态系统中的核心组件,下面将分别详细介绍它们的功能和用法。 **Hadoop**: Hadoop是...
【Ubuntu下配置Hadoop和Hive】的详细步骤 配置Hadoop和Hive在Ubuntu系统上是一项关键的任务,尤其对于那些需要处理大规模数据的开发者和数据分析师来说。在这个过程中,你需要确保你的系统满足必要的先决条件,并...
### Hadoop MapReduce 教程知识点详述 #### 核心概念:MapReduce与Hadoop **MapReduce**是Google提出的一种编程模型,用于大规模数据集(多TB甚至PB)的并行运算,设计目标是为非专业程序员提供一种简单的并行程序...
"大数据Hadoop、MapReduce、Hive项目实践" 大数据Hadoop、MapReduce、Hive项目实践是当前大数据处理领域中最流行的技术组合。本文将对大数据的概念、特性、应用场景,以及Hadoop、MapReduce、Hive等技术的架构、...