`
wbj0110
  • 浏览: 1617809 次
  • 性别: Icon_minigender_1
  • 来自: 上海
文章分类
社区版块
存档分类
最新评论

Using Hive for Data Analysis

    博客分类:
  • Hive
阅读更多

概述

Hive is designed to enable easy data summarization and ad-hoc analysis of large volumes of data. It uses a query language called Hive-QL which is similar to SQL.

In this tutorial, we will explore the following:

  1. Load a data file into a Hive table
  2. Create a table using RCFormat
  3. Query tables
  4. Managed tables vs external tables
  5. ORC format
  6. PARTITIONED a Table
  7. Bucketing a Table

Prerequisites

A working HDP cluster – the easiest way to have a HDP cluster is to download the Hortonworks Sandbox

Step 1. Let’s load a data file into a Hive table.

First of all, download data file from here click here and name the file as TwitterData.txt . You can copy the downloaded file into hdfs folder, /user/hadoop using hdfs fs -put command (see this tutorial) or the Hue Interface.

As the file is small, you can simply open it, copy and create a local file in the sandbox manually as well.

We will use the Web UI here.

Open http://localhost:8000 in your browser.

Now, click on File Browser logo and you will see the following screen. You could load to /user hdfs folder or /user/hue folder. Please choose your hdfs path.

Now click on Upload option and select file TwitterData.txt from your computer.

Here is a sample syntax to create a table and load datafile into the table.

Make sure you are using the correct path for your data file.

Let’s create this table and load data.
Type “hive” at the command prompt after you ssh in to the Sandbox.

    CREATE TABLE TwitterExampletextexample(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING,
        favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;

    LOAD  DATA  INPATH  '/user/Twitterdata.txt' OVERWRITE INTO TABLE TwitterExampletextexample;

Here is the log that you can refer for exact steps.

Please run select * from this table to see the data.

Step 2. Let’s create a table using RCfile format

Record Columnar(RC) format determines how to store relational tables on distributed computer clusters. With this format, you can get the advantages of a columnar format over row format of a record.

Here is a sample Create RC file format table syntax:
    CREATE TABLE TwitterExampleRCtable(
        tweetId INT, username BIGINT,
        txt STRING, CreatedAt STRING,
        profileLocation STRING COMMENT 'Location of user',
        favc INT,retweet STRING,retcount INT,followerscount INT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS RCFILE;

Here is the step on how to Load Data into the RC Table. Please execute and see the results.

INSERT OVERWRITE TABLE TwitterExampleRCtable select * from  TwitterExampletextexample;

Here are the logs of the exact steps.

Step 3. Let’s query the table we just created.

Let’s find top 10 countries who tweeted most using TwitterExampleRCtable.

Select profileLocation, COUNT(txt) as count1 FROM TwitterExampleRCtable GROUP BY profileLocation ORDER BY count1 desc limit 10;

Please see the folloiwng log and the results:

Step 4. Let’s look at Managed tables vs External tables

Managed tables are created by default with CREATE TABLE statements, whereas External tables are used when you want your tables to point to data file in place.

Here is the syntax for creating these tables.

Managed:
    CREATE TABLE ManagedExample(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING,
        favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;
External:
    CREATE EXTERNAL TABLE IF NOT EXISTS ExternalExample(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING,
        favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE
    location '/user/Twitterdata.txt';

Also, when you drop a Managed table, it deletes the data, and it also deletes the metadata.

When you drop an External table, it only deletes the metadata.

As a next step, you could describe the above tables as below and compare the output with managed vs. external tables.

describe formatted ManagedExample;
describe formatted ExternalExample;

Step 5. Hive ORC File format.

Optimized Row Columnar (ORC) File format is used as it further compresses data files. It could result in a small performance loss in writing, but there will be huge performance gain in reading.

Let’s try it out. Please see that the table is stored as ORC.

    CREATE TABLE ORCFileFormatExample(
        tweetId INT, username BIGINT,
        txt STRING, CreatedAt STRING,
        profileLocation STRING COMMENT 'Location of user',
        favc INT,retweet STRING,retcount INT,followerscount INT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS ORC tblproperties ("orc.compress"="GLIB");

Step 6. Let’s create a PARTITIONED Table and load data into.

Partitions are horizontal slices of data which allow large sets of data to be segmented into more manageable blocks.
Here is the sample syntax to create a partitioned table and load data into partitions.

CREATE TABLE PARTITIONEDExample(
tweetId INT, username BIGINT, txt STRING,favc INT,retweet STRING,retcount INT,followerscount INT) COMMENT 'This is the Twitter streaming data' PARTITIONED BY(CreatedAt STRING, profileLocation STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

FROM twitterexampletextexample
INSERT OVERWRITE TABLE PARTITIONEDExample PARTITION (CreatedAt="26 04:50:56 UTC 2014",profileLocation="Chicago") SELECT tweetId,username,txt,favc,retweet,retcount,followerscount where profileLocation='Chicago' limit 100;

Here is the log from creating a table with ORC file format and a Partitioned table for your reference.

Step 7. Let’s create a table with Buckets.

Bucketing is a technique that allows to cluster or segment large sets of data to optimize query performance.

Here is an example for creating a table with buckets and load data into it.

    CREATE TABLE BucketingExample(
        tweetId INT, username BIGINT,
        txt STRING,CreatedAt STRING,favc INT,retweet STRING,retcount INT,                           followerscount INT)
    COMMENT 'This is the Twitter streaming data'
    PARTITIONED BY( profileLocation STRING)
    CLUSTERED BY(tweetId) INTO 2 BUCKETS
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;

    set hive.enforce.bucketing = true; 
    FROM twitterexampletextexample
    INSERT OVERWRITE TABLE BucketingExample PARTITION (profileLocation="Chicago")    SELECT tweetId,username,txt,CreatedAt,favc,retweet,retcount,followerscount       where profileLocation='Chicago' limit 100;

enter image description here

You can go to hdfs folder and see the directory structure behind these Hive tables that you have just created. That could help you to design your tables and file distributions which is very important in designing your warehouse.

Hope, this was helpful and simple enough to give you a glimpse of the Hive world.

http://zh.hortonworks.com/hadoop-tutorial/using-hive-data-analysis/

 
分享到:
评论

相关推荐

    Data Analytics with Hadoop: An Introduction for Data Scientists

    from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data ...

    Scala: Guide for Data Science Professionals

    "Scala: Guide for Data ... It contains a diverse set of recipes that cover the full spectrum of interesting data analysis tasks and will help you revolutionize your data analysis skills using Scala.

    Sams.Teach.Yourself.Big.Data.Analytics.with.Microsoft.HDInsight

    Consuming HDInsight data from Microsoft BI Tools over Hive ODBC Driver - Using HDInsight with Microsoft BI and Power BI to simplify data integration, analysis, and reporting Using PIG for big data ...

    Data Analytics with Hadoop(O'Reilly,2016)

    from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data ...

    Big Data Made Easy - A Working Guide To The Complete Hadoop Toolset

    Extract, Transform, Load (ETL) processes are fundamental in preparing data for analysis. This chapter discusses: - **Pentaho**: An open-source data integration tool that supports ETL processes. - **...

    Scala:Applied Machine Learning

    Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters Apply key ...

    Hadoop.Application.Architectures.1491900083

    Architecture examples for clickstream analysis, fraud detection, and data warehousing Table of Contents Part I. Architectural Considerations For Hadoop Applications Chapter 1. Data Modeling In Hadoop ...

    Windows Registry Forensics

    Approaches to live response and analysis are included, and tools and techniques for postmortem analysis are discussed at length. Tools and techniques will be presented that take the analyst beyond ...

    Hadoop.Essentials.1784396680

    MapReduce, HDFS, and other Hadoop ecosystem components, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also ...

    impala-2.8

    - **Interactive Query Processing**: It supports ad-hoc queries, making it ideal for exploratory data analysis. - **Scalability**: Impala can scale out to handle large volumes of data across multiple ...

    Pro Spark Streaming(Apress,2016)

    Real-time and scalable ETL using data frames, SparkSQL, Hive, and SparkR Streaming machine learning, predictive analytics, and recommendations Meshing batch processing with stream processing via the ...

    hadoop_the_definitive_guide_3nd_edition

    Data Storage and Analysis 3 Comparison with Other Systems 4 RDBMS 4 Grid Computing 6 Volunteer Computing 8 A Brief History of Hadoop 9 Apache Hadoop and the Hadoop Ecosystem 12 Hadoop Releases 13 What...

    基于Hadoop技术的大数据就业岗位数据分析.docx

    the hiring salary, quantity, detailed information, such as professional skill requirements for the candidates for your reference and decision analysis. Key words: Big data; Job recruitment; ...

    twitter-Sentiment-Analysis-using-hadoop:一个项目,在该项目中,人们可以获取和阅读推文,并向谁展示分析力,例如谁是最有影响力的

    该项目名为“twitter-Sentiment-Analysis-using-hadoop”,旨在通过Hadoop生态系统进行大规模的数据处理,以分析Twitter上的用户情绪。这个项目的核心目标是提取有价值的信息,例如识别最有影响力的Twitter用户,这...

    Spark与Mysql的交互

    背景  Spark在对目标数据进行计算后,RDD格式的数据一般都会存在HDFS,Hive,HBase中,另一方面,对于非RDD格式的数据,可能会存放在像...  - Using Apache Spark and MySQL for Data Analysis  - spark 1.3.0 将d

Global site tag (gtag.js) - Google Analytics