概述
Hive is designed to enable easy data summarization and ad-hoc analysis of large volumes of data. It uses a query language called Hive-QL which is similar to SQL.
In this tutorial, we will explore the following:
- Load a data file into a Hive table
- Create a table using RCFormat
- Query tables
- Managed tables vs external tables
- ORC format
- PARTITIONED a Table
- Bucketing a Table
Prerequisites
A working HDP cluster – the easiest way to have a HDP cluster is to download the Hortonworks Sandbox
Step 1. Let’s load a data file into a Hive table.
First of all, download data file from here click here and name the file as TwitterData.txt . You can copy the downloaded file into hdfs folder, /user/hadoop using hdfs fs -put command (see this tutorial) or the Hue Interface.
As the file is small, you can simply open it, copy and create a local file in the sandbox manually as well.
We will use the Web UI here.
Open http://localhost:8000 in your browser.
Now, click on File Browser logo and you will see the following screen. You could load to /user hdfs folder or /user/hue folder. Please choose your hdfs path.
Now click on Upload option and select file TwitterData.txt from your computer.
Here is a sample syntax to create a table and load datafile into the table.
Make sure you are using the correct path for your data file.
Let’s create this table and load data.
Type “hive” at the command prompt after you ssh in to the Sandbox.
CREATE TABLE TwitterExampletextexample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA INPATH '/user/Twitterdata.txt' OVERWRITE INTO TABLE TwitterExampletextexample;
Here is the log that you can refer for exact steps.
Please run select * from this table to see the data.
Step 2. Let’s create a table using RCfile format
Record Columnar(RC) format determines how to store relational tables on distributed computer clusters. With this format, you can get the advantages of a columnar format over row format of a record.
Here is a sample Create RC file format table syntax:
CREATE TABLE TwitterExampleRCtable(
tweetId INT, username BIGINT,
txt STRING, CreatedAt STRING,
profileLocation STRING COMMENT 'Location of user',
favc INT,retweet STRING,retcount INT,followerscount INT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS RCFILE;
Here is the step on how to Load Data into the RC Table. Please execute and see the results.
INSERT OVERWRITE TABLE TwitterExampleRCtable select * from TwitterExampletextexample;
Here are the logs of the exact steps.
Step 3. Let’s query the table we just created.
Let’s find top 10 countries who tweeted most using TwitterExampleRCtable.
Select profileLocation, COUNT(txt) as count1 FROM TwitterExampleRCtable GROUP BY profileLocation ORDER BY count1 desc limit 10;
Please see the folloiwng log and the results:
Step 4. Let’s look at Managed tables vs External tables
Managed tables are created by default with CREATE TABLE statements, whereas External tables are used when you want your tables to point to data file in place.
Here is the syntax for creating these tables.
Managed:
CREATE TABLE ManagedExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
External:
CREATE EXTERNAL TABLE IF NOT EXISTS ExternalExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
location '/user/Twitterdata.txt';
Also, when you drop a Managed table, it deletes the data, and it also deletes the metadata.
When you drop an External table, it only deletes the metadata.
As a next step, you could describe the above tables as below and compare the output with managed vs. external tables.
describe formatted ManagedExample;
describe formatted ExternalExample;
Step 5. Hive ORC File format.
Optimized Row Columnar (ORC) File format is used as it further compresses data files. It could result in a small performance loss in writing, but there will be huge performance gain in reading.
Let’s try it out. Please see that the table is stored as ORC.
CREATE TABLE ORCFileFormatExample(
tweetId INT, username BIGINT,
txt STRING, CreatedAt STRING,
profileLocation STRING COMMENT 'Location of user',
favc INT,retweet STRING,retcount INT,followerscount INT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS ORC tblproperties ("orc.compress"="GLIB");
Step 6. Let’s create a PARTITIONED Table and load data into.
Partitions are horizontal slices of data which allow large sets of data to be segmented into more manageable blocks.
Here is the sample syntax to create a partitioned table and load data into partitions.
CREATE TABLE PARTITIONEDExample(
tweetId INT, username BIGINT, txt STRING,favc INT,retweet STRING,retcount INT,followerscount INT) COMMENT 'This is the Twitter streaming data' PARTITIONED BY(CreatedAt STRING, profileLocation STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
FROM twitterexampletextexample
INSERT OVERWRITE TABLE PARTITIONEDExample PARTITION (CreatedAt="26 04:50:56 UTC 2014",profileLocation="Chicago") SELECT tweetId,username,txt,favc,retweet,retcount,followerscount where profileLocation='Chicago' limit 100;
Here is the log from creating a table with ORC file format and a Partitioned table for your reference.
Step 7. Let’s create a table with Buckets.
Bucketing is a technique that allows to cluster or segment large sets of data to optimize query performance.
Here is an example for creating a table with buckets and load data into it.
CREATE TABLE BucketingExample(
tweetId INT, username BIGINT,
txt STRING,CreatedAt STRING,favc INT,retweet STRING,retcount INT, followerscount INT)
COMMENT 'This is the Twitter streaming data'
PARTITIONED BY( profileLocation STRING)
CLUSTERED BY(tweetId) INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
set hive.enforce.bucketing = true;
FROM twitterexampletextexample
INSERT OVERWRITE TABLE BucketingExample PARTITION (profileLocation="Chicago") SELECT tweetId,username,txt,CreatedAt,favc,retweet,retcount,followerscount where profileLocation='Chicago' limit 100;
You can go to hdfs folder and see the directory structure behind these Hive tables that you have just created. That could help you to design your tables and file distributions which is very important in designing your warehouse.
Hope, this was helpful and simple enough to give you a glimpse of the Hive world.
http://zh.hortonworks.com/hadoop-tutorial/using-hive-data-analysis/
相关推荐
from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data ...
"Scala: Guide for Data ... It contains a diverse set of recipes that cover the full spectrum of interesting data analysis tasks and will help you revolutionize your data analysis skills using Scala.
Consuming HDInsight data from Microsoft BI Tools over Hive ODBC Driver - Using HDInsight with Microsoft BI and Power BI to simplify data integration, analysis, and reporting Using PIG for big data ...
from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data ...
Extract, Transform, Load (ETL) processes are fundamental in preparing data for analysis. This chapter discusses: - **Pentaho**: An open-source data integration tool that supports ETL processes. - **...
Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters Apply key ...
Architecture examples for clickstream analysis, fraud detection, and data warehousing Table of Contents Part I. Architectural Considerations For Hadoop Applications Chapter 1. Data Modeling In Hadoop ...
Approaches to live response and analysis are included, and tools and techniques for postmortem analysis are discussed at length. Tools and techniques will be presented that take the analyst beyond ...
MapReduce, HDFS, and other Hadoop ecosystem components, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also ...
- **Interactive Query Processing**: It supports ad-hoc queries, making it ideal for exploratory data analysis. - **Scalability**: Impala can scale out to handle large volumes of data across multiple ...
Real-time and scalable ETL using data frames, SparkSQL, Hive, and SparkR Streaming machine learning, predictive analytics, and recommendations Meshing batch processing with stream processing via the ...
Data Storage and Analysis 3 Comparison with Other Systems 4 RDBMS 4 Grid Computing 6 Volunteer Computing 8 A Brief History of Hadoop 9 Apache Hadoop and the Hadoop Ecosystem 12 Hadoop Releases 13 What...
the hiring salary, quantity, detailed information, such as professional skill requirements for the candidates for your reference and decision analysis. Key words: Big data; Job recruitment; ...
该项目名为“twitter-Sentiment-Analysis-using-hadoop”,旨在通过Hadoop生态系统进行大规模的数据处理,以分析Twitter上的用户情绪。这个项目的核心目标是提取有价值的信息,例如识别最有影响力的Twitter用户,这...
背景 Spark在对目标数据进行计算后,RDD格式的数据一般都会存在HDFS,Hive,HBase中,另一方面,对于非RDD格式的数据,可能会存放在像... - Using Apache Spark and MySQL for Data Analysis - spark 1.3.0 将d