1、包准备
hadoop-2.5.0-cdh5.3.0.tar.gz
zookeeper-3.4.5-cdh5.3.0.tar.gz
hive-0.13.1-cdh5.3.0.tar.gz
jdk1.7
2、环境准备
1)免密码SSH
ssh-keygen -t rsa -P ""
cat id_rsa.pub>> authorized_keys
chmod 700 .ssh
chmod 600 authorized_keys
2)主机名与IP映射
vi/etc/sysconfig/network
vim /etc/hosts
3)时钟同步
4)关闭防火墙
3、配置文件
1)Hadoop Core
配置项HDFS和MapReduce常用IO配置等
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
2)Hadoop Env 运行环境变量
略
3)Hdfs-Site Hadoop
守护进程的配置项,包括NameNode、辅助NameNode和DataNode等
<configuration>
<property>
<!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
<name>dfs.name.dir</name>
<value>/opt/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hdfs/data</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/opt/hadoop/hdfs/namesecondary</value>
</property>
</configuration>
4)Mapred-site
jobtracker和tasktracker配置
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/opt/hadoop/mapred/local</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/opt/hadoop/mapred/system</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>7</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maxinum</name>
<value>7</value>
</property>
</configuration>
4、Hdfs
1)格式化
bin/hdfs.sh namenode -format
【注】重复格式化,导致datanode无法启动,可以删除core-site.xml中配置
的dfs name、data配置
2)启动NameNode和DataNode
sbin/start-dfs.sh
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory
(defaults to $HADOOP_HOME/logs)
或者
$ ./hadoop-daemon.sh start namenode
$ ./hadoop-daemon.sh start secondarynamenode
$ ./hadoop-daemon.sh start datanode
5、Namenode、SecondNameNode、DataNode、JobTracker、TaskTracker关系统
【注】 当MapReduce作业在高负载集群时,JobTracker会占用大量内存和
CPU资源,因些单独运行在一个专用节点
TaskTracker与map、reduce关系
问题:
1)Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
原因1:64位使用了32本地库
6、HDFS
dfs.namenode.secondary.http-address | 0.0.0.0:50090 | The secondary namenode http server address and port. |
dfs.namenode.secondary.https-address | 0.0.0.0:50091 | The secondary namenode HTTPS server address and port. |
dfs.datanode.address | 0.0.0.0:50010 | The datanode server address and port for data transfer. |
dfs.datanode.http.address | 0.0.0.0:50075 | The datanode http server address and port. |
dfs.datanode.ipc.address | 0.0.0.0:50020 | The datanode ipc server address and port. |
dfs.namenode.http-address | 0.0.0.0:50070 |
dfs.namenode.name.dir | file://${hadoop.tmp.dir}/dfs/name | Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. |
dfs.namenode.edits.dir | ${dfs.namenode.name.dir} | Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir |
dfs.datanode.data.dir | file://${hadoop.tmp.dir}/dfs/data | Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. The directories should be tagged with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS storage policies. The default storage type will be DISK if the directory does not have a storage type tagged explicitly. Directories that do not exist will be created if local filesystem permission allows. |
dfs.namenode.checkpoint.dir | file://${hadoop.tmp.dir}/dfs/namesecondary | Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy. |
dfs.namenode.checkpoint.edits.dir | ${dfs.namenode.checkpoint.dir} | Determines where on the local filesystem the DFS secondary name node should store the temporary edits to merge. If this is a comma-delimited list of directories then the edits is replicated in all of the directories for redundancy. Default value is same as dfs.namenode.checkpoint.dir |
等价:bin/hdfs.sh dfs== bin/hadoop fs
浏览HDFS分布式文件系统根目下的文件
$>bin/hdfs.sh dfs -ls /
或者 $>bin/hadoop.sh fs -ls /
在根目下创建目录user
$>bin/hdfs.sh dfs -mkdir /user
上传本地文件至HDFS分布式文件系统
$>bin/hdfs dfs -put ../etc/hadoop /user/
或者使用
[-copyFromLocal [-f] [-p] <localsrc> ... <dst>] 上传
从HDFS分式文件系统下载到本地
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] 下载
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]下载
删除文件
$>bin/hdfs.sh dfs -rm /user/hadoop/*
查看HDFS文件内容
$>bin/hdfs.sh dfs -cat /user/core-site.xml
//=================HDFS Configuration对象加载参数=================
1.加载类路径资源文件
loaded in-order from the classpath:
1) core-default.xml: Read-only defaults for hadoop.
2) core-site.xml: Site-specific configuration for a given hadoop installation.
2.Final参数
Configuration parameters may be declared final. Once a resource
declares a value final,
no subsequently-loaded resource can alter that value.
For example, one might define a final parameter with:
<property>
<name>dfs.hosts.include</name>
<value>/etc/hadoop/conf/hosts.include</value>
<final>true</final>
</property>
3.变量扩展
Value strings are first processed for variable expansion. The available properties are:
1) Other properties defined in this Configuration; and, if a name is undefined here
2) Properties in System.getProperties().
For example, if a configuration resource contains the following property definitions:
<property>
<name>basedir</name>
<value>/user/${user.name}</value>
</property>
<property>
<name>tempdir</name>
<value>${basedir}/tmp</value>
</property>
//===================HDFS 接口=======================
与HDFS常用交互方式:
1、HTTP(不依赖特定HDFS版本)
HftpFileSystem/HsftpFileSystem
2、FTP(暂未实现)
3、Thrift
4、JAVA(命令行解释器采用FileSystem)
5、C
JAVA接口:
1.Hadoop URL 读取数据
此方法存在很大问题,Java虚拟机只能调用一次URL.setURLStreamHandlerFactory,无
法屏蔽第三方组件调用,否则无法从Hadoop中读取数据。
public class App {
static{
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) {
InputStream in = null;
try {
in = new URL("hdfs://192.168.121.200:9001/user/core-site.xml").openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} catch (IOException e) {
e.printStackTrace();
IOUtils.closeStream(in);
}
}
}
2.FileSystem API 读取数据
public class FileSystemReadHadoopApp {
public static void main(String[] args) {
// see Configuration describer above
Configuration conf = new Configuration();
InputStream in = null;
try {
// 读类路径下的core.site.xml文件中默认的文件系统
FileSystem _fs = FileSystem.get(conf);
// 去掉校验和验证
// .crc文件大小512字节,可以通过参数io.bytes.per.checksum
//_fs.setVerifyChecksum(false);
in = _fs.open(new Path(new URI("hdfs://192.168.121.200:9001/user/core-site.xml")));
IOUtils.copyBytes(in, System.out, 4096, false);
} catch (Exception e) {
e.printStackTrace();
IOUtils.closeStream(in);
}
}
}
7、Hadoop I/O
7.1 Hdfs完整性校验证文件:数据块同.filename.crc文件同一目录,大小512字节,可
以通过参数io.bytes.per.checksum设置文件大小,默认512字节
校验和验证参数开关:
1) 代码执行FileSystem类setVerifyChecksum方法,设备为false
2) 解释命令行 -ignoreCrc和-get或等价-copyToLocal
校验实现LocalFileSystem/RawLocalFileSystem:
1) FileSystem _fs2 = new RawLocalFileSystem();
_fs2.initialize(null, conf);
2)设置全局属性fs.file.impl值为org.apache.hadoop.fs.RawLocalFileSystem
7.2 压缩
1)效率:bzip2 > gzip > lzo
2)压缩速度:lzo > gzip > bzip2
3)解压速度:lzo > gzip > bzip2
7.2.1 解压缩接口CompressionCodec
hadoop codec实现:
编号 | 压缩格式 | HadoopCompressionCodec |
1 | DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
2 | gzip | org.apache.hadoop.io.compress.GzipCodec |
3 | bzip2 | org.apache.hadoop.io.compress.Bzip2Codec |
4 | lzo | com.hadoop.compression.lzo.LzopCodec |
CompressionCodec接口方法:
1)压缩:CompressionOutputStream createOutputStream(OutputStream out)
2)解压:CompressionInputStream createInputStream(InputStream in)
CompressionCodecFactory根据扩展名推断CompressionCodec:
CompressionCodecFactory从属性io.compression.codecs定义的列表中找到codec,
默认情况hadoop提供所有codec,所以需要定制的codec时,才需要修改此属性。
为了性能,最好使用“原生”类库实现压缩与解压。hadoop.native.lib=false 禁用原生
代码类库;可以通过java.library.path设置原生类库位置(hadoop配置文件下)
压缩格式选择:
1. 原始文件
2. 支持压缩分片bzip2
3. 应用分片
4. sequence file
5. avro 数据文件
MapReduce压缩:
mapred.output.compress=true
mapred.output.compression.codec= codec实现类
或者
压缩格式mapred.output.compression.type=RECORD
Map压缩:
mapred.compress.map.output=true
mapred.map.output.compression.codec= codec实现类
八、YARN on a Single Node
8.1 配置信息
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
8.2 启动
$ ./yarn-daemon.sh start resourcemanager
$ ./yarn-daemon.sh start nodemanager
配置控制台日志:
hadoop-env.sh
export HADOOP_ROOT_LOGGER=DEBUG,console
fs.default.name
- For MRv1:
<property> <name>fs.default.name/name> <value>hdfs://mycluster</value> </property>
- For YARN:
<property> <name>fs.defaultFS</name> <value>hdfs://mycluster</value> </property>
配置日志(独立于Hadoop之外):
export HADOOP_LOG_DIR=/var/log/hadoop
内存配置:
默认情况,hadoop为各个守护进程分配1G内存,该值由配置文件hadoop-env.sh
属性HADOOP_HEAPSIZE参数控制;
# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""
tasktracker启动独立的子JVM分别运行map和reduce任务,任务内存由参数
mapred.child.java.opts控制,默人为200M
一个tasktracker所能同时运行的最大map任务数由
mapred.tasktracker.map.tasks.maxinum,默认为2
相应的,一个tasktracker所能同时运行的最大reduce任务数由
mapred.tasktracker.reduce.tasks.maxinum属性控制,默认为2
一个tasktracker所能同时运行的任务数取决于一台机器有多少处理器,由于Mapreduce作业通常
是I/O受限的(主要开销主要I/O操作);经验法则是任务数(包括map和reduce)与处理
器数的比值最好在1和2之间(低于处理器数)
守护进程namenode、secondnamenode、jobtracker默认1G内存,经验法则,每1百万数
据块分配1G内存空间,如200节点集群为例,每个节点4T磁盘空间,数据块大小为128M,复
本数量为3的话,则约有2百万个数据块(甚至更多):200 * 4000000M / (128M * 3) ,因此
本例namenode、secodenamenode配置为2G (namenode与secondnamenode一样)
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=
${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=
${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=
ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=
${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=
${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
创建账号授权
#./hadoop fs -mkdir /user/oy
#./hadoop fs -ls /user
-rw-r--r-- 3 root supergroup 1063 2017-03-17 07:24 /user/core-site.xml
drwxr-xr-x - root supergroup 0 2017-03-17 07:14 /user/hadoop
drwxr-xr-x - root supergroup 0 2017-06-16 19:39 /user/oy
#./hadoop fs [-chown [-R] [OWNER][:[GROUP]] PATH...]
#./hadoop fs -chown oy:oy /user/oy
#./hadoop fs -chown oy:oy /user/oy
//=======================Hadoop环境变量===========================
Hadoop启动时需读取hadoop-config.sh配置的变量:
配置文件位置:hadoop-2.6.0-cdh5.4.0/libexec/hadoop-config.sh
//======================Hadoop问题===============================
问题
1)NativeCodeLoader
NativeCodeLoader: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
2)格式化
17/06/19 06:50:43 WARN common.Util: Path /opt/hadoop/hfiles/hdfs/name should be
specified as a URI in configuration files. Please update hdfs configuration.
17/06/19 06:50:43 WARN common.Util: Path /opt/hadoop/hfiles/hdfs/name should be
specified as a URI in configuration files. Please update hdfs configuration.
增加文件前缀:
<property>
<!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
<name>dfs.name.dir</name>
<value>file:/opt/hadoop/hfiles/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:/opt/hadoop/hfiles/hdfs/data</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:/opt/hadoop/hfiles/hdfs/namesecondary</value>
</property>
Formatting using clusterid: CID-009ca18e-cedc-4f93-b40e-c1c1107b4164
17/06/19 06:50:43 INFO namenode.FSNamesystem: No KeyProvider found.
17/06/19 06:50:43 DEBUG crypto.OpensslCipher: Failed to load OpenSSL Cipher.
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsOpenssl()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsOpenssl(Native Method)
at org.apache.hadoop.crypto.OpensslCipher.<clinit>(OpensslCipher.java:84)
at org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec.<init>(OpensslAesCtrCryptoCodec.java:50)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
at org.apache.hadoop.crypto.CryptoCodec.getInstance(CryptoCodec.java:68)
at org.apache.hadoop.crypto.CryptoCodec.getInstance(CryptoCodec.java:101)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:802)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:778)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:980)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1425)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1550)
3)yarn 任务挂起
[root@hadoop1 conf]# jps
2776 DataNode
5722 RunJar
6427 Jps
3397 ResourceManager
5478 RunJar
2943 SecondaryNameNode
2669 NameNode
0: jdbc:hive2://localhost:10000/testdb> select count(*) from t1;
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
WARN : Hadoop command-line option parsing not performed. Implement the Tool
interface and execute your application with ToolRunner to remedy this.
INFO : Starting Job = job_1498049139033_0001, Tracking URL =
http://hadoop1:8088/proxy/application_1498049139033_0001/
INFO : Kill Command = /opt/hadoop/hadoop-2.6.0-cdh5.4.0/bin/hadoop job -kill job_1498049139033_0001
无nodemanager启动,以致任务挂起
结束yarn job:
[root@hadoop1 bin]# yarn application -kill application_1498049139033_0001
//================================HIVE===========================
hive.metastore.local 在新的版本(0.10)以后不再使用的配置属性。
1)hive本地模式
hive.metastore.uris 为空测为本地模式
cli启动hive,不需要启动metastore、hiveserver或hiveserver2
2)hive远程模式
// 配置连接metastore server
<property>
<name>hive.metastore.uris</name>
<value>thrift://ip:port</value>
</property>
3)hive端口
hive --service metastore & --->默认启动9083端口
hive --service hiveserver2 & ---->thrift监听端口10000(默认)
4)hive cli(新版已过时)
hive cli 依赖 hadoop
默认情况下载hive包,配置conf/hive-env.sh文件:
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/opt/hadoop/hadoop-2.6.0-cdh5.4.0
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/opt/hadoop/hive-1.1.0-cdh5.4.0/conf
本地(嵌入模式)
hive.metastore.uris属性值为空
5)hive beeline(类似hive cli)
本地嵌入/远程模式都可以启动
依赖hiveserver2,需先启动,默认监听端口10000;也可以自定端口。
hive --service hiveserver2 -p 11000 &
连接hiveserver2:
jdbc:hive2://ip:port/dbname
例如:
[root@hadoop1 bin]# ./beeline
17/06/21 06:28:52 DEBUG util.VersionInfo: version: 2.6.0-cdh5.4.0
Beeline version 1.1.0-cdh5.4.0 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000/testdb
scan complete in 12ms
Connecting to jdbc:hive2://localhost:10000/testdb
Enter username for jdbc:hive2://localhost:10000/testdb:
Enter password for jdbc:hive2://localhost:10000/testdb:
Connected to: Apache Hive (version 1.1.0-cdh5.4.0)
Driver: Hive JDBC (version 1.1.0-cdh5.4.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/testdb> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| t1 |
+-----------+--+
1 row selected (2.72 seconds)
0: jdbc:hive2://localhost:10000/testdb>
端口查看:
[root@hadoop1 bin]# netstat -anp | grep 10000
6)hive 初始化使用 postgresql
[root@hadoop1 bin]# ./schematool -dbType postgres -initSchema
初始化各类数据库(oracle/mysql/postgres/derby/mssql)脚本文件位于
$HIVE_HOME/scripts/metastore/upgrade/
7)FATAL: no pg_hba.conf entry for host "192.168.110.166",
user "jcbk", database "jcbk", SSL off
修改postgres中的PostgreSQL\9.5\data\pg_hba.conf:
# TYPE DATABASE USER ADDRESS METHOD
# IPv4 local connections:
host all all 127.0.0.1/32 md5
host all all 192.168.110.166/32 trust
# IPv6 local connections:
host all all ::1/128 md5
# Allow replication connections from localhost, by a user with the
# replication privilege.
#host replication postgres 127.0.0.1/32 md5
#host replication postgres ::1/128 md5
修改postgres中的PostgreSQL\9.5\data\postgresql.conf,监听所有主机请求:
listen_addresses = '*'
8)Unable to open a test connection to the given database
Caused by: java.sql.SQLException: Unable to open a test connection to the
given database. JDBC url = jdbc:postgresql://localhost:5432/jcbk, username
= jcbk. Terminating connection pool (set lazyInit to true if you expect to start
your database after your app). Original Exception: ------
org.postgresql.util.PSQLException: Connection refused. Check that the hostname
and port are correct and that the postmaster is accepting TCP/IP connections.
原因:IP不正确
9) Failed to get schema version.
Metastore connection URL: jdbc:postgresql://localhost:5432/jcbk?createDatabaseIfNotExist=true
Metastore Connection Driver : org.postgresql.Driver
Metastore connection User: jcbk
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
*** schemaTool failed ***
原因:数据库脚本未初始化
10)MissingTableException
Caused by: org.datanucleus.store.rdbms.exceptions.MissingTableException:
Required table missing : ""VERSION"" in Catalog "" Schema "". DataNucleus
requires this table to perform its persistence operations. Either your MetaData
is incorrect, or you need to enable "datanucleus.autoCreateTables"
hive-site.xml 修改如下:
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
11)Failed to start database 'metastore_db'
Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with
class loader sun.misc.Launcher$AppClassLoader@2bb0bf9a, see the next exception for details.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
... 61 more
Caused by: java.sql.SQLException: Database at /opt/hadoop/hive-1.1.0-cdh5.4.0/bin/metastore_db
has an incompatible format with the current version of the software. The database
was created by or upgraded by version 10.11.
删除$HIVE_HOME/bin/metastore_db
12) Could not connect to meta store using any of the URIs
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/
CDH-5.3.0-1.cdh5.3.0.p0.30/jars/hive-common-0.13.1-cdh5.3.0.jar!/hive-log4j.properties
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException:
Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Caused by: MetaException(message:Could not connect to meta store using any of the
URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException:
java.net.ConnectException: Connection refused
metastore (hive server, port=9083) 未启动
#./hive --service metastore &
13) Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
the Hive properties to implicitly create or alter the existing schema are disabled by default.
Hive will not attempt to change the metastore schema implicitly. When you execute a
Hive query against an old schema, it will fail to access the metastore;
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException:
Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:367)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:633)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.RuntimeException:
Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
hive 配置文件hive-site.xml,增加如下:
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
14)HiveServer2 Security Configuration
HiveServer2 supports authentication of the Thrift client using either of these methods:
- Kerberos authentication
<property>
<name>hive.server2.authentication</name>
<value>KERBEROS</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/_HOST@YOUR-REALM.COM</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/etc/hive/conf/hive.keytab</value>
</property>
jdbc:hive2://node1:10000/default;principal=hive/HiveServer2Host@YOUR-REALM.CO
- LDAP authentication
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
<property>
<name>hive.server2.authentication.ldap.url</name>
<value>LDAP_URL</value>
</property>
<property>
<name>hive.server2.authentication.ldap.baseDN</name>
<value>LDAP_BaseDN</value>
</property>
jdbc:hive2://node1:10000/default;user=LDAP_Userid;password=LDAP_Password
15)Hive jdbc 与 server 版本不一致问题
Could not establish connection to jdbc:hive2://192.168.121.200:10000/default:
Required field 'client_protocol'
解决办法:将hadoop以下位置的jline替换成$HIVE_HOME/lib下的jline版本
E:\openjar\hadoop\hadoop\hadoop-2.5.0-cdh5.3.0\share\hadoop\yarn\lib
16)Windows下使用beeline 命令行访问 linux下的hive
使用方法如下:
(1) 下载安装hadoop,并配置环境变量HADOOP_HOME
(2) 下载安装JDK,并配置环境变量JAVA_HOME
(3) 下载安装hive,因$HIVE_HOME/bin下不存在beeline.cmd,在bin目录下
创建文件beeline.cmd,将以下内容拷备至beeline.cmd。
(4) hadoop下的jline与hive中的jline版本存在冲突时,统一成hive中jline版本
(5) 使用beeline jdbc客户端远程连接hive服务端时,如果创建的外部表locaion
文件路径需要权限时,需要在连接HIVE时提供用户名密码,否测无法创建外部表
(6) hive外部表挂接的文件,如果需要权限,则连接hive查询表时,需要提供用户名及密码
// ===============================beeline.cmd=================================
@echo off
@rem Licensed to the Apache Software Foundation (ASF) under one or more
@rem contributor license agreements. See the NOTICE file distributed with
@rem this work for additional information regarding copyright ownership.
@rem The ASF licenses this file to You under the Apache License, Version 2.0
@rem (the "License"); you may not use this file except in compliance with
@rem the License. You may obtain a copy of the License at
@rem
@rem http://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.
SetLocal EnableDelayedExpansion
pushd %CD%\..
if not defined HIVE_HOME (
set HIVE_HOME=%CD%
)
popd
if "%HADOOP_BIN_PATH:~-1%" == "\" (
set HADOOP_BIN_PATH=%HADOOP_BIN_PATH:~0,-1%
)
if not defined JAVA_HOME (
echo Error: JAVA_HOME is not set.
goto :eof
)
@rem get the hadoop envrionment
if not exist %HADOOP_HOME%\libexec\hadoop-config.cmd (
@echo +================================================================+
@echo ^| Error: HADOOP_HOME is not set correctly ^|
@echo +----------------------------------------------------------------+
@echo ^| Please set your HADOOP_HOME variable to the absolute path of ^|
@echo ^| the directory that contains \libexec\hadoop-config.cmd ^|
@echo +================================================================+
exit /b 1
)
@rem supress the HADOOP_HOME warnings in 1.x.x
set HADOOP_HOME_WARN_SUPPRESS=true
call %HADOOP_HOME%\libexec\hadoop-config.cmd
@rem include only the beeline client jar and its dependencies
pushd %HIVE_HOME%\lib
for /f %%a IN ('dir /b hive-beeline-**.jar') do (
set CLASSPATH=%CLASSPATH%;%HIVE_HOME%\lib\%%a
)
for /f %%a IN ('dir /b super-csv-**.jar') do (
set CLASSPATH=%CLASSPATH%;%HIVE_HOME%\lib\%%a
)
for /f %%a IN ('dir /b jline-**.jar') do (
set CLASSPATH=%CLASSPATH%;%HIVE_HOME%\lib\%%a
)
for /f %%a IN ('dir /b hive-jdbc-**-standalone.jar') do (
set CLASSPATH=%CLASSPATH%;%HIVE_HOME%\lib\%%a
)
popd
call %JAVA_HOME%\bin\java %JAVA_HEAP_MAX% %HADOOP_OPTS%
-classpath %CLASSPATH% org.apache.hive.beeline.BeeLine %*
endlocal
//===========================================================================
17)windows 命令行远程连接 HDFS
(1)下载hadoop,配置环境变量HADOOP_HOME
(2)配置hadoop core-site.xml,配置远程hdfs namenode ip:port
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.121.200:8020</value>
</property>
(3)下载winutils.exe文件,放置于$HADOOP_HOME/bin,解决以下问题:
17/06/22 16:09:14 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable
E:\openjar\hadoop\hadoop\hadoop-2.5.0-cdh5.3.0\bin\winutils.e
xe in the Hadoop binaries.
(4)在window上传文件时出现"块分片"异常,要linux上可以上传文件至远程HDFS服务器
18) Hdfs client : Permission denied
1、在系统的环境变量或Java JVM变量里面添加HADOOP_USER_NAME,这个值具体等于多少看自己的情况,以后会运行HADOOP上的linux的用户名。(修改完重启eclipse,不然可能不生效)
2、将当前系统的帐号修改为hadoop
3、使用HDFS的命令行接口修改相应目录的权限,hadoop fs -chmod 777 /user,后面的/user是要上传文件的路径,不同的情况可能不一样,比如要上传的文件路径为hdfs://namenode/user/xxx.doc,则这样的修改可以,如果要上传的文件路径为hdfs://namenode/java/xxx.doc,则要修改的为hadoop fs -chmod 777 /java或者hadoop fs -chmod 777 /,java的那个需要先在HDFS里面建立Java目录,后面的这个是为根目录调整权限。
4、在hdfs的配置文件中,将dfs.permissions修改为False
19) HDFS身份验证模式
Hadoop支持2种不同的身份验证模式,可以通过hadoop.security.authentication属性进行配置:
- simple
在simple身份认证模式下,用户的身份信息就是客户端的操作系统的登录用户,在Unix类的操作系统中,HDFS的用户名等同使用whoami命令查看结果的用户名。
- kerberos
在kerberos身份认证模式下,HDFS用户的身份是由kerberos凭证决定的。kerberos认证的安全性较高,但配置相对复杂,一般情况下很少使用。
20) Using Hive with Existing Files on S3
说明
安装JDK,配置JAVA_HOME环境变量,接下来就是配置$HIVE_OPTS参数。
更新配置
首先需要配置参数:This can be done via HIVE_OPTS, configuration files ($HIVE_HOME/conf/hive-site.xml), or via Hive CLI’s SETcommand.
Here are the configuration parameters:
Name | Value |
fs.s3n.awsAccessKeyId | Your S3 access key |
fs.s3n.awsSecretAccessKey | Your S3 secret access key |
通过S3创建Hive表
CREATE EXTERNAL TABLE mydata (key STRING, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '=' LOCATION 's3n://mys3bucket/';
Note: don’t forget the trailing slash in the LOCATION clause!
Here we’ve created a Hive table named mydata that has two columns: a key and a value. The FIELDS TERMINATED clause tells Hive that the two columns are separated by the ‘=’ character in the data files. The LOCATION clause points to our external data in mys3bucket.
Now, we can query the data:
SELECT * FROM mydata ORDER BY key;
20) HIVE文件存储格式
1.textfile
textfile为默认格式;存储方式:行存储;磁盘开销大 数据解析开销大;压缩的text文件 hive无法进行合并和拆分;查询的效率最低,加载数据的速度最高
2.sequencefile
二进制文件,以<key,value>的形式序列化到文件中;存储方式:行存储;可分割 压缩,一般选择block压缩;优势是文件和Hadoop api中的mapfile是相互兼容的。 存储空间消耗最大,压缩的文件可以分割和合并 查询效率高,需要通过text文件转化来加载
3.rcfile
存储方式:数据按行分块 每块按照列存储;压缩快 快速列存取;读记录尽量涉及到的block最少;读取需要的列只需要读取每个row group 的头部定义;读取全量数据的操作 性能可能比sequencefile没有明显的优势;存储空间最小,查询的效率最高,需要通过text文件转化来加载,加载的速度最低
4.orc
存储方式:数据按行分块 每块按照列存储;压缩快 快速列存取;效率比rcfile高,是rcfile的改良版本
5.自定义格式
用户可以通过实现inputformat和 outputformat来自定义输入输出格式。
21)hiveserver 与 hiveserver2
从hive1.0以后,hiveserver被hiveserver2取代
[root@hadoop1 bin]# ./hive --service hiveserver --help
17/07/05 23:25:36 DEBUG util.VersionInfo: version: 2.6.0-cdh5.4.0
Starting Hive Thrift Server
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.hive.service.HiveServer
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
相关推荐
在大数据处理领域,Hadoop和Hive是两个非常重要的组件。Hadoop是一个开源框架,主要用于分布式存储和计算大规模数据集,而Hive则是一个基于Hadoop的数据仓库工具,提供了SQL-like查询语言(HQL)来方便地管理和分析...
此外,该系统与传统单机处理Web日志的实验对比,证明了基于Hadoop/Hive的Web日志分析系统的有效性与价值。系统不但能够充分利用Hadoop进行大规模数据处理的能力,而且由于Hive的引入,显著降低了开发的难度和复杂度...
基于Hadoop网站流量日志数据分析系统 1、典型的离线流数据分析系统 2、技术分析 - Hadoop - nginx - flume - hive - mysql - springboot + mybatisplus+vcharts nginx + lua 日志文件埋点的 基于Hadoop网站流量...
【描述】:此系统设计的两个核心任务是构建一个分布式Hadoop集群以及利用这个集群进行日志分析。通过Hadoop,可以处理海量的电影数据,而Hive则作为一个基于Hadoop的数据仓库工具,提供了一种结构化数据的查询和分析...
- **Hive配置文件**:在`hive-site.xml`中配置Hive的工作目录、日志位置及辅助JAR路径,这些路径用于存储临时文件、日志以及与HBase和Zookeeper交互所需的库。 ```xml <name>hive.exec.scratchdir <value>/...
本文中提到的Hadoop与Hive是当前大数据存储和处理领域的两个核心技术和工具,它们在处理大规模数据集方面展现出显著优势。 Hadoop是由Apache基金会开发的一个开源框架,它允许使用简单的编程模型在由廉价的硬件构成...
Ashish Thusoo是Hive的主要开发者之一,他可能会详细介绍Hive如何帮助Facebook处理海量的日志数据,以及Hive如何与Facebook的数据基础设施集成,提供高效的数据分析能力。 Yahoo的“Hadoop by Hairong Kuang.ppt”...
基于Hadoop网站流量日志数据分析系统项目源码+教程.zip网站流量日志数据分析系统 典型的离线流数据分析系统 技术分析 hadoop nginx flume hive sqoop mysql springboot+mybatisplus+vcharts 基于Hadoop网站流量日志...
标题 "基于hadoop,hive,hbase的日志分析系统.zip" 涉及到的核心技术是大数据处理领域中的Hadoop、Hive和HBase。这三个工具在大数据生态系统中扮演着重要角色,尤其对于日志分析而言,它们提供了一种有效且可扩展的...
通常,Hadoop的版本发布会包括对HBase和Hive的兼容性测试报告,但这并不意味着所有早期版本的HBase和Hive都能与新版Hadoop完全兼容。相反,有时候新版本的Hadoop可能需要与特定版本的HBase和Hive一起使用才能正常...
【标题】"taotao-weblog-analysis基于openresty kafka hadoop hive 日志点击流数据分析"涉及的关键技术点包括OpenResty、Kafka、Hadoop和Hive,这些都是大数据处理和分析领域的重要组件。 OpenResty是基于Nginx与...
1. **mysql-connector-java-5.1.43-bin.jar**:这是一个MySQL的JDBC驱动,用于Hive与MySQL数据库之间的通信。在Hive中,元数据通常存储在关系型数据库如MySQL中,以便管理和检索。因此,我们需要把这个JDBC驱动添加...
总的来说,Hadoop 2.2上安装Hive是一项需要细心和耐心的工作,但一旦完成,我们可以享受到Hive带来的强大数据分析能力,它与Hadoop的结合,使得大数据处理变得更加便捷。在实践中,不断学习和优化配置,将使Hive更好...
在Windows 10环境下搭建Hadoop生态系统,包括JDK、MySQL、Hadoop、Scala、Hive和Spark等组件,是一项繁琐但重要的任务,这将为你提供一个基础的大数据处理平台。下面将详细介绍每个组件的安装与配置过程。 **1. JDK...
- Hive版本要与Hadoop集群版本兼容,避免版本冲突。 - 配置文件中的路径及数据库连接信息必须与实际部署环境一致。 - 在生产环境中,建议使用更安全的认证方式来连接数据库,并对JDBC连接进行加密。 - 大规模的Hive...
3. Hive:用于对Web日志数据进行查询和分析。 4. Pig:用于对Web日志数据进行数据处理和分析。 基于Hadoop的Web日志挖掘是指使用Hadoop大数据处理技术对Web日志进行挖掘和分析的过程。该技术可以帮助网站管理员更好...
5. **Hive在大数据分析中的应用**:Hive常用于离线分析,例如日志分析、用户行为分析、市场趋势预测等。由于其易用性和与SQL的兼容性,Hive在数据仓库和数据湖场景下非常受欢迎。 尽管Hive在实时分析方面相对较弱,...
在大数据领域,构建一个完整的生态系统是至关重要的,其中包括多个组件,如Hadoop、Spark、Hive、HBase、Oozie、Kafka、Flume、Flink、Elasticsearch和Redash。这些组件协同工作,提供了数据存储、处理、调度、流...
在构建大数据处理环境时,Apache Hive 是一个重要的组件,它提供了SQL查询功能,使用户能够方便地对存储在Hadoop文件系统(HDFS)中的大规模数据集进行数据汇总与查询。本篇文章将详细介绍如何在CentOS 7环境下安装...