`
MyEyeOfJava
  • 浏览: 1169019 次
  • 性别: Icon_minigender_1
  • 来自: 北京
博客专栏
7af2d6ca-4fe1-3e9a-be85-3f65f7120bd0
测试开发
浏览量:71780
533896eb-dd7b-3cde-b4d3-cc1ce02c1c14
晨记
浏览量:0
社区版块
存档分类
最新评论

[Hadoop]Sqoop 1.4.2中文文档(一)之数据导入

阅读更多
一、Sqoop Help
$ sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.


你可以使用sqoop help (tool-name)也可以使用sqoop (tool-name)--help来使用帮助。
例如: sqoop help import. sqoop import --help.

二、Sqoop的别名
例如:sqoop import --help 等同于 sqoop-import --help,即sqoop-import是sqoop import的别名。

三、sqoop-import
$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>     Specify JDBC connect string
   --connect-manager <jdbc-uri>     Specify connection manager class to use
   --driver <class-name>    Manually specify JDBC driver class to use
   --hadoop-home <dir>      Override $HADOOP_HOME
   --help                   Print usage instructions
-P                          Read password from console
   --password <password>    Set authentication password
   --username <username>    Set authentication username
   --verbose                Print more information while working

[...]

Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]


其中Generic option的设置要在Common arguments之前,-conf,-fs,-jt-,-D都是对hadoop服务进行设置的,例如 -D mapred.job.name=<job_name>能够制定job的名字,如果不指定的话Job的名字将以用到的Jar包作为Job的名字。
例如:
User: hdfs
Job Name: cn_opda_a_phonoalbumshoushou_json_120901.jar
Job File: hdfs://vm-nba01.in.dx:9000/home/hdfs/tmp/mapred/staging/hdfs/.staging/job_201210171559_0391/job.xml
Submit Host: vm-nba01.in.dx
Submit Host Address: 10.18.102.101
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Tue Oct 23 15:18:41 CST 2012
Finished at: Tue Oct 23 15:23:20 CST 2012
Finished in: 4mins, 39sec
Job Cleanup: Successful


而files、libjars 、archives 选项则不具有代表性质,因为这些选项在Hadoop内部命令中已经被支持了,可以查看hadoop job的帮助。
[work@vm-nba01 ~]$ hadoop job
Usage: JobClient <command> <args>
	[-submit <job-file>]
	[-status <job-id>]
	[-counter <job-id> <group-name> <counter-name>]
	[-kill <job-id>]
	[-set-priority <job-id> <priority>]. Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW VERY_LOW
	[-events <job-id> <from-event-#> <#-of-events>]
	[-history <jobOutputDir>]
	[-list [all]]
	[-list-active-trackers]
	[-list-blacklisted-trackers]
	[-list-attempt-ids <job-id> <task-type> <task-state>]

	[-kill-task <task-id>]
	[-fail-task <task-id>]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.


四、sqoop脚本
举例:
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST

下面把这些选项参数做成脚本进行传递:(import.txt)
$ sqoop --options-file /users/homer/work/import.txt --table TEST

那么import.txt中的参数要按照行来进行分隔,内容如下:
#
# Options file for Sqoop import
#

# Specifies the tool being invoked
import

# Connect parameter and value
--connect
jdbc:mysql://localhost/db

# Username parameter and value
--username
foo

#
# Remaining options should be specified in the command line.
#


举个sqoop连接数据库,将数据库内的数据导入到HDFS中的例子:
sqoop import --connect jdbc:mysql://database.example.com/employees \
    --username aaron --password 12345

这样连接例子需要把mysql driver的jar包放到你的环境Path中,否则请这样使用:
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
    --connect <connect-string> ...


sqoop-import控制参数:
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>                         Specify JDBC connect
                                                string
   --connection-manager <class-name>            Specify connection manager
                                                class name
   --connection-param-file <properties-file>    Specify connection
                                                parameters file
   --driver <class-name>                        Manually specify JDBC
                                                driver class to use
   --hadoop-home <dir>                          Override $HADOOP_HOME
   --help                                       Print usage instructions
-P                                              Read password from console
   --password <password>                        Set authentication
                                                password
   --username <username>                        Set authentication
                                                username
   --verbose                                    Print more information
                                                while working

Import control arguments:
   --append                        Imports data in append mode
   --as-avrodatafile               Imports data to Avro data files
   --as-sequencefile               Imports data to SequenceFiles
   --as-textfile                   Imports data as plain text (default)
   --boundary-query <statement>    Set boundary query for retrieving max
                                   and min value of the primary key
   --columns <col,col,col...>      Columns to import from table
   --compression-codec <codec>     Compression codec to use for import
   --direct                        Use direct import fast path
   --direct-split-size <n>         Split the input stream every 'n' bytes
                                   when importing in direct mode
-e,--query <statement>             Import results of SQL 'statement'
   --fetch-size <n>                Set number 'n' of rows to fetch from
                                   the database when more rows are needed
   --inline-lob-limit <n>          Set the maximum size for an inline LOB
-m,--num-mappers <n>               Use 'n' map tasks to import in parallel
   --split-by <column-name>        Column of the table used to split work
                                   units
   --table <table-name>            Table to read
   --target-dir <dir>              HDFS plain table destination
   --warehouse-dir <dir>           HDFS parent for table destination
   --where <where clause>          WHERE clause to use during import
-z,--compress                      Enable compression

Incremental import arguments:
   --check-column <column>        Source column to check for incremental
                                  change
   --incremental <import-type>    Define an incremental import of type
                                  'append' or 'lastmodified'
   --last-value <value>           Last imported value in the incremental
                                  check column

Output line formatting arguments:
   --enclosed-by <char>               Sets a required field enclosing
                                      character
   --escaped-by <char>                Sets the escape character
   --fields-terminated-by <char>      Sets the field separator character
   --lines-terminated-by <char>       Sets the end-of-line character
   --mysql-delimiters                 Uses MySQL's default delimiter set:
                                      fields: ,  lines: \n  escaped-by: \
                                      optionally-enclosed-by: '
   --optionally-enclosed-by <char>    Sets a field enclosing character

Input parsing arguments:
   --input-enclosed-by <char>               Sets a required field encloser
   --input-escaped-by <char>                Sets the input escape
                                            character
   --input-fields-terminated-by <char>      Sets the input field separator
   --input-lines-terminated-by <char>       Sets the input end-of-line
                                            char
   --input-optionally-enclosed-by <char>    Sets a field enclosing
                                            character

Hive arguments:
   --create-hive-table                         Fail if the target hive
                                               table exists
   --hive-delims-replacement <arg>             Replace Hive record \0x01
                                               and row delimiters (\n\r)
                                               from imported string fields
                                               with user-defined string
   --hive-drop-import-delims                   Drop Hive record \0x01 and
                                               row delimiters (\n\r) from
                                               imported string fields
   --hive-home <dir>                           Override $HIVE_HOME
   --hive-import                               Import tables into Hive
                                               (Uses Hive's default
                                               delimiters if none are
                                               set.)
   --hive-overwrite                            Overwrite existing data in
                                               the Hive table
   --hive-partition-key <partition-key>        Sets the partition key to
                                               use when importing to hive
   --hive-partition-value <partition-value>    Sets the partition value to
                                               use when importing to hive
   --hive-table <table-name>                   Sets the table name to use
                                               when importing to hive
   --map-column-hive <arg>                     Override mapping for
                                               specific column to hive
                                               types.

HBase arguments:
   --column-family <family>    Sets the target column family for the
                               import
   --hbase-create-table        If specified, create missing HBase tables
   --hbase-row-key <col>       Specifies which input column to use as the
                               row key
   --hbase-table <table>       Import to <table> in HBase

Code generation arguments:
   --bindir <dir>                        Output directory for compiled
                                         objects
   --class-name <name>                   Sets the generated class name.
                                         This overrides --package-name.
                                         When combined with --jar-file,
                                         sets the input class.
   --input-null-non-string <null-str>    Input null non-string
                                         representation
   --input-null-string <null-str>        Input null string representation
   --jar-file <file>                     Disable code generation; use
                                         specified jar
   --map-column-java <arg>               Override mapping for specific
                                         columns to java types
   --null-non-string <null-str>          Null non-string representation
   --null-string <null-str>              Null string representation
   --outdir <dir>                        Output directory for generated
                                         code
   --package-name <name>                 Put auto-generated classes in
                                         this package

Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]


五、利用查询结果作为sqoop的导入内容

举例:其中split-by是导入后的数据按照a.id进行分割,--target-dir目标地址,查询后的结果将放入这个文件
$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  --split-by a.id --target-dir /user/foo/joinresults

举例:m代表只查询一次并且边查询边导入
$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
  -m 1 --target-dir /user/foo/joinresults


导入时候可以控制分割文件大小,或者字符串转义例如:--direct-split-size 以及--warehouse-dir ,--default-character-set

例如:
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
    --direct -- --default-character-set=latin1
$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \
    ...


sqoop对java以及hive提供支持,所以你可以导入key/value这样的map数据,例如:
Argument	 Description
--map-column-java <mapping>	 Override mapping from SQL to Java type for configured columns.
--map-column-hive <mapping>	 Override mapping from SQL to Hive type for configured columns.
$ sqoop import ... --map-column-java id=String,value=Integer


六、sqoop的增量导入
Argument	 Description
--check-column (col)	 Specifies the column to be examined when determining which rows to import.
--incremental (mode)	 Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value)	 Specifies the maximum value of the check column from the previous import.


通过增量导入你可以只导入一个已经存在表的增列值,或者表后面的值。
增量导入需要给定必要的参数,详情一个增量导入的例子

当然你也可以指定导入到Hive后的文件格式:有2种
1.--as-textfile
这个参数你可以查看到hive内的原数据就是文本文件模式没有压缩
2.-z or --compress or --compression-codec
这个参数有3种写法不过导入到hive内的数据就是压缩过的了

七、sqoop的Hive与Hbase的数据导入
前六点都能看完后,Hive与Hbase的导入也就很简单了,其实就是多了一步导入的数据放在哪里而已。

Hive举例:
sqoop import --verbose --connect jdbc:mysql://10.18.102.133/tjss_opda --username tongji --password dx_tj --table opda_start_120604 --hive-import --hive-table opda_start_120604_incr  --hive-overwrite --direct

sqoop job --create opda_start_120604 -- import --connect jdbc:mysql://10.18.102.133/tjss_opda --username tongji --password dx_tj --table opda_start_120604 --hive-import --hive-table opda_start_120604_incr --check-column id --incremental append --last-value 0 -m 8 --hive-overwrite --hive-delims-replacement="\t"


注意事项:

1.注意Sqoop是在Hadoop上跑的,所以jdbc url不要写localhost,--direct的要求同理。
2.Sqoop也有metastore,目前看来,没有启动metastore service时其不是线程安全的。另外就是它只能使用hsqldb,暂不支持其他数据库,对hsqldb可靠性没有太高信心。
3.Metastore里password是明文存储的,所以它不建议存储password。
4.Sqoop有bug,目前数据库里有特殊表名时有问题。
5.Sqoop导入到hive里的表只能是TEXTFILE,不过可以选择压缩格式
6.Sqoop可以导入到一个已经存在的空hive表,但是是使用Load data导入数据,所以目标表的schema实际上是被无视了。
7.Sqoop导入hive若不加hive-overwirte,会留下hadoop目录,对下次若执行相同任务有影响。
8.注意加入delims-replace,否则很容易因为分隔符问题出现错误行。
9.Hive进行dynamic partition时,一次partition数量过多有Bug,必须加distribute by
10.Hive对大小写不区分,尽量全小写,否则有潜在bug
11.Sqoop要求运行时当前目录可写(code-gen)。
12.只要有jdbc驱动,所有jdbc兼容的数据库均可导入


导入时除了用到sqoop相关的hive与Hbase的参数外,还会用到导入时候专用的参数:
Input parsing arguments:
   --input-enclosed-by <char>               Sets a required field encloser
   --input-escaped-by <char>                Sets the input escape
                                            character
   --input-fields-terminated-by <char>      Sets the input field separator
   --input-lines-terminated-by <char>       Sets the input end-of-line
                                            char
   --input-optionally-enclosed-by <char>    Sets a field enclosing
                                            character


这个部分的参数有可能你会用到的。另外如果导入语句没有添加目的表或者地址则导入的内容会写在HDFS当前的操作目录下。


八、sqoop导入全部表和数据

举个例子,其他参数均与sqoop help参数相同:
$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp


验证结果:
$ hadoop fs -ls
Found 4 items
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/EMPLOYEES
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/PAYCHECKS
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/DEPARTMENTS
drwxr-xr-x   - someuser somegrp       0 2010-04-27 17:15 /user/someuser/OFFICE_SUPPLIES
分享到:
评论

相关推荐

    sqoop-1.4.2.bin__hadoop-2.0.0-alpha.tar

    Sqoop 是 Apache Hadoop 生态系统中的一个工具,主要用于在关系型数据库(如 MySQL、Oracle 等)和 Hadoop 分布式文件系统(HDFS)之间高效地传输数据。这个压缩包 "sqoop-1.4.2.bin__hadoop-2.0.0-alpha.tar" 提供...

    第9章 Sqoop组件安装配置.docx

    Sqoop 是一个开源的数据传输工具,用于在 Hadoop 和结构化数据存储之间传输数据。 Sqoop 提供了一个命令行界面,允许用户定义数据传输的参数和配置。Sqoop 的安装和配置是将其集成到 Hadoop 环境中的一个重要步骤。...

    Hadoop权威指南 第二版(中文版)

     1.6.4 导入数据  1.6.5 表的修改  1.6.6 表的丢弃  1.7 查询数据  1.7.1 排序(Sorting)和聚集(Aggregating)  1.7.2 MapReduce脚本  1.7.3 连接  1.7.4 子查询  1.7.5 视图(view)  1.8 用户定义函数(User-...

    Hadoop权威指南(中文版)2015上传.rar

    1.6.4 导入数据 1.6.5 表的修改 1.6.6 表的丢弃 1.7 查询数据 1.7.1 排序(Sorting)和聚集(Aggregating) 1.7.2 MapReduce脚本 1.7.3 连接 1.7.4 子查询 1.7.5 视图(view) 1.8 用户定义函数(User-Defined ...

    12_离线计算系统_第12天(辅助系统).docx

    例如,一个简单的工作流可能包括一个 Sqoop action 来导入数据,接着是一个 Hive action 来进行分析。 ```xml &lt;start to="sqoop-action"/&gt; &lt;action name="sqoop-action"&gt; &lt;sqoop xmlns="uri:oozie:sqoop-action...

Global site tag (gtag.js) - Google Analytics