`
joerong666
  • 浏览: 418086 次
  • 性别: Icon_minigender_1
  • 来自: 广州
社区版块
存档分类
最新评论

Impala介绍博客相关问答

 
阅读更多

原博客文章地址:

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

  • SONAL / OCTOBER 25, 2012 / 11:44 AM

Very excited to see Impala. The Dremel paper outlines efficient columnar storage for nested data. How does Impala achieve its speeds if data is not to be loaded in to the system?

Thanks
Sonal

Dremel论文描述了使用列储存来有效地储存嵌套数据。如果数据没有被加载到系统中,Impala的实现是如何保证其速度的?

To address Sonal’s question:

The performance advantage you will see with Impala will always depend on the storage format of the data, among other things. Impala tries hard to be fast on ascii-encoded data (text files and sequencefile), but of course the parsing overhead will always show up as a performance penalty compared to something like ColumnIO or Trevni. Impala will also support Trevni in the GA release, as mentioned in the blog post.

Regarding data loading: we are working on background conversion into Trevni, in a way that enables a logical table to be backed by a mix of data formats. New data would show up in, say, sequencefile format and eventually get converted into the more efficient Trevni columnar format, but all of the data would be queryable at all times, regardless of format.

Marcel

Impala的性能优势始终依赖于数据的储存格式。Impala致力于能够对ASCII编码的数据进行快速处理,但是同ColumnIO或Trevni相比,解析开销肯定会对性能造成影响。Impala在正式版本中将支持Trevni。

考虑数据加载:我们在后台将数据转换到Trevni,这种方式可以允许一张逻辑表以混合格式进行备份。新数据是顺序文件格式,最终被转换为更有效的Trevni列格式,但是所有数据在任何时刻都是可查询的,和格式无关。

  • ALEX B / NOVEMBER 22, 2012 / 8:25 AM

Can you please comment how Impala compares to Hadapt in terms of architecture ? As far as I understand in case of Hadapt ( and I could be wrong of course ) some transformation of the data to Postgre SQL is needed . That does not seems to be the case with Impala( at least in the current implementation) ?

Thanks,
Alex

Impala和Hadapt在结构上进行比较?Hadapt中,需要进行某些数据到PG的转换。Impala看起来不需要这样做。

Regarding Alex’s question:

That’s correct, Impala does read data directly from HDFS and HBase. Impala also relies on Apache Hive’s metastore for the mapping of files into tables, which means you can re-use your schema definitions if you’re already querying Hadoop through Hive.

Hadapt runs a PostgreSql instance on each data node, and appears to require some form of data movement (and duplication of data storage) between Postgres and HDFS, but for the specifics of that architecture I would recommend consulting the Hadapt website.

Marcel

Impala直接从HDFS和HBase上读取数据,同时Impala依赖Hive的元存储来将文件映射到表,这意味着你如果已经通过Hive对Hadoop上的数据进行查询,那么你可以重用你的模式定义。

Hadapt在每个数据节点上运行一个PG实例,而且似乎需要在PG和HDFS直接进行某些形式的数据移动(和数据复制),但对于相关架构的细节建议到Hadapt网站上进行咨询。

Great stuff! We have tried it and impala shows about 2x speedup vs. hive on our simple query on test dataset.

Could Marcel explain more about the main reasons that make impala faster?
1. about columnar storage: it seems that hive can also benifit from columnar storage compared with text file.
2. about distributed scalable aggregation algorithms: is there some details and examples about the algorithms?
3. about join: if dataset can not fit into memory, how impala keep faster if impala use disk.
4. about main memory as a cache for table data: is it a cache in impala for recently accessed data?

Thanks!
Kang

我们已经试用过Impala,在测试的数据集中,使用简单查询,Impala的速度比Hive提升了2倍。

请Marcel解释Impala速度快的主要原因:

  1. 关于列储存:相对于文本文件,Hive也可以通过使用列储存获益。
  2. 关于分布式可扩展聚集算法:有算法的细节和例子吗?
  3. 关于join:如果数据集无法全部读入内存,Impala如何在使用磁盘的时候保持速度。
  4. 关于用作表数据缓存的主内存:缓存Impala最近访问的数据?

Regarding Kang’s questions:

1. Yes, the Trevni columnar storage format will be an open and general purpose storage format that will be available for any of the Hadoop processing frameworks, including Hive, MapReduce, and Pig.

However, we expect to see greater performance gains from Trevni in Impala compared to what you’d see in Hive. The reason is that in a disk-based system, Impala is often I/O-bound, and a columnar format will reduce the total I/O volume, often by a substantial amount. Hive is often cpu-bound and will therefore benefit much less from a reduction in I/O volume.

2. At the moment, Impala does a simple 2-stage aggregation: pre-aggregation is done by all executing backends, followed by a single, central merge aggregation step in the coordinator. In an upcoming release Impala will also support repartitioning aggregation, where the result of the pre-aggregation step is hash-partitioned across all executing backends, so that the total merge aggregation work is also distributed.

3. Impala currently has the limitation that the right-hand side table of a join needs to fit into the memory of every executing backend. In the GA release, this will be relaxed, so that the right-hand side table will only have to fit into the *aggregate* memory of all executing backends. Disk-based join algorithms won’t be available until after the GA release.

4. Impala does not maintain its own cache; instead, it relies on the OS buffer cache in order to keep frequently-accessed data in memory.

Marcel

  1. Trevni列储存格式将是一个开放和通用的储存格式,对所有Hadoop处理框架都可用,包括Hive、MapReduce和Pig。

但是,相对Hive,我们希望通过Trevni在Impala上获得更多的性能提升。原因是在一个基于磁盘的系统中,Impala经常受到I/O的限制,而列格式可以减少总I/O量,而且经常可以减少很多。Hive经常受到CPU的限制因此在I/O量减少方面获益较少。

  1. 目前,Impala进行一个简单的2阶段聚集算法:预聚集在所有执行后端完成,之后在协调器进行一个单一的、中心合并聚集步骤。在即将发布的版本中,Impala还将支持再分配聚集,预聚集步骤的结果将通过hash分区到所有执行后端,所以合并聚集工作也是分布式的。
  2. Impala目前限制右连接表需要加载到每个执行后端的内存中处理。在正式版本中,限制将放宽,右连接表只需要能加载到所有执行后端的总内存中即可。基于磁盘的join算法在正式版本之前都不可用。
  3. Impala没有维持其自有的缓存,取而代之的是使用OS buffer进行缓存以保证频繁访问的数据保留在内存中。
分享到:
评论

相关推荐

    impala基础介绍

    - **版权归属**:Impala及相关文档的版权归Cloudera所有。 - **商标声明**:Impala、Cloudera及其产品和服务名称均为Cloudera或其供应商和许可方的商标。 - **Hadoop**:Hadoop及相应标志是Apache Software ...

    springboot集成impala(包含yml、impala配置类、pom.xml、impala jar)

    在本文中,我们将深入探讨如何在SpringBoot应用中集成Impala数据仓库系统,以及涉及到的相关配置和步骤。首先,我们来看看关键的组成部分: 1. **SpringBoot集成**: SpringBoot是基于Spring框架的一个轻量级开发...

    impala介绍ppt

    Impala是基于Hive的大数据实时分析查询引擎,直接使用Hive的元数据库Metadata,意味着impala元数据都存储在Hive的metastore中。并且impala兼容Hive的sql解析,实现了Hive的SQL语义的子集,功能还在不断的完善中

    Apache Impala Guide impala-3.3.pdf

    在开始介绍Apache Impala之前,首先要了解它的优势。Impala带来了更快的查询速度,减少了对复杂性架构的依赖。它能够直接在Hadoop的分布式文件系统HDFS和HBase上运行,与Hive无缝协作,并充分利用Hadoop的资源管理器...

    dbeaver impala jdbc连接包

    标题 "dbeaver impala jdbc连接包" 涉及到的是在数据管理工具DBeaver中连接Impala数据库所必需的Java Database Connectivity (JDBC)驱动。Impala是Cloudera Data Hub (CDH)中的一种分布式分析引擎,用于处理大规模的...

    impala-2.9.pdf

    Apache Impala 指南 Apache Impala 是一个基于 Apache Hadoop 的查询引擎,旨在提供高效、可扩展的数据分析解决方案。下面是 Impala 的重要知识点: Impala 的优点 Impala 的主要优点包括: * 高性能查询:...

    Impala驱动器相关报错信息

    Impala驱动器相关报错信息

    impala-3.4.pdf

    “Impala Concepts and Architecture”和“Impala Features”部分则可能详细介绍了Impala的架构组成,例如Impala Server的各个组件:Impala Daemon(守护进程)、Impala Statestore(状态存储)和Impala Catalog ...

    impala实用参考手册

    8. Impala的主要特点:文档简要介绍了Impala的核心特点,这些特点包括对Hadoop生态系统的深度整合、实时查询能力、标准SQL支持以及高性能的架构设计等,这些都是Impala在大数据查询领域中作为关键优势的体现。...

    impala数据库JDBC驱动集

    1. 导入JDBC相关的Java库: ```java import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; ``` 2. 加载JDBC驱动: ```java Class.forName(...

    impala cookbook详解版

    Impala Cookbook 是一本关于 Impala 的深入指南,涵盖了 Impala 的物理和 Schema 设计、内存使用量、物理设置等方面的详细介绍。下面是对 Impala Cookbook 的详细解释: 物理和 Schema 设计 在 Impala 中,物理和...

    大数据Impala架包

    1. **hive_metastore.jar**:这是Hive元数据存储的相关库,它包含了对Hive表和分区元数据的操作,这些元数据通常存储在MySQL或PostgreSQL等关系型数据库中,是Impala理解Hadoop集群上数据结构的关键。 2. **...

    Impala

    **Impala概述** Impala是Cloudera公司开发的一款开源大数据查询系统,它提供了一种快速、交互式的SQL查询方式,可以直接在Hadoop集群上处理大规模的数据。Impala与Hadoop生态系统中的其他组件如HDFS(Hadoop分布式...

    impala官方文档

    ### Impala官方文档知识点概述 #### 一、Impala简介 Impala是Cloudera公司开发的一款开源的大规模并行处理(MPP)查询引擎,它为存储在Hadoop中的数据提供实时查询服务。与传统的MapReduce计算模型相比,Impala...

    java通过jdbc连接impala所需jar

    4. **其他依赖库**: 压缩包中的其他文件如hive_metastore.jar、TCLIServiceClient.jar等,是Impala和Hive相关的库,它们用于与Hadoop生态系统中的组件进行通信,例如元数据管理和协调服务。 以下是使用Java通过JDBC...

    Impala安装,亲测

    在本文档中,我们将详细介绍 Impala 的安装过程,包括 MASTER 节点和 WORKER 节点的安装配置。 Impala 安装 在开始安装 Impala 之前,需要确保已经安装了 Hadoop 环境。 Impala 安装推荐使用 yum 安装,具体的安装...

    impala2.12 详细安装教程 - CSDN博客.mhtml

    impala2.12 详细安装教程 - CSDN博客.mhtml

    impala驱动jar包

    Impala是Apache Hadoop生态系统中的一个高性能、实时查询系统,专为大数据分析设计。它能够直接在HDFS(Hadoop Distributed File System)和HBase上运行SQL查询,无需将数据移动到其他系统,大大提高了数据分析的...

    Impala的JDBC编程驱动

    标题中的“Impala的JDBC编程驱动”指的是Impala(一个开源的、高性能的SQL查询引擎,用于处理存储在Hadoop集群中的数据)与Java应用程序之间的桥梁,即JDBC(Java Database Connectivity)驱动。JDBC驱动是Java...

Global site tag (gtag.js) - Google Analytics