Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real

wbj0110

浏览: 1637478 次
性别:
来自: 上海

最近访客更多访客>>

一往无前bhz

ninja2006

loginboot

u012363178

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hadoop

Hadoop

After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.

When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.

Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.) The first beta drop includes support for text files and SequenceFiles; SequenceFiles can be compressed as Snappy, GZIP, and BZIP (with Snappy recommended for maximum performance). Support for additional formats including Avro, RCFile, LZO text files, and the Parquet columnar format is planned for the production drop.

To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. The result is order-of-magnitude faster performance than Hive, depending on the type of query and configuration. (See FAQ below for more details.) Note that this performance improvement has been confirmed by several large companies that have tested Impala on real-world workloads for several months now.

A high-level architectural view is below:

There are many advantages to this approach over alternative approaches for querying Hadoop data, including::

Thanks to local processing on data nodes, network bottlenecks are avoided.
A single, open, and unified metadata store can be utilized.
Costly data format conversion is unnecessary and thus no overhead is incurred.
All data is immediately query-able, with no delays for ETL.
All hardware is utilized for Impala queries as well as for MapReduce.
Only a single machine pool is needed to scale.

We encourage you to read the documentation for further technical details.

Finally, we’d like to answer some questions that we anticipate will be popular:

Is Impala open source?
Yes, Impala is 100% open source (Apache License). You can review the code for yourself at Github today.

How is Impala different than Dremel?
The first and principal difference is that Impala is open source and available for everyone to use, whereas Dremel is proprietary to Google.

Technically, Dremel achieves interactive response times over very large data sets through the use of two techniques:

A novel columnar storage format for nested relational data/data with nested structures
Distributed scalable aggregation algorithms, which allow the results of a query to be computed on thousands of machines in parallel.

The latter is borrowed from techniques developed for parallel DBMSs, which also inspired the creation of Impala. Unlike Dremel as described in the 2010 paper, which could only handle single-table queries, Impala already supports the full set of join operators that are one of the factors that make SQL so popular.

In order to realize the full performance benefits demonstrated by Dremel, Hadoop will shortly have an efficient columnar binary storage format called Parquet. But contrary to Dremel, Impala supports a range of popular file formats. This lets users run Impala on their existing data without having to “load” or transform it. It also lets users decide if they want to optimize for flexibility or just pure performance.

To sum it up, Impala plus Parquet will achieve the query performance described in the Dremel paper, but surpass what is described there in SQL functionality.

How much faster are Impala queries than Hive ones, really?
The precise amount of performance improvement is highly dependent on a number of factors:

Hardware configuration: Impala is generally able to take full advantage of hardware resources and specifically generates less CPU load than Hive, which often translates into higher observed aggregate I/O bandwidth than with Hive. Impala of course cannot go faster than the hardware permits, so any hardware bottlenecks will limit the observed speedup. For purely I/O bound queries, we typically see performance gains in the range of 3-4x.
Complexity of the query: Queries that require multiple MapReduce phases in Hive or require reduce-side joins will see a higher speedup than, say, simple single-table aggregation queries. For queries with at least one join, we have seem performance gains of 7-45X.
Availability of main memory as a cache for table data: If the data accessed through the query comes out of the cache, the speedup will be more dramatic thanks to Impala’s superior efficiency. In those scenarios, we have seen speedups of 20x-90x over Hive even on simple aggregation queries.

Is Impala a replacement for MapReduce or Hive – or for traditional data warehouse infrastructure, for that matter?
No. There will continue be many viable use cases for MapReduce and Hive (for example, for long-running data transformation workloads) as well as traditional data warehouse frameworks (for example, for complex analytics on limited, structured data sets). Impala is a complement to those approaches, supporting use cases where users need to interact with very large data sets, across all data silos, to get focused result sets quickly.

Does the Impala Beta Release have any technical limitations?
As mentioned previously, supported file formats in the first beta drop include text files and SequenceFiles, with many other formats to be supported in the upcoming production release. Furthermore, currently all joins are done in a memory space no larger than that of the smallest node in the cluster; in production, joins will be done in aggregate memory. Lastly, no UDFs are possible at this time.

What are the technical requirements for the Impala Beta Release?
You will need to have CDH4.1 installed on RHEL/CentOS 6.2. We highly recommend the use of Cloudera Manager (Free or Enterprise Edition) to deploy and manage Impala because it takes care of distributed deployment and monitoring details automatically.

What is the support policy for the Impala Beta Release?
If you are an existing Cloudera customer with a bug, you may raise a Customer Support ticket and we will attempt to resolve it on a best-effort basis. If you are not an existing Cloudera customer, you may use our public JIRA instance or the impala-user mailing list, which will be monitored by Cloudera employees.

When will Impala be generally available for production use?
A production drop is planned for the first quarter of 2013. Customers may obtain commercial support in the form of a Cloudera Enterprise RTQ subscription at that time.

We hope that you take the opportunity to review the Impala source code, explore the beta release, download and install the VM, or any combination of the above. Your feedback in all cases is appreciated; we need your help to make Impala even better.

We will bring you further updates about Impala as we get closer to production availability. (Update: Read about Impala 1.0.)

Impala resources:
– Impala source code
– Impala downloads (Beta Release and VM)
– Impala documentation
– Public JIRA
– Impala mailing list
- Free Impala training (Screencast)

(Added 10/30/2012) Third-party articles about Impala:
- GigaOm: Real-time query for Hadoop democratizes access to big data analytics (Oct. 22, 2012)
- Wired: Man Busts Out of Google, Rebuilds Top-Secret Query Machine (Oct. 24, 2012)
- InformationWeek: Cloudera Debuts Real-Time Hadoop Query (Oct. 24, 2012)
- GigaOm: Cloudera Makes SQL a First-Class Citizen on Hadoop (Oct. 24, 2012)
- ZDNet: Cloudera’s Impala Brings Hadoop to SQL and BI (Oct. 25, 2012)
- Wired: Marcel Kornacker Profile (Oct. 29, 2012)
- Dr. Dobbs: Cloudera Impala – Processing Petabytes at The Speed Of Thought (Oct. 29, 2012)

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

分享到：

skipfish和nikto的简单应用 | Eclipse调用hadoop2运行MR程序

2014-06-25 16:31
浏览 808
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

Big Data Made Easy - A Working Guide To The Complete Hadoop Toolset: - **Apache Storm**: A real-time computation system for processing streaming data. #### Chapter 7: Monitoring Data Monitoring is essential for ensuring the health and performance of Hadoop clusters. ...

FPGA电机控制方案解析：基于Verilog与Nios2的软硬协同设计: 内容概要：本文详细介绍了基于FPGA的电机控制系统设计方案，重点探讨了Verilog和Nios2软核的协同工作。系统通过将底层驱动（如编码器处理、坐标变换、SVPWM生成等）交给Verilog实现，确保实时性和高效性；同时，复杂的算法（如Park变换、故障保护等）则由Nios2处理。文中展示了多个具体实现细节，如四倍频计数、定点数处理、查表法加速、软硬件交互协议等。此外，还讨论了性能优化方法，如过调制处理、五段式PWM波形生成以及故障保护机制。适合人群：具备一定FPGA和嵌入式系统基础知识的研发人员，尤其是从事电机控制领域的工程师。使用场景及目标：适用于希望深入了解FPGA在电机控制中的应用，掌握软硬件协同设计方法，提高系统实时性和效率的技术人员。目标是通过学习本方案，能够独立设计并实现高效的电机控制系统。其他说明：本文不仅提供了详细的代码片段和技术细节，还分享了许多实践经验，如调试技巧、常见错误及其解决办法等。这对于实际工程项目非常有帮助。

模拟太阳系、轨道进动、时间延迟、光线偏折、黑洞阴影、星团以及航天器轨迹 matlab代码.rar: 1.版本：matlab2014/2019a/2024a 2.附赠案例数据可直接运行matlab程序。 3.代码特点：参数化编程、参数可方便更改、代码编程思路清晰、注释明细。 4.适用对象：计算机，电子信息工程、数学等专业的大学生课程设计、期末大作业和毕业设计。

计算机数控(CNC)装置.pdf: 计算机数控(CNC)装置.pdf

西门子PLC与TiA博途实现冷热水恒压供水系统的变频器控制及多参数调控: 内容概要：本文详细介绍了使用西门子PLC和TiA博途软件构建冷热水恒压供水系统的具体方法和技术要点。主要内容涵盖变频器控制、模拟量输入输出处理、温度控制、流量计算控制及配方控制等方面。文中不仅提供了具体的编程实例，如LAD和SCL语言的应用，还分享了许多实用的经验和技巧，例如模拟量处理中的滤波方法、PID控制的优化策略、流量计算的高精度算法等。此外，针对实际应用中的常见问题，如信号干扰和参数整定，作者也给出了有效的解决方案。适合人群：从事自动化控制系统开发的技术人员，尤其是对西门子PLC和TiA博途有一定了解并希望深入掌握冷热水恒压供水系统设计的专业人士。使用场景及目标：适用于工业环境中需要精确控制水压、温度和流量的冷热水供应系统的设计与维护。主要目标是帮助工程师理解和实施基于西门子PLC和TiA博途的冷热水恒压供水系统，提高系统的稳定性和效率。其他说明：文中提到的实际案例和编程代码片段对于初学者来说非常有价值，能够加速学习进程并提升实际操作能力。同时，关于硬件配置的选择建议也为项目规划提供了指导。

基于PLC的自动蜂窝煤生产线五传送带控制系统设计与实现: 内容概要：本文详细介绍了基于PLC（可编程逻辑控制器）的自动蜂窝煤生产线中五条传送带的控制系统设计。主要内容涵盖IO分配、梯形图程序编写、接线图原理图绘制以及组态画面的设计。通过合理的IO分配，确保各个输入输出点正确连接；利用梯形图程序实现传送带的启动、停止及联动控制；接线图确保电气连接的安全性和可靠性；组态画面提供人机交互界面，便于操作员远程监控和操作。此外，还分享了一些实际调试中的经验和教训，如传感器安装位置、硬件接线注意事项等。适合人群：从事自动化控制领域的工程师和技术人员，尤其是对PLC编程和工业自动化感兴趣的读者。使用场景及目标：适用于需要设计和实施自动化生产线的企业和个人。目标是提高生产线的自动化程度，减少人工干预，提升生产效率和产品质量。其他说明：文中提到的具体实例和代码片段有助于读者更好地理解和掌握相关技术和方法。同时，强调了硬件和软件相结合的重要性，提供了实用的调试技巧和经验总结。

自动驾驶仿真中OpenScenario XML语法与场景构建详解: 内容概要：本文详细介绍了OpenScenario场景仿真的结构及其应用，特别是通过具体的XML代码片段解释了各个参数的作用和配置方法。文中提到的思维导图帮助理解复杂的参数关系，如Storyboard、Act、ManeuverGroup等层级结构，以及它们之间的相互作用。同时，文章提供了多个实用案例，如跟车急刹再加速、变道场景等，展示了如何利用这些参数创建逼真的驾驶场景。此外，还特别强调了一些常见的错误和解决方法，如条件触发器的误用、坐标系转换等问题。适用人群：从事自动驾驶仿真研究的技术人员，尤其是对OpenScenario标准有一定了解并希望深入掌握其应用场景的人。使用场景及目标：适用于需要精确控制交通参与者行为的自动驾驶仿真项目，旨在提高开发者对OpenScenario的理解和运用能力，减少开发过程中常见错误的发生。其他说明：文章不仅提供了理论指导，还包括大量实践经验分享，如调试技巧、参数优化等，有助于快速解决问题并提升工作效率。

基于Maxwell仿真的30kW自启动永磁同步电机6极72槽设计方案及性能优化: 内容概要：本文详细介绍了30kW、1000rpm、线电压380V的自启动永磁同步电机的6极72槽设计方案及其性能优化过程。首先，通过RMxprt进行快速建模，设定基本参数如电机类型、额定功率、速度、电压、极数和槽数等。接着，深入探讨了定子冲片材料选择、转子结构设计、绕组配置以及磁密波形分析等方面的技术细节。文中特别强调了双层绕组设计、短距跨距选择、磁密波形优化、反电势波形验证等关键技术手段的应用。此外，还讨论了启动转矩、效率曲线、温升控制等方面的优化措施。最终，通过一系列仿真和实测数据分析，展示了该设计方案在提高效率、降低谐波失真、优化启动性能等方面的显著成果。适合人群：从事电机设计、电磁仿真、电力电子领域的工程师和技术人员。使用场景及目标：适用于希望深入了解永磁同步电机设计原理及优化方法的专业人士，旨在为类似项目的开发提供参考和借鉴。其他说明：文章不仅提供了详细的参数设置和代码示例，还分享了许多实践经验，如材料选择、仿真技巧、故障排除等，有助于读者更好地理解和应用相关技术。

基于S7-1200 PLC和WinCC的燃油锅炉控制系统设计与实现: 内容概要：本文详细介绍了如何使用S7-1200 PLC和WinCC搭建一个完整的燃油锅炉自动控制系统。首先明确了系统的IO分配，包括数字量输入输出和模拟量输入输出的具体连接方式。接着深入讲解了梯形图编程的关键逻辑，如鼓风机和燃油泵的联锁控制、温度PID调节等。对于接线部分，强调了强电弱电线缆分离以及使用屏蔽线的重要性。WinCC组态方面，则着重于创建直观的操作界面和有效的报警管理。此外，还分享了一些调试技巧和常见问题的解决方案。适合人群：从事工业自动化领域的工程师和技术人员，尤其是对PLC编程和SCADA系统有一定了解的人群。使用场景及目标：适用于需要构建高效稳定的燃油锅炉控制系统的工业环境，旨在提高系统的可靠性和安全性，降低故障率并提升工作效率。其他说明：文中提供了丰富的实践经验，包括具体的硬件选型、详细的程序代码片段以及实用的故障排查方法，有助于读者快速掌握相关技能并在实际工作中应用。

电力电子领域中逆变器输出纹波电流预测与变开关频率PWM控制的Simulink仿真: 内容概要：本文详细探讨了逆变器输出纹波电流的来源及其对系统稳定性的影响，并提出了一种基于变开关频率PWM控制策略的解决方案。文中首先分析了纹波电流产生的原因，包括开关元件的导通关断、电感电流的非理想特性和电源电压波动。接着介绍了变开关频率PWM控制的基本原理，通过实时调整开关频率来优化纹波电流和开关损耗之间的平衡。随后，利用傅里叶变换建立了纹波电流预测模型，并通过Simulink仿真模型进行了验证。仿真结果显示，变开关频率控制能够显著减小纹波电流的幅值，提高系统的稳定性和效率。此外，文章还提供了具体的MATLAB/Simulink建模步骤以及一些优化建议，如提高开关频率上限、采用低纹波PWM算法和增加电感电流反馈。适合人群：从事电力电子系统设计和优化的研究人员和技术人员，尤其是关注逆变器性能提升的专业人士。使用场景及目标：适用于需要优化逆变器输出质量、提高系统稳定性和效率的应用场合。目标是通过变开关频率PWM控制策略，解决传统固定开关频率控制中存在的纹波电流大、效率低等问题。其他说明：文章不仅提供了理论分析，还包括详细的仿真建模指导和优化建议，有助于读者更好地理解和应用相关技术。同时，文中提到的一些实用技巧和注意事项对于实际工程应用具有重要参考价值。

数据结构领域中平衡树的原理及其应用解析: 内容概要：本文详细介绍了平衡树的基本概念、发展历程、不同类型（如AVL树、红黑树、2-3树）的特点和操作原理。文中解释了平衡树如何通过自平衡机制克服普通二叉搜索树在极端情况下的性能瓶颈，确保高效的数据存储和检索。此外，还探讨了平衡树在数据库索引和搜索引擎等实际应用中的重要作用，并对其优缺点进行了全面分析。适合人群：计算机科学专业学生、软件工程师、算法爱好者等对数据结构有兴趣的人群。使用场景及目标：帮助读者理解平衡树的工作原理，掌握不同类型平衡树的特点和操作方法，提高在实际项目中选择和应用适当数据结构的能力。其他说明：本文不仅涵盖了理论知识，还包括具体的应用案例和技术细节，旨在为读者提供全面的学习资料。

计算机三级网络技术机试100题和答案.pdf: 计算机三级网络技术机试100题和答案.pdf

LabVIEW与YOLOv5结合：基于ONNX Runtime的多模型并行推理DLL封装及工业应用: 内容概要：本文详细介绍了将YOLOv5模型集成到LabVIEW环境中进行目标检测的方法。作者通过C++封装了一个基于ONNX Runtime的DLL，实现了YOLOv5模型的高效推理，并支持多模型并行处理。文中涵盖了从模型初始化、视频流处理、内存管理和模型热替换等多个方面的具体实现细节和技术要点。此外，还提供了性能测试数据以及实际应用场景的经验分享。适合人群：熟悉LabVIEW编程，有一定C++基础，从事工业自动化或计算机视觉相关领域的工程师和技术人员。使用场景及目标：适用于需要在LabVIEW环境下进行高效目标检测的应用场景，如工业质检、安防监控等。主要目标是提高目标检测的速度和准确性，降低开发难度，提升系统的灵活性和扩展性。其他说明：文中提到的技术方案已在实际项目中得到验证，能够稳定运行于7x24小时的工作环境。GitHub上有完整的开源代码可供参考。

逻辑回归ex2-logistic-regression-ex2data1: 逻辑回归ex2-logistic-regression-ex2data1

MATLAB仿真单相高功率因数整流器：单周期控制与优化实践: 内容概要：本文详细介绍了使用MATLAB/Simulink搭建单相高功率因数整流器仿真的全过程。作者通过单周期控制（OCC）方法，使电感电流平均值跟随电压波形，从而提高功率因数。文中涵盖了控制算法的设计、主电路参数的选择、波形采集与分析以及常见问题的解决方案。特别是在控制算法方面，通过动态调整占空比，确保系统的稳定性，并通过实验验证了THD低于5%，功率因数达到0.98以上的优异性能。适合人群：电力电子工程师、科研人员、高校师生等对高功率因数整流器仿真感兴趣的读者。使用场景及目标：适用于研究和开发高效电源转换设备的技术人员，旨在通过仿真手段优化整流器性能，降低谐波失真，提高功率因数。其他说明：文章提供了详细的代码片段和调试经验，帮助读者更好地理解和应用单周期控制技术。同时提醒读者注意仿真与实际硬件之间的差异，强调理论计算与实际调试相结合的重要性。

计算机设备采购合同.pdf: 计算机设备采购合同.pdf

计算机三级网络技术考试资料大全.pdf: 计算机三级网络技术考试资料大全.pdf

基于Simulink的燃料电池系统建模与先进控制策略研究: 内容概要：本文详细介绍了如何在Simulink中构建质子交换膜燃料电池（PEMFC）和固体氧化物燃料电池（SOFC）的仿真模型及其控制策略。主要内容涵盖各子系统的建模方法，如气体流道、温度、电压、膜水合度等模块的具体实现细节；探讨了几种先进的控制算法，包括模糊PID、自抗扰控制（ADRC）、RBF神经网络PID以及它们的应用场景和优势；并通过具体案例展示了不同控制器在处理复杂工况时的表现差异。此外，文中还分享了一些实用技巧，如避免模型参数调校中的常见错误、提高仿真的稳定性和准确性。适合人群：从事燃料电池研究与开发的专业人士，尤其是具有一定Matlab/Simulink基础的研究人员和技术工程师。使用场景及目标：帮助读者掌握燃料电池系统建模的基本流程和技术要点，理解各种控制算法的特点及其应用场景，从而能够独立完成相关项目的开发与优化工作。其他说明：文章提供了大量MATLAB代码片段作为实例支持，便于读者理解和实践。同时强调了理论联系实际的重要性，在介绍每种技术时均结合具体的实验数据进行分析讨论。

IMX662 sensor原理图: IMX662 sensor板原理图.dsn参考资料

数据结构解析：线性表顺序表示的原理、操作及应用: 内容概要：本文详细介绍了线性表及其顺序表示的概念、原理和操作。线性表作为一种基础数据结构，通过顺序表示将元素按顺序存储在连续的内存空间中。文中解释了顺序表示的定义与原理，探讨了顺序表与数组的关系，并详细描述了顺序表的基本操作，包括初始化、插入、删除和查找。此外，文章分析了顺序表的优点和局限性，并讨论了其在数据库索引、图像处理和嵌入式系统中的实际应用。最后，对比了顺序表和链表的性能特点，帮助读者根据具体需求选择合适的数据结构。适合人群：计算机科学专业的学生、软件开发人员以及对数据结构感兴趣的自学者。使用场景及目标：①理解线性表顺序表示的原理和实现；②掌握顺序表的基本操作及其时间复杂度；③了解顺序表在实际应用中的优势和局限性；④学会根据应用场景选择合适的数据结构。其他说明：本文不仅提供了理论知识，还附带了具体的代码实现，有助于读者更好地理解和实践线性表的相关概念和技术。

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论