`

Apache Drill Could Power Faster Through Data

 
阅读更多

The proposed “Drill” project could help speed up Apache Hadoop.


Actual drills aren’t very useful when it comes to data analysis. But the Drill under proposal could help speed things up.

Given the burgeoning interest in Hadoop and data analytics in general, it’s unsurprising that IT vendors and developers would turn to ways to speed up the process of sorting and gaining insights from data. Enter “Drill,” a new open-source project proposed via the Apache Software Foundation’s incubator wing.

“There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers),” read the proposal submitted for the project. “This need was identified by Google and addressed internally with a system called Dremel.”

Over the past few years, more open-source frameworks emerged to help data analysts and IT departments with scalable batch processing. Of these, Apache Hadoop emerged as the favorite of many organizations needing to crunch massive amounts of data. But in the eyes of Drill’s creators, Hadoop’s design prevents it from achieving “the sub-second latency needed for interactive data analysis and exploration.”

Drill, they added, “is intended to address this need.”

Drill’s architecture centers on four components: support for a variety of languages and programming models, including DrQL (used by Dremel and Google BigQuery), Mongo Query Language, and Plume; a low-latency distributed execution engine capable of efficiently querying petabytes of data on 10,000 servers; a layer for supporting schema-based and schema-less formats such as JSON (in the latter case) and Protocol Buffers/Dremel; and a layer supporting various data sources, with an initial focus on “Hadoop as a data source.”

Drill will eventually support encryption on the wire, which is not considered one of the project’s initial goals.

“Significant work” has apparently been done to identify Drill’s initial requirements and system architecture, with implementation of those four components offered as the next step. Although there’s a growing need for tools capable of large-dataset analysis (look at the buzz around Hadoop), Drill’s creators acknowledge that any project of this scope carries inherent risks: vendors deciding to change their strategies around data analytics could doom the project, although that scenario seems unlikely thanks to the aforementioned interest.

The proposal seeks to downplay other potential dangers, including excessive reliance on salaried developers (“we are confident that the project will continue even if no salaried developers contribute to the project”) and relationships with other Apache products (“we look forward to collaborating with those communities, as well as other Apache communities”). Initial workers on the project include employees of MapR Technologies, Drawn to Scale, and Concurrent, with mentors from MapR Technologies, Lucid Imagination and Nokia.

 

http://slashdot.org/topic/bi/apache-drill-could-power-faster-through-data/

分享到:
评论

相关推荐

    Learning Apache Drill Queryand Analyze Distributed Data Sources with SQL

    Learning Apache Drill Queryand Analyze Distributed Data Sources with SQL

    使用Apache Drill技术

    ### 使用Apache Drill技术详解 #### 一、Apache Drill概述 **Apache Drill** 是一款用于大数据交互式分析的强大工具,属于开源分布式系统。它的主要特点包括: - **支持多种数据源和格式**:不仅可以处理传统的...

    Learning Apache Drill 2019.pdf

    《Learning Apache Drill 2019》是一本关于如何使用Apache Drill进行分布式数据源查询和分析的书籍。Apache Drill是一个开源的SQL查询引擎,它能够查询各种数据源,包括Hadoop上的数据、NoSQL数据库、云存储服务和...

    Apache Drill技术手册

    Apache Drill 技术手册 Apache Drill 是一个低延迟的分布式海量数据交互式查询引擎,使用户能够使用 ANSI SQL 兼容语法查询多种类型的数据存储系统,包括本地文件、HDFS、Hive、HBase、MongoDB 等。Drill 的设计...

    Learning Apache Drill

    Apache Drill是Google BigQuery团队发起的一个开源项目,它是一个分布式、低延迟的SQL查询引擎,设计用于处理大规模的非结构化和半结构化数据。Apache Drill的目标是提供一种简单、快速的方式来查询和分析大规模的...

    Apache Drill常用函数

    Apache Drill是一款强大的、跨平台的数据查询引擎,专为大数据分析设计。它支持SQL查询语言,使得用户能够方便地处理各种不同类型的数据源,如Hadoop、NoSQL数据库、云存储等。在Apache Drill 1.18版本中,我们找到...

    apache_drill_tutorial.pdf

    Apache Drill 是一个开源的无模式SQL查询引擎,它在大数据分析领域扮演着重要的角色。与传统的Hive不同,Drill不依赖MapReduce作业,并且它并不完全基于Hadoop生态系统。实际上,Drill的设计灵感来源于Google的...

    apache-drill-jdbc-plugin:适用于Apache Drill的JDBC插件

    apache-drill-jdbc-plugin 适用于Apache Drill的JDBC插件 下载Apache Drill 0.9。 将代码添加到contrib中,然后用此文件夹中的pom文件替换现有的pom文件。 用mvn构建。 要仅生成软件包,请使用与以下类似的符号:...

    演练:Apache Drill

    Apache Drill是一个分布式MPP查询层,支持针对NoSQL和Hadoop数据存储系统SQL和替代查询语言。 它的部分灵感来自 。 开发者 请阅读以设置和运行Apache Drill。 有关完整的开发人员文档,请参见 更多信息 请参阅或以...

    数据整合处理的工具,apache-drill

    Apache Drill是一款开源的分布式SQL查询引擎,专门设计用于大规模数据集的分析,尤其适用于现代大数据存储格式,如Hadoop Distributed File System (HDFS)、云存储服务以及NoSQL数据库。这款工具无需预先定义schema...

    drill-domain-tools:一组用于处理Internet域名的Apache Drill UDF

    一组用于处理Internet域名的Apache Drill UDF UDFs 有一个UDF: suffix_extract(domain-string) :给定一个有效的互联网域名(FQDN或其他方式),这将返回一个地图的领域tld , assigned , subdomain和hostname的...

    Drill_PTH_Through.DRL

    Drill_PTH_Through.DRL

    Drill_NPTH_Through.DRL

    Drill_NPTH_Through.DRL

    drill-twitter-text:一个Apache Drill UDF,用于通过twitter-text Java库(https

    一个Apache Drill UDF,用于通过 Java库处理Twitter tweet文本。 UDFs tw_parse_tweet(string) :解析tweet文本并返回具有以下命名值的地图列: weightedLength :(整数)tweet的总长度,其中代码点按配置文件中...

    drill-html-tools:Apache Drill UDF用于检索和使用HTML文本

    Apache Drill UDF用于检索和使用HTML文本 基于库。 注意:这绝对是一个在制品。 UDFs soup_read_html(url-string, timeout-ms) :此UDF要求网络可到达预期的URL目标。 给定一个URL和一个连接超时(以毫秒为单位)...

    cb-drill:Apache Drill的存储插件

    Apache Drill是一款开源的分布式SQL查询引擎,主要用于大数据分析。它设计的目标是提供低延迟的交互式查询能力,支持多种数据源,包括Hadoop的HDFS、Amazon S3、Cassandra、MongoDB等,以及文件系统如本地文件系统或...

    drillnode:Apache Drill的Node.js客户端

    query ( "SELECT * FROM dfs.`home/<USERNAME>/apache-drill-<VERSION>/sample-data/region.parquet` WHERE R_NAME = 'AFRICA'" ). then ( function ( resdata ) { console . log ( resdata ) ;} ). catch ( funct

    军士:使用“ Apache”“ Drill”转换和查询数据的工具

    Apache Drill 是一个开源的分布式大数据查询引擎,设计用于无模式(schema-less)的数据湖环境,支持多种文件格式,包括 Parquet、JSON、CSV等。它提供了SQL接口,使得用户能够轻松地对大规模分布式存储的数据进行...

    数据分析工具:Apache Drill.zip

    史上最全大数据技术全套教程,包括: 分布式存储系统 大数据基础 大数据处理框架 大数据管理与监控 实时计算 数据仓库 数据分析工具 数据湖 数据集成工具 消息队列 等流行技术的系列教程

Global site tag (gtag.js) - Google Analytics