Hadoop is everywhere. For better or worse, it has become synonymous with big data. In just a few years it has gone from a fringe technology to the de facto standard. Want to be big bata or enterprise analytics or BI-compliant? You better play well with Hadoop.
It’s therefore far from controversial to say that Hadoop is firmly planted in the enterprise as the big data standard and will likely remain firmly entrenched for at least another decade. But,building on some previous discussion, I’m going to go out on a limb and ask, “Is the enterprise buying into a technology whose best day has already passed?”
First, there were Google File System and Google MapReduce
To study this question we need to return to Hadoop’s inspiration – Google’s MapReduce. Confronted with a data explosion, Google engineers Jeff Dean and Sanjay Ghemawat architected (and published!) two seminal systems: the Google File System (GFS) and Google MapReduce (GMR). The former was a brilliantly pragmatic solution to exabyte-scale data management using commodity hardware. The latter was an equally brilliant implementation of a long-standing design pattern applied to massively parallel processing of said data on said commodity machines.
GMR’s brilliance was to make big data processing approachable to Google’s typical user/developer and to make it fast and fault tolerant. Simply put, it boiled data processing at scale down to the bare essentials and took care of everything else. GFS and GMR became the core of the processing engine used to crawl, analyze, and rank web pages into the giant inverted index that we all use daily at google.com. This was clearly a major advantage for Google.
Enter reverse engineering in the open source world, and, voila, Apache Hadoop — comprised of the Hadoop Distributed File System and Hadoop MapReduce — was born in the image of GFS and GMR. Yes, Hadoop is developing into an ecosystem of projects that touch nearly all parts of data management and processing. But, at its core, it is a MapReduce system. Your code is turned into map and reduce jobs, and Hadoop runs those jobs for you.
Then Google evolved. Can Hadoop catch up?
Most interesting to me, however, is that GMR no longer holds such prominence in the Google stack. Just as the enterprise is locking into MapReduce, Google seems to be moving past it. In fact, many of the technologies I’m going to discuss below aren’t even new; they date back the second half of the last decade, mere years after the seminal GMR paper was in print.
Here are technologies that I hope will ultimately seed the post-Hadoop era. While many Apache projects and commercial Hadoop distributions are actively trying to address some of the issues below via technologies and features such as HBase, Hive and Next-Generation MapReduce (aka YARN), it is my opinion that it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google’s technology. (A more technical exposition with published benchmarks is available athttp://www.slideshare.net/mlmilleratmit/gluecon-miller-horizon.)
Percolator for incremental indexing and analysis of frequently changing datasets. Hadoop is a big machine. Once you get it up to speed it’s great at crunching your data. Get the disks spinning forward as fast as you can. However, each time you want to analyze the data (say after adding, modifying or deleting data) you have to stream over the entire dataset. If your dataset is always growing, this means your analysis time also grows without bound.
So, how does Google manage to make its search results increasingly real-time? By displacing GMR in favor of an incremental processing engine called Percolator. By dealing only with new, modified, or deleted documents and using secondary indices to efficiently catalog and query the resulting output, Google was able to dramatically decrease the time to value. As the authors of the Percolator paper write, ”[C]onverting the indexing system to an incremental system … reduced the average document processing latency by a factor of 100.” This means that new content on the Web could be indexed 100 times faster than possible using the MapReduce system!
Coming from the Large Hadron Collider (an ever-growing big data corpus), this topic is near and dear to my heart. Some datasets simply never stop growing. It is why we baked a similar approach deep into the Cloudant data layer service, it is why trigger-based processing is now available in HBase, and it is a primary reason that Twitter Storm is gaining momentum for real-time processing of stream data.
Dremel for ad hoc analytics. Google and the Hadoop ecosystem worked very hard to make MapReduce an approachable tool for ad hoc analyses. From Sawzall through Pig and Hive, many interface layers have been built. Yet, for all of the SQL-like familiarity, they ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration.
In stark contrast, many BI/analytics queries are fundamentally ad hoc, interactive, low-latency analyses. Not only is writing map and reduce workflows prohibitive for many analysts, but waiting minutes for jobs to start and hours for workflows to complete is not conducive to the interactive experience. Therefore, Google invented Dremel (now exposed as the BigQuery product) as a purpose-built tool to allow analysts to scan over petabytes of data in seconds to answer ad hoc queries and, presumably, power compelling visualizations.
Google’s Dremel paper says it is “capable of running aggregation queries over trillions of rows in seconds,” and the same paper notes that running identical queries in standard MapReduce is approximately 100 times slower than in Dremel. Most impressive, however, is real world data from production systems at Google, where the vast majority of Dremel queries complete in less than 10 seconds, a time well below the typical latencies of even beginning execution of a MapReduce workflow and its associated jobs.
Interestingly, I’m not aware of any compelling open source alternatives to Dremel at the time of this writing and consider this a fantastic BI/analytics opportunity.
Pregel for analyzing graph data. Google MapReduce was purpose-built for crawling and analyzing the world’s largest graph data structure – the internet. However, certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, telecommunications equipment, documents and other graph data structures. For example, calculation of the single-source shortest path (SSSP) through a graph requires copying the graph forward to future MapReduce passes, an amazingly inefficient approach and simply untenable at scale.
Therefore, Google built Pregel, a large bulk synchronous processing application for petabyte -scale graph processing on distributed commodity machines. The results are impressive. In contrast to Hadoop, which often causes exponential data amplification in graph processing, Pregel is able to naturally and efficiently execute graph algorithms such as SSSP or PageRank in dramatically shorter time and with significantly less complicated code. Most stunning is the published data demonstrating processing on billions of nodes with trillions of edges in mere minutes, with a near linear scaling of execution time with graph size.
At the time of writing, the only viable option in the open source world is Giraph, an early Apache incubator project that leverages HDFS and Zookeeper. There’s another project called Golden Orbavailable on GitHub.
In summary, Hadoop is an incredible tool for large-scale data processing on clusters of commodity hardware. But if you’re trying to process dynamic data sets, ad-hoc analytics or graph data structures, Google’s own actions clearly demonstrate better alternatives to the MapReduce paradigm. Percolator, Dremel and Pregel make an impressive trio and comprise the new canon of big data. I would be shocked if they don’t have a similar impact on IT as Google’s original big three of GFS, GMR, and BigTable have had.
Mike Miller (@mlmilleratmit) is chief scientist and co-founder at Cloudant, and Affiliate Professor of Particle Physics at University of Washington.
Feature image courtesy of Shutterstock user Jason Prince; evolution of the wheel image courtesy of Shutterstock user James Steidl.
http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/
http://www.csdn.net/article/2012-08-27/2809191-why-the-days-are-numbered-for-hadoop
相关推荐
在深入探讨THS118用户手册所涵盖的关键知识点之前,我们先来简要了解THS118这一设备的基本信息及其在IT行业的应用。THS118是一款由知名制造商设计并生产的高性能硬件设备,它广泛应用于网络通信、数据处理、安全监控...
ths3091带宽为210MHz,采用两块ths3091并联放大,信号稳定。
### THS6214 双端口差分输入输出线性驱动运算放大器关键特性与应用 #### 一、产品概述 THS6214是一款专为高比特率数字用户线路(xDSL)系统设计的双端口、电流反馈架构、差分线驱动放大器。该器件适用于非常高速的...
THS6012用户指南提供了关于THS6012这款集成电路(IC)的详细使用说明。THS6012是德州仪器(Texas Instruments,简称TI)的一款产品,它属于混合信号类产品。为了确保用户能够正确、安全地使用THS6012,用户指南中...
此用户操作手册详细介绍了THS V5.0的各项新特性和改进,旨在帮助用户更好地理解和使用该产品。 在V5.0版本中,THS进行了多项重大更新: 1. **RESTFUL API接口**:THS V5.0引入了RESTful API接口,支持基于HTTP协议...
THS1206.txt文件可能是关于THS1206的详细资料,可能包含ADC的规格书、应用笔记、接口示例代码等,这些都是设计和实现FPGA与THS1206接口的重要参考资料。开发者应当仔细阅读这些文档,理解ADC的工作原理,熟悉其电气...
### THS4021ID 英文资料关键知识点总结 #### 一、产品概述 THS4021ID 是一款由德州仪器(Texas Instruments)生产的高性能电压反馈放大器,具有超低噪声特性,主要适用于对电压噪声要求严格的通信与成像应用领域。...
THS3091芯片,是TI公司的一款放大器芯片,性能高片也有它独特的地方,广义上,只要是使用微细加工手段制造出来的半导体片子,都可以叫做芯片,里面并不一定有电路。比如半导体光源芯片;比如机械芯片,如MEMS陀螺仪...
根据提供的文件信息,我们可以总结以下关于电流型运算放大器(特别是THS3001)的知识点: 1. 电流型运算放大器简介: 电流型运算放大器(Current Feedback Op-Amp,简称CFOA)是一种特殊类型的运算放大器。与传统的...
TI的THS4521、THS4522和THS4524是高性能、低功耗的全差分放大器系列,适用于需要极低功率消耗且对性能有高要求的应用,如低功耗数据采集系统和高密度设计。这些器件具有轨至轨的输入和输出能力,使得在音频应用中...
THS6182是一款由Texas Instruments(TI)设计与制造的低压电力线放大器,它被广泛应用于ADSL系统中作为信号放大器。这款放大器具有低功耗特性,非常适合于需要高密度布置的ADSL中央办公室应用。 #### 二、技术特性 ...
### THS1206ADC手册关键知识点解析 #### 一、THS1206 概述 THS1206 是德州仪器(TI)推出的一款高性能、多功能、低功耗的12位模拟到数字转换器(ADC)。这款芯片支持高速采样(最高6MSPS),拥有4个模拟输入通道,...
TongHttpServer(THS)是一款功能强大、稳定高效、高性价比、易于使用、便于维护的负载均衡软件产品。THS 不仅可以满足用户对负载均衡服务的需求,提升系统可靠性、高效性、可扩展性及资源利用率,还具有很高的性价...
### THS5651 数字模拟转换器关键技术知识点 #### 一、产品概述 THS5651是一款由德州仪器(TI)制造的10位数字模拟转换器(DAC),采用先进的高速混合信号CMOS工艺制造。该器件特别适用于需要高速数据传输的有线及...
### THS3001高精度电流反馈运算放大器关键知识点 #### 一、产品概述 THS3001是一款高性能的电流反馈运算放大器,适用于通信、成像及高质量视频应用等领域。该器件拥有极快的信号响应速度、低失真特性以及宽广的电源...
《安装算量 THS-3DM2014 使用手册》是针对建筑行业中工程量计算软件THS-3DM2014的操作指南,旨在帮助用户有效地安装、使用该软件进行精准的工程量统计和预算分析。以下是手册中的关键知识点详解: 1. 概述: - ...
《THS_longxin.tar.gz——TongWeb与TongHttpServer3.8在负载均衡集群中的应用》 在IT行业中,高效稳定的服务器架构是保障业务正常运行的关键。TongWeb和TongHttpServer,作为国内知名的中间件产品,被广泛应用于...
TVP7002和THS8200是德州仪器(Texas Instruments,简称TI)公司推出的高性能模拟信号处理芯片,分别担任模拟-数字转换器(ADC)和数字-模拟转换器(DAC)的角色。在数字视频处理领域中,这两款芯片常常一起使用,...