`
zhuzhiguosnail
  • 浏览: 111117 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

What Factors Justify the Use of Apache Hadoop?

阅读更多

Relational database authors and advocates have two criticisms of Hadoop. First, that most users have little need for Big Data. Second, that MapReduce is more complex than traditional SQL queries.

Both of these criticisms are valid.

In a post entitled “Terabytes is not big data, petabytes is,” Henrik Ingo argued that the gigabytes and terabytes I referenced as Big Data did not justify that term. He is correct. Further, it is true that the number of enterprises worldwide with petabyte scale data management challenges is limited.

MapReduce, for its part, is in fact challenging. Challenging enough that there are two separate projects (Hive and Pig) that add SQL-like interfaces as a complement to the core Hadoop MapReduce functionality. Besides being more accessible, SQL skills are an order of magnitude more common from a resource availability standpoint.

Hadoop supporters, meanwhile, counter both of those concerns.

It was Hadoop sponsor Cloudera, in fact, that originally coined the term “Medium Data” as an acknowledgement that data complexity was not purely a function of volume. As Bradford Cross put it:

Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.

Big Data, like NoSQL, has become a liability in most contexts. Setting aside the lack of a consistent definition, the term is of little utility because it is single-dimensional. Larger dataset sizes present unique computational challenges. But the structure, workload, accessibility and even location of the data may prove equally challenging.

We use Hadoop at RedMonk, for example, to attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.

There are a variety of options for data mining at the scale we practice it. From the basic grep to the Perl CPAN modules Henrik points to, there are many tools that would provide us with similar capabilities. Why Hadoop? Because the ecosystem is growing, the documentation is generally excellent, the unstructured nature of our datasets and, yes, its ability to attack Big Data. Because while our datasets – at least individually – do not constitute Big Data, they are growing rapidly.

Nor have we had to learn MapReduce. The Hadoop ecosystem at present is rich enough already that we have a variety of front end options available, from visual spreadsheet metaphors (Big Sheets) to SQL-style queries (Hive) with a web UI (Beeswax). No Java necessary.

Brian Aker’s comparison of MapReduce to an SUV in Henrik’s piece is apt; whether you’re a supporter of Hadoop or not, curiously. Brian’s obviously correct that a majority of users will use a minority of its capabilities. Much like SUVs and their owners.

While the overkill of an SUV is offset by its higher fuel costs and size, however, the downside to Hadoop usage is less apparent. Its single node performance is merely adequate and the front ends are immature relative to the tooling available in the relational database world, but the build out around the core is improving by the day.

When is Hadoop justified? For a petabyte workloads, certainly. But the versatility of tool makes it appropriate for a variety of workloads beyond quote unquote big data. It’s not going to replace your database, but your database isn’t likely to replace Hadoop either.

Different tools for different jobs, as ever.

Disclosure: Cloudera is a RedMonk customer.

 

分享到:
评论

相关推荐

    MAKING THE CASE: How to Justify the Cost of a Rapid Prototyping System

    To designers, engineers and product... Yet, in spite of the obvious value, it may not be clear how to convince the management and accounting departments that the benefits justify the capital expenditure.

    NIST SP800-55.pdf

    through the use of metrics, identifies the adequacy of in-place security controls, policies, and procedures. It provides an approach to help management decide where to invest in additional security ...

    FPGA implementations of neural networks

    This lack of success may be largely attributed to the fact that earlier work was almost entirely aimed at developing custom neurocomputers, based on ASIC technology, but for such niche areas this ...

    Microservice-Architecture-Aligning-Principles-Practices-and-Culture.pdf

    This section will be valuable to anyone who needs to justify the use of microser‐ vices within their organization and provide some background on how other organizations have started on this journey....

    a project model for the FreeBSD Project.7z

    [2] , ranging from web servers to games, programming languages and most of the application types that are in use on modern computers. Ports will be discussed further in the section The Ports ...

    SSD7 EX1 答案

    3. Write the SQL statement to retrieve the title and price of all books published by either of two publishers (say "Addison Wesley" and "McGraw Hill"). In the file Rel-ops.txt, list which relational ...

    Cisco Press - Taking Charge of Your VoIP Project

    The comprehensive look at quality of service and tuning describes when and where to use them in a VoIP deployment. These are often the most complex topics in VoIP; you'll get smart recommendations on ...

    Python 2.6 Graphics Cookbook.pdf

    Alignment of text – left and right justify 49 All the fonts available on your computer 54 Chapter 4: Animation Principles 57 Introduction 57 Static shifting of a ball 58 Time-controlled shifting...

    编译原理龙书答案

    Use induction on the number of nodes in a parse tree. num -> 11 | 1001 | num 0 | num num Does the grammar generate all binary strings with values divisible by 3? answer prove any string derived from ...

    IEEE 802.11ax: High-Efficie ncy WLANs

    In this article, we review the expected future WLAN scenarios and use cases that justify the push for a new PHY/MAC IEEE 802.11 amendment. After that, we overview a set of new technical features that...

    IEEE 802.11ax HIgH-EffIcIEncy WLan.pdf

    In this article, we review the expected future WLAN scenarios and use cases that justify the push for a new PHY/MAC IEEE 802.11 amendment. After that, we overview a set of new technical features that...

    IEEE 80211ax HIgH-EffIcIEncy WLans.pdf

    scenarios and use cases that justify the push for a new PHY/MAC IEEE 802.11 amendment. After that, we overview a set of new technical features that may be included in the IEEE 802.11ax-2019 ...

    A-Practical-Guide-to-LATEX-Tips.pdf

    These reasons justify the presence of this book. I have tried to pack the most useful tricks which can enhance your documents’ content in a concise way. The aim is not to teach LATEX programming, but...

    flex布局 justify-content 解决最后一排数量不够自动向两端排列问题.doc

    flex布局 justify-content 解决最后一排数量不够自动向两端排列问题,简单,高效,好用。

    应用线性代数导论

    one of the most compelling reasons to learn the material: You can use the ideas and methods described in this book to do practical things like build a prediction model from data, enhance images, or ...

    Apache服务器配置大全及DIV布局.zip

    Apache服务器配置大全及DIV布局.zip这个压缩包文件包含的主题涵盖了两个重要的方面:Apache服务器的配置以及前端网页设计中的DIV布局。接下来,我们将深入探讨这两个关键领域的详细知识点。 首先,让我们来了解一下...

    Packt Oracle Advanced PL SQL Developer Professional Guide 2012

    The topics covered and demonstrated are in line with the Oracle University prescription for Oracle Professional certification, which justify the version updates to be advanced and not complex. The ...

Global site tag (gtag.js) - Google Analytics