`

Hadoop vs. Spark: The New Age of Big Data

阅读更多
Hadoop vs. Spark: The New Age of Big Data

Posted February 5, 2016
By Ken Hess


In the question of Hadoop vs. Spark, the most accurate view is that designers intended Hadoop and Spark to work together on the same team.


A direct comparison of Hadoop and Spark is difficult because they do many of the same things, but are also non-overlapping in some areas.

For example, Spark has no file management and therefor must rely on Hadoop’s Distributed File System (HDFS) or some other solution. It is wiser to compare Hadoop MapReduce to Spark, because they’re more comparable as data processing engines.

As data science has matured over the past few years, so has the need for a different approach to data and its “bigness.” There are business applications where Hadoop outperforms the newcomer Spark, but Spark has its place in the big data space because of its speed and its ease of use. This analysis examines a common set of attributes for each platform including performance, fault tolerance, cost, ease of use, data processing, compatibility, and security.

The most important thing to remember about Hadoop and Spark is that their use is not an either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-in replacement for the other. The two are compatible with each other and that makes their pairing an extremely powerful solution for a variety of big data applications.


Hadoop Defined

Hadoop is an Apache.org project that is a software library and a framework that allows for distributed processing of large data sets (big data) across computer clusters using simple programming models. Hadoop can scale from single computer systems up to thousands of commodity systems that offer local storage and compute power. Hadoop, in essence, is the ubiquitous 800-lb big data gorilla in the big data analytics space.

Hadoop is composed of modules that work together to create the Hadoop framework. The primary Hadoop framework modules are:

· Hadoop Common

· Hadoop Distributed File System (HDFS)

· Hadoop YARN

· Hadoop MapReduce

Although the above four modules comprise Hadoop’s core, there are several other modules. These include Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop, which further enhance and extend Hadoop’s power and reach into big data applications and large data set processing.

Many companies that use big data sets and analytics use Hadoop. It has become the de facto standard in big data applications. Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. The result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed processing engine, MapReduce.

Hadoop is useful to companies when data sets become so large or so complex that their current solutions cannot effectively process the information in what the data users consider being a reasonable amount of time.

MapReduce is an excellent text processing engine and rightly so since crawling and searching the web (its first job) are both text-based tasks.


Spark Defined

The Apache Spark developers bill it as “a fast and general engine for large-scale data processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah.

Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), they might not be so ready to acknowledge that it runs up to ten times faster on disk. Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.

Spark’s big claim to fame is its real-time data processing capability as compared to MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules. In fact, on Hadoop’s project page, Spark is listed as a module.

Spark has its own page because, while it can run in Hadoop clusters through YARN (Yet Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a Hadoop module and as a standalone solution makes it tricky to directly compare and contrast. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.

Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. For example, Spark doesn’t have its own distributed filesystem, but can use HDFS.

Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. The primary difference between MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered in more detail under the Fault Tolerance section.


Performance

There’s no lack of information on the Internet about how fast Spark is compared to MapReduce. The problem with comparing the two is that they perform processing differently, which is covered in the Data Processing section. The reason that Spark is so fast is that it processes everything in memory. Yes, it can also use disk for data that doesn’t all fit into memory.

Spark’s in-memory processing delivers near real-time analytics for data from marketing campaigns, machine learning, Internet of Things sensors, log monitoring, security analytics, and social media sites. MapReduce alternatively uses batch processing and was really never built for blinding speed. It was originally setup to continuously gather information from websites and there were no requirements for this data in or near real-time.


Ease of Use

Spark is well known for its performance, but it’s also somewhat well known for its ease of use in that it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL. Spark SQL is very similar to SQL 92, so there’s almost no learning curve required in order to use it.

Spark also has an interactive mode so that developers and users alike can have immediate feedback for queries and other actions. MapReduce has no interactive mode, but add-ons such as Hive and Pig make working with MapReduce a little easier for adopters.


Costs

Both MapReduce and Spark are Apache projects, which means that they’re open source and free software products. While there’s no cost for the software, there are costs associated with running either platform in personnel and in hardware. Both products are designed to run on commodity hardware, such as low cost, so-called white box server systems.

MapReduce and Spark run on the same hardware, so where’s the cost differences between the two solutions? MapReduce uses standard amounts of memory because its processing is disk-based, so a company will have to purchase faster disks and a lot of disk space to run MapReduce. MapReduce also requires more systems to distribute the disk I/O over multiple systems.

Spark requires a lot of memory, but can deal with a standard amount of disk that runs at standard speeds. Some users have complained about temporary files and their cleanup. Typically these temporary files are kept for seven days to speed up any processing on the same data sets. Disk space is a relatively inexpensive commodity and since Spark does not use disk I/O for processing, the disk space used can be leveraged SAN or NAS.

It is true, however that Spark systems cost more because of the large amounts of RAM required to run everything in memory. But what’s also true is that Spark’s technology reduces the number of required systems. So, you have significantly fewer systems that cost more. There’s probably a point at which Spark actually reduces costs per unit of computation even with the additional RAM requirement.

To illustrate, “Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on one-tenth of the machines.” This feat won Spark the 2014 Daytona GraySort Benchmark.


Compatibility

MapReduce and Spark are compatible with each other and Spark shares all MapReduce’s compatibilities for data sources, file formats, and business intelligence tools via JDBC and ODBC.


Data Processing

MapReduce is a batch-processing engine. MapReduce operates in sequential steps by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on. Spark performs similar operations, but it does so in a single step and in memory. It reads data from the cluster, performs its operation on the data, and then writes it back to the cluster.

Spark also includes its own graph computation library, GraphX. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs), discussed in the Fault Tolerance section.


Fault Tolerance

For fault tolerance, MapReduce and Spark resolve the problem from two different directions. MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a heartbeat is missed then the JobTracker reschedules all pending and in-progress operations to another TaskTracker. This method is effective in providing fault tolerance, however it can significantly increase the completion times for operations that have even a single failure.

Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of elements that can be operated on in parallel. RDDs can reference a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including local filesystems or one of those listed previously.

An RDD possesses five main properties:

·  A list of partitions

·  A function for computing each split

·  A list of dependencies on other RDDs

·  Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

·  Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

RDDs can be persistent in order to cache a dataset in memory across operations. This allows future actions to be much faster, by as much as ten times. Spark’s cache is fault-tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by using the original transformations.


Scalability

By definition, both MapReduce and Spark are scalable using the HDFS. So how big can a Hadoop cluster grow?

Yahoo reportedly has a 42,000 node Hadoop cluster, so perhaps the sky really is the limit. The largest known Spark cluster is 8,000 nodes, but as big data grows, it’s expected that cluster sizes will increase to maintain throughput expectations.


Security

Hadoop supports Kerberos authentication, which is somewhat painful to manage. However, third party vendors have enabled organizations to leverage Active Directory Kerberos and LDAP for authentication. Those same third party vendors also offer data encrypt for in-flight and data at rest.

Hadoop’s Distributed File System supports access control lists (ACLs) and a traditional file permissions model. For user control in job submission, Hadoop provides Service Level Authorization, which ensures that clients have the right permissions.

Spark’s security is a bit sparse by currently only supporting authentication via shared secret (password authentication). The security bonus that Spark can enjoy is that if you run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication.



Hadoop vs. Spark Summary

Upon first glance, it seems that using Spark would be the default choice for any big data application. However, that’s not the case. MapReduce has made inroads into the big data market for businesses that need huge datasets brought under control by commodity systems. Spark’s speed, agility, and relative ease of use are perfect complements to MapReduce’s low cost of operation.

The truth is that Spark and MapReduce have a symbiotic relationship with each other. Hadoop provides features that Spark does not possess, such as a distributed file system and Spark provides real-time, in-memory processing for those data sets that require it. The perfect big data scenario is exactly as the designers intended—for Hadoop and Spark to work together on the same team.

Photo courtesy of Shutterstock.




Tags: Hadoop, big data, Spark



1 Comments

By Matt Pouttu-Clarke   February 06 2016 19:35 PST

When speaking about Spark I am shocked that the severe resourcing issues created by Scala aren't mentioned. I have worked with Spark and seen companies literally have to acquire other companies just to get competent Scala resources. Apache Beam from Google was recently open sourced (written in Java by Google full time staff). Shouldn't we let the hype go with Spark and Scala and start talking about Apache Beam instead? Please see my blog for coverage from someone who has actually been there and done that with Spark: https://mpouttuclarke.wordpress.com/2016/01/04/why-i-tried-apache-spark-and-moved-on/ Thanks, Matt









-

Reference:
Hadoop vs. Spark: The New Age of Big Data
http://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html


-
分享到:
评论

相关推荐

    Spark: The Definitive Guide: Big Data Processing Made Simple 英文.pdf版

    《Spark: The Definitive Guide: Big Data Processing Made Simple》是大数据处理领域的经典著作,由Databricks的创始人之一Michael Armbrust等专家撰写。这本书深入浅出地介绍了Apache Spark的核心概念、架构以及...

    hadoop1.0 Failed to set permissions of path 解决方案

    ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Failed to set permissions of path: \tmp\hadoop-admin \mapred\local\ttprivate to 0700 at org.apache...

    spark-3.1.3-bin-hadoop3.2.tgz

    在这个特定的压缩包"spark-3.1.3-bin-hadoop3.2.tgz"中,我们得到了Spark的3.1.3版本,它已经预编译为与Hadoop 3.2兼容。这个版本的Spark不仅提供了源码,还包含了预编译的二进制文件,使得在Linux环境下快速部署和...

    spark-3.1.2.tgz & spark-3.1.2-bin-hadoop2.7.tgz.rar

    Spark-3.1.2.tgz和Spark-3.1.2-bin-hadoop2.7.tgz是两个不同格式的Spark发行版,分别以tar.gz和rar压缩格式提供。 1. Spark核心概念: - RDD(弹性分布式数据集):Spark的基础数据结构,是不可变、分区的数据集合...

    hadoop java.lang.UnsatisfiedLinkError

    解决方案:Exceptionin thread "main" java.lang.UnsatisfiedLinkError:org.apache.hadoop.util.NativeCrc32.nativeCo

    spark-3.2.1-bin-hadoop2.7.tgz

    这个名为"spark-3.2.1-bin-hadoop2.7.tgz"的压缩包是Spark的一个特定版本,即3.2.1,与Hadoop 2.7版本兼容。在Linux环境下,这样的打包方式方便用户下载、安装和运行Spark。 Spark的核心设计理念是快速数据处理,...

    spark-3.3.3-bin-hadoop3.tgz

    Spark 3.3.3是Apache Spark的一个重要版本,它是一个快速、通用且可扩展的大数据处理框架。这个版本特别针对Hadoop 3.x进行了优化,使得它能够充分利用Hadoop生态系统中的新特性和性能改进。在本文中,我们将深入...

    Hadoop from the beginning: The basics

    The author has explored every component of Hadoop. Prior to that, the author helps you understand how to setup Hadoop on your Linux platform. The Hadoop HDFS has been explored in detail. You will ...

    spark-assembly-1.5.2-hadoop2.6.0.jar

    《Spark编程核心组件:spark-assembly-1.5.2-hadoop2.6.0.jar详解》 在大数据处理领域,Spark以其高效、易用和灵活性脱颖而出,成为了许多开发者的首选框架。Spark-assembly-1.5.2-hadoop2.6.0.jar是Spark中的一个...

    winutils.exe:解决hadoop在windows运行出现的bug

    如果出现如下bug:“Could not locate executable null\bin\winutils.exe in the Hadoop binaries”,则下载该文件,放入hadoop的bin文件夹下,并设置环境变量HADOOP_HOME:F:\hadoop2.2.0即可。

    spark-3.4.1-bin-hadoop3.tgz - Spark 3.4.1 安装包(内置了Hadoop 3)

    文件名: spark-3.4.1-bin-hadoop3.tgz 这是 Apache Spark 3.4.1 版本的二进制文件,专为与 Hadoop 3 配合使用而设计。Spark 是一种快速、通用的集群计算系统,用于大规模数据处理。这个文件包含了所有必要的组件,...

    外网无法访问HDFS org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block

    报错 org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block 2、百度结果 参考 https://blog.csdn.net/xiaozhaoshigedasb/article/details/88999595  防火墙记得关掉; 查看DataNode是否启动;...

    spark-3.0.0-bin-hadoop2.7.tgz

    Spark-3.0.0-bin-hadoop2.7.tgz 是Spark 3.0.0版本的预编译二进制包,其中包含了针对Hadoop 2.7版本的兼容性构建。这个版本的发布对于数据科学家和大数据工程师来说至关重要,因为它提供了许多性能优化和新功能。 1...

    spark-2.3.4-bin-hadoop2.7.tgz

    在本案例中,我们关注的是Spark的2.3.4版本,它预编译为与Hadoop 2.7兼容的版本,打包成"spark-2.3.4-bin-hadoop2.7.tgz"的压缩文件。这个压缩包包含了运行Spark所需的所有组件,包括Java库、Python库(pyspark)、...

    spark-3.1.2-bin-hadoop3.2.tgz

    Spark 3.1.2是Apache Spark的一个重要版本,它为大数据处理提供了高效、可扩展的框架。这个版本是针对Scala 2.12编译的,并且与Hadoop 3.2兼容,这意味着它可以充分利用Hadoop生态系统的最新功能。在Linux环境下,...

    spark-3.2.0-bin-hadoop3.2.tgz

    这个压缩包"spark-3.2.0-bin-hadoop3.2.tgz"包含了Spark 3.2.0版本的二进制文件,以及针对Hadoop 3.2的兼容构建。 Spark的核心组件包括:Spark Core、Spark SQL、Spark Streaming、MLlib(机器学习库)和GraphX(图...

    spark-2.4.7-bin-hadoop2.7.tgz

    Spark 2.4.7是Apache Spark的一个稳定版本,它为大数据处理提供了高效、易用且可扩展的框架。这个版本兼容Hadoop 2.7,这意味着它可以在使用Hadoop 2.7作为数据存储和资源管理的环境中无缝运行。Spark的核心特性包括...

    spark-2.1.1-bin-hadoop2.7.tgz.7z

    这个特定的压缩包"spark-2.1.1-bin-hadoop2.7.tgz.7z"是为Linux系统设计的,它包含了Spark 2.1.1版本,并且已经与Hadoop 2.7.2版本进行了预编译集成,这意味着它可以无缝地与Hadoop生态系统交互。 Hadoop 2.7.2是一...

    spark-2.4.0-bin-hadoop2.7.tgz

    与hadoop2.7版本的集成,意味着Spark可以很好地兼容Hadoop生态系统,包括HDFS(Hadoop分布式文件系统)和YARN(资源调度器)。 在"spark-2.4.0-bin-hadoop2.7.tgz"这个压缩包中,主要包含以下几个部分: 1. **...

Global site tag (gtag.js) - Google Analytics