`

APACHE SPARK: RDD, DATAFRAME OR DATASET?

 
阅读更多

[from] http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/

 

Blog-InlineIMAGES-SparkAPIs

There Are Now 3 Apache Spark APIs. Here’s How to Choose the Right One

See Apache Spark 2.0 API Improvements: RDD, DataFrame, DataSet and SQL here.

Apache Spark is evolving at a rapid pace, including changes and additions to core APIs. One of the most disruptive areas of change is around the representation of data sets. Spark 1.0 used the RDD API but in the past twelve months, two new alternative and incompatible APIs have been introduced. Spark 1.3 introduced the radically different DataFrame API and the recently released Spark 1.6 release introduces a preview of the new Dataset API.

Many existing Spark developers will be wondering whether to jump from RDDs directly to the Dataset API, or whether to first move to the DataFrame API. Newcomers to Spark will have to choose which API to start learning with.

This article provides an overview of each of these APIs, and outlines the strengths and weaknesses of each one. A companion github repository provides working examples that are a good starting point for experimentation with the approaches outlined in this article.

Talk to a Spark expert. Contact Us.

The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release. This interface and its Java equivalent, JavaRDD, will be familiar to any developers who have worked through the standard Spark tutorials. From a developer’s perspective, an RDD is simply a set of Java or Scala objects representing data.

The RDD API provides many transformation methods, such as map()filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

Example of RDD transformations and actions

Scala:

 

 

Java:

 

The main advantage of RDDs is that they are simple and well understood because they deal with concrete classes, providing a familiar object-oriented programming style with compile-time type-safety. For example, given an RDD containing instances of Person we can filter by age by referencing the age attribute of each Person object:

Example: Filter by attribute with RDD

Scala:

 

 

Java:

 

The main disadvantage to RDDs is that they don’t perform particularly well. Whenever Spark needs to distribute the data within the cluster, or write the data to disk, it does so using Java serialization by default (although it is possible to use Kryo as a faster alternative in most cases). The overhead of serializing individual Java and Scala objects is expensive and requires sending both data and structure between nodes (each serialized object contains the class structure as well as the values). There is also the overhead of garbage collection that results from creating and destroying individual objects.

DataFrame API

Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization. There are also advantages when performing computations in a single process as Spark can serialize the data into off-heap storage in a binary format and then perform many transformations directly on this off-heap memory, avoiding the garbage-collection costs associated with constructing individual objects for each row in the data set. Because Spark understands the schema, there is no need to use Java serialization to encode the data.

The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans, but not natural for the majority of developers. The query plan can be built from SQL expressions in strings or from a more functional approach using a fluent-style API.

Example: Filter by attribute with DataFrame

Note that these examples have the same syntax in both Java and Scala

SQL Style

 

 

Expression builder style:

 

Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.

Another downside with the DataFrame API is that it is very scala-centric and while it does support Java, the support is limited. For example, when creating a DataFrame from an existing RDD of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement the scala.Product interface. Scala case classes work out the box because they implement this interface.

Dataset API

The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release.

Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant. In writing the examples to accompany this article, we ran into errors when trying to create a Dataset in Java from a list of Java objects that were not fully bean-compliant.

Need help with Spark APIs? Contact Us.

Example: Creating Dataset from a list of objects

Scala

 

 

Java

 

Transformations with the Dataset API look very much like the RDD API and deal with the Person class rather than an abstraction of a row.

Example: Filter by attribute with Dataset

Scala

 

 

Java

 

Despite the similarity with RDD code, this code is building a query plan, rather than dealing with individual objects, and if age is the only attribute accessed, then the rest of the the object’s data will not be read from off-heap storage.

Get started with your Big Data Strategy. Contact Us.

Conclusions

If you are developing primarily in Java then it is worth considering a move to Scala before adopting the DataFrame or Dataset APIs. Although there is an effort to support Java, Spark is written in Scala and the code often makes assumptions that make it hard (but not impossible) to deal with Java objects.

If you are developing in Scala and need your code to go into production with Spark 1.6.0 then the DataFrame API is clearly the most stable option available and currently offers the best performance.

However, the Dataset API preview looks very promising and provides a more natural way to code. Given the rapid evolution of Spark it is likely that this API will mature very quickly through 2016 and become the de-facto API for developing new applications.

See Apache Spark 2.0 API Improvements: RDD, DataFrame, DataSet and SQL here.

分享到:
评论

相关推荐

    spark rdd api dataframe 以及dataframe rdd dataset 相互转换 spark sql

    在大数据处理框架Apache Spark中,RDD(弹性分布式数据集)是基础的数据处理抽象,它提供了容错、分布式数据操作的能力。而DataFrame和Dataset是Spark SQL中更高级的数据抽象,提供了更多的优化和易于使用的特点。...

    spark-scala-examples:该项目以Scala语言提供了Apache Spark SQL,RDD,DataFrame和Dataset示例

    有关该项目中存在的所有Spark SQL,RDD,DataFrame和Dataset示例的说明,请访问 。所有这些示例均以Scala语言编码并在我们的开发环境中进行了测试。目录(Scala中的Spark示例)Spark RDD示例火花蓄能器介绍将Spark ...

    深入理解Spark:核心思想及源码分析.pdf

    《深入理解Spark:核心思想及源码分析》这本书旨在帮助读者深入掌握Apache Spark这一大数据处理框架的核心原理与实现细节。Spark作为一个快速、通用且可扩展的数据处理系统,已经在大数据领域得到了广泛应用。它提供...

    【SparkSql篇01】SparkSql之DataFrame和DataSet1

    总之,Spark SQL的DataFrame和DataSet提供了高效、灵活的结构化数据处理手段,SparkSession作为统一入口简化了数据访问,而从RDD转换为DataFrame或DataSet则提供了多种途径,满足不同需求。这些特性使得Spark SQL...

    playing-with-spark-rdd:Apache Spark RDD示例

    在本项目"playing-with-spark-rdd:Apache Spark RDD示例"中,我们将深入探讨如何使用Spark的RDD API以及与DataSet API的交互。 RDD是Spark对大规模数据的抽象表示,它是由不可变、分区的数据元素集合构成,分布在...

    Apache Spark 2.4 and beyond

    1. **DataFrame/Dataset API强化**:Spark 2.4继续加强DataFrame和Dataset API,提供了更丰富的SQL函数支持,包括窗口函数和多列聚合,使得数据处理更加灵活和高效。此外,API的类型安全性和推断机制也得到了改进,...

    Spark SQL PDF

    Spark SQL 是 Apache Spark 的一个模块,它允许开发者使用 SQL 语言或者DataFrame API处理数据,同时兼容Hive查询语言(HQL)。Spark SQL 提供了一种统一的方式来操作数据,无论是来自传统的数据库、Hadoop 文件系统...

    High Performance Spark Best Practices for Scaling and Optimizing Apache Spark

    5. **DataFrame与Dataset API**:DataFrame和Dataset API相比RDD更高级,提供类型安全和优化的执行计划。使用它们可以自动进行代码优化,如列式存储和编译时优化。 6. **SQL与DataFrame优化**:如果使用Spark SQL,...

    Frank Kane's Taming Big Data with Apache Spark and Python

    在书中,Kane可能详细讲解了Spark的几个关键组件,包括RDD(弹性分布式数据集)、DataFrame、DataSet以及Spark SQL。读者可以通过这些组件来处理结构化和非结构化数据。通过本书,读者可能了解如何将Spark的MLlib...

    Apache Spark 2.0.2 中文文档 - v0.1.0

    3. **DataFrame和Dataset**:在Spark 2.0.2中,DataFrame和Dataset成为主要的数据抽象。DataFrame基于SQL概念,提供了一种统一的API来处理结构化和半结构化数据,而Dataset则引入了类型安全和代码生成的特性,结合了...

    ApacheCN - Apache Spark 2.0.2 中文文档 - v0.1.0PDF

    DataFrame提供了一种统一的方式来处理各种数据源,而Dataset API则提供了面向对象的编程模型,结合了DataFrame的便利性和RDD(弹性分布式数据集)的性能。 3. **Spark SQL**:Spark SQL是Spark处理结构化数据的主要...

    Apache Spark的面试题.zip

    ** Spark SQL提供了SQL接口,可以处理结构化和半结构化数据,并且与DataFrame和Dataset API无缝集成。 - **DataFrame和Dataset的区别?** DataFrame是基于列的数据结构,兼容SQL查询;Dataset是类型安全的DataFrame...

    Spark the definitive guide 英文版

    新API主要基于DataFrame和Dataset这两个概念进行数据处理,这相较于原来的RDD(弹性分布式数据集)提供了更加高级的数据抽象,使得开发者能够更简单、直观地进行数据处理和分析工作。 #### 6. 适用对象与教育价值 ...

    spark_jar包

    对于初学者,理解Spark的基础概念和核心API是非常重要的,这包括RDD(弹性分布式数据集)、DataFrame/Dataset、SparkSQL以及Spark Streaming等。同时,熟悉Spark的部署模式和资源管理也是必要的,以便在实际项目中...

    mastering-apache-spark最好的spark教程

    Dataset结合了RDD的类型安全优势和DataFrame的优化执行引擎的优点。 6. Spark SQL中的编码器(Encoders) 编码器是Spark用来将Java或Scala的对象转换成内部二进制格式的组件。它们对于在内存中存储和操作数据非常...

    Learning Apache Spark with Python.pdf

    它的核心抽象是弹性分布式数据集(RDD),以及其高级抽象,如DataFrame和Dataset。通过使用RDD,开发者可以执行转换和行动操作来处理数据。DataFrame则是一个以RDD为基础的分布式数据集合,并带有结构信息和优化执行...

    带你深入理解Spark核心思想走进Sprak的源码分析

    3. Dataset:Dataset是DataFrame的类型安全版本,提供了强类型和编译时检查,结合了RDD的性能和DataFrame的易用性。 五、Spark Streaming Spark Streaming提供了实时流处理能力,基于微批处理模型,将流数据分割成...

    Apache Spark 应用

    DataFrame和Dataset都是Spark SQL的一部分,它们提供了结构化的数据处理能力,DataFrame是带有结构化数据的RDD,而Dataset则是具有类型信息的DataFrame,它们都支持复杂的操作和优化。 关于大规模数据集优化问题,...

    藏经阁-Apache Spark – Apache HBase Connector.pdf

    Apache Spark - Apache HBase Connector 的动机是由于 Spark 在 HBase 上的支持有限,仅支持 RDD 级别的访问,且缺乏 DataFrame 和 Dataset 的支持。同时,现有的 Connector 设计复杂,且需要大量的维护工作。 ...

    藏经阁-Demystifying Data Frame and Da.pdf

    在大数据处理领域,Apache Spark作为一个高效、通用的计算框架,其核心概念之一就是DataFrame和Dataset。这两者是Spark SQL中的关键数据结构,为开发人员提供了强大而灵活的数据操作接口。本文将深入探讨Spark 2.x...

Global site tag (gtag.js) - Google Analytics