what is
"Apache Spark™ is a fast and general engine for large-scale data processing....Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk." stated in apache spark
in despite of it's real a fact or not, i think certain key concepts/components to support these points of view:
a.use Resilient Distributed Datasets(RDD) program modeling largely differs from common ideas,eg. mapreduce.spark uses many optimized algorithms(e.g. iterative,localization etc) spread workload to across many workers in cluster.specially in reuse of data computation.
RDD:A resilient distributed dataset (RDD) is a read-only col- lection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.[1]
b.uses memory as far as possible.most of the intermediate results from spark retains in memory other than disks,so it's needles suffer from the io problem and serial-deserial cases.
in fact we use many tools to do similar stuffs ,like memocache,redis..
c.emphasizes the parallism concept.
d.degrades the jvm supervior responsibilities.eg. use one executor to hold on certain tasks instead of one container per task in yarn.
architecture
(the core component is as a platform for other components)
usages of spark
1.iterative alogrithms.eg. machine learning,clustering..
2.interactive analystics. eg. query a ton of data loaded from disk to memory to reduce the latency of io
3.batch process
program language
most of the source code are writing with scala( i think many functions,ideas are inspirated from scala;),but u can also write with java,python in it
flex integrations
many popular frameworks are supported by spark,e.g. hadoop,hbase,mesos etc
ref:
[1] some papers
相关推荐
《SAML V2.0 Technical Overview》(委员会草案02版)是一份由OASIS(组织为开放标准)安全服务技术委员会(Security Services TC)发布的文档,该文档详细介绍了SAML 2.0的技术框架与应用范围。SAML 2.0(Security ...
Spark Overview Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution ...
ASAM_XCP_Part1-Overview_V1-1-0
Overview ===================================== This is the full sources of NetFilter SDK 2.0 + ProtocolFilters. Package contents ===================================== bin\Release - x86 and x64 ...
其中,“overview.html”文件可能是项目的一份概述文档,详述了JAWS-SRC的设计理念、主要功能以及使用方法。通过阅读这份文档,开发者可以快速上手,理解如何利用该工具进行同义词和近义词的相关研究。 “package-...
There is no better time to learn Spark than...This chapter provides a high-level overview of Spark, including the core concepts, the architecture, and the various components inside the Apache Spark stack
for an overview on Axis2 architecture. Axis2/C supports both SOAP 1.1 and SOAP 1.2. The soap processing model is built on the AXIOM XML object model. Axis2/C is capable of handling one-way ...
本项目“1-Collections-Overview-Section-Java-Collections-S_overview”着重于概述Java集合框架的基本概念和关键组件,旨在帮助开发者理解和掌握这个强大的工具。 在Java中,集合框架包括两种主要类型:集合...
"opc-ua-part-1-overview-and-concepts-1.03-specification-20151222.rar"是一个压缩包,其中包含了OPC UA技术规范的第一部分,即概述和概念的详细文档。 首先,OPC UA的核心概念包括: 1. **服务模型**:OPC UA...
Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4. Along the way, you’ll learn about the design and implementation of V2 of theData Source API and ...
maven-overview-plugin-1.6.jar
maven-overview-plugin-1.5.jar
maven-overview-plugin-1.4.jar
maven-overview-plugin-1.3.jar
xwork-2.0.6-src是Struts2框架的核心组件之一,它包含了xwork库的源代码,对于理解Struts2的工作原理以及进行二次开发具有重要的参考价值。本文将深入探讨xwork-2.0.6的源码,帮助开发者更好地理解和利用这个强大的...
ey-overview-of-china-outbound-investment-of-2020.pdf
【Neo4j高级应用技术系列-APOC-1-Overview-v1.pptx】这篇文章主要介绍了Neo4j的高级应用技术,重点聚焦于APOC(Awesome Procedures On Cypher)库,该库为开发者提供了实现复杂图遍历和高性能操作的强大工具。...
1. **源代码编辑器**:集成的编辑器支持语法高亮、自动完成、错误检查等特性,帮助程序员快速编写和修正代码。 2. **项目管理**:IDE-overview允许用户组织和管理多个项目,每个项目可以包含多个模块和源文件,方便...