spark-broadcast in spark

博客分类：

spark

go through this block codes below,we will figure out some conclusions: val barr1 = sc.broadcast(arr1) //-broadcast a array with 1M int elements //-this is a embedded broadcast wrapped by rdd below.so this data val observedSizes = sc.parallelize(1 to 10, slices).map(_ => barr1 ...

2016-12-22 15:54
浏览 399
评论(0)
分类:开源软件

spark-storage/memory used in spark

博客分类：

spark

access pattern in spark storage [1] 到目前为止，我们已经了解了spark怎么使用JVM的内存以及集群上执行槽是什么，目前为止还没有谈到task的一些细节，这将在另一个文章中提高，基本上就是spark的一个工作单元，作� ...

2016-12-12 16:31
浏览 1065
评论(0)
分类:编程语言

spark-hive on spark

博客分类：

spark

总体设计 Hive on Spark总体的设计思路是，尽可能重用Hive逻辑层面的功能；从生成物理计划开始，提供一整套针对Spark的实现，比如SparkCompiler、SparkTask等，这样Hive的查询就可以作为Spark的任务来执行了。以下是几点主要的设计原则。尽可能减少对Hive原有代码的修改。这是和之前的Shark设计思路最大的不同。Shark对Hive的改动太大以至于无法被Hive社区接受，Hive on Spark尽可能少改动Hive的代码，从而不影响Hive目前对MapReduce和Tez的支持。同时，Hive on Spark保证对现有的MapReduc ...

2016-12-06 15:04
浏览 480
评论(0)
分类:互联网

spark-RDD vs DataFrame vs DataSet

博客分类：

spark

In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you low-level functionality and control, the latter allows custom view and structure, offers high-level and domain specific operations, saves space, and executes at superior speeds. ...

2016-11-29 15:38
浏览 770
评论(0)
分类:开源软件

[spark-src-core] 8. trivial bug in spark standalone executor assignment

博客分类：

spark

yep from [1] we know that spark will divide jobs into two steps to be executed:a.launches executors and b.assigns tasks to that executors by driver.so how do executors are assigned to workers by master is very important! for standalone mode,when we dive into the src in Master#receiveWithLo ...

2016-11-22 17:24
浏览 520
评论(0)
分类:开源软件

[spark-src-core] 7.1 application in spark-PageRank

博客分类：

spark

below code path are all from sparks' example beside some comments are added by me. val lines = ctx.textFile(args(0), 1) //-1 generate links of <src,targets> pair var links = lines.map{ s => val parts = s.split("\\s+") (parts(0), parts(1)) //-pair of ...

2016-11-03 15:59
浏览 628
评论(0)
分类:开源软件

[spark-src-core] 6. checkpoint in spark

博客分类：

spark

same as others big data technology,CheckPoint is a well-knowed solution to keep data a snapshot for speeduping failovers,ie. restores to most recent checkpoint state of data ,so u will not need to recomputate the rdd against the job. in fact,the checkpoint op will cut down the relationships ...

2016-10-19 17:14
浏览 585
评论(0)
分类:开源软件

[spark-src-core] 5.big data techniques in spark

博客分类：

spark

there are several nice techniques in spark,eg. in user api side.here will dive into it check how does spark implement them. 1.abstract(functions in RDD) group function feature principle 1 first() retrieve the first element in this rdd,if it's more than one partitons,the ...

2016-10-12 17:48
浏览 578
评论(0)
分类:开源软件

[spark-src-core] 4.2 communications b/t certain kernal components

博客分类：

spark

there are several component entities run as daemons in spark(standalone),know to what/how they are working is necessary indeed. akka msg flow similar to tcp note: register driver =RequestSubmitDriver register app=ResigterApplication which sends by AppClient to master when startups ...

2016-09-27 12:26
浏览 454
评论(0)
分类:开源软件

[spark-src-core] 4.1 spark on yarn

as the officials statements,spark is a computation framework,ie u can use it anywhere on which supplys a platform (eg yarn ,mesos) to run . so in this cluster manager,the all spark's daemons are unnecessary to run the app.feel free to stop all of them. hadoop@xx:~/spark/spark-1.4.1- ...

2016-09-27 12:16
浏览 597
评论(0)
分类:开源软件

[spark-src-core] 3.3 run spark in standalone(cluster) mode

博客分类：

spark

simiar to the prevous article,this one is focused on cluster mode. 1.issue command ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode cluster --master spark://gzsw-02:6066 lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user/hadoop/input.txt note:1) th ...

2016-09-19 12:30
浏览 904
评论(0)
分类:编程语言

[spark-src-core] 3.2.run spark in standalone(client) mode

博客分类：

spark

1.startup command ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode client --master spark://gzsw-02:7077 lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user/hadoop/input.txt note:1) the master is the cluster manager which stated in spark master ui page, ...

2016-09-19 11:55
浏览 523
评论(0)
分类:开源软件

[spark-src-core] 3.run spark in cluster(local) mode

博客分类：

spark

yep ,just the same with your guess,there are many deploy modes in spark,eg standalone,yarn,mesos etc.go advance step,the standalone mode can be devided into standalone,cluster(local) mode.the former is a real cluster mode which the master and workers are all run in individual nodes,while the late ...

2016-09-02 17:53
浏览 1075
评论(0)
分类:开源软件

[spark-src-core] 2.5 core concepts in Spark

博客分类：

spark

1.overview in wordcount -memory tips: Job > Stage > Rdd > Dependency RDDs are linked by Dependencies. 2.terms -RDD is associated by Dependency,ie Dependency is a warpper of RDD. Stage contains corresponding rdd; dependency contains parent rdd also -Stage is a wrapper of same ...

2016-08-25 17:38
浏览 382
评论(0)
论坛回复 / 浏览 (0 / 812)
分类:开源软件

[spark-src-core] 2.4 communications b/t certain kernal components

博客分类：

spark

1 data flow overview note: -arrow here is means by:bold line is as data line ‘w/o sender and recevier meanings’ but only with data ‘from-to’ -two ways to retieve task result:direct result and indirect result(over akka frame size) 2.actor in spark 3.several components communicated through E ...

2016-08-25 17:36
浏览 519
评论(0)
分类:开源软件

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

spark-broadcast in spark

spark-storage/memory used in spark

spark-hive on spark

spark-RDD vs DataFrame vs DataSet

[spark-src-core] 8. trivial bug in spark standalone executor assignment

[spark-src-core] 7.1 application in spark-PageRank

[spark-src-core] 6. checkpoint in spark

[spark-src-core] 5.big data techniques in spark

[spark-src-core] 4.2 communications b/t certain kernal components

[spark-src-core] 4.1 spark on yarn

[spark-src-core] 3.3 run spark in standalone(cluster) mode

[spark-src-core] 3.2.run spark in standalone(client) mode

[spark-src-core] 3.run spark in cluster(local) mode

[spark-src-core] 2.5 core concepts in Spark

[spark-src-core] 2.4 communications b/t certain kernal components

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

最近访客更多访客>>