- 浏览: 285927 次
- 性别:
- 来自: 广州
-
最新评论
-
jpsb:
...
为什么需要分布式? -
leibnitz:
hi guy, this is used as develo ...
compile hadoop-2.5.x on OS X(macbook) -
string2020:
撸主真土豪,在苹果里面玩大数据.
compile hadoop-2.5.x on OS X(macbook) -
youngliu_liu:
怎样运行这个脚本啊??大牛,我刚进入搜索引擎行业,希望你能不吝 ...
nutch 数据增量更新 -
leibnitz:
also, there is a similar bug ...
2。hbase CRUD--Lease in hbase
文章列表
go through this block codes below,we will figure out some conclusions:
val barr1 = sc.broadcast(arr1) //-broadcast a array with 1M int elements
//-this is a embedded broadcast wrapped by rdd below.so this data
val observedSizes = sc.parallelize(1 to 10, slices).map(_ => barr1 ...
access pattern in spark storage
[1]
到目前为止,我们已经了解了spark怎么使用JVM的内存以及集群上执行槽是什么,目前为止还没有谈到task的一些细节,这将在另一个文章中提高,基本上就是spark的一个工作单元,作 ...
spark-hive on spark
- 博客分类:
- spark
总体设计
Hive on Spark总体的设计思路是,尽可能重用Hive逻辑层面的功能;从生成物理计划开始,提供一整套针对Spark的实现,比如SparkCompiler、SparkTask等,这样Hive的查询就可以作为Spark的任务来执行了。以下是几点主要的设计原则。
尽可能减少对Hive原有代码的修改。这是和之前的Shark设计思路最大的不同。Shark对Hive的改动太大以至于无法被Hive社区接受,Hive on Spark尽可能少改动Hive的代码,从而不影响Hive目前对MapReduce和Tez的支持。同时,Hive on Spark保证对现有的MapReduc ...
In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you low-level functionality and control, the latter allows custom view and structure, offers high-level and domain specific operations, saves space, and executes at superior speeds.
...
yep from [1] we know that spark will divide jobs into two steps to be executed:a.launches executors and b.assigns tasks to that executors by driver.so how do executors are assigned to workers by master is very important!
for standalone mode,when we dive into the src in Master#receiveWithLo ...
below code path are all from sparks' example beside some comments are added by me.
val lines = ctx.textFile(args(0), 1)
//-1 generate links of <src,targets> pair
var links = lines.map{ s =>
val parts = s.split("\\s+")
(parts(0), parts(1)) //-pair of ...
same as others big data technology,CheckPoint is a well-knowed solution to keep data a snapshot for speeduping failovers,ie. restores to most recent checkpoint state of data ,so u will not need to recomputate the rdd against the job.
in fact,the checkpoint op will cut down the relationships ...
there are several nice techniques in spark,eg. in user api side.here will dive into it check how does spark implement them.
1.abstract(functions in RDD)
group
function
feature
principle
1
first()
retrieve the first element in this rdd,if it's more than one partitons,the ...
there are several component entities run as daemons in spark(standalone),know to what/how they are working is necessary indeed.
akka msg flow similar to tcp
note:
register driver =RequestSubmitDriver
register app=ResigterApplication which sends by AppClient to master when startups ...
as the officials statements,spark is a computation framework,ie u can use it anywhere on which supplys a platform (eg yarn ,mesos) to run .
so in this cluster manager,the all spark's daemons are unnecessary to run the app.feel free to stop all of them.
hadoop@xx:~/spark/spark-1.4.1- ...
simiar to the prevous article,this one is focused on cluster mode.
1.issue command
./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode cluster --master spark://gzsw-02:6066 lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user/hadoop/input.txt
note:1) th ...
1.startup command
./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode client --master spark://gzsw-02:7077 lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user/hadoop/input.txt
note:1) the master is the cluster manager which stated in spark master ui page, ...
yep ,just the same with your guess,there are many deploy modes in spark,eg standalone,yarn,mesos etc.go advance step,the standalone mode can be devided into standalone,cluster(local) mode.the former is a real cluster mode which the master and workers are all run in individual nodes,while the late ...
1.overview in wordcount
-memory tips:
Job > Stage > Rdd > Dependency
RDDs are linked by Dependencies.
2.terms
-RDD is associated by Dependency,ie Dependency is a warpper of RDD.
Stage contains corresponding rdd; dependency contains parent rdd also
-Stage is a wrapper of same ...
1 data flow overview
note:
-arrow here is means by:bold line is as data line ‘w/o sender and recevier meanings’ but only with data ‘from-to’
-two ways to retieve task result:direct result and indirect result(over akka frame size)
2.actor in spark 3.several components communicated through E ...