Spark Streaming的窗口操作

m635674608

浏览: 5061375 次
性别:
来自: 南京

最近访客更多访客>>

wusuosuo

yijiaomuqing

millerchu

xdung

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

spark

Spark Streaming的Window Operation可以理解为定时的进行一定时间段内的数据的处理。

不要怪我语文不太好。。下面上原理图吧，一图胜千言：

滑动窗口在监控和统计应用的场景比较广泛，比如每隔一段时间(2s)统计最近3s的请求量或者异常次数，根据请求或者异常次数采取相应措施

如图：

1. 红色的矩形就是一个窗口，窗口hold的是一段时间内的数据流。

2.这里面每一个time都是时间单元，在官方的例子中，每个窗口大小(window size)是3时间单元 (time unit), 而且每隔2个单位时间，窗口会slide（滑动）一次。

所以基于窗口的操作，需要指定2个参数：

window length - The duration of the window (3 in the figure)
slide interval - The interval at which the window-based operation is performed (2 in the figure).

1.窗口大小，个人感觉是一段时间内数据的容器。

2.滑动间隔，就是我们可以理解的cron表达式吧。 - -！

举个例子吧：

还是以最著名的wordcount举例，每隔10秒，统计一下过去30秒过来的数据。

[java]view plaincopy 
// Reduce last 30 seconds of data, every 10 seconds  
val windowedWordCounts = pairs.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))  

这里的paris就是一个MapedRDD，类似(word,1)

[java]view plaincopy 
reduceByKeyAndWindow // 这个类似RDD里面的reduceByKey，就是对RDD应用function  

在这里是根据key，对至进行聚合，然后累加。

下面粘贴一下它的API，仅供参考：

Transformation Meaning

window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength,slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength,slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameterinvFunc. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.
countByValueAndWindow(windowLength,slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

Output Operations

When an output operator is called, it triggers the computation of a stream. Currently the following output operators are defined:

Output Operation Meaning

print()	Prints first ten elements of every batch of data in a DStream on the driver.
foreachRDD(func)	The fundamental output operator. Applies a function, func, to each RDD generated from the stream. This function should have side effects, such as printing output, saving the RDD to external files, or writing it over the network to an external system.
saveAsObjectFiles(prefix, [suffix])	Save this DStream's contents as a `SequenceFile` of serialized objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsTextFiles(prefix, [suffix])	Save this DStream's contents as a text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix])	Save this DStream's contents as a Hadoop file. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

原创，转载请注明出处http://blog.csdn.net/oopsoom/article/details/23776477

分享到：

Ace - Responsive Admin Template | 滑动窗口 TOPN 技术实现演变

2015-10-12 10:01
浏览 1855
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论