“Sketching” data structures store a summary of a data set in situations where the whole data would be prohibitively costly to store (at least in a fast-access place like the memory as opposed to the hard disk). Variants of trees, hash tables, etc. are not sketching structures, they just facilitate access to the data, but they still store the data itself. However, the concept of hashing is closely related to most sketching ideas as we will see.
The main feature of sketching data structures is that they can answer certain questions about the data extremely efficiently, at the price of the occasional error. The best part is that the probability of an error can be quantified and the programmer can trade off the expected error rate with the amount of resources (storage, time) afforded. At the limit of this trade-off (when no error is allowed) these sketching structures collapse into traditional data structures.
Sketching data structures are somewhat counter-intuitive, but they can be useful in many real applications. I look at two such structures mostly for my own benefit: As I try to understand them, I write down my notes. Perhaps someone else will find them useful. Links to further information can be found in the end. Leave comments if you know of other sketching data structures that you found useful or if you have some favorite elegant and unusual data structure.
The Count-Min (CM) sketch is less known than the Bloom filter, but it is somewhat similar (especially to the counting variants of the Bloom filter). The problem here is to store a numerical value associated with each element, say the number of occurrences of the element in a stream (for example when counting accesses from different IP addresses to a server). Surprisingly, this can be done using less space than the number of elements, with the trade-off that the result can be slightly off sometimes, but mostly on the small values. Again, the parameters of the data structure can be chosen such as to obtain a desired accuracy.
CM works as follows: we have k different hash functions and k different tables which are indexed by the outputs of these functions (note that the Bloom filter can be implemented in this way as well). The fields in the tables are now integer values. Initially we have all fields set to 0 (all unseen elements have count 0). When we increase the count of an element, we increment all the corresponding k fields in the different tables (given by the hash values of the element). If a decrease operation is allowed (which makes things more difficult), we similarly subtract a value from all k elements.
To obtain the count of an element, we take the minimum of the k fields that correspond to that element (as given by the hashes). This makes intuitive sense. Out of the k values, probably some have been incremented on other elements also (if there were collisions on the hash values). However, if not all k fields have been returned by the hash functions on other elements, the minimum will give the correct value. See illustration for an example on counting hits from IP addresses:
In this example the scenario could be that we want to notice if an IP address is responsible for a lot of traffic (to further investigate if there is a problem or some kind of attack). The CM structure allows us to do this without storing a record for each address. When we increment the fields corresponding to an address, simultaneously we check if the minimum is above some threshold and we do some costly operation if it is (which might be a false alert). On the other hand, the real count can never be larger than the reported number, so if the minimum is a small number, we don’t have to do anything (this holds for the presented simple variant that does not allow decreases). As the example shows, CM sketch is most useful for detecting “heavy hitters” in a stream.
It is interesting to note that if we take the CM data structure and make the counters such that they saturate at 1, we obtain the Bloom filter.
Refrences
http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
http://research.neustar.biz/2011/09/08/our-approach/
http://lkozma.net/blog/sketching-data-structures/
http://research.neustar.biz/tag/count-min-sketch/
相关推荐
提出的并行Count-Min Sketch算法将输入的数据流等分为子线程,每个子线程使用原始的Count-Min Sketch算法处理子流。每个本地Count-Min Sketch中计数器频率增量超过预定义阈值的,会被发送到一个合并线程,该线程能够...
Count-Min Sketch是一种数据结构,常用于实时大数据流分析中的频数估计。它是一个轻量级的、空间效率高的工具,能对大量数据流中的元素出现次数进行近似计数,而不会占用过多的内存。在标题"countminsketch:YACMSI -...
Count-Min Sketch是一种概率数据结构,常用于在线计算大规模数据流中的频率估计。在C语言中实现Count-Min Sketch,可以高效地处理大数据分析任务,尤其是那些内存受限或实时性要求高的场景。以下是对Count-Min ...
Count-Min-Log是一种用于近似计数稀疏事件的数据结构,它是在Count-Min Sketch的基础上进行优化,以降低内存需求。Count-Min Sketch最初由Cormode和Muthukrishnan在2004年提出,用于估计流中的元素出现次数。它利用...
为了解决这一问题,冯文峰等人构造了一种名为“多层Count-Min概要数据结构”(Hierarchical Count-Min Sketch)的算法,简称HCM。 HCM结构通过在多层数据域上定义一系列两两相互独立的异或哈希函数族,将数据流元组...
Count-Min 草图 Count-Min 草图的 Java 实现,用于在指定错误范围内查找数据流中的事件频率。 更多信息 参考: [1] [2] [3] [4] 问题 欢迎修复错误或改进! 请 fork 项目并在 github 上发送 pull request。...
CountMin Sketch算法是一种在大数据分析领域广泛使用的概率数据结构,主要应用于近似计数和频率估算。这个算法设计的目标是在空间效率和准确性之间找到一个良好的平衡,尤其适用于处理大规模、高维度的数据流。在...
Count-Min Sketch是一种在计算机科学领域,特别是在数据流分析、网络流量监测以及大数据处理中广泛使用的数据结构。它被设计用来近似地估计大规模数据集中的元素频率或计数,而不需要存储所有元素。这种技术的主要...
开源项目-seiflotfy-cmts.zip是一个聚焦于数据统计与概览的开源项目,主要采用了Count-Min-Tree Sketch(计最小树结构)这一技术。Count-Min-Tree Sketch是一种用于近似计算高维大数据流中元素出现频率的高效算法。...
Count-Min Sketch(简称CMS)是一种用于数据流分析的高效、空间高效的概率数据结构,用于近似计算数据流中的元素计数。它在大数据处理、网络流量分析、搜索引擎优化等领域有广泛应用。本篇文章将深入探讨Count-Min ...
艺术家配对评估器 - Count-Min Sketch 关于 文本文件“Artist_lists_small.txt”包含来自的 1000 位用户最喜欢的音乐艺术家。 每行是一个最多 50 位艺术家的列表,格式如下: Radiohead,Pulp,Morrissey,Delays,...
这些数据结构包括Theta Sketch、Count-Min Sketch、Quantile Sketch等。Theta Sketch用于估算不重复元素的数量,它通过合并多个草图可以得到更准确的结果,适合处理流式数据。Count-Min Sketch则用于估算元素的频率...
这些算法通常包括Count-Min Sketch、Count-Sketch、Bloom Filter和Top-K Sketch等。例如,Count-Min Sketch用于估算数据流中的元素频次,Bloom Filter则用于快速判断一个元素是否可能存在于数据集中,而Top-K Sketch...
资源分类:Python库 所属语言:Python 资源全名:count_min_sketch-1.0.0.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
最后,使用`count_min_sketch-master`这个压缩包,你将获得一个已经实现好的CountMinSketch GenServer源代码。通过阅读和理解这个源代码,你可以更好地了解如何在Elixir中应用CountMinSketch,并将其集成到你的数据...
因此,基数估计器如HyperLogLog、LogCount或Count-Min Sketch等算法应运而生。这些算法利用概率统计原理,在牺牲一定精度的同时,以极低的空间复杂度提供基数估算。 HyperLogLog算法是基数估计的代表之一,它通过...
2. **Count-Min Sketch**: Count-Min Sketch是另一种概率数据结构,用于估算流数据中的元素频率。它使用二维数组和多个哈希函数,可以快速近似计算元素出现的次数。在大数据实时分析中,Count-Min Sketch被广泛用来...
- Skizze中的概率数据结构主要包括布隆过滤器(Bloom Filter)、Count-Min Sketch和HyperLogLog等。这些数据结构利用统计学原理来近似地存储和查询数据,降低了内存占用,同时也允许一定程度的错误容忍。 2. **...
1. Count-min草图(Count-Min Sketch): Count-min草图是一种概率数据结构,用于估计数据流中的元素频率。它通过使用多维数组和随机哈希函数来近似计数,能够在空间效率极高的情况下提供一定的误差保证。在处理...