`

Finding Top K Frenquent Elements in Large Dataset

 
阅读更多

Problem: Consider the problem of finding top K frequent numbers from a file with N numbers.

Now consider that N is very large such that it cannot fit in the memory M available to the program.

How do we find the top K frequent elements now with the assumption that K < M < N. 


Deterministic solution
The idea is similar to solving the problem for small N (N < M) (see here). For large N, divide the problem into chunks of size <= M, then solve it.

In order to divide the problem, consider a uniform hashing function H that takes a key and returns an integer from the set {1,2,....,ceil(N/M)}. So we get N/M chunks with roughly M unique keys per chunk. Since the number of unique numbers (U) is less than N, we expect each chunk to have less than M unique numbers. 

Now for each chunk, we compute the frequency of numbers contained in that chunk. This frequency computation can be done in memory and we can maintain a MIN-HEAP of size K which can be directly updated (follow the steps presented here). As a result only two reads of the dataset and one disk write is required to get the top K frequent items. The complexity of the algorithm is O(N log K). 

Probably for a small K (K < M^2/N), we can compute the top K per chunk in O(K log M) and combine the top K for all chunks N*K/M (<M) to get the overall top K. The total complexity of the algorithm is O(N + M log M).

Alternate method
We assume that the disk has a huge capacity which allows us to make a disk based hash table. Imagine that we create a very large file with holes and divide the file into B blocks with a capacity to hold more than N/B numbers and their integer counts. Using a uniform hash function H which takes as input an arbitrary number x and return H(x): the block number from the set {0, 1, ..., B-1}, we can store the numbers and their frequencies on disk map. H would give us the offset to seek within the file and we write number (and their frequency) sequentially once in correct block (or via another hash function H1). 

Now we proceed as usual, start counting the numbers in memory using a hash table (former approach). When we encounter a new number that could not be put in memory, we purge some entries from the hash table. The purged entries are written to the disk map. Now to purge the entries, we maintain an array which counts the frequency of the keys that lie within a block. Keys that belong to the most infrequent blocks can be purged first (or blocks that are least recently used or that lead to least disk access, etc). 

The following code gives a very basic detail of this approach
BlockFreq = array(B)
NumberFreq = hashtable(M)
diskwrite = 0
for i = 1:N
   x = A[i]
   BlockFreq[H[x]] += 1

   if NumberFreq.haskey(x)
      NumberFreq[x] += 1
      continue
   end

   if NumberFreq.hasspace()
      NumberFreq[x] = 1
      continue
   end

   if DiskMap.haskey(x)
      DiskMap[x] += 1
   else
      DiskMap[x] = 1
   end

   if diskwrite == 10
      purge(NumberFreq, BlockFreq)
      diskwrite = 0
   else
      diskwrite += 1
   end
end
Here purge is a procedure to purge some set of keys from NumberFreq based on the BlockFreq. Note that this code omits several key details of this process, so the idea presented here is quite crude. 

Single pass probabilistic solution
Solution 1 is quite efficient as it requires only two disk reads of the dataset, but the bottleneck can be the disk writes during the initial chunk formation. We can reduce that bottleneck by considering a data-structure called Bloom filters.

So consider that we have B uniform hash functions H1, H2, ..., HB and each hash function converts a key to a range {1,2,...,R}. Now imagine an array C of size B x R (<M) that represents count of how many times each key is seen. For each number (say x) that we read from the dataset, compute Hi[x] and increment C[i,Hi[x]] by 1. So we maintain B counts of x in different R buckets. We can say that the true count of x is less than min(C[1,H1[x]], ..., C[B,HB[x]]).

Now if the query is to get all the elements with frequency greater than some threshold then we can use bloom filters to get all such numbers (with some false positives though, which can be filtered using another pass on the dataset). This can save a complete write of the data to the disk. (see the paper: Computing Iceberg Queries Efficiently). 

But in our case, we are interested in finding the top K frequent numbers. Following modification can be used to estimate the frequency of each number. 
MH = MIN-HEAP(K)
for i = 1:N
   x = A[i]
   for b = 1:B
      C[b,Hb(x)] += Sb(x)
   end

   if contains(MH,x)
      increment count of x in the heap
   else 
      f = median(Hi(x) * Si(x), \forall i)
      if f > min(MH)
         remove-min(MH)
         insert(MH, (x,f))
      end
   end
end
The Sb functions is a {-1,+1} hash function and this data-structure is called CountSketch. More details of the method is available in the paper: Finding Frequent Items in Data Streams.

Note that the above algorithm takes a single passes on the dataset (and no disk write) but it is not guaranteed to give the top K frequent items. It can make some mistakes for some less frequent items. In practice the choice of the hashing functions can be critical for the performance of the algorithm.

Reference:

http://n1b-algo.blogspot.com/2012/07/finding-top-k-elements-in-large-dataset.html 

分享到:
评论

相关推荐

    Finding Top-k Shortest Simple Paths with Diversity.pptx

    Finding Top-k Shortest Paths with Diversity 在计算机科学和信息技术领域中,寻找 Top-k 最短简单路径问题是一个经典的问题。该问题的应用非常广泛,例如在路线规划、交通导航、物流等领域都有重要的实践价值。本...

    Finding Top-k Min-Cost Connected Trees in Databases2007

    标题 "Finding Top-k Min-Cost Connected Trees in Databases 2007" 指出了这篇论文的核心研究问题,即在数据库中寻找最小成本的前k个连通树。从标题可以推导出以下几点: 1. **数据库**:数据库技术是现代信息系统...

    Spotting Outliers in Large Distributed Datasets using

    Finding deviating observations in the large distributed database rather than in any individual database is not a simple task. Integrating distributed database cause two major problems. First, render ...

    finding-lungs-in-ct-data.zip

    "finding-lungs-in-ct-data.zip" 文件集合显然与CT图像处理相关,特别是关注肺部的分析。这个压缩包包含了几个关键文件,它们提供了用于肺部分割和体积识别的数据和结果。 首先,`lung_stats.csv` 文件很可能是记录...

    Programming Interview Problems and Algorithms in Ruby

    Tree Serialization, Finding the Top k Elements of Data Streams, MapReduce, Partial Sorting, the Skyline Problem, DFS, BFS and Topological Sorting of Dags, the Alternative Alphabet and the Phone Words...

    Finding Lungs in CT Data – CT 影像数据集.torrent

    Finding lungs in CT 是基于肺部 CT 影像分割处理的数据集,其包含一系列 CT 影像中对肺部影像的分割,并以此识别和估计肺部容积量。 该数据集包含 4 名患者的数据,以 nifti 格式的图像和分段肺面罩为主,由 ...

    paper report《Finding high-quality content in social media》

    NULL 博文链接:https://irwenqiang.iteye.com/blog/1279041

    Finding People in Images and Videos

    Finding People in Images and Videos,dalal大神的毕业论文130多页,我打印出一部分看过,理解hog很有用,光流部分还没看。还有另一个人毕业论文,放这里,怕自己以后找不到

    Finding Metric Structure in Information Theoretic Clustering

    《Finding Metric Structure in Information Theoretic Clustering》一文由Kamalika Chaudhuri和Andrew McGregor共同撰写,深入探讨了使用Kullback-Leibler(简称KL)散度作为距离度量进行聚类的问题。 ### 1. KL...

    论文研究-An Improved Method for Solving Equal Weights Sub-Paths Problem in K Shortest Path Algorithm.pdf

    在研究论文《An Improved Method for Solving Equal Weights Sub-Paths Problem in K Shortest Path Algorithm》中,作者李玉萍、王大江等介绍了一种针对K最短路径算法(KSP)的改进方法。该方法专门针对在大型网络...

    the Big Data Tidal Wave, Finding Opportunities in Huge Data Streams

    ( Wiley Taming the Big Data Tidal Wave, Finding Opportunities in Huge Data Streams with Advanced Analytics (2012).pdf )

    Finding People in Images and Videos Navneet Dalal

    《寻找图像和视频中的人物——Navneet Dalal》是Navneet Dalal博士的一篇学术论文,共计150页,深入探讨了计算机视觉领域中的人脸与人体识别技术。这篇论文涵盖了从静态图像到动态视频的人员检测、识别和追踪的广泛...

    K-DBSCAN: Identifying Spatial Clusters With Differing Density Levels

    The strength of K- DBSCAN lies in finding arbitrary shaped clusters in variable density regions. Moreover, it can also discover clusters with overlapping spatial regions, but differing density levels...

    Finding overlapping communities in networks by label propagation

    在这篇文章中,作者Steve Gregory提出了一种基于标签传播技术来寻找大型网络中重叠社区结构的算法。社区结构是复杂网络中一个非常重要的特性,指的是网络中节点倾向于聚集在具有密集内部连接但稀疏外部连接的不同...

    Finding_position_in_Java_ME.zip_in

    本压缩包“Finding_position_in_Java_ME.zip”似乎包含了实现这一功能的相关代码资源。 标题中的"Finding_position_in_Java_ME"暗示了我们关注的重点是如何在Java ME环境中获取和处理地理位置信息。这通常涉及到...

    Finding Preimages in Full MD5 Faster Than Exhaustive Search

    ### 寻找完整MD5预映像比穷举搜索更快 #### 摘要与背景 本文介绍了一种针对完整MD5散列函数的有效预映像攻击方法,该方法复杂度为\(2^{116.9}\),可以生成一个伪预映像,并且在复杂度为\(2^{123.4}\)的情况下生成...

    Finding community structure in networks using the eigenvectors of matrices

    本文介绍的标题是“使用矩阵的特征向量寻找网络中的社团结构”,主要围绕着newman算法及其在解决网络社团划分问题中的应用进行阐述。社团划分问题是一个研究网络中具有密集内部连接和稀疏外部连接的顶点组的重要领域...

    Large Scale Machine Learning with Python

    But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and ...

    Finding Collisions in the Full SHA-1

    标题与描述:“寻找完整SHA-1中的碰撞” ...SHA-1(Secure Hash Algorithm 1),作为1995年由美国国家标准与技术研究院(NIST)发布的一种联邦信息处理标准,自问世以来便被广泛应用于各种政府及行业安全标准中,尤其...

Global site tag (gtag.js) - Google Analytics