`

Finding Top K Frequent Items

 
阅读更多

 

Problem: Consider a file containing N numbers. For e.x. {2,3,4,3,1,78,78,3} is a file containing N=8 numbers. Assume that N is small enough to fit in the computer's memory M. So in this case N << M. Now we need to find top K frequent numbers in the file. For the above example, output should be {3,78} for K = 2.

Running Time: Time complexity should be either O(N), O(N log K), O(N + K log N) 

Solution
Finding top K frequent elements is a classical database and data streaming problem and there are several solutions to it ranging from deterministic to probabilistic. I'll illustrate the three obvious ones here. 

The first step is to count how many times each number appears in the file. If the file is pre-sorted then we need a single scan over the file. 
Function COUNTS:
   A = load_file_in_array

   last = A[1]
   ctr = 1
   for i = 2:N
      if last == A[i]
         ctr += 1
      else
         CountMap{last} = ctr;
         last = A[i]
         ctr = 1
      end 
   end

   // required for the last element of the array
   CountMap{last} = ctr
end
Note that CountMap is a hashmap that stores counts of all the elements. The procedure COUNTS is quite efficient if the file is pre-sorted. Now if the file is not pre-sorted then sorting increases the time complexity to O(N log N). In that case, we can do better by directly using hashmap to maintain current count of each number without sorting the file as follows:
Function EFFICIENT_COUNTS:
   A = load_file_in_array
   
   for i = 1:N
      CountMap{A[i]} += 1
   end
end
The above procedure obtains counts of each element in a single scan of the file. Hence it runs in O(N) time. So now we have all the numbers along with their frequencies in an array (can be extracted by enumerating all keys of the CountMap or by another scan of the file!!). So for the example we have in the problem statement, we get the following tuple: {(2,1), (3,3), (4,1), (1,1), (78,2)}.

The problem that remains is to find the most frequent K numbers from the array. The naive approach would be to sort numbers on their frequencies and pick the top K. This would take O(U log U) time where U (=5) is the number of unique elements in the array. If we consider U = O(N) then the time complexity of this sorting is O(N log N). We can do better than that as follows:

Approach 1: O(N) time
Use selection algorithm to find the Kth most frequent number (on the second element of the tuple) using the Selection Algorithm in O(U) time. The Kth most frequent element partitions the array in two parts: first part containing top K most frequent elements and second part containing bottom U-K-1 frequent elements. So we get the top K most frequent elements in no particular order in O(N) time (assuming U = O(N)). They can be sorted in O(K log K) if needed. Note that although this approach runs in O(N) time, the constants hidden in the O-notation can be large. So in practice this approach can be slower than the two approaches described below. 

Approach 2: O(N log K) time
Pick first K tuples and put them on MIN-HEAP, where a tuple (x,y) is less than a tuple (a,b) if y is less than b. The time complexity to make the min heap of size K is O(K). 

Then for the remaining U - K elements, pick them one by one. If the picked element is lesser than the minimum on the heap, discard that element. Otherwise remove the min element from the head and insert the selected element in the heap. This ensures that heap contains only K elements. This delete-insert operation is O(log K) for each element. 

Once we are done picking all the elements, the elements that finally remain in the min-heap are the top K frequent items which can be popped in O(K log K) time. The overall cost of this approach is O(K + (U-K) log K + K log K) = O(K + U log K). Since K < U and U = O(N), we get the time complexity of O(N log K).
public int[] kMostFrequentItems(int[] A, int k) {
	Map<Integer, Integer> map = new HashMap<>();
	for(int num:A) {
		Integer cnt = map.get(num);
		if(cnt == null) {
			cnt = 0;
		}
		map.put(num, cnt+1);
	}
	Queue<Integer> pq = new PriorityQueue<>(k+1, new Comparator<Integer>(){
		@Override
		public int compare(Integer o1, Integer o2) {
			return map.get(o1) - map.get(o2);
		}
	});
	
	for(Integer key:map.keySet()) {
		pq.offer(key);
		if(pq.size() > k) {
			pq.poll();
		}
	}
	
	int size = Math.min(k, pq.size());
	int[] items = new int[size];
	for(int i=size-1; i>=0; i--) {
		items[i] = pq.poll();
	}
	return items;
}
 
Approach 3: O(N + K log N) time
This approach is similar to approach 2 but the main difference is that we make a MAX-HEAP of all the U elements. So the first step is to make the max heap of all the elements in O(U). Then remove the maximum element from the heap K times in O(K log U) time. The K removed elements are the desired most frequent elements. The time complexity of this method is O(U + K log U) and by setting U = O(N) we get O(N + K log N). 

Let us stop for a moment and contrast approach 2 from 3. For simplicity let T2 = K + N log K be the time complexity of approach 2 and T3 = N + K log N be the time complexity of the third approach. Figure below plots the ratio T2/T3 for N=100 and for different values of K. We observe that approach 3 is considerably better for small values of K whereas approach 2 is better for large values of K. Though actual difference depends on the constants involved we can still see the merit of one approach over another. 

Reference:

http://n1b-algo.blogspot.com/2012/07/finding-top-k-frequent-items.html 

分享到:
评论

相关推荐

    Finding Frequent Items in Data Streams-计算机科学

    Finding Frequent Items in Data StreamsMoses Charikar?1, Kevin Chen??2, and Martin Farach-Colton31 Princeton University moses@cs.princeton.edu2 UC Berkeley kevinc@cs.berkeley.edu3 Rutgers University ...

    Finding Frequent Items in Data Streams - PLVDB - 2008-计算机科学

    Finding Frequent Items in Data StreamsGraham Cormode Marios Hadjieleftheriou AT&T Labs–Research, Florham Park, NJ {graham,marioh}@research.att.comABSTRACT The frequent items problem is to process a ...

    Finding Top-k Shortest Simple Paths with Diversity.pptx

    Finding Top-k Shortest Paths with Diversity 在计算机科学和信息技术领域中,寻找 Top-k 最短简单路径问题是一个经典的问题。该问题的应用非常广泛,例如在路线规划、交通导航、物流等领域都有重要的实践价值。本...

    Finding Top-k Min-Cost Connected Trees in Databases2007

    标题 "Finding Top-k Min-Cost Connected Trees in Databases 2007" 指出了这篇论文的核心研究问题,即在数据库中寻找最小成本的前k个连通树。从标题可以推导出以下几点: 1. **数据库**:数据库技术是现代信息系统...

    论文研究-An Improved Method for Solving Equal Weights Sub-Paths Problem in K Shortest Path Algorithm.pdf

    在研究论文《An Improved Method for Solving Equal Weights Sub-Paths Problem in K Shortest Path Algorithm》中,作者李玉萍、王大江等介绍了一种针对K最短路径算法(KSP)的改进方法。该方法专门针对在大型网络...

    大数据之数据挖掘课程:海量数据集挖掘 03-LSH Finding Similar Items 共59页.pdf

    03-LSH Finding Similar Items:Locality Sensitive Hashing 04-LSH theory of Locality Sensitive Hashing 05-聚类算法 clustering 06-降维技术 Dimensionality Reduction:SVD&CUR 07-推荐系统 Recommender ...

    Finding Similar Items-计算机科学

    Chapter 3Finding Similar ItemsA fundamental data-mining problem is to examine data for “similar” items. We shall take up applications in Section 3.1, but an example would be looking at a collection ...

    finding a majority among n votes.pdf

    finding a majority among n votes.pdffinding a majority among n votes.pdffinding a majority among n votes.pdffinding a majority among n votes.pdfv

    Learning the Kernel and Finding Performance Problems with KFI

    标题与描述:“学习内核与使用KFI查找性能问题” 在深入探讨KFI(Kernel Function Instrumentation)如何帮助我们理解Linux内核并检测性能瓶颈之前,让我们先对这个主题有一个全面的理解。 ### KFI:内核函数仪器...

    edge_finding_matlab边缘寻找_

    "edge_finding_matlab边缘寻找_"这个标题所指的是使用MATLAB进行图像边缘检测的过程。下面我们将深入探讨这一主题。 边缘检测的主要目的是将图像从二维空间转化为一维边缘描述,从而简化图像并突出重要的特征。...

    kuka_SeamTech_Finding中文说明书--视觉探寻.pdf

    SeamTech Finding是库卡系统中一款集成视觉探寻功能的软件,用于工业机器人在进行焊接、装配等任务时,通过视觉系统找到焊接缝隙等特征的位置。接下来,将详细介绍SeamTech Finding软件的主要知识点。 ### 1. ...

    论文翻译:On Finding Socially Tenuous Groups for Online Social Networks

    论文翻译:On Finding Socially Tenuous Groups for Online Social Networks。现有的寻找社会群体的研究主要集中在社交网络中的密集子图。然而,寻找社会脆弱的群体也有许多重要的应用。在本文中,我们引入了K三角的...

    Matlab安装Error finding installer class解决方法

    ### Matlab安装Error finding installer class解决方法 #### 问题背景及表现 在安装特定版本的Matlab(例如R2009b、R2010b等)时,可能会遇到一个名为“Error finding installer class”的错误。这个错误通常出现在...

    Finding Metric Structure in Information Theoretic Clustering

    《Finding Metric Structure in Information Theoretic Clustering》一文由Kamalika Chaudhuri和Andrew McGregor共同撰写,深入探讨了使用Kullback-Leibler(简称KL)散度作为距离度量进行聚类的问题。 ### 1. KL...

    finding-lungs-in-ct-data.zip

    "finding-lungs-in-ct-data.zip" 文件集合显然与CT图像处理相关,特别是关注肺部的分析。这个压缩包包含了几个关键文件,它们提供了用于肺部分割和体积识别的数据和结果。 首先,`lung_stats.csv` 文件很可能是记录...

    Finding People in Images and Videos

    Finding People in Images and Videos,dalal大神的毕业论文130多页,我打印出一部分看过,理解hog很有用,光流部分还没看。还有另一个人毕业论文,放这里,怕自己以后找不到

    root-finding.rar_ROOT_root finding_root-finding

    标题中的"root-finding.rar_ROOT_root finding"暗示了这是一个关于根查找算法的资料包,而描述提到了三种方法:二分法、最近邻法和错点法。 **二分法**(Bisection Method)是最基础且最直观的根查找算法之一。它...

    FINDING STRUCTURE WITH RANDOMNESS.pdf

    3. 对于那些因太大而不能完全加载到快速内存中的矩阵,随机化技术仅需要常数次遍历数据,而传统算法需要O(k)次遍历。实际上,有时可以仅用一次数据遍历就能完成矩阵近似。 文章还讨论了低秩矩阵近似的模型问题,即...

    A Unified Method for Finding Impossible Differentials of

    本文介绍了一种系统化的方法——统一不可能差分查找方法(Unified Impossible Differential finding method, UID-method),用于高效地寻找各种块密码结构中的不可能差分。这种方法相较于先前由Kim等人提出的U-...

Global site tag (gtag.js) - Google Analytics