`

Median of Integer Stream

 
阅读更多

Given a stream of unsorted integers, find the median element in sorted order at any given time. So, we will be receiving a continuous stream of numbers in some random order and we don’t know the stream length in advance. Write a function that finds the median of the already received numbers efficiently at any time. We will be asked to find the median multiple times. Just to recall, median is the middle element in an odd length sorted array, and in the even case it’s the average of the middle elements.

 

This is a data structure question. We will insert the received numbers into such a data structure that we’ll be able to find the median very efficiently. Let’s analyse the possible options.

 

We can insert the integers to an unsorted array, so we’ll just append the numbers to the array one by one as we receive. Insertion complexity is O(1) but finding the median will take O(N) time, if we use the Median of Medians algorithm that I described in my previous post. However, our goal is to find the median most efficiently, we don’t care that much about insertion performance. But this algorithm does the exact opposite, so unsorted array is not a feasible solution.

 

What about using a sorted array? We can find the position to insert the received number in O(logN) time using binary search. And at any time if we’re asked for the median we can just return the middle element if the array length is odd, or the average of middle elements if the length is even. This can be done in O(1) time, which is exactly what we’re looking for. But there’s a major drawback of using a sorted array. To keep the array sorted after inserting an element, we may need to shift the elements to the right, which will take O(N) time. So, even if finding the position to insert the number takes O(logN) time, the overall insertion complexity is O(N) due to shifting. But finding the median is still extremely efficient, constant time. However, linear time insertion is pretty inefficient and we would prefer a better performance.

 

Let’s try linked lists. First unsorted linked list. Insertion is O(1), we can insert either to the head or tail but we suffer from the same problem of unsorted array. Finding the median is O(N). What if we keep the linked list sorted? We can find the median in O(1) time if we keep track of the middle elements. Insertion to a particular location is also O(1) in any linked list, so it seems great thus far. But, finding the right location to insert is not O(logN) as in sorted array, it’s instead O(N) because we can’t perform binary search in a linked list even if it is sorted. So, using a sorted linked list doesn’t worth the effort, insertion is O(N) and finding median is O(1), same as the sorted array. In sorted array insertion is linear due to shifting, here it’s linear because we can’t do binary search in a linked list. This is a very fundamental data structure knowledge that we should keep at the top of our heads all the time.

 

Using a stack or queue wouldn’t help as well. Insertion would be O(1) but finding the median would be O(N), very inefficient.

 

What if we use trees? Let’s use a binary search tree with additional information at each node, number of children on the left and right subtrees. We also keep the number of total nodes in the tree. Using this additional information we can find the median in O(logN) time, taking the appropriate branch in the tree based on number of children on the left and right of the current node. However, the insertion complexity is O(N) because a standard binary search tree can degenerate into a linked list if we happen to receive the numbers in sorted order.

 

So, let’s use a balanced binary search tree to avoid worst case behaviour of standard binary search trees. In a height balanced binary search tree (i.e. AVL tree) the balance factor is the difference between the heights of left and right subtrees. A node with balance factor 0, +1, or -1 is considered to be balanced. However, in our tree the balance factor won’t be height, it is the number of nodes in the left subtree minus the number of nodes in the right subtree. And only the nodes with balance factor of +1 or 0 are considered to be balanced. So, the number of nodes on the left subtree is either equal to or 1 more than the number of nodes on the right subtree, but not less. If we ensure this balance factor on every node in the tree, then the root of the tree is the median, if the number of elements is odd. In the even case, the median is the average of the root and its inorder successor, which is the leftmost descendent of its right subtree. So, complexity of insertion maintaining balance condition is O(logN) and find median operation is O(1) assuming we calculate the inorder successor of the root at every insertion if the number of nodes is even. Insertion and balancing is very similar to AVL trees. Instead of updating the heights, we update the number of nodes information.

 

Balanced binary search trees seem to be the most optimal solution, insertion is O(logN) and find median is O(1). Can we do better? We can achieve the same complexity with a simpler and more elegant solution. We will use 2 heaps simultaneously, a max-heap and a min-heap with 2 requirements. The first requirement is that the max-heap contains the smallest half of the numbers and min-heap contains the largest half. So, the numbers in max-heap are always less than or equal to the numbers in min-heap. Let’s call this the order requirement. The second requirement is that, the number of elements in max-heap is either equal to or 1 more than the number of elements in the min-heap. So, if we received 2N elements (even) up to now, max-heap and min-heap will both contain N elements. Otherwise, if we have received 2N+1 elements (odd), max-heap will contain N+1 and min-heap N. Let’s call this the size requirement.

 

The heaps are constructed considering the two requirements above. Then once we’re asked for the median, if the total number of received elements is odd, the median is the root of the max-heap. If it’s even, then the median is the average of the roots of the max-heap and min-heap. Let’s now analyse why this approach works, and how we construct the heaps.

 

We will have two methods, insert a new received number to the heaps and find median. The insertion procedure takes the two requirements into account, and it’s executed every time we receive a new element. We take two different approaches depending on whether the total number of elements is even or odd before insertion.

 

Let’s first analyze the size requirement during insertion. In both cases we insert the new element to the max-heap, but perform different actions afterwards. In the first case, if the total number of elements in the heaps is even before insertion, then there are N elements both in max-heap and min-heap because of the size requirement. After inserting the new element to the max-heap, it contains N+1 elements but this doesn’t violate the size requirement. Max-heap can contain 1 more element than min-heap. In the second case, if the number of elements is odd before insertion, then there are N+1 elements in max-heap and N in min-heap. After we insert the new element to the max-heap, it contains N+2 elements. But this violates the size constraint, max-heap can contain at most 1 more element than min-heap. So we pop an element from max-heap and push it to min-heap. The details will be described soon.

 

Now let’s analyse the order requirement. This requirement forces every element in the max-heap to be less than or equal to all the elements in min-heap. So the max-heap contains the smaller half of the numbers and the min-heap contains the larger half. Note that by design the root of the max-heap is the maximum of the lower half, and root of the min-heap is the minimum of the upper half. Keeping these in mind, we again take two different actions depending on whether the total number of elements is even or odd before insertion. In the even case we just inserted the new element to the max-heap. If the new element is less than all the elements in the min-heap, then the order constraint is satisfied and we’re done. We can perform this check by comparing the new element to the root of the min-heap in O(1) time since the root of the min-heap is the minimum. But if the new element is larger than the root of min-heap then we should exchange those elements to satisfy the order requirement. Note that in this case the root of the max-heap is the new element. So we pop the the root of min-heap and insert it to max-heap. Also pop the root of max-heap and insert it to min-heap. In second case, where the total number of elements before insertion is odd, we inserted the new element to max-heap, then we popped an element and pushed it to the min-heap. To satisfy the order constraint, we pop the maximum element of the max-heap, the root, and insert it to the min-heap. Insertion complexity is O(logN), which is the insertion complexity of a heap.

 

That is exactly how the insertion procedure works. We ensured that both size and order requirements are satisfied during insertion. Find median function works as follows. At any time we will be queried for the median element. If the total number of elements at that time is odd, then the median is the root of the max-heap. Let’s visualize this with an example. Assume that we have received 7 elements up to now, so the median is the 4th number in sorted order. Currently, max-heap contains 4 smallest elements and min-heap contains 3 largest because of the requirements described above. And since the root of the max-heap is the maximum of the smallest four elements, it’s the 4th element in sorted order, which is the median. Else if the total number of elements is even, then the median is the average of the roots of max-heap and min-heap. Let’s say we have 8 elements, so the median is the average of 4th and 5th elements in sorted order. Currently, both the max-heap and min-heap contain 4 numbers. Root of the max-heap is the maximum of the smallest numbers, which is 4th in sorted order. And root of the min-heap is the minimum of the largest numbers, which is 5th in sorted order. So, the median is the average of the roots. In both cases we can find the median in O(1) time because we only access the roots of the heaps, neither insertion nor removal is performed. Therefore, overall this solution provides O(1) find heap and O(logN) insert.

 

A code is worth a thousand words, here is the code of the 2-heaps solution. As you can see, it’s much less complicated than it’s described. We can use the heapq module in python, which provides an implementation of min-heap only. But we need a max-heap as well, so we can make a min-heap behave like a max-heap by multiplying the number to be inserted by -1 and then inserting. So, every time we insert or access an element from the max-heap, we multiply the value by -1 to get the original number:

public class MedianOfIntegerStream {

	private int count;
	private Queue<Integer> minHeap = new PriorityQueue<>();
	private Queue<Integer> maxHeap = new PriorityQueue<>(new Comparator<Integer>(){
		@Override
		public int compare(Integer o1, Integer o2) {
			return o2-o1;
		}
	});
	
	public double addNumberToStream(int num) {
		maxHeap.offer(num);
		if(count % 2 == 0) {
			if(!minHeap.isEmpty() && maxHeap.peek() > minHeap.peek()) {
				int maxHeapRoot = maxHeap.poll();
				int minHeapRoot = minHeap.poll();
				maxHeap.offer(minHeapRoot);
				minHeap.offer(maxHeapRoot);
			}
		} else {
			minHeap.offer(maxHeap.poll());
		}
		count++;
		return getMedian();
	}
	
	public double getMedian() {
		if(count % 2 == 0) {
			return (maxHeap.peek() + minHeap.peek()) / 2.0;
		} else {
			return maxHeap.peek();
		}
	}
	
	public static void main(String[] args) {
		MedianOfIntegerStream streamMedian = new MedianOfIntegerStream();
        streamMedian.addNumberToStream(10);
        streamMedian.addNumberToStream(50);
        streamMedian.addNumberToStream(-50);
        streamMedian.addNumberToStream(3);
        streamMedian.addNumberToStream(16);
	}
}

 

From:

http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/

 

分享到:
评论

相关推荐

    LeetCode4 Median of Two Sorted Arrays

    There are two sorted arrays nums1 and nums2 of size m and n respectively. ...Find the median of the two sorted arrays. The overall run time complexity should be O(log (m+n)). Java AC版本

    Analysis of Median of Medians Algorithm

    本文是对选择算法或者称为“中位数中位数”算法的详细分析,该算法是文本中未详尽阐述的。首先,我们要明确算法的几个关键点:S表示要从中选择的n个不同元素的集合,g为向下取整的n/5,代表分组的数量,m是算法找到...

    An Introduction to Least Median of Squares

    最小中值二乘(Least Median of Squares,简称LMS)是一种用于估计回归模型参数的稳健统计方法。该方法特别适用于数据集中存在异常值(离群点)时,仍能保持较好的回归性能,是传统最小二乘法(Ordinary Least ...

    题目1004:Median

    Given an increasing sequence S of N integers, the median is the number at the middle position. For example, the median of S1={11, 12, 13, 14} is 12, and the median of S2={9, 10, 15, 16, 17} is 15. The...

    median-of-two-sorted-arrays

    Find the median of the two sorted arrays. The overall run time complexity should be O(log (m+n)). Example 1: nums1 = [1, 3] nums2 = [2] The median is 2.0 Example 2: nums1 = [1, 2] nums2 = [3, 4...

    js代码-4. Median of Two Sorted Arrays

    在JavaScript编程领域,"Median of Two Sorted Arrays"(两个排序数组的中位数)是一个经典的算法问题。这个问题的目的是在给定两个已排序的数组中找到它们合并后的中位数,但无需实际合并这两个数组。这是一个典型...

    Select-Median.rar_median select_medianselect_select median

    “Median of Medians”算法是一个更为稳定的选择方法,它确保了每次都能选出一个接近中位数的枢轴,从而避免了最坏情况的发生。该算法首先将数组分为若干组,每组5个元素(如果数组长度不能被5整除,则最后一组可能...

    MedianFlow.rar

    **MedianFlow追踪算法详解** MedianFlow是一种经典的运动目标追踪算法,由Matthias Kaltenbach在2009年提出。这个算法的核心是基于中值流的概念,它结合了光流法和卡尔曼滤波器的优点,适用于处理复杂的背景、光照...

    Median-filtering.rar_median filter

    标题"Median-filtering.rar_median filter"提到了"中值滤波"这一关键词,这是一项在图像处理领域广泛应用的技术。"Median filter"通常是指中值滤波器,它是一种非线性的滤波方法,主要用于去除图像中的噪声,尤其是...

    c语言-leetcode 0004-median-of-two-sorted-array.zip

    c c语言_leetcode 0004_median_of_two_sorted_array.zip

    java-leetcode题解之004-Median-of-Two-Sorted-Arrays

    java入门 java_leetcode题解之004_Median_of_Two_Sorted_Arrays

    K-Median_聚类_

    "K-Median_聚类_" 指的是使用K-Median算法进行聚类的过程。K-Median是K-Means算法的一种变体,主要区别在于它使用中位数而非平均值来代表每个簇的中心,这使得它对异常值更具鲁棒性。 K-Median算法的基本步骤如下:...

    a new directional weighted median filter

    描述中提到的"a new directional weighted median filter for removal of random-valued impulse noise"进一步细化了这个滤波器的应用场景,即专门用于去除随机值的脉冲噪声。脉冲噪声是一种常见的图像噪声,它表现...

    3x3_Median_test.zip_3X3_median vhdl_median3x3_vhdl median

    【3x3_Median_test.zip_3X3_median vhdl_median3x3_vhdl median】这个标题揭示了我们要讨论的核心内容:一个3x3中值滤波器的VHDL实现。VHDL(Very High Speed Integrated Circuit Hardware Description Language)是...

    ada.rar_Adaptive median_Median Algorithm_median

    标题中的“ada.rar_Adaptive median_Median Algorithm_median”暗示了我们正在探讨一种名为“自适应中值滤波器”的算法,通常用于图像处理领域。在图像处理中,中值滤波器是一种非线性的滤波方法,它通过将图像窗口...

    Estimate of Median Household Income Group Series-数据集

    estimate-of-median-household-income-for-alameda-county-ca_metadata.json estimate-of-median-household-income-for-bergen-county-nj_metadata.json estimate-of-median-household-income-for-cook-county-il_...

    Calculating Mean, Median, and Standard Deviation of Data

    Matlab Tutorial - 33 - Calculating Mean, Median, and Standard Deviation of Data in a Vector

    中值滤波median_filter

    中值滤波median_filtermedian_filtermedian_filtermedian_filtermedian_filter

    Median - MetaTrader 5脚本.zip

    **MetaTrader 5脚本——Median** MetaTrader 5(MT5)是一个广泛使用的交易平台,专为金融市场的交易者设计,支持多种金融工具,包括外汇、股票、期货等。它提供了丰富的技术分析工具和自动化交易功能,使得交易者...

    median_polish.zip_median

    "median_polish.zip_median"文件包就提供了一个这样的解决方案,它基于Tukey的中位数平滑法(Median Polish),这是一种用于拟合加性模型的稳健方法。 Tukey的中位数平滑法,也称为中位数平滑或中位数打磨,是由...

Global site tag (gtag.js) - Google Analytics