- 浏览: 538796 次
- 性别:
- 来自: 杭州
文章分类
最新评论
-
GGGGeek:
看完了博主的博文,如果没猜错的话应该是浙大吧?很多优秀的人因为 ...
转《D君的故事》 以时刻警示自己 -
游牧民族:
楼主写的不错,学习了,最近对爬虫比较感兴趣,也写了些爬虫相关的 ...
通用爬虫框架及heritrix爬虫介绍 -
jimmee:
jerome_s 写道ice 你怎么看? 粗略的看了一下ice ...
MessagePack, Protocol Buffers和Thrift序列化框架原理和比较说明 -
jerome_s:
ice 你怎么看?
MessagePack, Protocol Buffers和Thrift序列化框架原理和比较说明 -
jimmee:
nk_tocean 写道照着做了,但是不行啊,还是乱码.先确认 ...
hive编写udf处理非utf-8数据
I came across an interesting programming puzzle today, and I'd like to share a couple of variants on it.
To start with, let's say we have an array A that contains some numbers. For simplicity, let's say that the array is 1-based. How can we efficiently find out if any of those numbers are duplicates?
The easiest approach is to make a hash table to keep track of the numbers we've seen so far. Something like this:
Given an array A of length n,
let h ← (new hash table)
for 1 <= i <= n:
if A[i] is present in h: return A[i]
set h[A[i]] ← True
return <no duplicates>
Now, suppose that the array is of length n and only contains positive integers less than n. We can be sure (by the pigeonhole principle) that there is at least one duplicate. The bounds on the values mean that we can be a little more efficient and store the already-seen numbers in an array rather than a hash table:
let s ← array [1..n]
initialize s to all zeroes
for 1 <= i <= n:
if s[A[i]] > 0: return A[i]
set s[A[i] ← 1
Now, suppose that we have to do this in constant space; no more creating hash tables and arrays on the fly. Is if still possible? If you add the constraint that there be exactly one duplicate value, this is easy.
Since there are n values from 1 to n-1 with one duplicate, we can use the fact that the sum of the integers from 1 to m is (m*(m+1))/2, take the sum from 1 to n-1, subtract that from the actual sum of numbers in the array, and the result will be the duplicated number:
let s ← 0
for 1 <= i <= n:
s ← s + A[i]
return s - n*(n-1)/2
But suppose there are no guarantees about the number of duplicates, and the only assurance you have is that the values are between 0 and n, exclusive. Is it still possible to find a duplicated value in O(n) time and O(1) space? Feel free to stop here and try to work things out yourself, if you want.
As you might have guessed from the length of this entry, there is an approach that will work. It's something of a radical departure from the previous approaches, and it relies rather heavily on the particular bounds we have on the values in the array.
First, there's the solution-by-cheating. If you're working in a language with bounded integers, you can exploit that knowledge to meet the letter of the law in the "constant space" requirement.
Let's say we're using a language and platform where integers are 32 bits, signed. The maximum value for an array element would therefore be 231-1, while the minimum value would still be 1. We only need one bit of information per value (seen/not seen), so, at 8 bits per byte, an array 223 bytes (8MB) long would be enough to track any set of integers the environment is capable of passing. This algorithm is therefore O(n) time and O(1) space on a system with 32-bit signed integers:
let s ← array [1..2^31-1] of bit
for 1 <= i <= n:
if s[A[i]] = 1: return A[i]
set s[A[i]] ← 1
Of course, it relies on implementation details that might change (a similar array on a 64-bit system would require 32 petabytes of memory) and it might be nice to have an algorithm that works regardless of any artifical bounds on your integers. Fortunately, there are a couple more approaches we can take.
Recall that the array is 1-based, so its elements are numbered from 1 to n. The values are positive integers less than n, so they can range anywhere from 1 to n-1. Because the values are a subset of the array bounds, we can treat those values as indices into the array itself.
If we're allowed to modify the array, we can just go through it reordering the elements, trying to put each value into the corresponding element (so the value 1 goes into the first element, 2 goes into the second element, and so on). When we find a collision, that's the duplicate.
for 1 <= i <= n:
while A[i] ≠ i:
if A[A[i]] = A[i]: return A[i]
swap(A[A[i]], A[i])
If the array is immutable, though, the solution is a little more involved. Given the view of each value as an index into the array, we can treat the array as a directed graph, where each element is a node and the value of that element is an edge pointing at the next node. Since every node points to another node, there is at least one cycle in the graph. We can also conclude that every node leads to a cycle, even if it is not directly part of one.
(Side note on terminology, for those not immediately familiar: A graph is simply a collection of nodes and edges. Each edge connects exactly two nodes. In a directed graph, each edge also has a direction; it goes from one node to another. If you start at a particular node, follow (one of) its edges to the next node, continue doing that for a certain amount of time, and eventually come back to the starting node, that's a cycle. If you start at a particular node, follow its edges out in the same way, and end up in a cycle that does not contain the starting node, all of the nodes and edges you crossed before getting to the cycle are called the tail. The first node that you reach that is in the cycle is defined to be the beginning of the cycle. The symbols λ (lambda) and μ (mu) represent the lengths of the tail and cycle, respectively. Technically, the directed graph we're dealing with here is also a sequence, since every node has at most one edge leading away from it (though it might have more than one leading into it).)
Since the values in the array are all less than n, no element in the array can point to the nth element. Therefore, the last element cannot be part of a cycle. Since every element must lead to a cycle, it must be part of a tail. If we follow the sequence from this point, we will eventually run into a cycle, and the last element in the tail will be a duplicate value.
So, how do we find the beginning of the cycle? The easiest approach is to use Floyd's cycle-finding algorithm. It works roughly like this: Start at the beginning of the sequence. Keep track of two values (call them ai and aj). At each step of the algorithm, move ai one step along the sequence, but move aj two steps. Stop when ai = aj.
At this point, the difference between i and j must be a multiple of the cycle length. If the algorithm took m steps to get to this point, then i = m and j = 2m, so m is a multiple of the cycle length. i has also traveled λ steps along the tail and m-λ steps along the cycle.
If we now start another value (call it ak) at the beginning of the sequence and move both it and ai simultaneously and at the same rate, after λ steps, ak will have traveled the entire tail and will have arrived at the beginning of the cycle. At the same time, ai will have moved a total of m steps along the cycle and, since m is a multiple of μ, will also have arrived at the beginning of the cycle. Thus, if you stop as soon as ai = ak, you will have found the beginning of the cycle and the end of the tail.
So. Applying all of that to our problem means that we start at the last element of the array, have two variables following the pointers in the array, one going twice as fast as the other, until they are equal to each other. At that point, start one of them back at the beginning and run them at the same speed until they again equal each other. Their value will then be the duplicate value.
let i ← n, j ← n
do: i ← A[i], j ← A[A[j]]; until i = j
set j ← n
do: i ← A[i], j ← A[j]; until i = j
return i
This algorithm also lends itself nicely to functional programming, since it doesn't actually require any side effects, so here's an implementation in Haskell:
import Data.Array
findDup a = findDup' n $ findCycle (a!n) (a!(a!n))
where n = snd $ bounds a
findCycle i j | i == j = i
| otherwise = findCycle (a!i) (a!(a!j))
findDup' i j | i == j = i
| otherwise = findDup' (a!i) (a!j)
转载自:http://aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html
To start with, let's say we have an array A that contains some numbers. For simplicity, let's say that the array is 1-based. How can we efficiently find out if any of those numbers are duplicates?
The easiest approach is to make a hash table to keep track of the numbers we've seen so far. Something like this:
Given an array A of length n,
let h ← (new hash table)
for 1 <= i <= n:
if A[i] is present in h: return A[i]
set h[A[i]] ← True
return <no duplicates>
Now, suppose that the array is of length n and only contains positive integers less than n. We can be sure (by the pigeonhole principle) that there is at least one duplicate. The bounds on the values mean that we can be a little more efficient and store the already-seen numbers in an array rather than a hash table:
let s ← array [1..n]
initialize s to all zeroes
for 1 <= i <= n:
if s[A[i]] > 0: return A[i]
set s[A[i] ← 1
Now, suppose that we have to do this in constant space; no more creating hash tables and arrays on the fly. Is if still possible? If you add the constraint that there be exactly one duplicate value, this is easy.
Since there are n values from 1 to n-1 with one duplicate, we can use the fact that the sum of the integers from 1 to m is (m*(m+1))/2, take the sum from 1 to n-1, subtract that from the actual sum of numbers in the array, and the result will be the duplicated number:
let s ← 0
for 1 <= i <= n:
s ← s + A[i]
return s - n*(n-1)/2
But suppose there are no guarantees about the number of duplicates, and the only assurance you have is that the values are between 0 and n, exclusive. Is it still possible to find a duplicated value in O(n) time and O(1) space? Feel free to stop here and try to work things out yourself, if you want.
As you might have guessed from the length of this entry, there is an approach that will work. It's something of a radical departure from the previous approaches, and it relies rather heavily on the particular bounds we have on the values in the array.
First, there's the solution-by-cheating. If you're working in a language with bounded integers, you can exploit that knowledge to meet the letter of the law in the "constant space" requirement.
Let's say we're using a language and platform where integers are 32 bits, signed. The maximum value for an array element would therefore be 231-1, while the minimum value would still be 1. We only need one bit of information per value (seen/not seen), so, at 8 bits per byte, an array 223 bytes (8MB) long would be enough to track any set of integers the environment is capable of passing. This algorithm is therefore O(n) time and O(1) space on a system with 32-bit signed integers:
let s ← array [1..2^31-1] of bit
for 1 <= i <= n:
if s[A[i]] = 1: return A[i]
set s[A[i]] ← 1
Of course, it relies on implementation details that might change (a similar array on a 64-bit system would require 32 petabytes of memory) and it might be nice to have an algorithm that works regardless of any artifical bounds on your integers. Fortunately, there are a couple more approaches we can take.
Recall that the array is 1-based, so its elements are numbered from 1 to n. The values are positive integers less than n, so they can range anywhere from 1 to n-1. Because the values are a subset of the array bounds, we can treat those values as indices into the array itself.
If we're allowed to modify the array, we can just go through it reordering the elements, trying to put each value into the corresponding element (so the value 1 goes into the first element, 2 goes into the second element, and so on). When we find a collision, that's the duplicate.
for 1 <= i <= n:
while A[i] ≠ i:
if A[A[i]] = A[i]: return A[i]
swap(A[A[i]], A[i])
If the array is immutable, though, the solution is a little more involved. Given the view of each value as an index into the array, we can treat the array as a directed graph, where each element is a node and the value of that element is an edge pointing at the next node. Since every node points to another node, there is at least one cycle in the graph. We can also conclude that every node leads to a cycle, even if it is not directly part of one.
(Side note on terminology, for those not immediately familiar: A graph is simply a collection of nodes and edges. Each edge connects exactly two nodes. In a directed graph, each edge also has a direction; it goes from one node to another. If you start at a particular node, follow (one of) its edges to the next node, continue doing that for a certain amount of time, and eventually come back to the starting node, that's a cycle. If you start at a particular node, follow its edges out in the same way, and end up in a cycle that does not contain the starting node, all of the nodes and edges you crossed before getting to the cycle are called the tail. The first node that you reach that is in the cycle is defined to be the beginning of the cycle. The symbols λ (lambda) and μ (mu) represent the lengths of the tail and cycle, respectively. Technically, the directed graph we're dealing with here is also a sequence, since every node has at most one edge leading away from it (though it might have more than one leading into it).)
Since the values in the array are all less than n, no element in the array can point to the nth element. Therefore, the last element cannot be part of a cycle. Since every element must lead to a cycle, it must be part of a tail. If we follow the sequence from this point, we will eventually run into a cycle, and the last element in the tail will be a duplicate value.
So, how do we find the beginning of the cycle? The easiest approach is to use Floyd's cycle-finding algorithm. It works roughly like this: Start at the beginning of the sequence. Keep track of two values (call them ai and aj). At each step of the algorithm, move ai one step along the sequence, but move aj two steps. Stop when ai = aj.
At this point, the difference between i and j must be a multiple of the cycle length. If the algorithm took m steps to get to this point, then i = m and j = 2m, so m is a multiple of the cycle length. i has also traveled λ steps along the tail and m-λ steps along the cycle.
If we now start another value (call it ak) at the beginning of the sequence and move both it and ai simultaneously and at the same rate, after λ steps, ak will have traveled the entire tail and will have arrived at the beginning of the cycle. At the same time, ai will have moved a total of m steps along the cycle and, since m is a multiple of μ, will also have arrived at the beginning of the cycle. Thus, if you stop as soon as ai = ak, you will have found the beginning of the cycle and the end of the tail.
So. Applying all of that to our problem means that we start at the last element of the array, have two variables following the pointers in the array, one going twice as fast as the other, until they are equal to each other. At that point, start one of them back at the beginning and run them at the same speed until they again equal each other. Their value will then be the duplicate value.
let i ← n, j ← n
do: i ← A[i], j ← A[A[j]]; until i = j
set j ← n
do: i ← A[i], j ← A[j]; until i = j
return i
This algorithm also lends itself nicely to functional programming, since it doesn't actually require any side effects, so here's an implementation in Haskell:
import Data.Array
findDup a = findDup' n $ findCycle (a!n) (a!(a!n))
where n = snd $ bounds a
findCycle i j | i == j = i
| otherwise = findCycle (a!i) (a!(a!j))
findDup' i j | i == j = i
| otherwise = findDup' (a!i) (a!j)
转载自:http://aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html
发表评论
-
moses安装记录
2016-12-11 11:00 01. moses的安装说明地址(注意,一定要安装好相应的bo ... -
翻译算法
2016-12-09 07:50 0翻译的概率T=argsMaxP(T|S)=P(T)P(S|T ... -
JPEG 简易文档 V2.15【转载】
2016-04-10 16:35 631JPEG 简易文档 V2.15 ------------- ... -
Java 并发之 ConcurrentSkipListMap 简述
2015-09-20 20:24 1123JCIP 提到了在 Java 6 中引入了两个新的并发集合类 ... -
最简单的平衡树(红-黑树)的实现
2015-09-04 08:04 1186在二叉搜索树(BST)的基础上,要实现一颗平衡树,可以使用 ... -
lucene索引创建的理解思路
2014-06-29 23:12 1481虽然lucene4很早就出来,但是这里仍然以lucene3. ... -
lucene的拼写检查的实现原理
2014-06-08 18:19 13101. 建索引时, 使用ngram的方式创建索引 Sp ... -
字符串相似算法-(3) NGram Distance
2014-06-08 17:54 4893就是N-Gram version of edit dista ... -
字符串相似算法-(2) Levenshtein distance
2014-06-08 16:32 2219编辑距离概念描述: ... -
字符串相似算法-(1) Jaro-Winkler Distance
2014-06-08 12:05 6751Jaro-Winkler Distance 算法 ... -
Lucene的数字范围搜索 (Numeric Range Query)原理
2014-04-05 16:08 135470. 全文索引的核心就是倒排索引. 1. 若 ... -
UDT协议-基于UDP的可靠数据传输协议的实现分析(7)-流量和拥塞控制
2014-04-02 20:53 4141流量控制 对于一个带宽1Gbps, RTT为100ms的网络 ... -
UDT协议-基于UDP的可靠数据传输协议的实现分析(6)-链接的建立和关闭
2014-04-01 22:47 20421. 模式有client/server mode(客户端, ... -
协议-基于UDP的可靠数据传输协议的实现分析(5)-可靠性怎么保证
2014-03-31 23:08 3777发送方的处理:1) 包发送确认后,由于还没有收到确认,先缓存 ... -
UDT协议-基于UDP的可靠数据传输协议的实现分析(4)-发送和接收的算法
2014-03-30 10:09 71470. 计时器udt有四种计时器: ACK, NAK, E ... -
UDT协议-基于UDP的可靠数据传输协议的实现分析(3)-包结构说明
2014-03-29 17:24 3194udt的包结构1. 数据包,基本结构如下: 0 ... -
UDT协议-基于UDP的可靠数据传输协议的实现分析(2)-为什么要用udt
2014-03-28 08:00 37580. AIMD算法的简单回顾 (1) 慢开始 ... -
UDT协议-基于UDP的可靠数据传输协议的实现分析(1)-准备工作
2014-03-27 12:52 38121. 协议实现方案: Yunhong Gu提出的rfc的草 ... -
mapreduce的一些算法设计,优化等(2)
2014-01-28 15:50 14671. 反序(order inversion ... -
mapreduce的一些算法设计,优化等(1)
2014-01-27 17:15 2067本系列是根据书籍《Da ...
相关推荐
《Finding Metric Structure in Information Theoretic Clustering》一文由Kamalika Chaudhuri和Andrew McGregor共同撰写,深入探讨了使用Kullback-Leibler(简称KL)散度作为距离度量进行聚类的问题。 ### 1. KL...
"finding-lungs-in-ct-data.zip" 文件集合显然与CT图像处理相关,特别是关注肺部的分析。这个压缩包包含了几个关键文件,它们提供了用于肺部分割和体积识别的数据和结果。 首先,`lung_stats.csv` 文件很可能是记录...
Finding People in Images and Videos,dalal大神的毕业论文130多页,我打印出一部分看过,理解hog很有用,光流部分还没看。还有另一个人毕业论文,放这里,怕自己以后找不到
Finding lungs in CT 是基于肺部 CT 影像分割处理的数据集,其包含一系列 CT 影像中对肺部影像的分割,并以此识别和估计肺部容积量。 该数据集包含 4 名患者的数据,以 nifti 格式的图像和分段肺面罩为主,由 ...
在这篇文章中,作者Steve Gregory提出了一种基于标签传播技术来寻找大型网络中重叠社区结构的算法。社区结构是复杂网络中一个非常重要的特性,指的是网络中节点倾向于聚集在具有密集内部连接但稀疏外部连接的不同...
本文介绍的标题是“使用矩阵的特征向量寻找网络中的社团结构”,主要围绕着newman算法及其在解决网络社团划分问题中的应用进行阐述。社团划分问题是一个研究网络中具有密集内部连接和稀疏外部连接的顶点组的重要领域...
Fast Duplicate File Finder is a powerful utility for finding duplicate files in a folder and all its sub folders. The duplicate remover has the following features: Find DUPLICATE files or find ...
( Wiley Taming the Big Data Tidal Wave, Finding Opportunities in Huge Data Streams with Advanced Analytics (2012).pdf )
《寻找图像和视频中的人物——Navneet Dalal》是Navneet Dalal博士的一篇学术论文,共计150页,深入探讨了计算机视觉领域中的人脸与人体识别技术。这篇论文涵盖了从静态图像到动态视频的人员检测、识别和追踪的广泛...
### 寻找完整MD5预映像比穷举搜索更快 #### 摘要与背景 本文介绍了一种针对完整MD5散列函数的有效预映像攻击方法,该方法复杂度为\(2^{116.9}\),可以生成一个伪预映像,并且在复杂度为\(2^{123.4}\)的情况下生成...
Finding Frequent Items in Data StreamsMoses Charikar?1, Kevin Chen??2, and Martin Farach-Colton31 Princeton University moses@cs.princeton.edu2 UC Berkeley kevinc@cs.berkeley.edu3 Rutgers University ...
标题与描述:“寻找完整SHA-1中的碰撞” ...SHA-1(Secure Hash Algorithm 1),作为1995年由美国国家标准与技术研究院(NIST)发布的一种联邦信息处理标准,自问世以来便被广泛应用于各种政府及行业安全标准中,尤其...
「威胁情报」D2T1 - Finding 0days in Embedded Systems with Code Coverage Guided Fuzzing - Dr Quynh and Kai Jern Lau - APT 安全漏洞 无线安全 WEB应用防火墙 安全人才 信息安全
### 求线段交最优算法详解 #### 引言 在计算几何领域,寻找一组线段中的交点是一项基础而重要的任务。本篇文章将详细介绍由Ivan J. Balaban提出的一种新算法,该算法能够有效地找出给定线段集合中的所有相交对,...
an array and uses it to pathfinding. The second class, Path(path.cs) is used by third class to find the path. The third class Node, is used by Path class. The fourth class, PathFinding...
本文《Finding Structure in Audio for Music Information Retrieval》探讨了音乐信息检索领域中对音频信号进行结构化分析的方法和技术。该文由Bryan Pardo撰写,发表于2006年5月的IEEE Signal Processing Magazine...
本压缩包“Finding_position_in_Java_ME.zip”似乎包含了实现这一功能的相关代码资源。 标题中的"Finding_position_in_Java_ME"暗示了我们关注的重点是如何在Java ME环境中获取和处理地理位置信息。这通常涉及到...
NULL 博文链接:https://irwenqiang.iteye.com/blog/1279041
### 寻找微小人脸:探索规模不变性、图像分辨率与上下文推理 #### 摘要 在物体识别领域已经取得了巨大的进步,但检测小型物体仍然是一个未解决的挑战。本文探讨了寻找小型人脸问题的三个方面:规模不变性的角色、...