- 浏览: 1312025 次
- 性别:
- 来自: 江苏
最新评论
-
honey_fansy:
的确,不要自己的支持就说完美支持,我的就不行,别说我的不是fi ...
无js实现text-overflow: ellipsis; 完美支持Firefox -
fanchengfei:
事件长微博,欢迎转发:http://weibo.com/332 ...
《在路上 …》 写代码也需要一点演技 – python2.6 的 class decorator -
blued:
没有报错,但排版效果一点都没有 咋回事。请指教
python排版工具 -
szxiaoli:
耍人呀,效果在哪儿呀
滑动效果 -
accaolei:
这个能监到控子目录吗?,我测试了一下,发现子目录里的文件监控不 ...
windows监控目录改动
Python语言: 临时自用代码@代码发芽网
#coding:utf-8
# Bloom filters in Python
# Adam Langley <agl@imperialviolet.org>
# 给CountedBloom加了一个max_count 张沈鹏 <zsp007@gmail.com>
# Bloom-Filter算法简介
# http://www.googlechinablog.com/2007/07/bloom-filter.html
# http://zh.wikipedia.org/wiki/%E5%B8%83%E9%9A%86%E8%BF%87%E6%BB%A4%E5%99%A8
# 这个计算器可以帮你求最佳的参数
# http://www.cc.gatech.edu/~manolios/bloom-filters/calculator.html
# CountedBloom 的 buckets 参数对应于计算器的m,也就是"m denotes the number of bits in the Bloom filter"
import array
import struct
mixarray = array.array ('B', '\x00' * 256)
# The mixarray is based on RC4 and is used as diffusion in the hashing function
def mixarray_init (mixarray):
for i in range (256):
mixarray[i] = i
k = 7
for j in range (4):
for i in range (256):
s = mixarray[i]
k = (k + s) % 256
mixarray[i] = mixarray[k]
mixarray[k] = s
mixarray_init (mixarray)
class Bloom (object):
'''Bloom filters provide a fast and compact way of checking set membership. They do this by introducing a risk of a
false positive (but there are no false negatives).
For more information see http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html'''
def __init__ (self, bytes, hashes, data = None):
'''@bytes is the size of the bloom filter in 8-bit bytes and @hashes is the number of hash functions to use. Consult the
web page linked above for values to use. If in doubt, bytes = num_elements and hashes = 4'''
self.hashes = hashes
self.bytes = bytes
if data == None:
self.a = self._make_array (bytes)
else:
assert len (data) == bytes
self.a = data
def init_from_counted (self, cnt):
'''Set the contents of this filter from the contents of the counted filter @cnt. You have to match sizes'''
if self.bytes * 8 != (len (cnt.a) * 2):
raise ValueError ('Filters are not the same size')
for i in xrange (len (cnt.a)):
b = cnt.a[i]
b1 = (b & 0xf0) >> 4
b2 = (b & 0x0f)
if b1:
self.a[(i * 2) // 8] |= self.bitmask[(i * 2) % 8]
if b2:
self.a[(i * 2 + 1) // 8] |= self.bitmask[(i * 2 + 1) % 8]
def _make_array (self, size):
a = array.array ('B')
# stupidly, there's no good way that I can see of resizing an array without allocing a huge string to do so
# thus I use this, slightly odd, method:
blocklen = 256
arrayblock = array.array ('B', '\x00' * blocklen)
todo = size
while (todo >= blocklen):
a.extend (arrayblock)
todo -= blocklen
if todo:
a.extend (array.array ('B', '\x00' * todo))
# now a is of the right length
return a
def _hashfunc (self, n, val):
'''Apply the nth hash function'''
global mixarray
b = [ord(x) for x in struct.pack ('I', val)]
c = array.array ('B', [0, 0, 0, 0])
for i in range (4):
c[i] = mixarray[(b[i] + n) % 256]
return struct.unpack ('I', c.tostring())[0]
bitmask = [0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01]
def insert (self, val):
for i in range (self.hashes):
n = self._hashfunc (i, val) % (self.bytes *
self.a[n // 8] |= self.bitmask[n % 8]
def __contains__ (self, val):
for i in range (self.hashes):
n = self._hashfunc (i, val) % (self.bytes *
if not self.a[n // 8] & self.bitmask[n % 8]:
return 0
return 1
MAX_COUNT = 15
class CountedBloom (Bloom):
'''Just like a Bloom filter, but provides counting (e.g. you can delete as well). This uses 4 bits per bucket, so is
generally four times larger than the same non-counted bloom filter.'''
def __init__ (self, buckets, hashes):
'''Please note that @buckets must be even. Also note that with a Bloom object you give the number of *bytes* and each byte is 8 buckets. Here you're giving the number of buckets.'''
assert buckets % 2 == 0
self.hashes = hashes
self.buckets = buckets
self.a = self._make_array (buckets // 2)
def insert (self, val):
masks = [(0x0f, 0xf0), (0xf0, 0x0f)]
shifts = [4, 0 ]
for i in range (self.hashes):
n = self._hashfunc (i, val) % self.buckets
byte = n // 2
bucket = n % 2
(notmask, mask) = masks[bucket]
shift = shifts[bucket]
bval = ((self.a[byte] & mask) >> shift)
if bval < MAX_COUNT: # we shouldn't increment it if it's at the maximum
bval += 1
self.a[byte] = (self.a[byte] & notmask) | (bval << shift)
def __contains__ (self, val):
masks = [(0x0f, 0xf0), (0xf0, 0x0f)]
shifts = [4, 0]
for i in range (self.hashes):
n = self._hashfunc (i, val) % self.buckets
byte = n // 2
bucket = n % 2
(notmask, mask) = masks[bucket]
shift = shifts[bucket]
bval = ((self.a[byte] & mask) >> shift)
if bval == 0:
return 0
return 1
def max_count(self, val):
masks = [(0x0f, 0xf0), (0xf0, 0x0f)]
shifts = [4, 0]
count_val = MAX_COUNT
for i in range (self.hashes):
n = self._hashfunc (i, val) % self.buckets
byte = n // 2
bucket = n % 2
(notmask, mask) = masks[bucket]
shift = shifts[bucket]
bval = ((self.a[byte] & mask) >> shift)
if bval < MAX_COUNT:
if bval == 0:
return 0
else:
count_val = bval
return count_val
def __delitem__ (self, val):
masks = [(0x0f, 0xf0), (0xf0, 0x0f)]
shifts = [4, 0]
for i in range (self.hashes):
n = self._hashfunc (i, val) % self.buckets
byte = n // 2
bucket = n % 2
(notmask, mask) = masks[bucket]
shift = shifts[bucket]
bval = ((self.a[byte] & mask) >> shift)
if bval < MAX_COUNT: # we shouldn't decrement it if it's at the maximum
bval -= 1
self.a[byte] = (self.a[byte] & notmask) | (bval << shift)
__all__ = ['Bloom']
if __name__ == '__main__':
print 'Testing bloom filter: there should be no assertion failures'
a = Bloom (3, 4)
a.insert (45)
print a.a
a.insert (17)
print a.a
a.insert (12)
print a.a
assert 45 in a
assert 45 in a
assert not 33 in a
assert 45 in a
assert 17 in a
assert 12 in a
c = 0
for x in range (255):
if x in a:
c += 1
print c
print float(c)/255
a = CountedBloom (24, 4)
a.insert (45)
print a.a
a.insert (17)
print a.a
a.insert (12)
a.insert (12)
print "a.max_count(12)", a.max_count(12)
a.insert ("张沈鹏")
a.insert ("张沈鹏")
a.insert ("张沈鹏")
print "a.max_count(zsp)", a.max_count(12)
print a.a
assert 45 in a
assert 45 in a
assert not 33 in a
assert 45 in a
assert 17 in a
assert 12 in a
c = 0
for x in range (255):
if x in a:
c += 1
print c
print float(c)/255
del a[45]
assert not 45 in a
a2 = Bloom (3, 4)
a2.init_from_counted (a)
print a2.a
assert 17 in a2
assert 12 in a2
assert not 45 in a
- pybloom.zip (2.4 KB)
- 下载次数: 32
发表评论
-
关于"Google限制Python"事件我的看法
2009-11-17 15:11 8426本来做一个勤勤恳恳的 ... -
python排版工具
2009-10-15 14:22 3524http://pypi.python.org/pypi/pyt ... -
Fast Asynchronous Python Web Server (Fapws is short)
2009-08-15 12:12 1873http://github.com/william-os4y/ ... -
python奇技淫巧
2009-07-23 22:27 2527http://wiki.python.org/moin/By ... -
跨平台 获取系统信息的python库 http://support.hyperic.com/disp
2009-06-12 11:49 3663The Sigar API provides a portab ... -
频繁集统计 python 封装
2009-05-29 15:49 2676封装的是附件这篇paper的count 因为对比发现这个的综合 ... -
libsvm (python封装) 学习笔记 1
2009-05-19 14:28 42532009-05-19 14:10:38 #!/usr/bin ... -
lcs.py 最长公共子串算法
2009-05-05 15:50 2994感觉用来匹配相似文件比最短编辑距离更靠谱,最短编辑应该是用来纠 ... -
lrucache.py 最近最少使用算法
2009-05-04 13:23 2928lrucache.py 最近最少使用算法 2009-05-04 ... -
史上最快 异步消息队列zeromq 简介
2009-04-30 21:40 27301是的,我喜欢Z开头的东西. http://www.zer ... -
相似单词
2009-03-18 00:54 1786给你一个单词a,如果通过交换单词中字母的顺序可以得到另外的单词 ... -
is_cn_char
2009-03-14 13:39 1360unicode码 def is_cn_char(i): ... -
写一个python的urldecode
2009-03-03 10:57 5138from urllib import unquote def ... -
今天才发现python的sort有个key参数,我好圡...
2009-02-28 20:59 3141>>> a=range(10) >& ... -
发一个山寨版python的Orm
2009-02-24 23:49 2256发一个山寨版的Orm 大概用法见 http://docs. ... -
pyrex学习笔记
2009-02-24 03:36 17120. easy_install pyrex 1.写pyrex ... -
python的一个有趣的细节
2009-02-24 02:00 1385python3.0一个有趣的细节 2009-02-24 01: ... -
python备玩候选者
2009-02-24 00:34 1716* 张沈鹏 推荐网址当然要有一个部署的东西 Exs ... -
python读取mp3 ID3信息
2009-02-18 16:57 2665pyid3不好用,常常有不认识的. mutagen不错,不过 ... -
又写了一个python的route模块
2009-01-14 01:18 2115是的 我很无聊
相关推荐
4. **False Negatives的避免**:在LDFD-BloomFilter的实现中,可能会使用多个独立的Bloom Filters,每个代表不同时间窗口。这样,即使在某个窗口内出现假阳性,通过多窗口的交叉验证,也能降低漏检LDFs的概率。 5. ...
脚本: 依存关系: tensorflow1 (仅测试1.14.0) tensorflow_hub 麻木tqdm 杜松子酒#pip install gin-config numpy_ringbuffer#点安装numpy_ringbuffer peloton_bloomfilters从大集团(在我们
1. **快速**:LevelDB利用了一系列优化技术,如Bloom filters和数据分层存储,实现了高效的读写性能。它将数据分层存储在磁盘上,通过最小化磁盘I/O来提高速度。 2. **内存管理**:LevelDB将数据缓存在内存中,当...
Redis还支持一些其他特性,如布隆过滤器(Bloom Filters)、HyperLogLog(用于统计唯一元素数量)和地理空间索引(Geospatial Indexing),这些在大型分布式系统中尤其有用。此外,Redis Cluster是其内置的分布式...
- **ZOrdering和Bloom Filters**: 提高查询性能和数据过滤效率。 - **Metadata Versioning**: 支持元数据版本控制,确保向后兼容性。 - **Column Statistics**: 统计信息帮助优化查询计划。 要使用这个ORC库,...
2. **Bloom Filters**:LevelDB使用Bloom Filter来减少不必要的磁盘I/O操作,通过概率性地判断一个元素是否存在于集合中,以节省空间和提高性能。 3. **压缩技术**:为了优化存储效率,LevelDB内置了数据压缩机制,...
对于结构化数据,可以使用更高级的压缩算法,如Bloom Filters或Count-Min Sketch,它们在空间效率和查询速度上有所优化。 7. **脚本精简**: 在存储库中,"精简脚本"可能指的是优化和简化代码,以提高运行效率和可...
9. **布隆过滤器(Bloom Filters)** Redis通过`BF.*`系列命令提供了布隆过滤器功能,用于可能存在的元素检测,降低空间占用,但有一定的误判率。 10. **HyperLogLog** HyperLogLog是Redis中的一种稀疏数据结构,...
7. **布隆过滤器(Bloom Filters)**:如果涉及此功能,会测试其在空间效率和误判率方面的表现。 8. **HyperLogLog**:测试这个稀疏数据结构在估算基数时的准确性。 9. **限流(Rate Limiting)**:检查 Redis ...