`
13146489
  • 浏览: 251304 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

redis autocomplete

 
阅读更多
原文地址http://antirez.com/post/autocomplete-with-redis.html
Hello! This isn't just a blog post, it is actually the Redis weekly update number eight, but we can't tell this to Google: naming posts like Redis Weekly update #... will not play well with Google indexing, so I'm starting to use blog post titles more predictive of what people will actually type into a search engine, in this specific case autocomplete redis, probably.

Too bad that I'm forced to play the game of the SE0 L00z3r, but it's for a good reason. I guess this will also work well when articles are submitted in places like Hacker News, Programming Reddit and so forth. Readers will have a simpler way to tell if they want to actually click or not on the link. Ok... end of the off topic.

I hope the topic of this post will be very interesting for many Redis users, and for many non Redis users that experimented the pain of modeling auto complete for web or mobile applications using a database that is not designed to make this a simple task.

Auto complete is a cool and useful feature for users, but it can be non trivial to implement it with good performances when the set of completions is big.
Simple auto complete

In the simplest auto completion schema you have:
A set of N words, that are in some way notable words that you want to auto complete. This can be for instance the names of the biggest cities in the world in a subscription form, or the most frequent search terms in a search service. We'll try to complete a list of female names. You can find the raw file name here: female-names.txt.
A prefix string, that is, what your user is writing in the input field.
Given the prefix, you want to extract a few words that start with the specified prefix.

But in what order? In many applications (like the list of cities in a web form) this order can just be lexicographic, and this is more simple and memory efficient to model in Redis, so we'll start with it, trying to evolve our approach to reach more interesting results.

So for instance if the prefix is mar, what I want to see in the completion list is mara, marabel, marcela, and so forth.

Since we want a scalable schema, we also require that this entries are returned in O(log(N)) time in the average case. A O(N) approach is not viable for us. We'll see the space complexity, that is another very important meter, in a moment.
A little known feature of sorted sets

The order of elements into a sorted set is, of course, given by the score. However this is not the only ordering parameter since sorted set elements may have the same score. Elements with the same score are sorted lexicographically, as you can see in this example:
redis> zadd zset 0 foo
(integer) 1
redis> zadd zset 0 bar
(integer) 1
redis> zadd zset 0 x
(integer) 1
redis> zadd zset 0 z
(integer) 1
redis> zadd zset 0 a
(integer) 1
redis> zrange zset 0 -1
1. "a"
2. "bar"
3. "foo"
4. "x"
5. "z"
This is a very important feature of sorted sets, for instance it guarantees O(log(N)) operations even when large clusters of elements have the same score. And this will be the foundation for our code completion strategy.

Well... that's not enough, but fortunately there is another cool feature of sorted sets, that is the ZRANK command. With this command given an element you can know the position (that is, the index) of an element in the sorted set. So what should I do if I want to know what words follow "bar" in my sorted set? (note that indexes are zero based)
redis> zrank zset bar
(integer) 1
redis> zrange zset 2 -1
1. "foo"
2. "x"
3. "z"
Hey that seems pretty close to what we want in our first completion strategy... but we need a final trick.
Completion in lexicographical ordering

So what we are able to do is looking up a word in the sorted set and obtain a list of words lexicographically following this word. This is a bit different from what we actually need, as we want to complete words just having a prefix of the word (or sentence, and in general any string).

We can have all this paying a price, that is: more space.

What we do is the following: instead of adding only full words to the sorted set, we add also all the possible prefixes of any given word. We also need a way to distinguish actual words from prefixes. So this is what we do:
For every word, like "bar", we add all the prefixes: "b", "ba", "bar".
For every word we finally add the word itself but with a trailing "*" character,
so that we can tell this is an actual word (of course you can use a different marker, even binary to avoid collisions). So we also add "bar*". If I add the word "foo", "bar", and "foobar" using the above rules, this is what I obtain:
redis> zrange zset 0 -1
1. "b"
2. "ba"
3. "bar"
4. "bar*"
5. "f"
6. "fo"
7. "foo"
8. "foo*"
9. "foob"
10. "fooba"
11. "foobar"
12. "foobar*"
Now we are very close to perform the completion!

If the user is writing "fo" we do the following:
redis> zrank zset fo
(integer) 5
redis> zrange zset 6 -1
1. "foo"
2. "foo*"
3. "foob"
4. "fooba"
5. "foobar"
6. "foobar*"
We just need to get all the items marked with the trailing "*', so we'll be able to show "foo" and "foobar" as possible completions.
Let's try it with a bigger dictionary

We'll use a list of 4960 female names, female-names.txt and a Ruby script to check how well our approach is working.

# compl1.rb - Redis autocomplete example
# download female-names.txt from http://antirez.com/misc/female-names.txt

require 'rubygems'
require 'redis'

r = Redis.new

# Create the completion sorted set
if !r.exists(:compl)
    puts "Loading entries in the Redis DB\n"
    File.new('female-names.txt').each_line{|n|
        n.strip!
        (1..(n.length)).each{|l|
            prefix = n[0...l]
            r.zadd(:compl,0,prefix)
        }
        r.zadd(:compl,0,n+"*")
    }
else
    puts "NOT loading entries, there is already a 'compl' key\n"
end

# Complete the string "mar"

def complete(r,prefix,count)
    results = []
    rangelen = 50 # This is not random, try to get replies < MTU size
    start = r.zrank(:compl,prefix)
    return [] if !start
    while results.length != count
        range = r.zrange(:compl,start,start+rangelen-1)
        start += rangelen
        break if !range or range.length == 0
        range.each {|entry|
            minlen = [entry.length,prefix.length].min
            if entry[0...minlen] != prefix[0...minlen]
                count = results.count
                break
            end
            if entry[-1..-1] == "*" and results.length != count
                results << entry[0...-1]
            end
        }
    end
    return results
end

complete(r,"marcell",50).each{|res|
    puts res
}

view rawcompl1.rbThis Gist brought to you by GitHub.


What we do in the script is adding all the words using our set of rules, then using ZRANGE in small ranges of 50 elements to get as much words as we like. This seems all cool, but what is the space and time complexity of our approach?
The big O

Both ZRANK and ZRANGE (with a fixed range of 50 elements) are O(log(N)) operations. So we really don't have problems about handling huge word lists composed of millions of elements.

Let's check the memory requirements: in the worst case, for every word, we need to add M+1 elements in the sorted set, where M is the length of the word. So for N words, we need N*(Ma+1) elements where Ma is the average length of our words. Fortunately in the practice there are a lot of collisions among prefixes, so this is going to be better. For the 4960 female names we needed 14798 elements in the sorted set. Still it's better to do your math using the N*(Ma+1) figure just to be safe. It seems like that the memory requirement is perfectly viable even with millions of words to complete.
Query prediciton

Our current schema completed things in lexicographically order. Sometimes this is what you want, like in words of an online dictionary. Other times what you want to do is, instead, much similar to what Google is doing in its Instant Search. Given a prefix you want to find the first N words having this prefix ordered by frequency, that is, if I'm typing "ne" I want to see things like "netflix", "news", "new york times", that is what people searched most in the latest weeks, out of all the strings starting by "ne".

This can't be done easily with a simple "range query" of the kind we did earlier. A given prefix can have sub prefixes with different "top strings" and we need to be able to update our data at runtime in a non blocking way. Our approach is using a different sorted set for every prefix.

We also need to create a schema where it is simple to build and update our sorted sets incrementally. The direct solution to this problem using Redis is the following:
Every time an user enters a query, like "news", we compute all the prefixes: "n", "ne", "new", "news". We use every prefix as the key name for a sorted set, executing the a ZINCRBY <prefix> 1 news for each key.
In order to complete a given prefix, we just need to perform a ZRANGE <prefix> 0 4, to show the first 5 items.
There is a problem with this approach: most completion systems will only show the top five or so items, but in order to compute what this top items are, we need to take a longer list. In theory we would need to take the whole list of all the words searched for every possible prefix...

Fortunately stream algorithms can help us. We know that statistically even taking just, for instance, 300 items per prefix, we'll be able to have a very good approximation of the top five items, since if a query is frequent enough, it will eventually win over the other queries. So this is what we actually need to do every time we receive a search for e given string. For every prefix of the search string:
If our sorted set for this prefix is still not at the max number of elements (300 in the example), just do ZINCRBY, that is, add the element with score 1.
Instead if the max number of items was already reached, remove the element with lower score, and add this new one with a score that is the lower score remaining, plus 1.


If you feel like this is too complex, you can use a variation of this algorithm that will have more or less the same performances, that is:

Just ZINCRBY the current element, then sample three random elements and remove the one with lower score.


I guarantee you that this works well for completion ;) But actually this algorithms have different guarantees that are related to the distribution of the input. For instance if all the strings have very very close frequencies it will have an hard time to output reliable data, but this is not the case, we know that the search is the kind of problem where a small subset of search strings accounts for a very big percentage of all the searches.

But if you are curious, make sure to take at look at this google search.
Clean up stage

Clearly because of the long tail nature of searches, we'll end with a lot of sorted sets related to prefix of words very rarely used. This sorted sets will have maybe just a few items, all with score 1. There is no interest for us in this data... and it's a good idea to clean up this keys, otherwise our approach will use too much (wasted) memory.

There are two ways to do this. If you are using Redis 2.0 the simplest thing to do is to use an associated key for every sorted set with the time of last update. Then using RANDOMKEY you can get a random element from time to time with a background worker, and check if it's better to delete it since it was updated a few days ago the last time.

If you are using 2.2-alpha (Redis master branch) you have a better option, that is, every time you update a sorted set just set a new EXPIRE. Redis will care to remove expired keys for you. Hint: in Redis 2.2 it is now possible to both write against expiring keys AND to update expire times without troubles.

I'm sure it's possible to do much better than what I described in this comments, but I hope this will get you started in the right direction if you plan to use Redis to implement completion. Please use the comments if you want to share your experience or ideas. Thanks for reading!
分享到:
评论

相关推荐

    基于redis的自动补全autocomplete-redis.zip

    autocomplete-redis 是基于redis的自动补全,他会自动索引你要自动补全的句子,然后根据你的输入返回包含这个输入的句子。这儿有一个完整的演示实例: http://ohbooklist.com/redis/ ,我们索引了3.7万本书的名字。 ...

    autocompletex:lix剂的redis自动完成

    Redis是一种高性能的键值存储系统,而自动完成(autocomplete)通常用于提升用户输入效率,常见于搜索引擎、命令行工具或在线文本输入框,它根据用户输入的部分文字预测并提供可能的完整选项。 描述虽然简短,但...

    autocomplete:这是词输入自动完成依赖redis

    标题中的“autocomplete”指的是词输入自动完成功能,这是一种常见的用户界面特性,常见于搜索引擎、文本编辑器和各种在线表单中。它通过预测并显示用户可能想要输入的词汇,提高了输入效率。在这个场景中,该功能是...

    autoComplete[Mootool 版本]

    - 为了性能考虑,可能需要实现缓存机制,例如使用Redis或Memcached,避免频繁访问数据库。 3. **交互设计**: - 自动补全组件通常会在用户停止输入一段时间(如300毫秒)后发送请求,避免过于频繁的网络交互。 -...

    Redis源码漂流记(二)-搭建Redis调试环境.doc

    你需要安装 C/C++ 扩展以获取语法高亮和错误检查,Code Runner 提供编译和运行环境,C/C++ Snippets 可以加速编码,EPITECH C/C++ Headers 用于添加头部信息,Include Autocomplete 则提供头文件的自动补全功能。...

    开源项目-augurysys-autocomplete.zip

    开源项目-augurysys-autocomplete.zip,github.com/augurysys/autocomplete - a Library for building auto-complete services with Golang and Redis

    海象:用于Redis的轻量级Python实用程序

    walrus不需要redis-py新的库,而是将其子类扩展了流行的redis-py客户端,从而可以将其用作替代产品。 除了redis-py所有功能之外,walrus还增加了对一些较新命令的支持,包括对流和使用者组的全面支持。 海象包括: ...

    oc_autocomplete_redis:使用 redis 在 OpenClinica 中实现 CRF 的自动完成

    为了提升用户输入数据的效率和准确性,"oc_autocomplete_redis"项目利用Redis作为后台缓存,结合NodeJS后端处理和jQuery前端交互,实现了在OpenClinica中CRF字段的自动完成功能。这一技术解决方案主要涉及以下知识点...

    autocomplete

    同时,为了性能考虑,可能还需要实现缓存策略,比如使用Redis或Memcached来存储最近的查询结果。 JavaScript在前端负责用户交互和显示提示。它可以监听输入事件,一旦检测到用户输入变化,就向服务器发送Ajax请求。...

    autocomplete-redis-golang-api:使用Redis和Golang实时按需搜索

    在键入API时进行搜索(Golang + Redis) 特征 指数 不同索引中的单独内容类型(例如:用户名,文章标题...) 可以在同一字符串中索引多个术语 将有效负载/文档附加到索引的字符串 评分机制: 通过索引同一文档多次...

    redis-search4j:自动从code.google.compredis-search4j导出

    3.支持Suggest前缀、拼音查找(AutoComplete 功能) 4.支持单个或多个分词搜索 5.可根据字段进行结果排序 环境 1.jdk 1.6+ 2.redis 2.2+ 依赖包 1.Jedis-2.1.0 2.commons-pool-1.6.jar 3.IKAnalyzer-3.2.8.jar 4....

    Laravel开发-redcard

    Redis 是一个高性能的键值存储系统,常被用于缓存、会话管理以及像自动完成这样的实时数据检索任务。 首先,让我们了解自动完成的基本概念。自动完成,也称为预测输入或智能提示,是一种用户界面技术,它根据用户...

    autocomplete-java:java的自动补全工具

    自动完成-java java的自动补全工具#TODO 我将实现更多的自动完成算法: 三元搜索树----完成基于redis ... #Usage 可以直接使用 KeywordsAutoComplete.java,也可以使用 TernarySearchTree.java 构建自己的服务。 #...

    仿Baidu的搜索下拉提示框源码(ASP.NET)

    为了提供快速的搜索建议,可能使用了缓存技术,如ASP.NET缓存或Redis,将热门搜索词缓存起来,减少数据库查询的次数。 通过这个项目,开发者不仅可以学习到如何在ASP.NET中实现动态Web功能,还能了解AJAX、jQuery...

    lambda-function-autocomplete-handler:使用AWS Lambda响应typeaheadautocomplete操作的示例

    设置 npm install npm install serverless -g...redis db由一个lambda填充,该lambda导入由hybris导出到s3存储桶的json文件。 此无服务器功能是只读的。 如何调试? 如何部署? 最初,您可以生成一个程序包。 从长远

    几种CodeValueWeb输入的解决方案[代码].docx

    例如,使用AutoComplete控件或jQuery插件,用户输入时动态从服务器检索匹配的Code和Name,这样既实现了键盘操作,又减少了服务器负担。 综上所述,处理Code Value Web输入的方法多样,应根据项目需求、性能要求和...

    类似百度的自动匹配下拉控件

    - jQuery的`autocomplete`插件可以轻松实现这一功能,只需配置数据源和一些回调函数。 - 如果使用Vue或React等现代前端框架,可以自定义组件,利用虚拟DOM来提高性能,同时结合`debounce`或`throttle`函数优化输入...

    google.rar_google site

    2. **搜索提示**(Autocomplete):这是一种用户体验优化技术,根据用户输入的前几个字符快速提供可能的搜索建议。实现搜索提示需要实时处理大量的数据,并且需要高效的缓存策略以提供快速响应。开发者需要理解如何...

    java版自动补全

    对于自动补全,可能有一个专门的`AutoCompleteServlet`,它监听特定的URL路径,如`/autocomplete`。 3. **请求处理**:当用户输入一部分文本并发送请求到服务器时,Servlet的`doGet`或`doPost`方法会被调用。在这里...

    Chat:使用 socket.io 聊天应用程序

    还可以优化性能,比如使用 Redis 或其他中间件进行消息队列处理,提高高并发场景下的性能。 总结,Socket.IO 提供了一个简单而强大的方式来构建实时的 Web 应用程序,特别是聊天应用。通过结合 Node.js 服务器和 ...

Global site tag (gtag.js) - Google Analytics