关于几十万词汇量词典检索的讨论，希望大家见仁见智，帮我提出些意见吧！ -

tomyth

浏览: 850 次
性别:

最近访客更多访客>>

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1)

社区版块

存档分类

2011-12 ( 1)
更多存档...

关于几十万词汇量词典检索的讨论，希望大家见仁见智，帮我提出些意见吧！

Android 检索几十万词 trie 数据库

目前YourDict词典程序大部分工作已经完成，英文检索速度小于1s，还算可以能接受，但汉语词典检索成了一个问题，汉语词典动辄60几万词的词量，让我程序检索时间骤然上升到了10几秒。。。让我很苦恼，在思考解决方法的过程中发现这也算是计算机科学的一个经典问题了，只不过这回是在Android，这样一个内存环境极度匮乏的情况下，建树等方式需要极端小心。。。希望大家能给我提出一个切实可行的方案！

如下是我在google群组里提出的问题，现在基本是两个方案database，和trie。。。一开始我觉得kris的数据库建议不错，但Christopher提出的关于数据库实时效率差的问题也让我有些头疼。。。如果都试的话工作量太大。。。大家有什么好的方案。。。希望实践过trie树的童鞋能告诉我一下他的效率和编程时的注意事项，也希望大家共同进步！

felix

查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午1时26分
Hi!
I'm working on a dict app on android,
I need to search a list of words(about 500-600 thousand words) in file
to find the word.
It took me about 10-20 seconds to search the word. How can I improve
the search speed?
Thanks to all!

    回复     回复作者      转发

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午1时49分
Ah, what a classic question in computer science :-)
To really get the answer to this question, you're going to have to
learn a little bit about data structures. Wait... How is it taking
you *20 seconds* to find the word!? That's absurd! Really? You're
doing string comparisons over 500 strings and it's taking you 20
seconds!?

Anyway, there are two solutions, you might just try using a database,
(not a bad idea, actually), or you might use a hash table (lookup
"HashTable"), if you want to check for bogus words before searching
(okay so this is a bit of a stretch and probably not useful but I
think it deserves a mention) you can look at using a bloom filter...
Obviously there are tons of other data structures you can use too.

Kris

P.s., (did I mention that you should probably be using a database, as,
for Android, it's probably going the best acceptable solution that is
fairly extensible. I'm sure somebody might bring up the possible
badness of having it out on the SD card somewhere, but even this isn't
so bad, especially compared to 20 seconds!)

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午1时49分
OH! Very sorry! I didn't see the 500, thousand!!!
Kris

On Tue, Dec 20, 2011 at 12:49 AM, Kristopher Micinski

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Jim Graham
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午2时27分

On Mon, Dec 19, 2011 at 09:26:11PM -0800, felix wrote:
> Hi!
> I'm working on a dict app on android,
> I need to search a list of words(about 500-600 thousand words) in file
> to find the word.
> It took me about 10-20 seconds to search the word. How can I improve
> the search speed?

Well, along with Kris's solutions, here's another (that you could use
with his, or on its own if it's enough):
Use whatever works best for you (regexp or simply grabbing the first
char directly from the string) and get the first character (or first
two, or ... and so on) and split your data accordingly. That way,
instead of searching through the WHOLE LIST for zulu, you'd only search
words starting with 'z' (or "zu", etc.). It would no doubt work better
combined with Kris's ideas.

Later,
   --jim

--
THE SCORE: ME: 2 CANCER: 0
73 DE N5IAL (/4)        MiSTie #49997 < Running FreeBSD 7.0 >
spooky1...@gmail.com ICBM/Hurricane: 30.44406N 86.59909W

      "'Wrong' is one of those concepts that depends on witnesses."
     --Catbert: Evil Director of Human Resources (Dilbert, 05Nov09)

Android Apps Listing at http://www.jstrack.org/barcodes.html

    回复     回复作者      转发

举报垃圾内容

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午2时30分

- 显示引用的文字 -

Jim, this is a good solution, but I would argue that if he does indeed
"read about data structures" (which I nebulously proposed he do), he
might stumble upon a trie:
http://en.wikipedia.org/wiki/Trie

Which is basically what you propose.

(I'm not trying to be condescending here, I'm really trying to point
the OP to another data structure he could consider.)

kris

    回复     回复作者      转发

举报垃圾内容

Jim Graham
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午2时35分

On Tue, Dec 20, 2011 at 01:30:10AM -0500, Kristopher Micinski wrote:
> On Tue, Dec 20, 2011 at 1:27 AM, Jim Graham <spooky1...@gmail.com> wrote:
> > On Mon, Dec 19, 2011 at 09:26:11PM -0800, felix wrote:
> http://en.wikipedia.org/wiki/Trie

Wow...I never knew that had a name. :-)
Later,
   --jim

--
THE SCORE: ME: 2 CANCER: 0
73 DE N5IAL (/4)        MiSTie #49997 < Running FreeBSD 7.0 >
spooky1...@gmail.com ICBM/Hurricane: 30.44406N 86.59909W

      "'Wrong' is one of those concepts that depends on witnesses."
     --Catbert: Evil Director of Human Resources (Dilbert, 05Nov09)

Android Apps Listing at http://www.jstrack.org/barcodes.html

    回复     回复作者      转发

举报垃圾内容

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午2时36分

On Tue, Dec 20, 2011 at 1:35 AM, Jim Graham <spooky1...@gmail.com> wrote:
> On Tue, Dec 20, 2011 at 01:30:10AM -0500, Kristopher Micinski wrote:
>> On Tue, Dec 20, 2011 at 1:27 AM, Jim Graham <spooky1...@gmail.com> wrote:
>> > On Mon, Dec 19, 2011 at 09:26:11PM -0800, felix wrote:
>> http://en.wikipedia.org/wiki/Trie

> Wow...I never knew that had a name. :-)

> Later,
>   --jim

"A common application of a trie is storing a dictionary, such as one
found on a mobile telephone. "
:-)

Kris

P.s., (promise I didn't write that.)

    回复     回复作者      转发

举报垃圾内容

felix
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午2时46分
Thanks a lot! I think I'll give database a try!:)
On 12月20日, 下午1时49分, Kristopher Micinski <krismicin...@gmail.com>
wrote:

- 显示引用的文字 -

    回复     回复作者      转发

felix
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午2时47分
I've considered trie. But it consumes a lot of memory to construct...
On 12月20日, 下午2时35分, Jim Graham <spooky1...@gmail.com> wrote:

- 显示引用的文字 -

    回复     回复作者      转发

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午2时51分
But you only have to construct it once.
Many data structures with good lookup perf will take time to set up

Kris

P.s., However, databases are highly evolved, and do all of this very
efficiently, so the whole argument is somewhat silly, as if you just
use one you'll be fine.

2011/12/20 felix <guofuchu...@gmail.com>:

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Christopher Van Kirk
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午2时54分
A conventional database isn't going to do better than a Trie, I think.
On 12/20/2011 2:46 PM, felix wrote:

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午3时08分
Right,
But it does have the advantage that the technology on Android is
already there, so he doesn't have to write the implementation himself,
or grab one and learn to use it off the web.

kris

2011/12/20 Christopher Van Kirk <christopher.vank...@gmail.com>:

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Christopher Van Kirk
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午3时11分
Then again a Trie isn't really that hard to write.
On 12/20/2011 3:08 PM, Kristopher Micinski wrote:

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午3时16分
Right,
but getting the huge thing in the right format, storing that
statically, etc.., vs preloading the app with a database, which sounds
easier? I just think the database sounds like the better way to go on
this one, and I'm biased to not reinventing the wheel, but the OP is
obviously free to use whatever..

kris

On Tue, Dec 20, 2011 at 2:11 AM, Christopher Van Kirk

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

martypantsROK
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午5时10分
Don't forget there are more than data structures involved here.
The method searching could be improved. As Jim suggested, breaking
things down with an index (search for zulu beginning in the z section)
could be sped up even more. Search for the last letter in the string
first. By searching for that 4th character "u" first you've
eliminated
3 other characters and can skip on to the next word. That way,
similar
words like zuch or zucchini won't slow you down matching the first two
characters. Works even better for longer words.
Marty

On Dec 20, 4:16 pm, Kristopher Micinski <krismicin...@gmail.com>
wrote:

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Solution 9420
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午5时40分
Hi,
I'm the auther of 9420 Thai Keyboard which incorporates English word
suggestion feature as well.
I've around 200K words and can look up in average of 80 mSec.
Assuming the word DB is static, I've done the following...
1. pre-sorted your word in file.
2. pre-index your words.
3. Use binary search tree algorithm.

You'll have to a bit careful the size of the index file, and very
optimized on memory usage to avoid the delay from JAVA gabage
collection as well.

Cheers,
Solution 9420...

www.solution9420.com

On Dec 20, 12:26 am, felix <guofuchu...@gmail.com> wrote:

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午5时43分

On Tue, Dec 20, 2011 at 4:10 AM, martypantsROK <martyg...@gmail.com> wrote:
> Don't forget there are more than data structures involved here.
> The method searching could be improved. As Jim suggested, breaking
> things down with an index (search for zulu beginning in the z section)
> could be sped up even more. Search for the last letter in the string
> first. By searching for that 4th character "u" first you've
> eliminated
> 3 other characters and can skip on to the next word. That way,
> similar
> words like zuch or zucchini won't slow you down matching the first two
> characters. Works even better for longer words.
> Marty

I guess my point in all of this is that this searching is highly tied
to your data structure. Good algorithms only work with good data
structures to back them. And there are many indexing and optimization
techniques you can use to get more efficiency. My point is, that
since you can argue all day over these things getting more and more
complicated data structures and searching algorithms (each becoming
more and more context dependent), most of the time for this
application using a database will suffice. If you use a database,
whose indexing method is already going to be pretty good, and find it
doesn't suit your needs, *then* you can switch over to using something
fancier, though I highly doubt you'd need anything much fancier than a
trie in this case.
SQLite is using B+ trees for tables, while this isn't *amazing*
(especially compared to what you'll see with a trie), it's still going
to be massively better (where massively = logarithmic), than just
linear search. Along with this, it looks like "Solutin 9420" shared
his advice... And don't forget about the bloom filter, (this won't
actually help you that much unless you're doing a bunch of queries in
a row, most of which might not be int he database, but I wanted to
bring it up again anyway..)

kris

    回复     回复作者      转发

举报垃圾内容

Christopher Van Kirk
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午5时59分
Three points.
1) Building the searching functionality twice is far more expensive than
building it once, no matter what approach you use. Be sure that the
performance of the DB approach is acceptable before you go and build it
that way.

2) It can be quite challenging to get decent performance out of a
database for something like this, depending on the functionality
required. If, for example, you need real-time narrowing down of words, a
database is going to be very slow (e.g. as you type letters, you get an
alphabetized list of what's in the db).

3) There's probably an open source Trie out there somewhere that you can
just use.

Directed at the OP, of course.

Cheers...

On 12/20/2011 5:43 PM, Kristopher Micinski wrote:

- 显示引用的文字 -

    回复     回复作者      转发

举报垃圾内容

Kristopher Micinski
查看个人资料   翻译成中文（简体）更多选项 12月20日, 下午6时04分
On Tue, Dec 20, 2011 at 4:59 AM, Christopher Van Kirk

<christopher.vank...@gmail.com> wrote:
> Three points.
> 1) Building the searching functionality twice is far more expensive than
> building it once, no matter what approach you use. Be sure that the
> performance of the DB approach is acceptable before you go and build it that
> way.

Okay.

> 2) It can be quite challenging to get decent performance out of a database
> for something like this, depending on the functionality required. If, for
> example, you need real-time narrowing down of words, a database is going to
> be very slow (e.g. as you type letters, you get an alphabetized list of
> what's in the db).

True..

> 3) There's probably an open source Trie out there somewhere that you can
> just use.

Right, which is what I suggested in the first place if he goes this direction..
http://wikipedia-clustering.speedblue.org/trieJava.php

kris

    回复     回复作者      转发

举报垃圾内容

分享到：

2011-12-21 11:28
浏览 850
评论(0)
分类:移动开发
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

关于几十万词汇量词典检索的讨论，希望大家见仁见智，帮我提出些意见吧！

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

关于几十万词汇量词典检索的讨论，希望大家见仁见智，帮我提出些意见吧！

评论

发表评论

相关推荐

最近访客更多访客>>