`
tomyth
  • 浏览: 844 次
  • 性别: Icon_minigender_1
最近访客 更多访客>>
文章分类
社区版块
存档分类
最新评论

关于几十万词汇量词典检索的讨论,希望大家见仁见智,帮我提出些意见吧!

阅读更多
目前YourDict词典程序大部分工作已经完成,英文检索速度小于1s,还算可以能接受,但汉语词典检索成了一个问题,汉语词典动辄60几万词的词量,让我程序检索时间骤然上升到了10几秒。。。让我很苦恼,在思考解决方法的过程中发现这也算是计算机科学的一个经典问题了,只不过这回是在Android,这样一个内存环境极度匮乏的情况下,建树等方式需要极端小心。。。希望大家能给我提出一个切实可行的方案!

如下是我在google群组里提出的问题,现在基本是两个方案database,和trie。。。一开始我觉得kris的数据库建议不错,但Christopher提出的关于数据库实时效率差的问题也让我有些头疼。。。如果都试的话工作量太大。。。大家有什么好的方案。。。希望实践过trie树的童鞋能告诉我一下他的效率和编程时的注意事项,也希望大家共同进步!








felix

查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午1时26分
Hi!
I'm working on a dict app on android,
I need to search a list of words(about 500-600 thousand words) in file
to find the word.
It took me about 10-20 seconds to search the word. How can I improve
the search speed?
Thanks to all!

    回复     回复作者      转发 








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午1时49分
Ah, what a classic question in computer science :-)
To really get the answer to this question, you're going to have to
learn a little bit about data structures.  Wait... How is it taking
you *20 seconds* to find the word!?  That's absurd!  Really?  You're
doing string comparisons over 500 strings and it's taking you 20
seconds!?

Anyway, there are two solutions, you might just try using a database,
(not a bad idea, actually), or you might use a hash table (lookup
"HashTable"), if you want to check for bogus words before searching
(okay so this is a bit of a stretch and probably not useful but I
think it deserves a mention) you can look at using a bloom filter...
Obviously there are tons of other data structures you can use too.

Kris

P.s., (did I mention that you should probably be using a database, as,
for Android, it's probably going the best acceptable solution that is
fairly extensible.  I'm sure somebody might bring up the possible
badness of having it out on the SD card somewhere, but even this isn't
so bad, especially compared to 20 seconds!)



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午1时49分
OH!  Very sorry!  I didn't see the 500, thousand!!!
Kris

On Tue, Dec 20, 2011 at 12:49 AM, Kristopher Micinski



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Jim Graham 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午2时27分



On Mon, Dec 19, 2011 at 09:26:11PM -0800, felix wrote:
> Hi!
> I'm working on a dict app on android,
> I need to search a list of words(about 500-600 thousand words) in file
> to find the word.
> It took me about 10-20 seconds to search the word. How can I improve
> the search speed?

Well, along with Kris's solutions, here's another (that you could use
with his, or on its own if it's enough):
Use whatever works best for you (regexp or simply grabbing the first
char directly from the string) and get the first character (or first
two, or ... and so on) and split your data accordingly.  That way,
instead of searching through the WHOLE LIST for zulu, you'd only search
words starting with 'z' (or "zu", etc.).  It would no doubt work better
combined with Kris's ideas.

Later,
   --jim

--
THE SCORE:  ME:  2  CANCER:  0
73 DE N5IAL (/4)        MiSTie #49997  < Running FreeBSD 7.0 >
spooky1...@gmail.com ICBM/Hurricane: 30.44406N 86.59909W

      "'Wrong' is one of those concepts that depends on witnesses."
     --Catbert:  Evil Director of Human Resources (Dilbert, 05Nov09)

Android Apps Listing at http://www.jstrack.org/barcodes.html

    回复     回复作者      转发       

举报垃圾内容








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午2时30分





- 显示引用的文字 -



Jim, this is a good solution, but I would argue that if he does indeed
"read about data structures" (which I nebulously proposed he do), he
might stumble upon a trie:
http://en.wikipedia.org/wiki/Trie

Which is basically what you propose.

(I'm not trying to be condescending here, I'm really trying to point
the OP to another data structure he could consider.)

kris

    回复     回复作者      转发       

举报垃圾内容








Jim Graham 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午2时35分



On Tue, Dec 20, 2011 at 01:30:10AM -0500, Kristopher Micinski wrote:
> On Tue, Dec 20, 2011 at 1:27 AM, Jim Graham <spooky1...@gmail.com> wrote:
> > On Mon, Dec 19, 2011 at 09:26:11PM -0800, felix wrote:
> http://en.wikipedia.org/wiki/Trie

Wow...I never knew that had a name.  :-)
Later,
   --jim

--
THE SCORE:  ME:  2  CANCER:  0
73 DE N5IAL (/4)        MiSTie #49997  < Running FreeBSD 7.0 >
spooky1...@gmail.com ICBM/Hurricane: 30.44406N 86.59909W

      "'Wrong' is one of those concepts that depends on witnesses."
     --Catbert:  Evil Director of Human Resources (Dilbert, 05Nov09)

Android Apps Listing at http://www.jstrack.org/barcodes.html

    回复     回复作者      转发       

举报垃圾内容








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午2时36分



On Tue, Dec 20, 2011 at 1:35 AM, Jim Graham <spooky1...@gmail.com> wrote:
> On Tue, Dec 20, 2011 at 01:30:10AM -0500, Kristopher Micinski wrote:
>> On Tue, Dec 20, 2011 at 1:27 AM, Jim Graham <spooky1...@gmail.com> wrote:
>> > On Mon, Dec 19, 2011 at 09:26:11PM -0800, felix wrote:
>> http://en.wikipedia.org/wiki/Trie

> Wow...I never knew that had a name.  :-)

> Later,
>   --jim

"A common application of a trie is storing a dictionary, such as one
found on a mobile telephone. "
:-)

Kris

P.s., (promise I didn't write that.)

    回复     回复作者      转发       

举报垃圾内容








felix 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午2时46分
Thanks a lot! I think I'll give database a try!:)
On 12月20日, 下午1时49分, Kristopher Micinski <krismicin...@gmail.com>
wrote:



- 显示引用的文字 -



    回复     回复作者      转发 








felix 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午2时47分
I've considered trie. But it consumes a lot of memory to construct...
On 12月20日, 下午2时35分, Jim Graham <spooky1...@gmail.com> wrote:



- 显示引用的文字 -



    回复     回复作者      转发 








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午2时51分
But you only have to construct it once.
Many data structures with good lookup perf will take time to set up

Kris

P.s., However, databases are highly evolved, and do all of this very
efficiently, so the whole argument is somewhat silly, as if you just
use one you'll be fine.

2011/12/20 felix <guofuchu...@gmail.com>:



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Christopher Van Kirk 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午2时54分
A conventional database isn't going to do better than a Trie, I think.
On 12/20/2011 2:46 PM, felix wrote:



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午3时08分
Right,
But it does have the advantage that the technology on Android is
already there, so he doesn't have to write the implementation himself,
or grab one and learn to use it off the web.

kris

2011/12/20 Christopher Van Kirk <christopher.vank...@gmail.com>:



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Christopher Van Kirk 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午3时11分
Then again a Trie isn't really that hard to write.
On 12/20/2011 3:08 PM, Kristopher Micinski wrote:



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午3时16分
Right,
but getting the huge thing in the right format, storing that
statically, etc.., vs preloading the app with a database, which sounds
easier?  I just think the database sounds like the better way to go on
this one, and I'm biased to not reinventing the wheel, but the OP is
obviously free to use whatever..

kris

On Tue, Dec 20, 2011 at 2:11 AM, Christopher Van Kirk



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








martypantsROK 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午5时10分
Don't forget there are more than data structures involved here.
The method searching could be improved. As Jim suggested, breaking
things down with an index (search for zulu beginning in the z section)
could be sped up even more.  Search for the last letter in the string
first. By searching for that 4th character "u" first you've
eliminated
3 other characters and can skip on to the next word. That way,
similar
words like zuch or zucchini won't slow you down matching the first two
characters. Works even better for longer words.
Marty

On Dec 20, 4:16 pm, Kristopher Micinski <krismicin...@gmail.com>
wrote:



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Solution 9420 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午5时40分
Hi,
I'm the auther of 9420 Thai Keyboard which incorporates English word
suggestion feature as well.
I've around 200K words and can look up in average of 80 mSec.
Assuming the word DB is static, I've done the following...
1. pre-sorted your word in file.
2. pre-index your words.
3. Use binary search tree algorithm.

You'll have to a bit careful the size of the index file, and very
optimized on memory usage to avoid the delay from JAVA gabage
collection as well.

Cheers,
Solution 9420...

www.solution9420.com

On Dec 20, 12:26 am, felix <guofuchu...@gmail.com> wrote:



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午5时43分



On Tue, Dec 20, 2011 at 4:10 AM, martypantsROK <martyg...@gmail.com> wrote:
> Don't forget there are more than data structures involved here.
> The method searching could be improved. As Jim suggested, breaking
> things down with an index (search for zulu beginning in the z section)
> could be sped up even more.  Search for the last letter in the string
> first. By searching for that 4th character "u" first you've
> eliminated
> 3 other characters and can skip on to the next word. That way,
> similar
> words like zuch or zucchini won't slow you down matching the first two
> characters. Works even better for longer words.
> Marty

I guess my point in all of this is that this searching is highly tied
to your data structure.  Good algorithms only work with good data
structures to back them.  And there are many indexing and optimization
techniques you can use to get more efficiency.  My point is, that
since you can argue all day over these things getting more and more
complicated data structures and searching algorithms (each becoming
more and more context dependent), most of the time for this
application using a database will suffice.  If you use a database,
whose indexing method is already going to be pretty good, and find it
doesn't suit your needs, *then* you can switch over to using something
fancier, though I highly doubt you'd need anything much fancier than a
trie in this case.
SQLite is using B+ trees for tables, while this isn't *amazing*
(especially compared to what you'll see with a trie), it's still going
to be massively better (where massively = logarithmic), than just
linear search.  Along with this, it looks like "Solutin 9420" shared
his advice...  And don't forget about the bloom filter, (this won't
actually help you that much unless you're doing a bunch of queries in
a row, most of which might not be int he database, but I wanted to
bring it up again anyway..)

kris

    回复     回复作者      转发       

举报垃圾内容








Christopher Van Kirk 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午5时59分
Three points.
1) Building the searching functionality twice is far more expensive than
building it once, no matter what approach you use. Be sure that the
performance of the DB approach is acceptable before you go and build it
that way.

2) It can be quite challenging to get decent performance out of a
database for something like this, depending on the functionality
required. If, for example, you need real-time narrowing down of words, a
database is going to be very slow (e.g. as you type letters, you get an
alphabetized list of what's in the db).

3) There's probably an open source Trie out there somewhere that you can
just use.

Directed at the OP, of course.

Cheers...

On 12/20/2011 5:43 PM, Kristopher Micinski wrote:



- 显示引用的文字 -



    回复     回复作者      转发       

举报垃圾内容








Kristopher Micinski 
查看个人资料   翻译成中文(简体) 更多选项 12月20日, 下午6时04分
On Tue, Dec 20, 2011 at 4:59 AM, Christopher Van Kirk


<christopher.vank...@gmail.com> wrote:
> Three points.
> 1) Building the searching functionality twice is far more expensive than
> building it once, no matter what approach you use. Be sure that the
> performance of the DB approach is acceptable before you go and build it that
> way.

Okay.


> 2) It can be quite challenging to get decent performance out of a database
> for something like this, depending on the functionality required. If, for
> example, you need real-time narrowing down of words, a database is going to
> be very slow (e.g. as you type letters, you get an alphabetized list of
> what's in the db).

True..


> 3) There's probably an open source Trie out there somewhere that you can
> just use.

Right, which is what I suggested in the first place if he goes this direction..
http://wikipedia-clustering.speedblue.org/trieJava.php

kris

    回复     回复作者      转发       

举报垃圾内容

分享到:
评论

相关推荐

    十万词汇英汉词典词库JSON格式

    十万词汇英汉词典词库JSON格式,用心整理的一个词库,用JSON格式存储,直接可供JAVASCRIPT编程使用,适合制作在线或离线词典,或者背单词应用,注意只包含简要释义,不包含音标注音,更加精简节省存储空间。

    词典检索系统

    这种概念在词典检索中具有特殊意义,因为它可以帮助用户找到与原始词汇相关但形式不同的词汇,从而扩大搜索范围,增加信息的获取量。 词典检索系统的构建通常包含以下几个关键组成部分: 1. 数据库:这是词典检索...

    汉语词汇词典

    《汉语词汇词典》是一个专为汉语处理设计的资源,主要功能是进行分词操作。在自然语言处理(NLP)领域,分词是预处理阶段的重要步骤,它将连续的汉字序列切分成一个个有意义的词汇单元,为后续的文本分析、信息检索...

    日汉手机词典 7.4万词汇量

    《日汉手机词典》是一款专为Android 1.6及以上版本系统设计的离线词典应用,具备7.4万个词汇量,为用户提供了便捷的日语与汉语之间的翻译服务。这款应用程序的一大特点是无需网络连接,可以在任何时间、任何地点进行...

    记忆宝1.72版 [多文件版]50万词汇量的英汉词典.rar

    《记忆宝1.72版 [多文件版]50万词汇量的英汉词典》是一款专为英语学习者设计的词汇工具软件,旨在帮助用户高效记忆和查询大量英语词汇。这款词典包含了50万个词汇条目,涵盖了日常生活、学术研究、专业领域等各个...

    词典变位词检索系统

    词典变位词检索系统是一种专门用于查找词汇的不同变形或变位形式的软件工具,尤其在语言学习和处理中有着广泛的应用。这类系统通常由计算机程序实现,利用算法和技术来处理词汇的形态变化,帮助用户快速找到他们正在...

    十天英语词汇量突破一万

    【标题】"十天英语词汇量突破一万"的教程旨在帮助学习者在短短十天内大幅提高英语词汇量,达到10000个单词的目标。这个教程强调的是毅力和高效的背诵方法,要求每天投入6小时的学习时间。 【描述】提到的方法是通过...

    黑莓贝贝词典-词汇量大,查询方便

    这款词典以其丰富的词汇量和便捷的查询功能,深受广大用户喜爱,尤其是对语言学习者来说,它更是不可或缺的助手。 首先,我们要明确黑莓贝贝词典的核心特点——“词汇量大”。这表明该词典包含了大量词汇,涵盖了...

    JAVA英汉词典(超大词汇量版)

    《JAVA英汉词典(超大词汇量版)》是一款专为Java编程语言设计的双语词典软件,旨在帮助开发者、学习者更好地理解和运用Java技术。这款词典具有丰富的词汇量,覆盖了Java语言的基础概念、核心API、框架、开发工具以及...

    十天内词汇量突破20000.zip

    标题和描述中的“十天内词汇量突破20000”显然与英语学习和词汇扩展有关,这可能是一个特别设计的学习计划或者方法,旨在帮助用户在短短十天内显著提高他们的词汇量,达到20000个单词的认知水平。在英语学习中,拥有...

    英语 十天内词汇量突破20000

    英语 十天内词汇量突破20000英语 十天内词汇量突破20000英语 十天内词汇量突破20000英语 十天内词汇量突破20000英语 十天内词汇量突破20000英语 十天内词汇量突破20000英语 十天内词汇量突破20000英语 十天内词汇量...

    十万词英汉词典词库sqlite数据库

    《十万词英汉词典词库sqlite数据库》是一款专为英语学习者和开发者设计的数据库资源,它将十万词汇量的英汉双解词典整合到一个SQLite数据库中。SQLite是一种轻量级的、无服务器的、自包含的、开源的关系型数据库管理...

    专业词汇电子词典---词博4.0.zip

    1. **词汇量大**:词博4.0可能包含数十万乃至百万级别的专业词汇,确保用户能找到几乎所有需要的术语。 2. **多领域覆盖**:覆盖了多个专业领域,满足不同用户的需求,无论是科研人员、医生、律师还是工程师,都能...

    十万英语词汇读音音标词库JSON格式

    十万英语词汇读音音标词库JSON格式,格式规范,无特殊符号,标准英式音标,适合制作大型英汉词典读音显示

    电信设备-信息检索装置以及词典检索方法.zip

    在电信行业中,信息检索装置和词典检索方法是至关重要的技术组成部分,特别是在处理大量数据和信息的现代通信系统中。这些技术旨在高效、准确地获取和解析所需信息,以支持决策制定、故障诊断、客户服务等多个方面。...

    贝贝词典v2.8 超大容量词汇

    贝贝词典v2.8是一款专为手机用户设计的高效、轻量级词汇查询应用,其最大特色在于其超大的词汇量,为用户提供丰富的语言学习资源。这款应用在内存占用率和存储空间上进行了精心优化,确保用户在享受海量词汇服务的...

    国家教育研究院双语词汇双向词典.7z

    《国家教育研究院双语词汇双向词典》是专为学习者设计的一款强大的英语学习资源,尤其对于使用欧路词典的用户来说,它是一个不可或缺的工具。这款词典结合了国内权威教育研究机构的专业知识,提供了丰富的双语词汇和...

    英语十天内词汇量突破20000带翻译

    【标题】:“英语十天内词汇量突破20000带翻译” 【描述】:“英语十天内词汇量突破20000带翻译,帮助通过博士考试的哦” 【标签】:“英语十天内词汇量突破20000带翻译” 【正文】: 在英语学习中,词汇量的...

    大连理工情感词典,程度副词典,否定词典,停用词典

    大连理工提供的情感词典、程度副词典、否定词典和停用词典是进行情感分析的重要资源,这些词典对于理解和处理中文文本的情感色彩至关重要。 1. **情感词典**:情感词典是情感分析的基础,它包含大量带有正向或负向...

Global site tag (gtag.js) - Google Analytics