本文IT技术学习网将给大家讲述什么是mysql全文索引中的停止词(stopword也有的翻译做停止字)。
stopword
在全文索引中,如果一个词被认为是太普通或者太没价值,那么它将会被搜索索引和搜索查询忽略。innodb和myisam分别有两组不同的设置,控制着对应的stopword。
全文检索时,停止词列表将会被读取和检索,在不同的字符集和排序方式下(character_set_server and collation_server 系统变量),可能会导致在搜索时的停止词的不匹配。
停止词是否大小写敏感,取决于不同的排序方式,比如:latin1_swedish_ci下停止词是大小写敏感的,latin1_general_cs 或 latin1_bin下停止词是大小写不敏感的。
innodb的索引停止词
innodb的默认停止词列表很短。查询INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD表来查看默认的innodb停止词表。
- mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
- +-------+
- | value |
- +-------+
- | a |
- | about |
- | an |
- | are |
- | as |
- | at |
- | be |
- | by |
- | com |
- | de |
- | en |
- | for |
- | from |
- | how |
- | i |
- | in |
- | is |
- | it |
- | la |
- | of |
- | on |
- | or |
- | that |
- | the |
- | this |
- | to |
- | was |
- | what |
- | when |
- | where |
- | who |
- | will |
- | with |
- | und |
- | the |
- | www |
- +-------+
- 36 rows in set (0.00 sec)
myisam索引的停止词
myisam索引的停止词列表与innodb不同,默认的myisam停止词列表是直接在mysql程序源码中已写入。设置ft_stopword_file系统变量来指定停止词文件,从而覆盖默认设置。
在mysql源程序的 storage/myisam/ft_static.c file文件中,你可以找到默认的myisam停止词列表:
- a's able about above according
- accordingly across actually after afterwards
- again against ain't all allow
- allows almost alone along already
- also although always am among
- amongst an and another any
- anybody anyhow anyone anything anyway
- anyways anywhere apart appear appreciate
- appropriate are aren't around as
- aside ask asking associated at
- available away awfully be became
- because become becomes becoming been
- before beforehand behind being believe
- below beside besides best better
- between beyond both brief but
- by c'mon c's came can
- can't cannot cant cause causes
- certain certainly changes clearly co
- com come comes concerning consequently
- consider considering contain containing contains
- corresponding could couldn't course currently
- definitely described despite did didn't
- different do does doesn't doing
- don't done down downwards during
- each edu eg eight either
- else elsewhere enough entirely especially
- et etc even ever every
- everybody everyone everything everywhere ex
- exactly example except far few
- fifth first five followed following
- follows for former formerly forth
- four from further furthermore get
- gets getting given gives go
- goes going gone got gotten
- greetings had hadn't happens hardly
- has hasn't have haven't having
- he he's hello help hence
- her here here's hereafter hereby
- herein hereupon hers herself hi
- him himself his hither hopefully
- how howbeit however i'd i'll
- i'm i've ie if ignored
- immediate in inasmuch inc indeed
- indicate indicated indicates inner insofar
- instead into inward is isn't
- it it'd it'll it's its
- itself just keep keeps kept
- know known knows last lately
- later latter latterly least less
- lest let let's like liked
- likely little look looking looks
- ltd mainly many may maybe
- me mean meanwhile merely might
- more moreover most mostly much
- must my myself name namely
- nd near nearly necessary need
- needs neither never nevertheless new
- next nine no nobody non
- none noone nor normally not
- nothing novel now nowhere obviously
- of off often oh ok
- okay old on once one
- ones only onto or other
- others otherwise ought our ours
- ourselves out outside over overall
- own particular particularly per perhaps
- placed please plus possible presumably
- probably provides que quite qv
- rather rd re really reasonably
- regarding regardless regards relatively respectively
- right said same saw say
- saying says second secondly see
- seeing seem seemed seeming seems
- seen self selves sensible sent
- serious seriously seven several shall
- she should shouldn't since six
- so some somebody somehow someone
- something sometime sometimes somewhat somewhere
- soon sorry specified specify specifying
- still sub such sup sure
- t's take taken tell tends
- th than thank thanks thanx
- that that's thats the their
- theirs them themselves then thence
- there there's thereafter thereby therefore
- therein theres thereupon these they
- they'd they'll they're they've think
- third this thorough thoroughly those
- though three through throughout thru
- thus to together too took
- toward towards tried tries truly
- try trying twice two un
- under unfortunately unless unlikely until
- unto up upon us use
- used useful uses using usually
- value various very via viz
- vs want wants was wasn't
- way we we'd we'll we're
- we've welcome well went were
- weren't what what's whatever when
- whence whenever where where's whereafter
- whereas whereby wherein whereupon wherever
- whether which while whither who
- who's whoever whole whom whose
- why will willing wish with
- within without won't wonder would
- wouldn't yes yet you you'd
- you'll you're you've your yours
- yourself yourselves zero
相关推荐
搜索引擎开发中有一类词叫停止词,是由英文单词:stopword翻译过来的,原来在英语里面会遇到很多a,the,or等使用频率很多的字或词,常为冠词、介词、副词或连词等。如果搜索引擎要将这些词都索引的话,那么几乎每个...
- MySQL默认的全文索引词长最小为4个字符。若要支持中文单字(通常少于4个字符),需要修改配置文件`my.cnf`。 - 在`[mysqld]`部分添加配置项`ft_min_word_len = 2`,以允许长度为2的单词被索引。 - 可以根据需要...
IKAnalyzer 分词器所需要的停用词词典 ext_stopword.dic 下载 Solr中使用IK-Analyzer实现中文分词器的配置详情 : http://blog.csdn.net/hello_world_qwp/article/details/78890904
- **停止词列表(Stopword List)**:一组常见的无意义词汇,如“the”、“and”等,在索引过程中会被忽略。 3. **全文索引的体系结构** 全文索引由两个主要部分组成:系统表和索引文件。系统表存储索引元数据,...
最全的IKAnalyz 的中文停止词集,使用时需要简单配置IKAnalyzer....--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">stopword.dic;chinese_stopword.dic;</entry>
最全的IKAnalyz 的英文停止词集,使用时需要简单配置IKAnalyzer....--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">stopword.dic;english_stopword.dic;</entry>
内涵3000多个中文停用词,很有用。希望能够帮到大家,仅限个人学习使用,若有其他用途,后果自负。:总的来看 总的来说 总的说来 总而言之 总之 纵 纵令 纵然 纵使 遵照 作为 兮 呃 呗 咚 咦 喏 啐 喔唷 嗬 嗯
该文档内有已经总结好的所有常见停用词,适用于数据分析、数据挖掘方面,尤其是分析用户情感、拆分用户评论、商品评价等方面,对于去除数据的冗余性有很大的作用,可与jieba库一起使用
在搜索引擎技术中,停词(Stopword)是一个重要的概念,特别是在全文索引和搜索系统如Solr、Elasticsearch和Lucene中。停词是指那些在文本中非常常见,但在检索过程中通常不会对搜索结果产生实质性影响的词汇,例如...
在搜索引擎和信息检索领域,Lucene是一个非常重要的开源全文搜索引擎库。它提供了高效的文本索引和搜索功能,被广泛应用于各种信息系统的后台。在Lucene中,为了提高搜索的准确性和效率,通常会使用到一些优化技术,...
Solr是中国Apache软件基金会开发的一款高性能、基于Java的全文搜索引擎服务器。它允许用户通过HTTP接口对索引进行创建、更新和查询,同时提供了强大的搜索功能和配置灵活性。在这个"solr软件包扩展词典可停词配置...
### 停用词(Stopword)在搜索引擎技术中的应用 #### 一、停用词的概念 停用词(Stopwords),又称停止词或停用表,在自然语言处理(NLP)领域中扮演着重要的角色。它指的是在信息检索、文本挖掘等过程中,为了...
ik,elasticsearch,停词,词库,stopword.dic
solr搜索引擎,停止词
内容直接拷贝进stopword.dic即可使用(2614行常用停用词包含中英文,符号等)
在IT行业中,"stopword.zip" 这个文件名暗示了一个与自然语言处理(NLP)相关的主题。Stopwords是指在文本分析过程中通常被忽略的一类常见词汇,因为它们在句子中频繁出现,但对理解语义贡献较小。例如,“是”、...
最全中文停用词表整理(1893个),可用于自然语言处理任务,比如文本分类,文本摘要,关系抽取,事件抽取等