`
pxczy
  • 浏览: 6048 次
  • 性别: Icon_minigender_2
  • 来自: 深圳
最近访客 更多访客>>
文章分类
社区版块
存档分类
最新评论

mysql全文索引之停止词(stopword)

 
阅读更多

本文IT技术学习网将给大家讲述什么是mysql全文索引中的停止词(stopword也有的翻译做停止字)。

stopword

在全文索引中,如果一个词被认为是太普通或者太没价值,那么它将会被搜索索引和搜索查询忽略。innodb和myisam分别有两组不同的设置,控制着对应的stopword。

全文检索时,停止词列表将会被读取和检索,在不同的字符集和排序方式下(character_set_server and collation_server 系统变量),可能会导致在搜索时的停止词的不匹配。

停止词是否大小写敏感,取决于不同的排序方式,比如:latin1_swedish_ci下停止词是大小写敏感的,latin1_general_cs 或 latin1_bin下停止词是大小写不敏感的。

innodb的索引停止词

innodb的默认停止词列表很短。查询INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD表来查看默认的innodb停止词表。

      mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;

 

      +-------+

 

      | value |

 

      +-------+

 

      | a     |

 

      | about |

 

      | an    |

 

      | are   |

 

      | as    |

 

      | at    |

 

      | be    |

 

      | by    |

 

      | com   |

 

      | de    |

 

      | en    |

 

      | for   |

 

      | from  |

 

      | how   |

 

      | i     |

 

      | in    |

 

      | is    |

 

      | it    |

 

      | la    |

 

      | of    |

 

      | on    |

 

      | or    |

 

      | that  |

 

      | the   |

 

      | this  |

 

      | to    |

 

      | was   |

 

      | what  |

 

      | when  |

 

      | where |

 

      | who   |

 

      | will  |

 

      | with  |

 

      | und   |

 

      | the   |

 

      | www   |

 

      +-------+

 

    36 rows in set (0.00 sec)

myisam索引的停止词

myisam索引的停止词列表与innodb不同,默认的myisam停止词列表是直接在mysql程序源码中已写入。设置ft_stopword_file系统变量来指定停止词文件,从而覆盖默认设置。

在mysql源程序的 storage/myisam/ft_static.c file文件中,你可以找到默认的myisam停止词列表:

      a's able about above according

 

      accordingly across actually after afterwards

 

      again against ain't all allow

 

      allows almost alone along already

 

      also although always am among

 

      amongst an and another any

 

      anybody anyhow anyone anything anyway

 

      anyways anywhere apart appear appreciate

 

      appropriate are aren't around as

 

      aside ask asking associated at

 

      available away awfully be became

 

      because become becomes becoming been

 

      before beforehand behind being believe

 

      below beside besides best better

 

      between beyond both brief but

 

      by c'mon c's came can

 

      can't cannot cant cause causes

 

      certain certainly changes clearly co

 

      com come comes concerning consequently

 

      consider considering contain containing contains

 

      corresponding could couldn't course currently

 

      definitely described despite did didn't

 

      different do does doesn't doing

 

      don't done down downwards during

 

      each edu eg eight either

 

      else elsewhere enough entirely especially

 

      et etc even ever every

 

      everybody everyone everything everywhere ex

 

      exactly example except far few

 

      fifth first five followed following

 

      follows for former formerly forth

 

      four from further furthermore get

 

      gets getting given gives go

 

      goes going gone got gotten

 

      greetings had hadn't happens hardly

 

      has hasn't have haven't having

 

      he he's hello help hence

 

      her here here's hereafter hereby

 

      herein hereupon hers herself hi

 

      him himself his hither hopefully

 

      how howbeit however i'd i'll

 

      i'm i've ie if ignored

 

      immediate in inasmuch inc indeed

 

      indicate indicated indicates inner insofar

 

      instead into inward is isn't

 

      it it'd it'll it's its

 

      itself just keep keeps kept

 

      know known knows last lately

 

      later latter latterly least less

 

      lest let let's like liked

 

      likely little look looking looks

 

      ltd mainly many may maybe

 

      me mean meanwhile merely might

 

      more moreover most mostly much

 

      must my myself name namely

 

      nd near nearly necessary need

 

      needs neither never nevertheless new

 

      next nine no nobody non

 

      none noone nor normally not

 

      nothing novel now nowhere obviously

 

      of off often oh ok

 

      okay old on once one

 

      ones only onto or other

 

      others otherwise ought our ours

 

      ourselves out outside over overall

 

      own particular particularly per perhaps

 

      placed please plus possible presumably

 

      probably provides que quite qv

 

      rather rd re really reasonably

 

      regarding regardless regards relatively respectively

 

      right said same saw say

 

      saying says second secondly see

 

      seeing seem seemed seeming seems

 

      seen self selves sensible sent

 

      serious seriously seven several shall

 

      she should shouldn't since six

 

      so some somebody somehow someone

 

      something sometime sometimes somewhat somewhere

 

      soon sorry specified specify specifying

 

      still sub such sup sure

 

      t's take taken tell tends

 

      th than thank thanks thanx

 

      that that's thats the their

 

      theirs them themselves then thence

 

      there there's thereafter thereby therefore

 

      therein theres thereupon these they

 

      they'd they'll they're they've think

 

      third this thorough thoroughly those

 

      though three through throughout thru

 

      thus to together too took

 

      toward towards tried tries truly

 

      try trying twice two un

 

      under unfortunately unless unlikely until

 

      unto up upon us use

 

      used useful uses using usually

 

      value various very via viz

 

      vs want wants was wasn't

 

      way we we'd we'll we're

 

      we've welcome well went were

 

      weren't what what's whatever when

 

      whence whenever where where's whereafter

 

      whereas whereby wherein whereupon wherever

 

      whether which while whither who

 

      who's whoever whole whom whose

 

      why will willing wish with

 

      within without won't wonder would

 

      wouldn't yes yet you you'd

 

      you'll you're you've your yours

 

      yourself yourselves zero



分享到:
评论

相关推荐

    搜索引擎无用词 停用词 stopWord.xlsx

    搜索引擎开发中有一类词叫停止词,是由英文单词:stopword翻译过来的,原来在英语里面会遇到很多a,the,or等使用频率很多的字或词,常为冠词、介词、副词或连词等。如果搜索引擎要将这些词都索引的话,那么几乎每个...

    IKAnalyzer 分词器所需要的停用词词典 ext_stopword.dic 下载

    IKAnalyzer 分词器所需要的停用词词典 ext_stopword.dic 下载 Solr中使用IK-Analyzer实现中文分词器的配置详情 : http://blog.csdn.net/hello_world_qwp/article/details/78890904

    Mysql全文搜索match against的用法

    - MySQL默认的全文索引词长最小为4个字符。若要支持中文单字(通常少于4个字符),需要修改配置文件`my.cnf`。 - 在`[mysqld]`部分添加配置项`ft_min_word_len = 2`,以允许长度为2的单词被索引。 - 可以根据需要...

    Sql2005 全文索引详解

    - **停止词列表(Stopword List)**:一组常见的无意义词汇,如“the”、“and”等,在索引过程中会被忽略。 3. **全文索引的体系结构** 全文索引由两个主要部分组成:系统表和索引文件。系统表存储索引元数据,...

    chinese_stopword.zip_IKAnalyz_stopword_stopword 中文_stopword.dic_

    最全的IKAnalyz 的中文停止词集,使用时需要简单配置IKAnalyzer....--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">stopword.dic;chinese_stopword.dic;</entry>

    english_stopword.zip_English stop word_english_stopwords_停止词_英文停

    最全的IKAnalyz 的英文停止词集,使用时需要简单配置IKAnalyzer....--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">stopword.dic;english_stopword.dic;</entry>

    stopword.txt

    内涵3000多个中文停用词,很有用。希望能够帮到大家,仅限个人学习使用,若有其他用途,后果自负。:总的来看 总的来说 总的说来 总而言之 总之 纵 纵令 纵然 纵使 遵照 作为 兮 呃 呗 咚 咦 喏 啐 喔唷 嗬 嗯

    数据分析中最全停用词之stopword

    该文档内有已经总结好的所有常见停用词,适用于数据分析、数据挖掘方面,尤其是分析用户情感、拆分用户评论、商品评价等方面,对于去除数据的冗余性有很大的作用,可与jieba库一起使用

    搜索引擎 solr stopword 停词表

    在搜索引擎技术中,停词(Stopword)是一个重要的概念,特别是在全文索引和搜索系统如Solr、Elasticsearch和Lucene中。停词是指那些在文本中非常常见,但在检索过程中通常不会对搜索结果产生实质性影响的词汇,例如...

    Luence+Ikanalyzer+stopword+dic

    在搜索引擎和信息检索领域,Lucene是一个非常重要的开源全文搜索引擎库。它提供了高效的文本索引和搜索功能,被广泛应用于各种信息系统的后台。在Lucene中,为了提高搜索的准确性和效率,通常会使用到一些优化技术,...

    solr软件包扩展词典可停词配置学习和开发文档

    Solr是中国Apache软件基金会开发的一款高性能、基于Java的全文搜索引擎服务器。它允许用户通过HTTP接口对索引进行创建、更新和查询,同时提供了强大的搜索功能和配置灵活性。在这个"solr软件包扩展词典可停词配置...

    stopword

    ### 停用词(Stopword)在搜索引擎技术中的应用 #### 一、停用词的概念 停用词(Stopwords),又称停止词或停用表,在自然语言处理(NLP)领域中扮演着重要的角色。它指的是在信息检索、文本挖掘等过程中,为了...

    ik,elasticsearch,停词,词库,stopword.dic

    ik,elasticsearch,停词,词库,stopword.dic

    stopword.dic

    solr搜索引擎,停止词

    stopword.zip

    在IT行业中,"stopword.zip" 这个文件名暗示了一个与自然语言处理(NLP)相关的主题。Stopwords是指在文本分析过程中通常被忽略的一类常见词汇,因为它们在句子中频繁出现,但对理解语义贡献较小。例如,“是”、...

    比较全的IKAnalyzer分词器中文停用词 stopwords.dic

    内容直接拷贝进stopword.dic即可使用(2614行常用停用词包含中英文,符号等)

    stopwordlist.txt

    最全中文停用词表整理(1893个),可用于自然语言处理任务,比如文本分类,文本摘要,关系抽取,事件抽取等

Global site tag (gtag.js) - Google Analytics