- 浏览: 22105137 次
- 性别:
- 来自: 杭州
最新评论
-
ZY199266:
配置文件还需要额外的配置ma
Android 客户端通过内置API(HttpClient) 访问 服务器(用Spring MVC 架构) 返回的json数据全过程 -
ZY199266:
我的一访问为什么是 /mavenwebdemo/WEB-I ...
Android 客户端通过内置API(HttpClient) 访问 服务器(用Spring MVC 架构) 返回的json数据全过程 -
lvgaga:
我又一个问题就是 如果像你的这种形式写。配置文件还需要额外的 ...
Android 客户端通过内置API(HttpClient) 访问 服务器(用Spring MVC 架构) 返回的json数据全过程 -
lvgaga:
我的一访问为什么是 /mavenwebdemo/WEB-I ...
Android 客户端通过内置API(HttpClient) 访问 服务器(用Spring MVC 架构) 返回的json数据全过程 -
y1210251848:
你的那个错误应该是项目所使用的目标框架不支持吧
log4net配置(web中使用log4net,把web.config放在单独的文件中)
(转) 中文分词技术属于自然语言处理技术范畴,对于一句话,人可以通过自己的知识来明白哪些是词,哪些不是词,但如何让计算机也能理解?其处理过程就是分词算法。 |
几个月之前,在网上找到了一个中文词库素材(几百K),当时便想写一个分词程序了.我对汉语分词没有什么研究,也就凭自己臆想而写.若有相关方面专家,还请多给意见.
一、词库
词库大概有5万多词语(google能搜到,类似的词库都能用),我摘要如下:
地区82
重要81
新华社80
技术80
会议80
自己79
干部78
职工78
群众77
没有77
今天76
同志76
部门75
加强75
组织75
第一列是词,第二列是权重.我写的这个分词算法目前并未利用权重.
二、设计思路
算法简要描述:
对一个字符串S,从前到后扫描,对扫描的每个字,从词库中寻找最长匹配.比如假设S="我是中华人民共和国公民",词库中有"中华人民共和国","中华","公民","人民","共和国"......等词.当扫描到"中"字,那么从中字开始,向后分别取1,2,3,......个字("中","中华","中华人","中华人民","中华人民共","中华人民共和","中华人民共和国",,"中华人民共和国公"),词库中的最长匹配字符串是"中华人民共和国",那么就此切分开,扫描器推进到"公"字.
数据结构:
选择什么样的数据结构对性能影响很大.我采用Hashtable _rootTable记录词库.键值对为(键,插入次数).对每一个词语,如果该词语有N个字,则将该词语的1,1~2,1~3,......1~N个字作为键,插入_rootTable中.而同一个键如果重复插入,则后面的值递增.
三、程序
具体程序如下(程序中包含权重,插入次数等要素,目前的算法并没有利用这些.可以借此写出更有效的分词算法):
ChineseWordUnit.cs //struct--(词语,权重)对
1<shapetype id="_x0000_t75" stroked="f" filled="f" path="m@4@5l@4@11@9@11@9@5xe" o:preferrelative="t" o:spt="75" coordsize="21600,21600"><stroke joinstyle="miter"></stroke><formulas><f eqn="if lineDrawn pixelLineWidth 0"></f><f eqn="sum @0 1 0"></f><f eqn="sum 0 0 @1"></f><f eqn="prod @2 1 2"></f><f eqn="prod @3 21600 pixelWidth"></f><f eqn="prod @3 21600 pixelHeight"></f><f eqn="sum @0 0 1"></f><f eqn="prod @6 1 2"></f><f eqn="prod @7 21600 pixelWidth"></f><f eqn="sum @8 21600 0"></f><f eqn="prod @7 21600 pixelHeight"></f><f eqn="sum @10 21600 0"></f></formulas><path o:connecttype="rect" gradientshapeok="t" o:extrusionok="f"></path><lock aspectratio="t" v:ext="edit"></lock></shapetype><shape id="图片_x0020_1" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/None.gif" type="#_x0000_t75" o:spid="_x0000_i1287"><imagedata o:title="None" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image001.gif"></imagedata></shape>publicstructChineseWordUnit
2<shape id="Codehighlighter1_32_542_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1286"><imagedata o:title="ExpandedBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image002.gif"></imagedata></shape><shape id="Codehighlighter1_32_542_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1285"><imagedata o:title="ContractedBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image003.gif"></imagedata></shape><shape id="图片_x0020_4" style="VISIBILITY: visible; WIDTH: 11.25pt; HEIGHT: 15pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/dot.gif" type="#_x0000_t75" o:spid="_x0000_i1284"><imagedata o:title="dot" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image004.gif"></imagedata></shape>{
3<shape id="图片_x0020_5" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1283"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>privatestring_word;
4<shape id="图片_x0020_6" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1282"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>privateint_power;
5<shape id="图片_x0020_7" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1281"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>
6<shape id="Codehighlighter1_83_134_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1280"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_83_134_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1279"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape>/**////<summary>
7<shape id="图片_x0020_10" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1278"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>///中文词语单元所对应的中文词。
8<shape id="图片_x0020_11" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1277"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></shape>///</summary>
9<shape id="图片_x0020_12" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1276"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>publicstringWord
10<shape id="Codehighlighter1_158_197_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1275"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_158_197_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1274"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape><shape id="图片_x0020_15" style="VISIBILITY: visible; WIDTH: 11.25pt; HEIGHT: 15pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/dot.gif" type="#_x0000_t75" o:spid="_x0000_i1273"><imagedata o:title="dot" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image004.gif"></imagedata></shape>{
11<shape id="图片_x0020_16" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1272"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>get
12<shape id="Codehighlighter1_170_193_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1271"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_170_193_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1270"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape><shape id="图片_x0020_19" style="VISIBILITY: visible; WIDTH: 11.25pt; HEIGHT: 15pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/dot.gif" type="#_x0000_t75" o:spid="_x0000_i1269"><imagedata o:title="dot" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image004.gif"></imagedata></shape>{
13<shape id="图片_x0020_20" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1268"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>return_word;
14<shape id="图片_x0020_21" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1267"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></shape>}
15<shape id="图片_x0020_22" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1266"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></shape>}
16<shape id="图片_x0020_23" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1265"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>
17<shape id="Codehighlighter1_202_248_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1264"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_202_248_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1263"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape>/**////<summary>
18<shape id="图片_x0020_26" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1262"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>///该中文词语的权重。
19<shape id="图片_x0020_27" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1261"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></shape>///</summary>
20<shape id="图片_x0020_28" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1260"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>publicintPower
21<shape id="Codehighlighter1_270_310_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1259"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_270_310_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1258"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape><shape id="图片_x0020_31" style="VISIBILITY: visible; WIDTH: 11.25pt; HEIGHT: 15pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/dot.gif" type="#_x0000_t75" o:spid="_x0000_i1257"><imagedata o:title="dot" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image004.gif"></imagedata></shape>{
22<shape id="图片_x0020_32" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1256"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>get
23<shape id="Codehighlighter1_282_306_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1255"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_282_306_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1254"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape><shape id="图片_x0020_35" style="VISIBILITY: visible; WIDTH: 11.25pt; HEIGHT: 15pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/dot.gif" type="#_x0000_t75" o:spid="_x0000_i1253"><imagedata o:title="dot" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image004.gif"></imagedata></shape>{
24<shape id="图片_x0020_36" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1252"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>return_power;
25<shape id="图片_x0020_37" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1251"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></shape>}
26<shape id="图片_x0020_38" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1250"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></shape>}
27<shape id="图片_x0020_39" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1249"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>
28<shape id="Codehighlighter1_315_437_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1248"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_315_437_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1247"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape>/**////<summary>
29<shape id="图片_x0020_42" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1246"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>///结构初始化。
30<shape id="图片_x0020_43" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1245"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>///</summary>
31<shape id="图片_x0020_44" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1244"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>///<paramname="word">中文词语</param>
32<shape id="图片_x0020_45" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1243"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></shape>///<paramname="power">该词语的权重</param>
33<shape id="图片_x0020_46" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1242"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>publicChineseWordUnit(stringword,intpower)
34<shape id="Codehighlighter1_489_539_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1241"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_489_539_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1240"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape><shape id="图片_x0020_49" style="VISIBILITY: visible; WIDTH: 11.25pt; HEIGHT: 15pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/dot.gif" type="#_x0000_t75" o:spid="_x0000_i1239"><imagedata o:title="dot" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image004.gif"></imagedata></shape>{
35<shape id="图片_x0020_50" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1238"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>this._word=word;
36<shape id="图片_x0020_51" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1237"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>this._power=power;
37<shape id="图片_x0020_52" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1236"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></shape>}
38<shape id="图片_x0020_53" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1235"><imagedata o:title="ExpandedBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image009.gif"></imagedata></shape>}
ChineseWordsHashCountSet.cs //词库容器
1<shape id="Codehighlighter1_1_95_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1234"><imagedata o:title="ExpandedBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image002.gif"></imagedata></shape><shape id="Codehighlighter1_1_95_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1233"><imagedata o:title="ContractedBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image003.gif"></imagedata></shape>/**////<summary>
2<shape id="图片_x0020_56" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1232"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>///记录字符串出现在中文字典所录中文词语的前端的次数的字典类。如字符串“中”出现在“中国”的前端,则在字典中记录一个次数。
3<shape id="图片_x0020_57" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1231"><imagedata o:title="ExpandedBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image009.gif"></imagedata></shape>///</summary>
4<shape id="图片_x0020_58" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/None.gif" type="#_x0000_t75" o:spid="_x0000_i1230"><imagedata o:title="None" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image001.gif"></imagedata></shape>publicclassChineseWordsHashCountSet
5<shape id="Codehighlighter1_136_1564_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1229"><imagedata o:title="ExpandedBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image002.gif"></imagedata></shape><shape id="Codehighlighter1_136_1564_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1228"><imagedata o:title="ContractedBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image003.gif"></imagedata></shape><shape id="图片_x0020_61" style="VISIBILITY: visible; WIDTH: 11.25pt; HEIGHT: 15pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/dot.gif" type="#_x0000_t75" o:spid="_x0000_i1227"><imagedata o:title="dot" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image004.gif"></imagedata></shape>{
6<shape id="Codehighlighter1_140_230_Open_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif" type="#_x0000_t75" o:spid="_x0000_i1226"><imagedata o:title="ExpandedSubBlockStart" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image006.gif"></imagedata></shape><shape id="Codehighlighter1_140_230_Closed_Image" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ContractedSubBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1225"><imagedata o:title="ContractedSubBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image007.gif"></imagedata></shape>/**////<summary>
7<shape id="图片_x0020_64" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/InBlock.gif" type="#_x0000_t75" o:spid="_x0000_i1224"><imagedata o:title="InBlock" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image005.gif"></imagedata></shape>///记录字符串在中文词语中出现次数的Hashtable。键为特定的字符串,值为该字符串在中文词语中出现的次数。
8<shape id="图片_x0020_65" style="VISIBILITY: visible; WIDTH: 8.25pt; HEIGHT: 12pt; mso-wrap-style: square" alt="http://www.cnblogs.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif" type="#_x0000_t75" o:spid="_x0000_i1223"><imagedata o:title="ExpandedSubBlockEnd" src="file:///C:%5CDOCUME~1%5CADMINI~1%5CLOCALS~1%5CTemp%5Cmsohtmlclip1%5C01%5Cclip_image008.gif"></imagedata></s
发表评论
相关推荐
在本篇文章中,我们将围绕“baidu.rar_baidu_dictionary_中文分词_中文分词_词典_分词词典”这一主题,深入探讨百度曾经使用的中文分词词典及其在实际应用中的价值。 首先,让我们了解中文分词词典的重要性。词典是...
中文分词是自然语言处理中的一个基础任务,它是指将连续的汉字序列切分成具有语义意义的词汇序列的过程。与英文等其他语言相比,中文没有明显的单词界限,因此中文分词是进行后续自然语言处理任务(如文本分类、情感...
标题中的"C# 中文分词 LUCENE IKAnalyzer"是指使用C#语言实现的一个中文分词工具,它基于开源全文检索引擎Lucene,并且采用了IKAnalyzer(智能汉字分词系统)的Java版本进行移植。Lucene是一个强大的、高性能的信息...
中文分词是将连续的汉字序列切分成具有语义的词语的过程,是自然语言处理(NLP)中的基础步骤。在Lucene.NET中,为了支持中文分词,通常需要结合第三方分词器,如IK Analyzer、HanLP、jieba.NET等。这些分词器具备...
"百度中文分词词库"是一个专门用于分词的资源,它包含了大量的词汇及其组合方式,为精确地将连续的汉字序列分割成具有语义意义的词语提供了基础。 首先,我们要理解什么是中文分词。由于中文句子没有明显的空格来...
中文分词是自然语言处理(NLP)领域中的基础任务,它是将连续的汉字序列切分成具有语义意义的词语序列。在这个“中文分词数据集.zip”压缩包中,包含了一个专门用于训练中文分词模型的数据集。下面将详细讨论中文...
中文分词是自然语言处理(NLP)领域中的基础任务,它是将连续的汉字序列切分成具有语义意义的词语序列。在这个“中文分词词库整理.7z”压缩包中,包含的可能是一个精心整理的中文词汇集合,用于支持各种中文分词算法...
### 中文分词词库大全词库解析 #### 标题与描述概述 标题:“中文分词词库大全词库” 描述重复强调了一个词典的来源及其格式(TXT)。这表明该词库是为了中文自然语言处理(NLP)任务中的分词而准备的资源。中文...
中文分词是中文文本处理的基石,因为中文没有像英文那样的空格来自然地划分单词,所以需要通过分词算法将连续的汉字序列切分成有意义的词汇单元。这一过程对于后续的文本分析、信息检索、机器翻译等任务至关重要。 ...
标签 "中文分词" 是关键点,中文分词是将连续的汉字序列切分成有意义的词语,这是处理中文文本的基础步骤,对于信息检索、情感分析、机器翻译等任务至关重要。常见的中文分词算法有基于词典的匹配方法、统计模型如隐...
《深入理解Lucene 6.6:拼音与IK中文分词技术详解》 在信息检索领域,Lucene作为一款强大的全文搜索引擎库,被广泛应用。在处理中文文本时,分词是至关重要的一步,它决定了搜索的精度和效果。本文将详细讲解如何在...
在IT领域,汉字分词是自然语言处理(NLP)中的关键步骤,它涉及到将连续的汉字序列分割成有意义的词汇单元,以便计算机能够理解和分析文本。本项目名为"C#汉字分词程序",它实现了两种常见的分词算法:正向最大匹配...
在IT领域,中文分词是自然语言处理(NLP)中的关键步骤,它涉及到将连续的汉字序列分割成有意义的词语单元,以便计算机能够理解和分析文本。本项目以"matlab中文分词——最大正向匹配法.rar"为主题,重点讨论了如何...
CSW中文分词组件,是一套可自动将一段文本按常规汉语词组进行拆分,并以指定方式进行分隔的COM组件。本组件采用独有的高效的分词引擎及拆分算法,具有准确、高速、资源占用率小等特点。为了满足客户对文本语义进行...
分词是中文文本处理的基石,因为它能将连续的汉字序列划分为具有语义意义的单元,便于后续的分析和应用。 在"庖丁解牛"中,用户可以通过运行`analyzer.bat`程序来测试和查看分词结果。这是一个便捷的交互方式,允许...
中文分词是将连续的汉字序列切分成具有语义意义的词语的过程。与英文单词间的空格作为天然分隔符不同,中文没有明确的分词标志,因此需要借助特定算法来完成。常见的分词方法有基于词典的匹配法、统计模型如隐...
中文不同于英文,单词之间没有明显的分隔符,因此在处理中文文本时,我们需要先进行分词,即将连续的汉字序列切分成有意义的词汇单元。Sanford中文分词库是一种常用的分词工具,它基于统计模型,能够根据语料库学习...
分词是自然语言处理中的基础步骤,它将连续的汉字序列切分成有意义的词语,这对于搜索引擎、信息检索、文本分析等多个领域至关重要。这类类库能够帮助开发者高效地实现对中文文本的预处理,提升系统的性能和准确性。...
ik中文分词词库35万中文分词词库(含电商)
来自“猎图网 www.richmap.cn”基于IKAnalyzer分词算法的准商业化Lucene中文分词器。 1. 正向全切分算法,42万汉字字符/每秒的处理能力(IBM ThinkPad 酷睿I 1.6G 1G内存 WinXP) 2. 对数量词、地名、路名的...