`
01jiangwei01
  • 浏览: 542255 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

regain 检索工具两个配置文件的翻译

阅读更多
工作快两年了,今天经理又把去年的那个regain的检索拿出来,让以最快的速度整理好,让跑起来。呵呵,记得刚接触的时候自己还是个刚离开校园的毛头小子,捣鼓了一个月没弄好,最后让给经理了。现在拿到手里,又有时间就自己把里面的配置文件翻译一下:
其实主要有连个配置文件:CrawlerConfiguration.xml(建索引时使用),SearchConfiguration.xml(搜索索引时使用)

下载网址http://regain.sourceforge.net/download.php
CrawlerConfiguration.xml
<?xml version="1.0" encoding="GBK"?>

<!DOCTYPE configuration [
  <!ENTITY amp "&#x26;">
  <!ENTITY lt "&#x3C;">
  <!ENTITY minus "&#45;">
]>

<!--
 | Configuration for the regain crawler (for creating a search index)
 |翻译:为regain爬虫准备的配置文件,该配置文件用来创建查询索引
 | You can find a detailed description of all configuration tags here:
 | http://regain.murfman.de/wiki/en/index.php/CrawlerConfiguration.xml
 |翻译:你可以在下列网址中找到详细的关于该配置中所有标签的描述文件,http://regain.murfman.de/wiki/en/index.php/CrawlerConfiguration.xml
 | You can find more configration examples in the CrawlerConfiguration_examples.xml.
 |翻译:你也可以在CrawlerConfiguration_examples.xml.文件中找到更多的例子
 +-->
<configuration>

<!--
 | Enter your HTTP proxy settings here (Look at the preferences of your browser)
 |翻译:在这里输入你的http代理,可以查看的你的浏览器操作参数
 +-->
  <proxy>
  <!--
  <host>proxy</host>
  <port>3128</port>
  <user>HansWurst</user>
  <password>gkxy23</password>
  -->
  </proxy>


<!--
 | The list of URLs where the spidering will start.
 |翻译:spidering开始查找资料的URLs列表
 | Enter the start page of your web site resp. a file system folder here.
 |翻译:输入你的web地址,spidering将从这里开始。这里是一个系统文件夹
 | NOTE: The examples are in a comment. Thus, if you add your path in one of
 |       them, then don't forget to uncomment them.
 |翻译:注意例子中都有注释,所以如果在例子中添加了自己的路径,记住做标记
 +-->
  <startlist>
  <!-- Directory parsing  目录解析-->
  <!--
  <start parse="true" index="false">file://c:/Eigene Dateien</start>
  set the place where the document to located
  翻译:设置一个文件下载存放的位置
  file://E:/eclipse 3.2/workspace/SIS/WebRoot/FileDepository  ${SEARCHDIR}
  -->
 <start index="false" parse="true">file://${WORKDIR}FileDepository</start>
  <!-- HTML parsing -->
  <!--
  <start parse="true" index="true">http://www.mydomain.de/some/path/</start>
  -->
  </startlist>


<!--
 | The whitelist containing prefixes an URL must have to be processed
 |翻译:白名单包含一个URL必须处理的前缀
 | Enter the domain of your web site here.
 |翻译:在这里键入web地址
 +-->
  <whitelist>
       <prefix>file://</prefix>
  </whitelist>


<!--
 | The blacklist containing prefixes an URL must NOT have to be processed
 |翻译:黑名单列举了后缀一个URL不要处理的前缀
 | Enter sub directories you don't want to be indexed here.
 |翻译:在这里键入你不希望被索引的地址
 +-->
  <blacklist>
  <!--
  <prefix>http://www.mydomain.de/some/dynamic/content/</prefix>
  <regex>/backup/[^/]*$</regex>
  -->
  </blacklist>


<!--
 | ==================================================================================
 | That's all you have to configure! The rest of this file is advanced configuration.
 |翻译:以上是所有需要配置的地方,这个文件中下面的部分是高级配置
 | ==================================================================================
 +-->

<!--
 | The preferences for the search index.
 |翻译:查询索引参数
 +-->
  <searchIndex>
  <!-- 
  The directory where the index should be located ${SEARCHDIR}
  翻译:索引应该被放置的目录
  -->
  <dir>${SEARCHDIR}searchindex</dir>
  <!--
   | Specifies the analyzer type to use.
   | 翻译:指定分析机类型以便使用
   | You may specify the class name of the analyzer or you use one of the
   | following aliases: 
   |  * english: For the english language
   |    (alias for org.apache.lucene.analysis.standard.StandardAnalyzer)
   |  * german: For the german language
   |    (alias for org.apache.lucene.analysis.de.GermanAnalyzer)
   | 翻译:你可以指定分析机的类名,也可以任意选取下面的别名中的一个
   |	english:针对英文环境,是org.apache.lucene.analysis.standard.StandardAnalyzer的别名
   |	german:针对德文环境,是org.apache.lucene.analysis.de.GermanAnalyzer的别名
   +-->
    <analyzerType>english</analyzerType>
    <!-- 
		<analyzerType>german</analyzerType>
	    <analyzerType>chinese</analyzerType>
	    <analyzerType>paoding</analyzerType>
    
     -->

  <!--
   | Contains all words that should not be indexed.
   | Separate the words by a blank.
   |翻译:包含了所有的不必被索引的单词,把这些单词用空白分开
   +-->
    <stopwordList>
    einer eine eines einem einen der die das dass da?du er sie es was wer wie
    wir und oder ohne mit am im in aus auf ist sein war wird ihr ihre ihres als
    für von mit dich dir mich mir mein sein kein durch wegen wird
    </stopwordList>
  <!-- italian:
  <stopwordList>
    di a da in con su per tra fra io tu egli ella essa noi voi essi loro che cui
    se e n?anche inoltre neanche o ovvero oppure ma per?eppure anzi invece
    bens?tuttavia quindi dunque perci?pertanto cio?infatti ossia non come
    mentre perch?quando mio mia miei mie tuo tua tuoi tue suo sua suoi sue
    nostro nostre nostri nostre vostro vostre vostri vostre il lo la i gli le un
    uno una degli delle alcuno alcuna alcune qualcuno qualcuna nessuno nessuna
    molto molte molti molte poco parecchio assai
  </stopwordList>
  -->

  <!--
   | Contains all words that should not be changed by an analyser when indexed.
   | Separate the words by a blank.
   |翻译:包含所有的被分析机索引时不应该改变的内容。把这些单词用空白分开
   +-->
    <exclusionList></exclusionList>

  <!--
   | The names of the fields of which to prefetch the destinct values.
   | Separate the field names by a blank.
   |翻译:
   | Put in the names of the fields you use a search:input_fieldlist tag for.
   | The values shown in the list will then be extracted by the crawler and not
   | by the search mask, which prevents a slow first loading of a page for huge
   | indexes.
   |翻译:放置用来查询的字段名称,在列表中列举的值将被爬虫提取出来,但是不会被查询到,这些值阻止了页面第一次加载更多的索引
   +-->
    <valuePrefetchFields>mimetype</valuePrefetchFields>
  
  <!--
   | Specifies wether the whole content should be stored in the index for the
   | purpose of a content preview
   |翻译:指定为了能够预览内容是否所有内容需要被存储在索引中。
   +-->
    <storeContentForPreview>true</storeContentForPreview>
    
  </searchIndex>


<!--
 | The preparators in the order they should be applied. Preparators that aren't listed
 | here will be applied after the listed ones.
 |翻译:在序列中列举的preparators需要被应用,没有被列举的将在列举的后面被应用
 | You can use this list...
 |   ... to define the priority (= order) of the preparators
 |   ... to disable preparators
 |   ... to configure preparators
 |翻译:该属性有如下用途:
 |	... 定义preparators的属性(= order)
 |	... 禁用preparators
 |  ... 配置preparators
 +-->
  <preparatorList>
  <!--
   | Enable this preparator if you want to use the text extractor of
   | Microsoft Windows. This preparator is able to read tons of file formats.
   |翻译:如果你想应用这个提取的text文字,就使用preparator,preparator可以读取文件格式
   | NOTE: Under Windows 2000 you have to make sure that reg.exe is installed
   |       (It's part of the "Support Tools").
   |       For details see: http://support.microsoft.com/kb/301423
   |翻译:注意在windows2000以下的版本中,你需要确保安装了reg.exe(reg.exe是一个支持工具);
   |详细资料可以参考网址 http://support.microsoft.com/kb/301423
   +-->
    <preparator enabled="false">
      <class>.IfilterPreparator</class>
    </preparator>

  <!--
   | Enable this preparator if you want to use MS Excel for indexing your Excel
   | documents.
   |翻译:如果您要索引Excel格式文件内容,那么就使用preparator
   +-->
    <preparator enabled="false">
      <class>.JacobMsExcelPreparator</class>
    </preparator>
  
  <!--
   | Enable this preparator if you want to use MS Word for indexing your Word
   | documents.
   |翻译:如果您要索引Word格式文件内容,那么就使用preparator
   +-->
    <preparator enabled="false">
      <class>.JacobMsWordPreparator</class>
    </preparator>
  
  <!--
   | Enable this preparator if you want to use MS Powerpoint for indexing your
   | Powerpoint documents.
   |翻译:如果您要索引Powerpoint格式文件内容,那么就使用preparator
   +-->
    <preparator enabled="false">
      <class>.JacobMsPowerPointPreparator</class>
    </preparator>

  <!--
   | This tells regain that it should first try the SimpleRtfPreparator for RTF
   | files. Only if this one fails the SwingRtfPreparator is used
   | (which is much slower).
   |翻译:下面用来通知regain,首先使用SimpleRtfPreparator,只用当SimpleRtfPreparator失败了才使用SwingRtfPreparator
   |SwingRtfPreparator必须延迟。
   +-->
    <preparator>
      <class>.SimpleRtfPreparator</class>
    </preparator>
    <preparator>
      <class>.SwingRtfPreparator</class>
    </preparator>

  <!--
   | This preparator may be used if you have an external program that can
   | extract text. It's disabled by default.
   |翻译:如果你有一个可以提取text的外部项目,下面的preparator可以使用,默认情况下他是被禁用的
   +-->
    <preparator enabled="false">
      <class>.ExternalPreparator</class>
      <config>
        <section name="command">
          <param name="urlPattern">\.ps$</param>
          <param name="commandLine">ps2ascii ${filename}</param>
          <param name="checkExitCode">false</param>
        </section>
      </config>
    </preparator>

  <!-- 
  CatchAll-preparator on basis of EmptyPreparator
  翻译:在EmptyPreparator中缓存所有的preparator
  -->
    <preparator priority="-10">
      <class>.EmptyPreparator</class>
      <urlPattern>.*</urlPattern>
    </preparator>
  </preparatorList>


<!--
 | The index may be extended with auxiliary fields. These are fields that have
 | been generated from the URL of an document.
 |  翻译:通过辅助域索引可以扩充,这里有通过一个文档的url产生的字段。
 | Example: If you have a directory with a sub directory for every project,
 | then you may create a field with the project's name.
 |  翻译:例如:有这样一种情况,现在有一个所有项目都有子目录的目录,这时你就会用这个项目的名称产生一个字段
 | The folling tag will create a field "project" with the value "otto23"
 | from the URL "file://c:/projects/otto23/docs/Spez.doc":
 |翻译:下面的标签将从地址为"file://c:/projects/otto23/docs/Spez.doc"的url中
 |		产生一个名称为"project",值为"otto23"的字段
 |   <auxiliaryField name="project" regexGroup="1">
 |     <regex>^file://c:/projects/([^/]*)</regex>
 |   </auxiliaryField>
 |
 | URLs that doen't match will get no "project" field.
 |翻译:URLs不匹配的,将不能得到"project"字段。
 | Having done this you may search for "Offer project:otto23" and you will get
 | only hits from this project directory.
 |翻译:假设已经做了这些,你也许会查询"Offer project:otto23",这样你将只从该project目录获得结果集
 +-->
  <auxiliaryFieldList>
  <!--
   Don't change these two fields. But you may add your own. 
  翻译:不要更改这两个字段,但是你可以增加属于自己的条件。
  -->
    <auxiliaryField name="extension" regexGroup="1" toLowercase="true">
      <regex>\.([^\.]*)$</regex>
    </auxiliaryField>
    <auxiliaryField name="location" regexGroup="1" store="false" tokenize="true">
      <regex>^(.*)$</regex>
    </auxiliaryField>
    <auxiliaryField name="mimetype" regexGroup="1" >
      <regex>^()$</regex>
    </auxiliaryField>
  </auxiliaryFieldList>


<!-- The regular expressions that indentify URLs in HTML. -->
<!-- This configuration part is no longer neccessary -->
<!--htmlParserPatternList>
  <pattern parse="true" index="true" regexGroup="1">="([^"]*(/|htm|html|jsp|php\d?|asp))"</pattern>
  <pattern parse="false" index="false" regexGroup="1">="([^"]*\.(js|css|jpg|gif|png))"</pattern>
  <pattern parse="false" index="true" regexGroup="1">="([^"]*\.[^\."]{3})"</pattern>
</htmlParserPatternList-->
</configuration>


下面是SearchConfiguration.xml
<?xml version="1.0" encoding="GBK"?>

<!DOCTYPE configuration [
  <!ENTITY amp "&#x26;">
  <!ENTITY lt "&#x3C;">
]>

<!--
 | Configuration for the regain search mask.
 |翻译:regain search 的配置文件
 |
 | Normally you only have to specify the directory where the search index is
 | located. You do this in the <dir> tag of the <index name="main"> (line 74).
 |翻译:一般的您只需要指定查询索引所在的目录就可以了,在这个配置文件中你在 <index name="main">标签下的
 |<dir> 目录中指定

 | You can find a detailed description of all configuration tags here:
 |翻译:你可以在下面的这个网址中找到所有的配置标签的详细的说明
 | http://regain.murfman.de/wiki/en/index.php/SearchConfiguration.xml

 +-->
<configuration>

  <!-- The search indexes 查询索引-->
  <indexList>
    <!--
     | All settings defined in this section are applied to all indexes unless
     |翻译: 所有的在section中定义的设置被应用于所有的索引中,除非设置被重新定义
     | they redefine the setting.
     +-->
    <defaultSettings>
    <!--
    	 1 <defaultSettings>: The cascaded default settings
    	 2<index>: The settings for one index.
    	 
   	-->
      <!--
       | The regular expression that identifies URLs that should be opened in
       | a new window.
       | 翻译:在一个新窗口中打开的规则的整齐的标时urls的表达式
       +-->
      <openInNewWindowRegex>.(pdf|rtf|doc|xls|ppt)$</openInNewWindowRegex>
      
      <!--
       | Specifies whether the file-to-http-bridge should be used for file-URLs.
       |翻译:指定file-to-http-bridge是否被用于file-URLs
       | Mozilla browsers have a security mechanism that blocks loading file-URLs
       |翻译:Mozilla浏览器有一个安全机制,他限制从已经下载的http页面中下载 file-URLs
       | from pages loaded via http. To be able to load files from the search
       | results, regain offers the file-to-http-bridge that provides all files that
       | are listed in the index via http.
       |翻译:为了实现从查询结果中下载文件,file-to-http-bridge是regain提供的,是提供给所有的通过http在索引中列举的文件
       +-->
      <useFileToHttpBridge>true</useFileToHttpBridge>
      
      <!--
       | The index fields to search by default.
       |翻译:默认的查询索引字段
       | NOTE: The user may search in other fields also using the
       | "field:"-operator. Read the lucene query syntax for details:
       | http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
       |翻译:注意:用户在其他域中也许用"field:"-operator;请阅读lucene查询句法详细了解
       |网址是:http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
       +-->
      <searchFieldList>content title headlines location filename</searchFieldList>
      <!--
       | The SearchAccessController to use.
       | 翻译:应用查询访问控制器
       | This is a part of the access control system that ensures that only those
       | documents are shown in the search results that the user is allowed to
       | read.
       |翻译:访问控制系统的一部分,这部分的作用是保证只有用户允许阅读的文件出现在查询结果中
       | If you specify a SearchAccessController, don't forget to specify the
       | CrawlerAccessController counterpart in the CrawlerConfiguration.xml!
       |翻译:如果您要指定SearchAccessController(查询访问控制器),请确定修改CrawlerConfiguration.xml
       |中的爬虫反问控制器对应的字段。
       +-->
      <!--
      <searchAccessController>
        <class jar="myAccess.jar">mypackage.MySearchAccessController</class>
        <config>
          <param name="bla">blubb</param>
        </config>
      </searchAccessController>
      -->
      <!--
       |
       | Specifies whether the search terms should by highlighted whithin the
       | search results (summary, title)
       |翻译:指定在查询结果(summary, title)中,查询部分需要被高亮显示
       +-->
      <Highlighting>true</Highlighting>
      
    </defaultSettings>
    
    <!-- The search index 'main' 查询索引'main' -->
    <index name="main" default="true" isparent="true">
      <!-- 
      The directory where the index is located 
      	翻译:索引存放的位置
      -->
      <dir>${SEARCHDIR}searchindex</dir>
    </index>
    <!-- 
     | A child index of 'main' 
     |翻译:子索引存放的位置
     +-->
    <!-- 
    <index name="main1" default="true" isparent="false" parent="main">
      <dir>searchindex_1</dir>
    </index>
    -->
    
    <!-- The search index 'example' 查询索引'example' 例子-->
    <index name="example">
      <!-- The directory where the index is located  索引存放的目录-->
      <dir>c:\Temp\searchindex_example</dir>
      
      <rewriteRules>
        <rule prefix="file://c:/example/www-data" replacement="http://www.mydomain.de"/>
      </rewriteRules>
    </index>
  </indexList>

</configuration>


0
0
分享到:
评论

相关推荐

    基于lucene的搜索引擎regain安装版

    Lucene是Java语言实现的一个开源信息检索库,为开发人员提供了一个强大的文本分析和索引工具。Regain在其基础上构建,旨在简化搜索引擎的部署和使用过程。 ### 1. Lucene简介 Lucene是Apache软件基金会的顶级项目...

    PyPI 官网下载 | regain-0.1.7.tar.gz

    "regain-0.1.7.tar.gz"是一个压缩文件,通常包含源代码、元数据和任何必要的构建脚本。 描述中提到的"资源来自pypi官网"进一步确认了这个文件是PyPI官方提供的,这保证了其来源的可靠性和安全性。"资源全名:regain...

    搜索引擎regain_v1.2.3_server

    在服务器部署方面,regain_v1.2.3_server可能提供了详尽的文档和配置工具,帮助管理员轻松安装、配置和管理服务。这包括设置数据库连接、调整索引参数、配置日志监控等方面。同时,安全性和稳定性也是重点,可能会有...

    Regain:一个基于Jakarta Lucene的Java搜索引擎-开源

    Regain是一个基于Jakarta Lucene的Java搜索引擎。 它提供了索引和搜索文件的多种格式(HTML,XML,doc(x),xls(x),ppt(x),oo,PDF,RTF,mp3,mp4,Java)。 TagLibrary使您可以轻松地将搜索结果集成到基于...

    regain:在桌面或服务器上运行的搜索引擎,支持各种文件格式

    regain是一个类似于Google之类的网络搜索引擎的搜索引擎,区别在于您不搜索网络,而是搜索自己的文件和文档。 使用regain,您可以在几秒钟内搜索大部分数据(几GB!)! 这可以通过使用搜索索引来实现。 重新获得对...

    Regain Power-开源

    文件夹选项、任务管理器、regedit 大多被 windows 中的病毒禁用。该程序可以带回您的文件夹选项、任务管理器、regedit(windows 注册表编辑器)搜索选项、运行选项、显示隐藏文件和文件夹等.. 选项

    Python库 | regain-0.2.2.tar.gz

    资源分类:Python库 所属语言:Python 资源全名:regain-0.2.2.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059

    英文读后感《To Regain the Nature of Goodness》.doc

    《To Regain the Nature of Goodness》是一篇对查尔斯·狄更斯作品《雾都孤儿》(Oliver Twist)的英文读后感。这篇读后感由上海市实验学校高三(1)班的顾竹屹撰写,深入剖析了这部反映18世纪英国社会悲剧的小说。 ...

    重获:REGAIN(规则图形推论)

    恢复考虑到潜在变量的影响,跨多个时间戳的正则化图形推断。 它从包继承功能。入门依存关系REGAIN要求: Python(&gt; = 2.7或&gt; = 3.5) NumPy(&gt; = 1.8.2) scikit学习(&gt; = 0.17) 您可以通过运行以下命令安装(必需...

    regain-开源

    Regain 是一个基于 Jakarta Lucene 的 Java 搜索引擎。 它为多种格式(HTML、XML、doc(x)、xls(x)、ppt(x)、oo、PDF、RTF、mp3、mp4、Java)提供索引和搜索文件。 TagLibrary 简化了在基于 JSP 的网页中集成搜索结果...

    regain:koa2 + mysql + vue3

    node 后端 /back-end 在 /back-end 目录下创建 config 文件夹。 在其下添加 database.js const data = { url:'database-host', //host user:'database-user', //user pwd:'database-pwd', //password ...

    英语四级考试文科词汇PPT课件.pptx

    - `import` 和 `export`:`import` 是“进口”,`export` 是“出口”,在国际贸易中这两个词非常关键。 5. **工具与器械**: - `instrument` 可以指“乐器”或“医疗器械”,如 `medical instrument` 或 `musical...

    YacineNacer.rar_Alis_diagnostic

    Le diagnostic de défaillances des ... Le regain d’intérêt manifesté par les différents secteurs industriels et par le monde de la recherche, démontre que ce domaine est un créneau très porteur.

    GMAT曼哈顿语法中文版.doc

    1. 在一个句子中,不要用两个意思一样的词 * Rise-increase; sum-total; regain-again; enable-be able to; attempt-try;other than-opposite; drop-decrease * 1/30 wordsufficient-enough;including-among; have...

    Cracklock 时限破解器

    When installing Cracklock, users... Basically, users who can no longer access a certain shareware software that they have been using for the past 30 days can process it using Cracklock and regain access.

    2020_2021学年高中英语Unit4Makingthenewsgrammar课时作业1新人教版必修520210528267

    这篇高中英语Unit4 Making the News的Grammar部分主要涉及了完形填空的练习,文章通过一个故事讲述了慷慨助人的主题,并涉及到农业、慈善以及自然恢复力等相关概念。 1. 词汇理解: - economic conditions: 经济...

    PRACTICA 2_powerelectronics_

    The stability of power systems refers to the property that allows them to remain in an operating state in equilibrium under normal operating conditions and to regain another state of equilibrium after...

    陕西省澄城县寺前中学高一英语薄弱学科加强题 新人教版

    以上是针对题目中给出的部分内容进行的详细解释,涵盖了词汇、语法、句型结构和翻译等多个方面,旨在帮助高一学生加强英语学习,尤其是对于薄弱环节的提升。通过这样的练习,学生可以提高单词记忆、翻译能力、语法...

    学生管理系统

    该系统通过一个直观的菜单界面为用户提供多种功能选项,包括新建文件、数据处理(查询、修改、删除)、各科平均成绩计算、排名功能、显示所有学生信息、统计数据、保存以及关于系统的介绍等。 #### 二、系统结构与...

Global site tag (gtag.js) - Google Analytics