`
edwardpro
  • 浏览: 313110 次
  • 性别: Icon_minigender_1
社区版块
存档分类
最新评论

lucene 2.4 变化修炼之前言

阅读更多

升级lucene到2.4是最近几天的事情,但是升级发现很多东西改变了,最直接的:

1 hits没有了

2 copy函数变化了

 

我相信还有很多看看changelogs就知道了,所以怀着很大兴趣准备看看源码看看变化.

  • Changes in backwards compatibility policy   (1)
    1. LUCENE-1340: In a minor change to Lucene's backward compatibility policy, we are now allowing the Fieldable interface to have changes, within reason, and made on a case-by-case basis. If an application implements it's own Fieldable, please be aware of this. Otherwise, no need to be concerned. This is in effect for all 2.X releases, starting with 2.4. Also note, that in all likelihood, Fieldable will be changed in 3.0.
  • Changes in runtime behavior   (4)
    1. LUCENE-1151: Fix StandardAnalyzer to not mis-identify host names (eg lucene.apache.org) as an ACRONYM. To get back to the pre-2.4 backwards compatible, but buggy, behavior, you can either call StandardAnalyzer.setDefaultReplaceInvalidAcronym(false) (static method), or, set system property org.apache.lucene.analysis.standard.StandardAnalyzer.replaceInvalidAcronym to "false" on JVM startup. All StandardAnalyzer instances created after that will then show the pre-2.4 behavior. Alternatively, you can call setReplaceInvalidAcronym(false) to change the behavior per instance of StandardAnalyzer. This backwards compatibility will be removed in 3.0 (hardwiring the value to true).
      (Mike McCandless)
    2. LUCENE-1044: IndexWriter with autoCommit=true now commits (such that a reader can see the changes) far less often than it used to. Previously, every flush was also a commit. You can always force a commit by calling IndexWriter.commit(). Furthermore, in 3.0, autoCommit will be hardwired to false (IndexWriter constructors that take an autoCommit argument have been deprecated)
      (Mike McCandless)
    3. LUCENE-1335: IndexWriter.addIndexes(Directory[]) and addIndexesNoOptimize no longer allow the same Directory instance to be passed in more than once. Internally, IndexWriter uses Directory and segment name to uniquely identify segments, so adding the same Directory more than once was causing duplicates which led to problems
      (Mike McCandless)
    4. LUCENE-1396: Improve PhraseQuery.toString() so that gaps in the positions are indicated with a ? and multiple terms at the same position are joined with a |.
      (Andrzej Bialecki via Mike McCandless)
  • API Changes   (26)
    1. LUCENE-1084: Changed all IndexWriter constructors to take an explicit parameter for maximum field size. Deprecated all the pre-existing constructors; these will be removed in release 3.0. NOTE: these new constructors set autoCommit to false.
      (Steven Rowe via Mike McCandless)
    2. LUCENE-584: Changed Filter API to return a DocIdSet instead of a java.util.BitSet. This allows using more efficient data structures for Filters and makes them more flexible. This deprecates Filter.bits(), so all filters that implement this outside the Lucene code base will need to be adapted. See also the javadocs of the Filter class.
      (Paul Elschot, Michael Busch)
    3. LUCENE-1044: Added IndexWriter.commit() which flushes any buffered adds/deletes and then commits a new segments file so readers will see the changes. Deprecate IndexWriter.flush() in favor of IndexWriter.commit().
      (Mike McCandless)
    4. LUCENE-325: Added IndexWriter.expungeDeletes methods, which consult the MergePolicy to find merges necessary to merge away all deletes from the index. This should be a somewhat lower cost operation than optimize.
      (John Wang via Mike McCandless)
    5. LUCENE-1233: Return empty array instead of null when no fields match the specified name in these methods in Document: getFieldables, getFields, getValues, getBinaryValues.
      (Stefan Trcek vai Mike McCandless)
    6. LUCENE-1234: Make BoostingSpanScorer protected.
      (Andi Vajda via Grant Ingersoll)
    7. LUCENE-510: The index now stores strings as true UTF-8 bytes (previously it was Java's modified UTF-8). If any text, either stored fields or a token, has illegal UTF-16 surrogate characters, these characters are now silently replaced with the Unicode replacement character U+FFFD. This is a change to the index file format.
      (Marvin Humphrey via Mike McCandless)
    8. LUCENE-852: Let the SpellChecker caller specify IndexWriter mergeFactor and RAM buffer size.
      (Otis Gospodnetic)
    9. LUCENE-1290: Deprecate org.apache.lucene.search.Hits, Hit and HitIterator and remove all references to these classes from the core. Also update demos and tutorials.
      (Michael Busch)
    10. LUCENE-1288: Add getVersion() and getGeneration() to IndexCommit. getVersion() returns the same value that IndexReader.getVersion() returns when the reader is opened on the same commit.
      (Jason Rutherglen via Mike McCandless)
    11. LUCENE-1311: Added IndexReader.listCommits(Directory) static method to list all commits in a Directory, plus IndexReader.open methods that accept an IndexCommit and open the index as of that commit. These methods are only useful if you implement a custom DeletionPolicy that keeps more than the last commit around.
      (Jason Rutherglen via Mike McCandless)
    12. LUCENE-1325: Added IndexCommit.isOptimized().
      (Shalin Shekhar Mangar via Mike McCandless)
    13. LUCENE-1324: Added TokenFilter.reset().
      (Shai Erera via Mike McCandless)
    14. LUCENE-1340: Added Fieldable.omitTf() method to skip indexing term frequency, positions and payloads. This saves index space, and indexing/searching time.
      (Eks Dev via Mike McCandless)
    15. LUCENE-1219: Add basic reuse API to Fieldable for binary fields: getBinaryValue/Offset/Length(); currently only lazy fields reuse the provided byte[] result to getBinaryValue.
      (Eks Dev via Mike McCandless)
    16. LUCENE-1334: Add new constructor for Term: Term(String fieldName) which defaults term text to "".
      (DM Smith via Mike McCandless)
    17. LUCENE-1333: Added Token.reinit(*) APIs to re-initialize (reuse) a Token. Also added term() method to return a String, with a performance penalty clearly documented. Also implemented hashCode() and equals() in Token, and fixed all core and contrib analyzers to use the re-use APIs.
      (DM Smith via Mike McCandless)
    18. LUCENE-1329: Add optional readOnly boolean when opening an IndexReader. A readOnly reader is not allowed to make changes (deletions, norms) to the index; in exchanged, the isDeleted method, often a bottleneck when searching with many threads, is not synchronized. The default for readOnly is still false, but in 3.0 the default will become true.
      (Jason Rutherglen via Mike McCandless)
    19. LUCENE-1367: Add IndexCommit.isDeleted().
      (Shalin Shekhar Mangar via Mike McCandless)
    20. LUCENE-1061: Factored out all "new XXXQuery(...)" in QueryParser.java into protected methods newXXXQuery(...) so that subclasses can create their own subclasses of each Query type.
      (John Wang via Mike McCandless)
    21. LUCENE-753: Added new Directory implementation org.apache.lucene.store.NIOFSDirectory, which uses java.nio's FileChannel to do file reads. On most non-Windows platforms, with many threads sharing a single searcher, this may yield sizable improvement to query throughput when compared to FSDirectory, which only allows a single thread to read from an open file at a time.
      (Jason Rutherglen via Mike McCandless)
    22. LUCENE-1371: Added convenience method TopDocs Searcher.search(Query query, int n).
      (Mike McCandless)
    23. LUCENE-1356: Allow easy extensions of TopDocCollector by turning constructor and fields from package to protected.
      (Shai Erera via Doron Cohen)
    24. LUCENE-1375: Added convencience method IndexCommit.getTimestamp, which is equivalent to getDirectory().fileModified(getSegmentsFileName()).
      (Mike McCandless)
    25. LUCENE-1366: Rename Field.Index options to be more accurate: TOKENIZED becomes ANALYZED; UN_TOKENIZED becomes NOT_ANALYZED; NO_NORMS becomes NOT_ANALYZED_NO_NORMS and a new ANALYZED_NO_NORMS is added.
      (Mike McCandless)
    26. LUCENE-1131: Added numDeletedDocs method to IndexReader
      (Otis Gospodnetic)
  • Bug fixes   (16)
    1. LUCENE-1134: Fixed BooleanQuery.rewrite to only optimize a single clause query if minNumShouldMatch<=0.
      (Shai Erera via Michael Busch)
    2. LUCENE-1169: Fixed bug in IndexSearcher.search(): searching with a filter might miss some hits because scorer.skipTo() is called without checking if the scorer is already at the right position. scorer.skipTo(scorer.doc()) is not a NOOP, it behaves as scorer.next().
      (Eks Dev, Michael Busch)
    3. LUCENE-1182: Added scorePayload to SimilarityDelegator
      (Andi Vajda via Grant Ingersoll)
    4. LUCENE-1213: MultiFieldQueryParser was ignoring slop in case of a single field phrase.
      (Trejkaz via Doron Cohen)
    5. LUCENE-1228: IndexWriter.commit() was not updating the index version and as result IndexReader.reopen() failed to sense index changes.
      (Doron Cohen)
    6. LUCENE-1267: Added numDocs() and maxDoc() to IndexWriter; deprecated docCount().
      (Mike McCandless)
    7. LUCENE-1274: Added new prepareCommit() method to IndexWriter, which does phase 1 of a 2-phase commit (commit() does phase 2). This is needed when you want to update an index as part of a transaction involving external resources (eg a database). Also deprecated abort(), renaming it to rollback().
      (Mike McCandless)
    8. LUCENE-1003: Stop RussianAnalyzer from removing numbers.
      (TUSUR OpenTeam, Dmitry Lihachev via Otis Gospodnetic)
    9. LUCENE-1152: SpellChecker fix around clearIndex and indexDictionary methods, plus removal of IndexReader reference.
      (Naveen Belkale via Otis Gospodnetic)
    10. LUCENE-1046: Removed dead code in SpellChecker
      (Daniel Naber via Otis Gospodnetic)
    11. LUCENE-1189: Fixed the QueryParser to handle escaped characters within quoted terms correctly.
      (Tomer Gabel via Michael Busch)
    12. LUCENE-1299: Fixed NPE in SpellChecker when IndexReader is not null and field is
      (Grant Ingersoll)
    13. LUCENE-1303: Fixed BoostingTermQuery's explanation to be marked as a Match depending only upon the non-payload score part, regardless of the effect of the payload on the score. Prior to this, score of a query containing a BTQ differed from its explanation.
      (Doron Cohen)
    14. LUCENE-1310: Fixed SloppyPhraseScorer to work also for terms repeating more than twice in the query.
      (Doron Cohen)
    15. LUCENE-1351: ISOLatin1AccentFilter now cleans additional ligatures
      (Cedrik Lime via Grant Ingersoll)
    16. LUCENE-1383: Workaround a nasty "leak" in Java's builtin ThreadLocal, to prevent Lucene from causing unexpected OutOfMemoryError in certain situations (notably J2EE applications).
      (Chris Lu via Mike McCandless)
  • New features   (20)
    1. LUCENE-1137: Added Token.set/getFlags() accessors for passing more information about a Token through the analysis process. The flag is not indexed/stored and is thus only used by analysis.
    2. LUCENE-1147: Add -segment option to CheckIndex tool so you can check only a specific segment or segments in your index.
      (Mike McCandless)
    3. LUCENE-1045: Reopened this issue to add support for short and bytes.
    4. LUCENE-584: Added new data structures to o.a.l.util, such as OpenBitSet and SortedVIntList. These extend DocIdSet and can directly be used for Filters with the new Filter API. Also changed the core Filters to use OpenBitSet instead of java.util.BitSet.
      (Paul Elschot, Michael Busch)
    5. LUCENE-494: Added QueryAutoStopWordAnalyzer to allow for the automatic removal, from a query of frequently occurring terms. This Analyzer is not intended for use during indexing.
      (Mark Harwood via Grant Ingersoll)
    6. LUCENE-1044: Change Lucene to properly "sync" files after committing, to ensure on a machine or OS crash or power cut, even with cached writes, the index remains consistent. Also added explicit commit() method to IndexWriter to force a commit without having to close.
      (Mike McCandless)
    7. LUCENE-997: Add search timeout (partial) support. A TimeLimitedCollector was added to allow limiting search time. It is a partial solution since timeout is checked only when collecting a hit, and therefore a search for rare words in a huge index might not stop within the specified time.
      (Sean Timm via Doron Cohen)
    8. LUCENE-1184: Allow SnapshotDeletionPolicy to be re-used across close/re-open of IndexWriter while still protecting an open snapshot
      (Tim Brennan via Mike McCandless)
    9. LUCENE-1194: Added IndexWriter.deleteDocuments(Query) to delete documents matching the specified query. Also added static unlock and isLocked methods (deprecating the ones in IndexReader).
      (Mike McCandless)
    10. LUCENE-1201: Add IndexReader.getIndexCommit() method.
      (Tim Brennan via Mike McCandless)
    11. LUCENE-550: Added InstantiatedIndex implementation. Experimental Index store similar to MemoryIndex but allows for multiple documents in memory.
      (Karl Wettin via Grant Ingersoll)
    12. LUCENE-400: Added word based n-gram filter (in contrib/analyzers) called ShingleFilter and an Analyzer wrapper that wraps another Analyzer's token stream with a ShingleFilter
      (Sebastian Kirsch, Steve Rowe via Grant Ingersoll)
    13. LUCENE-1166: Decomposition tokenfilter for languages like German and Swedish
      (Thomas Peuss via Grant Ingersoll)
    14. LUCENE-1187: ChainedFilter and BooleanFilter now work with new Filter API and DocIdSetIterator-based filters. Backwards-compatibility with old BitSet-based filters is ensured.
      (Paul Elschot via Michael Busch)
    15. LUCENE-1295: Added new method to MoreLikeThis for retrieving interesting terms and made retrieveTerms(int) public.
      (Grant Ingersoll)
    16. LUCENE-1298: MoreLikeThis can now accept a custom Similarity
      (Grant Ingersoll)
    17. LUCENE-1297: Allow other string distance measures for the SpellChecker
      (Thomas Morton via Otis Gospodnetic)
    18. LUCENE-1001: Provide access to Payloads via Spans. All existing Span Query implementations in Lucene implement.
      (Mark Miller, Grant Ingersoll)
    19. LUCENE-1354: Provide programmatic access to CheckIndex
      (Grant Ingersoll, Mike McCandless)
    20. LUCENE-1279: Add support for Collators to RangeFilter/Query and Query Parser.
      (Steve Rowe via Grant Ingersoll)
  • Optimizations   (6)
    1. LUCENE-705: When building a compound file, use RandomAccessFile.setLength() to tell the OS/filesystem to pre-allocate space for the file. This may improve fragmentation in how the CFS file is stored, and allows us to detect an upcoming disk full situation before actually filling up the disk.
      (Mike McCandless)
    2. LUCENE-1120: Speed up merging of term vectors by bulk-copying the raw bytes for each contiguous range of non-deleted documents.
      (Mike McCandless)
    3. LUCENE-1185: Avoid checking if the TermBuffer 'scratch' in SegmentTermEnum is null for every call of scanTo().
      (Christian Kohlschuetter via Michael Busch)
    4. LUCENE-1217: Internal to Field.java, use isBinary instead of runtime type checking for possible speedup of binaryValue().
      (Eks Dev via Mike McCandless)
    5. LUCENE-1183: Optimized TRStringDistance class (in contrib/spell) that uses less memory than the previous version.
      (Cédrik LIME via Otis Gospodnetic)
    6. LUCENE-1195: Improve term lookup performance by adding a LRU cache to the TermInfosReader. In performance experiments the speedup was about 25% on average on mid-size indexes with ~500,000 documents for queries with 3 terms and about 7% on larger indexes with ~4.3M documents.
      (Michael Busch)
  • 分享到:
    评论
    1 楼 fuwang 2008-11-24  
    看样子2.4的变动太多,我还是继续用2.3吧

    相关推荐

      java拼车网雏形(Ext2.0+SSH+oracle10g+lucene2.4)

      【标题】"java拼车网雏形(Ext2.0+SSH+oracle10g+lucene2.4)" 涉及的核心技术是Java Web开发中的几个关键组件,包括ExtJS 2.0前端框架,Spring、Struts2和Hibernate(SSH)后端框架,Oracle 10g数据库以及Lucene ...

      Lucene 2.4 入门例子

      **Lucene 2.4 入门例子** Lucene 是一个高性能、全文本搜索库,由Apache软件基金会开发。它提供了强大的搜索功能,被广泛应用于各种应用中的信息检索。在这个入门例子中,我们将探讨Lucene 2.4版本的一些关键特性和...

      Lucene2.4入门总结

      **Lucene 2.4 入门指南** Lucene 是一个高性能、全文本搜索库,由 Apache 软件基金会开发。它提供了高级文本检索功能,广泛用于构建搜索引擎和其他需要高效全文检索能力的应用。本文将重点介绍 Lucene 2.4 版本的...

      lucene 2.4 jar

      lucene 2.4 jar lucene2.4版本的JAR包

      ictclas4j for lucene 2.4

      ictclas4j for lucene 2.4 任何人不得将此用于商业用途,仅限个人学习研究之用.该开源项目遵循Apache License 2.0

      Lucene2.4完美样例+中文文档

      **Lucene 2.4 完美样例与中文文档详解** Lucene 是一个高性能、全文本搜索库,由 Apache 软件基金会开发。它为开发者提供了在 Java 应用程序中实现全文检索功能的强大工具。Lucene 2.4 版本是其历史上的一个重要...

      lucene2.4+nutch学习笔记三:lucene 在多个文本文档里找出包含一些关键字的文档

      《Lucene 2.4与Nutch学习笔记:在多文本文档中搜索关键词》 Lucene是一个高性能、全文本搜索引擎库,它为开发者提供了在Java应用程序中实现全文搜索功能的基本工具。Nutch则是一个开源的网络爬虫项目,用于抓取...

      庖丁解牛 源码 for Lucene 2.4

      《庖丁解牛 源码 for Lucene 2.4》是一份针对开源全文搜索引擎Lucene 2.4版本的深度解析资料。这个压缩包包含的文件名为"paoding-for-lucene-2.4",很可能是针对中文处理的Paoding Lucene库的源代码分析或扩展。...

      struts2 + spring2.5 + hibernate 3.2 + lucene 2.4 + compass 2.0产品搜索

      struts2 + spring2.5 + hibernate 3.2 + lucene 2.4 + compass 2.0 包含所有jar包,按readme.txt导入并运行即可 开始不用分了................

      Lucene_2.4.CHM

      lucene2.4手册,是开发搜索引擎的好帮手.

      Lucene2.4.1

      《深入剖析Lucene 2.4.1:核心与示例》 Lucene是一个高性能、全文检索库,它由Apache软件基金会开发并维护。作为Java编写的一个开源项目,Lucene为构建复杂的搜索功能提供了强大的工具集。本次我们将深入探讨Lucene...

      Lucene.net3.0+PanGu2.4.zip

      支持net4.0环境下运行,Lucene.net版本为3.0,PanGu版本为2.4

      lunence2.4例题

      【标题】"lunence2.4例题" 指的是有关Lucene 2.4版本的一些示例和练习题目。Lucene是一款强大的开源全文搜索引擎库,它为Java开发者提供了文本检索和分析的工具,使得在应用程序中实现搜索功能变得简单。在Lucene ...

      基于lucene2.4.0的开发jar包

      《深入剖析Lucene 2.4.0:核心与扩展》 Lucene是一个开源全文搜索引擎库,由Apache软件基金会开发并维护。在2.4.0版本中,Lucene为开发者提供了一套强大的文本检索和分析工具,使得构建高效、可扩展的搜索应用成为...

      Lucene 庖丁解牛分词法2.4版本jar包

      《深入剖析Lucene:庖丁解牛分词法2.4版本》 在中文信息处理领域,Lucene作为一个强大的全文检索引擎库,扮演着至关重要的角色。然而,由于中文的复杂性,简单的英文分词策略无法满足需求,于是有了针对中文的分词...

      Lucene.net2.4.0

      3. **错误修复**:修复了之前版本中的一些已知问题,增强了系统的稳定性和可靠性。 4. **文档更新**:官方文档同步更新,提供了更详尽的使用指南和示例代码,帮助开发者更好地理解和使用Lucene.NET。 **四、应用...

      lucene-analyzers-common-5.1.0.jar

      might not be compatible with the Snowball module in Lucene 2.4 or greater. For more information about this issue see: https://issues.apache.org/jira/browse/LUCENE-1142 For more information on ...

      lucene-2.3.1.jar

      《Lucene 2.3.1.jar:洞察搜索引擎的核心技术》 在信息技术的海洋中,搜索引擎扮演着至关重要的角色,而Lucene则是其中的一颗璀璨明珠。作为一个开源全文检索库,Lucene为开发者提供了强大的文本搜索功能。在这里,...

      lucene,lucene教程,lucene讲解

      lucene,lucene教程,lucene讲解。 为了对文档进行索引,Lucene 提供了五个基础的类 public class IndexWriter org.apache.lucene.index.IndexWriter public abstract class Directory org.apache.lucene.store....

    Global site tag (gtag.js) - Google Analytics