【总结】搜索服务Solr
1, Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via
JSON, XML, CSV or binary over
HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary results.
2, Solr Administration User Interface
- Logging
- Cloud Screens
- Core Admin
- Java Properties
- Thread Dump
- Core-Specific Tools
-
- Analysis Screen
- Dataimport screen
- Documents Screen
- Files Screen
- Ping
- Plugin & Stats Screen
- Query Screen
- Replication Screen
- Schema Browser Screen
- Segments Info
3, Documents, Fields, Schema Design
- Field Properties
-
- indexed
- stored
- docValues
- sortMissingFirst / sortMissingLast
- multiValued
- omitNorms
- omitTermFreqAndPositions
- omitPositions
- termVectors / termPositions / termOffsets / termPayloads
- required
- Field Types
-
- BinaryField
- BoolField
- CollationField
- CurrencyField
- DataRangeField
- ExternalFileField
- EnumField
- LatLonType
- PointType
- TextField
- StrField
- TrieField
- TrieInt/Long/FloatField
- UUIDField
4, Analyzers, Tokenizers and Filters
- Analyzers
-
- An analyzer examines the text of fields and generates a token stream
- Analyzers are specified as a child of the <fieldType> element in the schema.xml configuration file
- Analyzers
-
- WhitespaceAnalyzer
- SimpleAnalyzer
- StopAnalyzer
- StandardAnalyzer
- Tokenizers
-
- The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence of the characters in the text
- An analyzer is aware of the field it is configured for, but a tokenizer is not
- Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a TokenStream)
- You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer>
- Tokenizers
-
- WhitespaceTokenizer
- KeywordTokenizer
- LetterTokenizer
- StandardTokenizer
- Filters
-
- Like tokenizers, filters consume input and produce a stream of tokens
- Filters also derive from org.apache.lucene.analysis.TokenStream
- Unlike tokenizers, a filter's input is another TokenStream. The job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially and decides whether to pass it along, replace it or discard it
- A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although this is less common
- One hypothetical use for such a filter might be to normalize state names that would be tokenized as two words. For example, the single token "california" would be replaced with "CA", while the token pair "rhode" followed by "island" would become the single token "RI"
- Filters
-
- LowerCaseFilter
- StopFilter
- PorterStemFilter
- ASCIIFoldingFilter
- StandardFilter
5,
Indexing
- The three most common ways of loading data into a Solr index
-
- Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats
- Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated
- Writing a custom Java application to ingest data through Solr's Java Client API
- Uploading Data with Index Handlers
-
- Index Handlers are Request Handlers designed to add, delete and update documents to the index
- In addition to having plugins for importing rich documents using Tika or from structured data sources using the Data Import Handler
- Solr natively supports indexing structured documents in XML, CSV and JSON
- Uploading Data with Solr Cell using Apache Tika
-
- Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself
- Working with this framework, Solr's ExtractingRequestHandler can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing
- When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework's name: Solr Cell
- Uploading Structured Data Store Data with the Data Import Handler
-
- The Data Import Handler (DIH) provides a mechanism for importing content from a data store and indexing it
- In addition to relational databases, DIH can index content from HTTP based data sources such as RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate fields
- Detecting Languages During Indexing
-
- Solr can identify languages and map text to language-specific fields during indexing using the langid UpdateRequestProcessor. Solr supports two implementations of this feature
-
- Tika's language detection feature
- LangDetect language detection
- UIMA Integration
-
- UIMA(the Apache Unstructured Information Management Architecture) lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations
6, Searching
- The search query is processed by a request handler
-
- Solr supports a variety of request handlers. Some are designed for processing search queries, while others manage tasks such as index replication
- To process a search query, a request handler calls a query parser, which interprets the terms and parameters of a query
- Input to a query parser can include
-
- search strings---that is, terms to search for in the index
- parameters for fine-tuning the query by increasing the importance of particular strings or fields, by applying Boolean logic among the search terms, or by excluding content from the search results
- parameters for controlling the presentation of the query response, such as specifying the order in which results are to be presented or limiting the response to particular fields of the search application's schema
- Search parameters may also specify a query filter
- Query Syntax and Parsing
-
- The Standard Query Parser
-
- Solr's default Query Parser is also known as the "lucene" parser
- The key advantage of the standard query parser is that it supports a robust and fairly intuitive syntax allowing you to create a variety of structured queries
- The largest disadvantage is that it's very intolerant of syntax errors, as compared with something like the DisMax query parser which is designed to throw as few errors as possible
- The DisMax Query Parser
-
- The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field
- Additional options enable users to influence the score based on rules specific to each use case independent of user input)
- The Extended DisMax Query Parser
-
- The Extended DisMax (eDisMax) query parser is an improved version of the DisMax query parser. In addition to supporting all the DisMax query parser parameters
- Other Parsers
-
- Block Join Query Parsers
- Boost Query Parser
- Collapsing Query Parser
- Complex Phrase Query Parser
- Field Query Parser
- Function Query Parser
- Function Range Query Parser
- Join Query Parser
- Lucene Query Parser
- Max Score Query Parser
- More Like This Query Parser
- Query
-
- TermQuery
- TermRangeQuery
- NumericRangeQuery
- PrefixQuery
- BooleanQuery
- PhraseQuery
- WildcardQuery
- FuzzyQuery
- MatchAllDocsQuery
- Faceting
-
- faceting is the arrangement of search results into categories based on indexed terms
- Searchers are presented with the indexed terms, along with numerical counts of how many matching documents were found were each term
- Faceting makes it easy for users to explore search results, narrowing in on exactly the results they are looking for
- Highlighting
-
- Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response
- There are three highlighting implementations available
-
- The Standard Highlighter is the swiss-army knife of the highlighters. It has the most sophisticated and fine-grained query representation of the three highlighters
- FastVector Highlighter
-
- The FastVector Highlighter requires term vector options (termVectors, termP ositions, and termOffsets) on the field, and is optimized with that in mind
- It tends to work better for more languages than the Standard Highlighter, because it supports Unicode breakiterators. On the other hand, its query-representation is less advanced than the Standard Highlighter
- for example it will not work well with the surround parser. This highlighter is a good choice for large documents and highlighting text in a variety of languages
- Postings Highlighter
-
- The Postings Highlighter requires storeOffsetsWithPositions to be configured on the field
- This is a much more compact and efficient structure than term vectors, but is not appropriate for huge numbers of query terms (e.g. wildcard queries). Like the FastVector Highlighter
- it supports Unicode algorithms for dividing up the document
- Spell Checking
- Query Re-Ranking
- Suggester
- MoreLikeThis
- Pagination of Results
- Result Grouping
- Spatial Search
- The Term Vector Component: For each document in the response, the TermVectorCcomponent can return the term vector, the term frequency, inverse document frequency, position, and offset information
- The Stats Component: The Stats component returns simple statistics for numeric, string, and date fields within the document set
- Response Writers
-
- CSVResponseWriter
- JSONResponseWriter
- VelocityResponseWriter
- XMLResponseWriter
7, The Well-Configured Solr Instance
- Configuring solrconfig.xml
-
- request handlers, which process the requests to Solr, such as requests to add documents to the index or requests to return results for a query
- listeners, processes that "listen" for particular query-related events; listeners can be used to trigger the
- execution of special code, such as invoking some common queries to warm-up caches
- the Request Dispatcher for managing HTTP communications
- the Admin Web interface
- parameters related to replication and duplication (these parameters are covered in detail in Legacy Scaling and Distribution)
- Solr Cores and solr.xml
-
- Solr.xml已经从配置一个Solr core进化到支持多个Solr core,并最终为SolrCloud定义参数
8, SolrCloud
- 概念
-
- Collection:在SolrCloud集群中逻辑意义上的完整的索引。它常常被划分为一个或多个Shard,它们使用相同的Config Set。如果Shard数超过一个,它就是分布式索引,SolrCloud让你通过Collection名称引用它,而不需要关心分布式检索时需要使用的和Shard相关参数
- Config Set: Solr Core提供服务必须的一组配置文件。每个config set有一个名字。最小需要包括solrconfig.xml (SolrConfigXml)和schema.xml (SchemaXml),除此之外,依据这两个文件的配置内容,可能还需要包含其它文件。它存储在Zookeeper中。Config sets可以重新上传或者使用upconfig命令更新,使用Solr的启动参数bootstrap_confdir指定可以初始化或更新它
- Core: 也就是Solr Core,一个Solr中包含一个或者多个Solr Core,每个Solr Core可以独立提供索引和查询功能,每个Solr Core对应一个索引或者Collection的Shard,Solr Core的提出是为了增加管理灵活性和共用资源。在SolrCloud中有个不同点是它使用的配置是在Zookeeper中的,传统的Solr core的配置文件是在磁盘上的配置目录中
- Leader: 赢得选举的Shard replicas。每个Shard有多个Replicas,这几个Replicas需要选举来确定一个Leader。选举可以发生在任何时间,但是通常他们仅在某个Solr实例发生故障时才会触发。当索引documents时,SolrCloud会传递它们到此Shard对应的leader,leader再分发它们到全部Shard的replicas
- Replica: Shard的一个拷贝。每个Replica存在于Solr的一个Core中。一个命名为“test”的collection以numShards=1创建,并且指定replicationFactor设置为2,这会产生2个replicas,也就是对应会有2个Core,每个在不同的机器或者Solr实例。一个会被命名为test_shard1_replica1,另一个命名为test_shard1_replica2。它们中的一个会被选举为Leader
- Shard: Collection的逻辑分片。每个Shard被化成一个或者多个replicas,通过选举确定哪个是Leader
- Zookeeper: Zookeeper提供分布式锁功能,对SolrCloud是必须的。它处理Leader选举。Solr可以以内嵌的Zookeeper运行,但是建议用独立的,并且最好有3个以上的主机
- Features
-
- Central configuration for the entire cluster
- Automatic load balancing and fail-over for queries
- ZooKeeper integration for cluster coordination and configuration
-
- Nodes and Cores
-
- In SolrCloud, anodeis Java Virtual Machine instance running Solr, commonly called a server. Each Solr core can also be considered a node. Any node can contain both an instance of Solr and various kinds of data
- A Solrcoreis basically an index of the text and fields found in documents
- A single Solr instance can contain multiple "cores", which are separate from each other based on local criteria
- When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the core and how to contact it
- Clusters
-
- A cluster is set of Solr nodes managed by ZooKeeper as a single unit
- When you have a cluster, you can always make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can be expanded or contracted
- Leaders and Replicas
-
- The concept of aleaderis similar to that ofmasterwhen thinking of traditional Solr replication. The leader is responsible for making sure thereplicasare up to date with the same information stored in the leader
- However, with SolrCloud, you don't simply have one master and one or more "slaves", instead you likely have distributed your search and index traffic to multiple machines
-
- When your data is too large for one node, you can break it up and store it in sections by creating one or moreshards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index
- A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined
- Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results
- ZooKeeper provides failover and load balancing
-
- One of the advantages of using SolrCloud is the ability to distribute requests among various shards that may or may not contain the data that you're looking for. You have the option of searching over all of your data or just parts of it
- Configuring the ShardHandlerFactory
-
- You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr. This allows for finer grained control and you can tune it to target your own specific requirements. The default configuration favors throughput over latency
9, 中文分词器
- mmseg4j
-
- mmseg4j用Chih-Hao Tsai 的MMSeg算法实现的中文分词器
- MMSeg 算法有两种分词方法:Simple和Complex,都是基于正向最大匹配。Complex加了四个规则过虑
- paoding
-
- Paoding's Knives 中文分词具有极 高效率 和 高扩展性 。引入隐喻,采用完全的面向对象设计,构思先进
- 高效率:在PIII 1G内存个人机器上,1秒 可准确分词 100万 汉字
- 采用基于不限制个数 的词典文件对文章进行有效切分,使能够将对词汇分类定义
- 能够对未知的词汇进行合理解析
- ictclas4j
-
- ictclas4j中文分词系统是sinboy在中科院张华平和刘群老师的研制的FreeICTCLAS的基础上完成的一个java开源分词项目
- IKAnalyzer
-
- 它是以开源项目Lucene为应用主体的,结合词典分词和文法分析算法的中文分词组件
- 采用了特有的“正向迭代最细粒度切分算法“,具有60万字/秒的高速处理能力
- 采用了多子处理器分析模式,支持:英文字母(IP地址、Email、URL)、数字(日期,常用中文数量词,罗马数字,科学计数法),中文词汇(姓名、地名处理)等分词处理
- 对中英联合支持不是很好,在这方面的处理比较麻烦.需再做一次查询,同时是支持个人词条的优化的词典存储,更小的内存占用
- 支持用户词典扩展定义
- 针对Lucene全文检索优化的查询分析器IKQueryParser;采用歧义分析算法优化查询关键字的搜索排列组合,能极大的提高Lucene检索的命中率
- ansj
-
- 这是一个ictclas的java实现.基本上重写了所有的数据结构和算法.词典是用的开源版的ictclas所提供的.并且进行了部分的人工优化
- 内存中中文分词每秒钟大约100万字(速度上已经超越ictclas)
- 文件读取分词每秒钟大约30万字
- 准确率能达到96%以上
- 目前实现了.中文分词. 中文姓名识别 . 用户自定义词典
- 可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目
10,Solr 性能因素
- Schema Design Considerations(数据模型方面考虑)
- indexed fields
- Configuration Considerations(配置方面考虑)
- mergeFactor
相关推荐
### Solr搜索服务器安装配置详解 #### 一、Solr简介 Apache Solr是一款开源的高性能全文搜索引擎,基于Lucene库构建。它采用Java开发,提供了丰富的API接口,支持多种编程语言,使得开发者能够轻松地集成搜索功能到...
### Solr基本总结 #### 一、Solr简介 **Solr** 是一款基于 Java 开发的、开源的企业级全文搜索引擎。它不仅提供了强大的搜索功能,还具备一系列高级特性,适用于构建复杂的应用程序。 ##### 1.1 Solr 的定义 - **...
总结,"全文搜索技术solr Demo"涵盖了Solr的基本概念、安装配置、数据导入、查询操作以及分布式搜索等内容。通过这个Demo,你将能够快速上手Solr,为你的项目构建高效、可扩展的全文搜索解决方案。
配置Solr时,需要将Solr的war文件放入Tomcat的webapps目录,然后启动Tomcat,Solr服务就准备好了。 **4. 使用Solr** 一旦Solr安装并配置完毕,你可以开始创建索引、定义字段、配置查询分析器和过滤器,以及设置...
总结来说,Solr 6.2.0是一个强大的全文搜索引擎,它的分布式特性、实时性以及丰富的功能使得它成为企业级搜索应用的理想选择。通过研究其源码,开发者不仅可以学习到搜索引擎的相关知识,还可以提升在大数据处理和...
Solr是Apache Lucene项目下的一个企业级搜索服务器,它提供了全文检索、高亮显示、 faceted search(分面搜索)以及实时分析等功能。本文将详细介绍如何搭建Solr环境,并解析其配置文件,同时也会涉及SolrJ客户端的...
### Solr在Java中的使用总结 #### 一、Solr简介 Solr是一个高性能的全文搜索引擎,基于Apache Lucene开发,使用Java 5编写。它不仅继承了Lucene的强大功能,还提供了更丰富的查询语言以及更好的性能优化。Solr具备...
Solr 是 Apache 开源项目中的一个全文搜索服务器,基于 Java 开发,并且是建立在 Lucene 全文检索引擎库之上。它不仅提供了更高级的查询语言,还实现了可配置、可扩展的特性,对索引和搜索性能进行了优化。在生产...
- 监控与告警:集成监控工具,如Zabbix或Prometheus,监控Solr服务的状态,及时发现并处理问题。 - 扩展性:设计服务的扩展机制,当业务增长时,可以轻松地添加新的Solr节点或Dubbo服务实例。 总结,这个项目结合了...
在部署Solr时,可以将Solr的WAR文件部署到Tomcat的webapps目录下,启动Tomcat即可运行Solr服务。 为了配置和调试Solr,你需要了解两个核心配置文件:`solrconfig.xml`和`schema.xml`。`solrconfig.xml`定义了Solr的...
- **SolrCore**:Solr 实例的核心,每个 SolrCore 提供单独的搜索和索引服务。 - **创建 SolrCore** 1. 在 Solr 解压包的 example\solr 文件夹下创建 SolrHome。 2. 复制 solr-4.10.3\example\solr 文件夹到本地...
总结来说,Solr作为一个强大且易用的全文搜索引擎,为企业和开发者提供了丰富的搜索解决方案。通过理解其核心概念、特性以及应用场景,我们可以充分利用Solr来提升系统的搜索性能,优化用户体验。
Solr 是一个开源的企业级搜索服务器,底层使用易于扩展和修改的Java 来实现。服务 器通信使用标准的HTTP 和XML,所以如果使用Solr 了解Java 技术会有用却不是必须的要 求。 Solr 主要特性有:强大的全文检索功能,...
Solr是一个高性能、可伸缩的企业级搜索引擎平台,它可以作为一个独立的服务运行,并且提供了丰富的API接口,支持多种编程语言,使得开发者能够轻松地集成搜索功能到现有的应用程序中。Solr的主要特点包括高度可配置...
### Solr技术总结 #### 一、Solr概述与应用场景 **Solr** 是一个高度成熟且被广泛应用的全文搜索引擎,由Apache基金会维护。Solr是基于Lucene的,但相较于Lucene,它提供了更丰富的功能和服务接口。Solr不仅支持...
在本例中,通过`@Value`注解从配置文件中读取Solr服务的基本URL,并通过该URL实例化一个`HttpSolrServer`对象,以便与Solr服务器建立连接。 ```java @Value("${ph.solr.baseurl}") private String baseUrl; ...
### Apache Solr 企业搜索引擎教程知识点总结 #### 1. Apache Solr 概述 - **Solr**:Apache Solr 是一款高度可扩展且高性能的企业级搜索平台,由Apache软件基金会维护。它是一个开源搜索服务器,使用Java语言编写...