`

Solr: Indexing 2

    博客分类:
  • Solr
 
阅读更多

Field types for structured nontext fields

In general, Solr provides a number of built-in field types for structured data, such as numbers, dates, and geo location fields.

 

String fields

Solr provides the string field type for fields that contain structured values that shouldn’t be altered in any way. For example, the lang field contains a standard ISO-639-1 language code used to identify the language of the tweet, such as en.

 

<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>

 

 

Date fields

A common approach to searching on date fields is to allow users to specify a date range.

 

<field name="timestamp" type="tdate" indexed="true" stored="true" />

<fieldType name="tdate" class="solr.TrieDateField" omitNorms="true"
precisionStep="6" positionIncrementGap="0"/>

 In general, Solr expects your dates to be in the ISO-8601 Date/Time format (yyyy-MM-ddTHH:mm:ssZ);

 

Z is UTC Timezone.

 

DATE GRANULARITY 
<field name="timestamp">2012-05-22T09:30:22Z/HOUR</field>

 

 

Numeric fields

 

<field name="favorites_count" type="int" indexed="true" stored="true" />

<fieldType name="int" class="solr.TrieIntField"
precisionStep="0" positionIncrementGap="0"/> 

 

Because we don’t need to support range queries on this field, we chose precisionStep="0", which works best for sorting without incurring the additional storage costs associated with a higher precision step used for faster range queries. Also, note that you shouldn’t index a numeric field that you need to sort as a string field because Solr will do a lexical sort instead of a numeric sort if the underlying type is stringbased.

 

Advanced field type attributes

Solr supports optional attributes for field types to enable advanced behavior.



 

-------------------------------------------------------------------------------------------------------------------------------------

Sending documents to Solr for indexing



 

 

 

Importing documents into Solr

  • HTTP POST
  • Data Import Handler (DIH)
  • ExtractingRequestHandler, aka Solr Cell
  • Nutch

-------------------------------------------------------------------------------------------------------------------------------------

Update handler

In general, the update handler processes all updates to your index as well as commit and optimize requests. Table 5.7 provides an overview of common request types supported by the update handler.



 



 

 

 

Committing documents to the index

 

  • NORMAL/HARD COMMIT     A normal or hard commit is one in which Solr flushes all uncommitted documents to disk and refreshes an internal component called a searcher so that the newly committed documents can be searched.
  • SOFT COMMIT A soft commit is a new feature in Solr 4 to support near real-time (NRT) searching. For now, you can think of a soft commit as a mechanism to make documents searchable in near real-time by skipping the costly aspects of hard commits, such as flushing to durable storage. As soft commits are less expensive, you can issue a soft commit every second to make newly indexed documents searchable within about a second of adding them to Solr. But keep in mind that you still need to do a hard commit at some point to ensure that documents are eventually flushed to durable storage.

AUTOCOMMIT

For either normal or soft commits, you can configure Solr to automatically commit documents using one of three strategies:

  1. Commit each document within a specified time.
  2. Commit all documents once a user-specified threshold of uncommitted documents is reached.
  3. Commit all documents on a regular time interval, such as every ten minutes.

 

When performing an autocommit, the normal behavior is to open a new searcher.But Solr lets you disable this behavior by specifying <openSearcher>false</openSearcher>. In this case, the documents will be flushed to disk, but won’t be visible in search results. Solr provides this option to help minimize the size of its transaction log of uncommitted updates (see the next section) and to avoid opening too many searchers during a large indexing process.



 



 

Transaction log

olr uses a transaction log to ensure that updates accepted by Solr are saved on durable storage until they’re committed to the index. Imagine the scenario in which your client application sends a commit every 10,000 documents. If Solr crashes after the client sends documents to be indexed but before your client sends the commit, then without a transaction log, these uncommitted documents will be lost. Specifically, the transaction log serves three key purposes:

  1. It is used to support real-time gets and atomic updates.
  2. It decouples write durability from the commit process.
  3. It supports synchronizing replicas with shard leaders in SolrCloud

 

<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>

Every update request is logged to the transaction log. The transaction log continues to grow until you issue a commit. During a commit, the active transaction log is processed and then a new transaction log file is opened.


 

With the transaction log, your main concern is balancing the trade-off between the length of your transaction log—that is, how many uncommitted updates—and how frequently you want to issue a hard commit. If your transaction log grows large, a restart may take a long time to process the updates, delaying your recovery process.

 

Atomic updates

You can update existing documents in Solr by sending a new version of the document. But unlike a database in which you can update a specific column in a row, with Solr you must update the entire document. Behind the scenes, Solr deletes the existing document and creates a new one; this occurs whether you change one field or all fields.

Atomic updates are a new feature in Solr that allows you to send updates to only the fields you want to change.



 

Behind the scenes, Solr locates the existing document with id=1, retrieves all stored fields from the index, deletes the existing document, and creates a new document from all existing fields plus the new retweet_count_ti field. It follows that all fields must be stored for this to work because the client application is only sending the id field and the new field. All other fields must be pulled from the existing document.

 

-----------------------------------------------------------------------------------------------------------------------------------

Index management

The most of the index-related settings in Solr are for expert use only. What this means is that you should take caution when you make changes and that the default settings are appropriate for most Solr installations.

 

Index storage

When documents are committed to the index, they’re written to durable storage using a component called a directory. The directory component provides the following key benefits to Solr:

  • Hides details of reading from and writing to durable storage, such as writing to a file on disk or using JDBC to store documents in a database.
  • Implements a storage-specific locking mechanism to prevent index corruption,such as OS-level locking for filesystem-based storage.
  • Insulates Solr from JVM and OS peculiarities.
  • Enables extending the behavior of a base directory implementation to support specific use cases like NRT search.

By default, Solr uses a directory implementation that stores data to the local filesystem in the data directory for a core.

The location of the data directory is controlled by the <dataDir> element in solrconfig.xml:

<dataDir>${solr.data.dir:}</dataDir>

The solr.data.dir property defaults to data but can be overridden in solr.xml for each core, such as

<core loadOnStartup="true" instanceDir="collection1/"
transient="false" name="collection1"
dataDir="/usr/local/solr-data/collection1"/>

 

Here are some basic pointers to keep in mind:

  • Each core shouldn’t have to compete for the disk with other processes. If you have multiple cores on the same server, it’s a good idea to use separate physical disks for each index.
  • Use high-quality, fast disks or, even better, consider using solid state drives (SSDs) if your budget allows.
  • Spend quality time with your system administrators to discuss RAID options for your servers.
  • The amount of memory (RAM) you leave available to your OS for filesystem caching can also have a sizable impact on your disk I/O needs.

The default directory implementation used by Solr is solr.NRTCachingDirectoryFactory, which is configured with the <directoryFactory> element in solrconfig.xml:

<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>

 

 

 

Segment merging

A segment is a self-contained, read-only subset of a full Lucene index; once a segment is flushed to durable storage, it’s never altered. When new documents are added to your index, they’re written to a new segment. Consequently, there can be many active segments in your index. Each query must read data from all segments to get a complete result set. At some point, having too many small segments can negatively impact query performance. Combining many smaller segments into fewer larger segments is
commonly known as segment merging.

 

  • 大小: 72.7 KB
  • 大小: 71.2 KB
  • 大小: 51.8 KB
  • 大小: 46.5 KB
  • 大小: 62.7 KB
  • 大小: 57.7 KB
  • 大小: 71.1 KB
  • 大小: 26.8 KB
  • 大小: 12.6 KB
  • 大小: 83.5 KB
  • 大小: 28.5 KB
  • 大小: 72.4 KB
分享到:
评论

相关推荐

    solr indexing

    solr indexing 介绍solr indexing过程,及常用的上传方法

    indexing-mysql-table-into-solr:将mysql表索引到solr中

    索引MySQL表到solr 将mysql表索引到solr中 在这里,我们将逐步进行过程。 要将mysql表索引到solr中,我们需要这些技术。 MySQL数据库 让我们从MySql开始聚会。 使用yum或任何您喜欢的方法,先安装mysql,再安装...

    windows-solr集群.docx

    - Solr管理界面提供了详细的配置选项,包括核心管理(Core Administration)、索引管理(Indexing Management)、查询处理(Query Handling)等功能模块。 #### 三、Solr数据导入与同步 1. **数据导入**: - 将Solr提供...

    Apache.Solr.Search.Patterns.1783981849

    Solr Indexing Internals Chapter 2. Customizing the Solr Scoring Algorithm Chapter 3. Solr Internals and Custom Queries Chapter 4. Solr for Big Data Chapter 5. Solr in E-commerce Chapter 6. Solr for ...

    solr6 增量导入demo

    在Solr6中,增量导入(Incremental Indexing)是一项重要的功能,它允许系统仅更新自上次导入以来发生变化的数据,从而提高了数据处理的效率并降低了资源消耗。本教程将深入探讨Solr6的增量导入及其应用。 一、Solr...

    Solr in action.mobi

    2 ■ Getting to know Solr 26 3 ■ Key Solr concepts 48 4 ■ Configuring Solr 82 5 ■ Indexing 116 6 ■ Text analysis 162 PART 2 CORE SOLR CAPABILITIES ..........................................195 7 ...

    linux版solr

    2. **索引(Indexing)**:Solr通过索引来提升搜索效率。索引过程将原始数据转换为倒排索引结构,使得搜索操作能够快速定位到相关的文档。 3. **XML请求处理器(XML Request Handler)**:Solr使用HTTP协议接收和响应...

    apache-solr-ref-guide-7.4(官方英文-文字版本)

    6. **在 AWS EC2 上部署 Solr Cloud**:针对云环境下的部署场景,介绍了如何在 Amazon Web Services (AWS) 的 Elastic Compute Cloud (EC2) 实例上部署 Solr Cloud。 7. **升级 Solr 集群**:提供了关于如何将现有...

    Apache Solr Essentials(PACKT,2015)

    The book starts off by explaining the fundamentals of Solr and then goes on to cover various topics such as data indexing, ways of extending Solr, client APIs and their indexing and data searching ...

    solr in action

    - **索引过程(Indexing Process)**:索引过程包括将原始数据转换为适合搜索的形式,并将其存储到索引中。这个过程通常涉及解析文档、提取元数据等步骤。 - **文本分析(Text Analysis)**:在索引文档之前,Solr会对其...

    Solr in action

    - **索引(Indexing)**:索引是Solr处理数据的基础。本节将详细介绍索引结构、文档字段类型以及如何优化索引效率。 - **文本分析(Text Analysis)**:文本分析是Solr处理非结构化文本数据的关键技术之一,涉及到分词、...

    Apache Solr(Apress,2015)

    The book, which assumes a basic knowledge of Java, starts with an introduction to Solr, followed by steps to setting it up, indexing your first set of documents, and searching them. It then covers the...

    Apache Solr lucene 搜索模块设计实现

    2. **Apache Solr**:Solr 基于 Lucene,提供了一个更高级的、企业级的搜索平台。它添加了分布式搜索、缓存、集群、日志记录、查询分析、结果高亮、分面搜索等功能,并且提供了基于 XML 和 JSON 的 RESTful API,...

    Apache Solr 3 Enterprise Search Server 部分中文翻译

    2. **Schema设计(Schema和文本分析)** Schema是Solr的核心概念之一,用于定义索引的结构和规则。它包括字段类型(Field Types)、字段(Fields)和动态字段(Dynamic Fields)。字段类型定义了字段的数据类型和...

    Apache.Solr.4.Enterprise.Search.Server.3rd.Edition.1782161368.epub

    Chapter 2. Schema Design Chapter 3. Text Analysis Chapter 4. Indexing Data Chapter 5. Searching Chapter 6. Search Relevancy Chapter 7. Faceting Chapter 8. Search Components Chapter 9. Integrating Solr...

    apache-solr-ref-guide-7.1.pdf

    “Indexing Using Client APIs”和“Introduction to Solr Indexing”部分提供了关于如何使用客户端API进行索引的概览和简介。 “Post Tool”部分讲解了Post工具的使用,这是一个简单的命令行工具,用于向Solr发送...

    Apache Solr [Apache Con 2006]

    - **Lucene**: Provides the underlying indexing and search capabilities. - **Admin Interface**: Offers a web-based interface for managing and configuring the Solr instance. - **Standard Request Handler...

    apache solr Reference guide 4.5.pdf

    “Indexing and Basic Data Operations”部分讲述了索引过程以及基础的索引操作,如提交(commit)、优化(optimize)和回滚(rollback)。这些操作是进行数据管理和维护的关键,对于保证索引的质量和性能至关重要。...

Global site tag (gtag.js) - Google Analytics