`

Introduce to Inforamtion Retrieval读书笔记(2)

阅读更多

The term vocabulary and postings lists

Inverted index construction step:

1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term occurs in.

 

2.1 Document delineation and character sequence decoding

Encoding Problems: how to auto-dectect encoding:

Text Format Problems: docs pdf xml html and so on.

Sequence Problems : Arabic(阿拉伯语), where text takes on some two dimensional and mixed order characteristics.

 

Choosing a document unit : A precision/recall tradeoff,large document units can be alleviated by use of explicit or implicit proximity search

 

2.2 Determining the vocabulary of terms

token :tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time
throwing away certain characters, such as punctuation。

Difference between token and type:

token not exactly the same word sequence,is a instance

type is exactly the same work sequence,is a class

like the difference of OOP's class and instance.

Tokenization are language-specific : Language identification based on clas-
IDENTIFICATION sifiers that use short character subsequences as features is highly effective;

most languages have distinctive signature patterns

中文分词:最大正向/反向匹配。

专有名词识别,ip url,邮箱、电话号码识别。

2.2.2 Dropping common terms: stop words

Stop words: extremely common words has little value in helping select documents matching a user need.

How to Collect:

The general COLLECTION strategy for determining a stop list is to sort the terms by collection frequency and then to take the most frequent terms, often hand-filtered for their semantic
content relative to the domain of the documents being indexed, as a stop list.

2.2.3 Normalization (equivalence classing of terms)

Token normalization is the process of canonicalizing TOKEN tokens so that matches
 occur despite superficial differences in the character sequences of the tokens.

不同写法:anti-discriminatory and antidiscriminatory

同义词:car and automobile

Accents and diacritics

Capitalization/case-folding

2.2.4 Stemming and lemmatization

The goal of both stemming and lemmatization is to reduce inflectional
forms and sometimes derivationally related forms of a word to a common
base form。

eg.:

am, are, is ⇒be
car, cars, car’s, cars’⇒car

Some common algorithm for stemming English:

Porter stemmer、Lovins stemmer、Paice stemmer

 

2.3 Faster postings list intersection via skip pointers

Postings lists intersection with skip pointers:

INTERSECTWITHSKIPS(p1, p2)
1 answer ← ()
2 while p1 != NIL and p2 != NIL
3 do if docID(p1) = docID(p2)
4 then ADD(answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 else if docID(p1) < docID(p2)
8 then if hasSkip(p1) and (docID(skip(p1) ≤ docID(p2)))
9 then while hasSkip(p1) and (docID(skip(p1) ≤ docID(p2)))
10 do p1 ← skip(p1)
11 else p1 ← next(p1)
12 else if hasSkip(p2) and (docID(skip(p2) ≤ docID(p1)))
13 then while hasSkip(p2) and (docID(skip(p2) ≤ docID(p1)))
14 do p2 ← skip(p2)
15 else p2 ← next(p2)
16 return answer

 2.4 Positional postings and phrase queries

Biword indexes : One approach to handling phrases is to consider every pair of consecutive
terms in a document as a phrase.(Not a standard solution)

Biword Extension:

The concept of a biword index can be extended to longer sequences of
words, and if the index includes variable length word sequences, it is generally
referred to as a phrase index

 

Positional indexes :(most commonly employed)

store postings of the form docID: <position1, position2, . . . >

 

An algorithm for proximity intersection of postings lists p1 and p2:

 

POSITIONALINTERSECT(p1, p2, k)
1 answer ← ()
2 while p1 != NIL and p2 != NIL
3 do if docID(p1) = docID(p2)
4 then l ← ()
5 pp1 ← positions(p1)
6 pp2 ← positions(p2)
7 while pp1 != NIL
8 do while pp2 != NIL
9 do if |pos(pp1) − pos(pp2)| > k
10 then break
11 else ADD(l, pos(pp2))
12 pp2 ← next(pp2)
13 while l != () and |l[0] − pos(pp1)| > k
14 do DELETE(l[0])
15 for each ps ∈ l
16 do ADD(answer, hdocID(p1), pos(pp1), psi)
17 pp1 ← next(pp1)
18 p1 ← next(p1)
19 p2 ← next(p2)
20 else if docID(p1) < docID(p2)
21 then p1 ← next(p1)
22 else p2 ← next(p2)
23 return answer
 

Combination schemes :

Combination of biword indexes and positional indexes。

 

0
0
分享到:
评论

相关推荐

    Introduce to Algorithms, A Creative Approach

    Introduce to Algorithms, A Creative Approach .英文版

    introduce to NS2

    ### NS2介绍与应用 #### 一、NS2概述 NS2(Network Simulator 2)是一种开源的事件驱动网络模拟器,专为计算机通信网络的研究而设计。自1989年问世以来,NS2在业界、学术界及政府机构中获得了极大的关注。通过多年...

    introduce to D3D Game Programming code

    2. **顶点和索引缓冲区**:这是3D图形的基础,用于存储模型的数据,如位置、颜色、纹理坐标等。了解如何创建、填充和使用这些缓冲区对于构建3D场景至关重要。 3. **渲染管线**:Direct3D使用渲染管线处理图形数据,...

    introduce to linux.html

    2. 跨平台:Linux不仅能在x86架构上运行,还支持ARM、MIPS等多种处理器架构,覆盖了从嵌入式设备到超级计算机的广泛应用。 3. 安全稳定:Linux的多用户、多任务环境保证了系统安全,且经过严格的权限控制和审核机制...

    Introduce to Java Programming 8th

    《Introduction to Java Programming》第8版是一本深入浅出的Java学习教材,为初学者提供了全面的Java编程知识。这本书涵盖了从基础语法到高级特性的全方位教学,旨在帮助读者掌握Java编程的核心技能。 在本书中,...

    Introduction to Linear Algebra

    从网上看到Introduction to Linear Algebra这本书(Algebra_Gilbert Strang),介绍的清晰易懂且很到位。就搜集了一些。包括第三版、第四版和讲课笔记。听说第三版比较经典,暂时只上传了第三版和笔记。虽然是英文的...

    Introduction to Lens Design

    Introduction to Lens Design

    Introduce to EIB Control Network and Protocol

    2. 数据链路层:定义了EIB网络中设备之间如何进行数据的封装、传输和接收,包括数据帧的结构、寻址方式、错误检测和校验等。数据链路层确保了数据在物理媒介上传输的准确性。 3. 网络层:负责整个EIB网络的逻辑结构...

    team introduce.key

    team introduce.key

    Introduction to linear optimization

    在“Introduction to Linear Optimization”一书中,对线性规划的几何理解进行了介绍,提供了线性规划问题的理论基础。 2. 单纯形方法(The Simplex Method):这是求解线性规划问题的经典算法之一,由乔治·丹齐格...

    Introduction to lens design.pdf

    2. 光学材料:光学材料是决定光学系统性能的关键因素之一。学习者需要了解不同类型的玻璃、塑料等材料的光学和物理特性,如折射率、色散率、热膨胀系数、光学均匀性和透光率等。 3. 光学元件:在光学系统中,各种...

    Introduce To Design Pattern(设计模式介绍)

    这是对设计模式的简单简绍,希望对大家有用

    [Introduction.to.Tornado(2012.3)].Michael.Dory.文字版.pdf

    《Introduction to Tornado》是Michael Dory、Adam Parrish和Brendan Berg三人所著,是一本关于Tornado网络服务器的入门与应用概览书籍。Tornado是一个开源的网络框架,最初是为FriendFeed公司的实时服务设计的,...

    Introduciton to Linear Algebra

    Gilbert Strang's textbooks have changed the entire approach to learning linear algebra -- away from abstract vector spaces to specific examples of the four fundamental subspaces: the column space and ...

    Introduction to Computing Systems From bits &amp; gates to C &amp; beyond

    计算机系统概论 英文版 作者: [美] Yale N. Patt'Introduction to Computing Systems: From bits & gates ... To understand the computer, the authors introduce the LC-3 and provide the LC-3 Simulator to give st

    introduce to HHT

    在"Introduction to HHT"这个压缩包文件中,很可能是包含了一些关于HHT理论的详细讲解,可能包括EMD的实例分析、希尔伯特变换的数学描述,以及如何运用HHT解决实际问题的案例。通过学习这些内容,读者可以深入了解...

    IntroduceIntroduceIntroduceIntroduce

    在当今职场竞争激烈的环境下,个人的自我介绍对于面试官来说,往往是了解应聘者的第一窗口。HHJ,一位26岁的年轻人,在面试中如何做好自我介绍,显得尤为重要。他将自己描述为勤奋、开放、随和、有责任感的人,这样...

    Introduce_to_Algorithms_3th.pdf

    - **书名**:《算法导论》第三版(Introduction to Algorithms Third Edition) - **作者**: - Thomas H. Cormen - Charles E. Leiserson - Ronald L. Rivest - Clifford Stein - **出版社**:MIT Press - **...

Global site tag (gtag.js) - Google Analytics