`

Introduce to Inforamtion Retrieval读书笔记(1)

阅读更多

很好的一本书,介绍的非常全面,看了很久了,还没有看完,刚看完前十章,发现前面看的都忘的差不多了,还是回来记一下吧。

Boolean Retrieval

一、information retrieval定义:

学院派定义:

Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers).

Category :

Category By Scale :

web search、domain-specific search、personal information retrieval

Basic need:

1、To process large document collections quickly.

2、To allow more flexible matching operations

3、To allow ranked retrieval

Simple idea:

term-document incidence matrix use binary logical OR AND NOT...:110100 AND 110111 AND 101111 = 100100

What is Boolean Retrival:

The Boolean retrieval model is a model for information BOOLEAN RETRIEVAL retrieval in which we
MODEL can pose any query which is in the form of a Boolean expression of terms,
that is, in which terms are combined with the operators AND, OR, and NOT.
Such queries effectively view each document as a set of words.

What's the boolean retrival query like:

(Calpurnia AND Brutus) AND Caesar

how to assess IR system

Precision : What fraction of the returned results are relevant to the information
need?
Recall : What fraction of the relevant documents in the collection were returned
by the system?

vector space model: Easy to rank

Term-document matrix: not scalable

Inverted index: dictionary and posting list.

 

How Build Inverted index :

1. Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2. Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . .

3. Do linguistic preprocessing, producing a list of normalized tokens, which
are the indexing terms: friend roman countryman so . . .
4. Index the documents that each term occurs in by creating an inverted index,
consisting of a dictionary and postings.

 

Processing Boolean queries:

AND operation:

intersect two posting list:

INTERSECT(p1, p2)
1 answer ← ()
2 while p1 != NIL and p2 != NIL
3 do if docID(p1) = docID(p2)
4 then ADD(answer, docID(p1))
5 p1 ← next(p1)
6 p2 ← next(p2)
7 else if docID(p1) < docID(p2)
8 then p1 ← next(p1)
9 else p2 ← next(p2)
10 return answer

  mulitiple term AND operation:

    Process terms in order of increasing document frequency:

      if we start by intersecting the two smallest postings lists, then all intermediate resultsmust be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work

INTERSECT(ht1, . . . , tni)
1 terms ← SORTBYINCREASINGFREQUENCY(ht1, . . . , tni)
2 result ← postings( f irst(terms))
3 terms ← rest(terms)
4 while terms != NIL and result != NIL
5 do result ← INTERSECT(result, postings( f irst(terms)))
6 terms ← rest(terms)
7 return result

OR operation:

    The idea is 归并排序中的n路归并,similarily with AND operation。

The extended Boolean model versus ranked retrieval:

Proximity operator:

A proximity operator is a way of specifying that two terms in a query must occur in a document close to each other, where closeness may be measured
by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.

 

Addition to do:

1. We would like to better determine the set of terms in the dictionary and
to provide retrieval that is tolerant to spelling mistakes and inconsistent
choice of words.
2. It is often useful to search for compounds or phrases that denote a concept
such as “operating system”. As the Westlaw examples show, we might also
wish to do proximity queries such as Gates NEAR Microsoft. To answer
such queries, the index has to be augmented to capture the proximities of
terms in documents.
3. A Boolean model only records term presence or absence, but often we
would like to accumulate evidence, givingmoreweight to documents that
have a term several times as opposed to ones that contain it only once. To
be able to do this we need the term frequency information TERM FREQUENCY (the number of
times a term occurs in a document) in postings lists.
4. Boolean queries just retrieve a set of matching documents, but commonly
we wish to have an effective method to order (or “rank”) the returned
results. This requires having a mechanism for determining a document
score which encapsulates how good a match a document is for a query.

 

0
0
分享到:
评论

相关推荐

    Introduce to Algorithms, A Creative Approach

    Introduce to Algorithms, A Creative Approach .英文版

    introduce to NS2

    1. **开放源代码**:这使得NS2能够被全球的研究者和开发者不断改进和完善。 2. **高度可扩展性**:用户可以根据自己的需求开发新的模块或修改现有模块。 3. **详细的结果分析工具**:支持对模拟结果进行深入分析,...

    introduce to D3D Game Programming code

    1. **Direct3D基础知识**:这部分内容通常会涵盖Direct3D的基本概念,包括设备创建、上下文管理、窗口设置和初始化过程。学习者需要理解Direct3D设备的生命周期以及如何与Windows应用程序集成。 2. **顶点和索引...

    introduce to linux.html

    1. 开源自由:Linux遵循GNU通用公共许可证(GPL),其源代码可供任何人查看、修改和分发,这促进了社区的协作和创新。 2. 跨平台:Linux不仅能在x86架构上运行,还支持ARM、MIPS等多种处理器架构,覆盖了从嵌入式...

    Introduce to Java Programming 8th

    《Introduction to Java Programming》第8版是一本深入浅出的Java学习教材,为初学者提供了全面的Java编程知识。这本书涵盖了从基础语法到高级特性的全方位教学,旨在帮助读者掌握Java编程的核心技能。 在本书中,...

    Introduction to Linear Algebra

    从网上看到Introduction to Linear Algebra这本书(Algebra_Gilbert Strang),介绍的清晰易懂且很到位。就搜集了一些。包括第三版、第四版和讲课笔记。听说第三版比较经典,暂时只上传了第三版和笔记。虽然是英文的...

    Introduction to Lens Design

    Introduction to Lens Design

    Introduce to EIB Control Network and Protocol

    EIB协会(EIBA,European Installation Bus Association)提供了一个标准化的参考手册系列,为想要深入了解和应用EIB技术的人员提供了详尽的指导和规范(例如手册第1卷和第3卷)。通过这些规范文档,制造商可以确保...

    team introduce.key

    team introduce.key

    Introduction to linear optimization

    1. 线性规划(Linear Programming):线性优化通常与线性规划同义,是研究线性约束和线性目标函数的问题。在“Introduction to Linear Optimization”一书中,对线性规划的几何理解进行了介绍,提供了线性规划问题的...

    Introduction to lens design.pdf

    1. 基础理论:光学设计的理论基础包括光线追踪原理、费马原理、斯涅尔定律等,这些都是分析光线如何在不同介质和光学元件中传播的基础。除此之外,还需掌握光学成像的基本公式,比如高斯光学中的透镜公式和牛顿公式...

    Introduce To Design Pattern(设计模式介绍)

    这是对设计模式的简单简绍,希望对大家有用

    [Introduction.to.Tornado(2012.3)].Michael.Dory.文字版.pdf

    《Introduction to Tornado》是Michael Dory、Adam Parrish和Brendan Berg三人所著,是一本关于Tornado网络服务器的入门与应用概览书籍。Tornado是一个开源的网络框架,最初是为FriendFeed公司的实时服务设计的,...

    Introduciton to Linear Algebra

    Gilbert Strang's textbooks have changed the entire approach to learning linear algebra -- away from abstract vector spaces to specific examples of the four fundamental subspaces: the column space and ...

    introduce to HHT

    在"Introduction to HHT"这个压缩包文件中,很可能是包含了一些关于HHT理论的详细讲解,可能包括EMD的实例分析、希尔伯特变换的数学描述,以及如何运用HHT解决实际问题的案例。通过学习这些内容,读者可以深入了解...

    Introduction to Computing Systems From bits &amp; gates to C &amp; beyond

    计算机系统概论 英文版 作者: [美] Yale N. Patt'Introduction to Computing Systems: From bits & gates ... To understand the computer, the authors introduce the LC-3 and provide the LC-3 Simulator to give st

    IntroduceIntroduceIntroduceIntroduce

    在当今职场竞争激烈的环境下,个人的自我介绍对于面试官来说,往往是了解应聘者的第一窗口。HHJ,一位26岁的年轻人,在面试中如何做好自我介绍,显得尤为重要。他将自己描述为勤奋、开放、随和、有责任感的人,这样...

    Introduce_to_Algorithms_3th.pdf

    - **书名**:《算法导论》第三版(Introduction to Algorithms Third Edition) - **作者**: - Thomas H. Cormen - Charles E. Leiserson - Ronald L. Rivest - Clifford Stein - **出版社**:MIT Press - **...

Global site tag (gtag.js) - Google Analytics