`
Gene
  • 浏览: 51059 次
社区版块
存档分类
最新评论

Librarian's Ultimate Guide to Search Engines

阅读更多

Librarian's Ultimate Guide to Search Engines

The Librarians Ultimate Guide to Search Engines

Librarians were the ultimate search guides before search was re-invented with the web. They are trusted, credible sources for historical information, and pioneers and innovators of taxonomy of information. Librarians witness, search for, find, organize and catalog knowledge.Online research and the power of the web, have made accessing information only fingertips away from all of us, but the taxonomies and standards used for search will impact how people learn online and off for years to come. Below are some of the things librarians understand about search - and things that anyone doing online research can benefit from.

Brief Recent History Of Search Engines

While there are many search engines, about 80-90% of the search market belongs to just a few: Google, Yahoo, and MSN, approximately in that order of decreasing use. There are a few other engines that are relatively popular but some are white-labelled versions of the above. If you want to see a chart of approximate web traffic figures for these engines, use alexaholic.com. Alexaholic uses Alexa.com but will let you view multiple traffic charts simultaneously. These will give you a relative comparison of which is more popular.
More in depth history of search engines, and search glossary.

For example, traffic to Google, Yahoo, and MSN has been relatively equal over the past 3 months. Though if you plot Alexa traffic figures over the past 5 years, you'll see how incredibly fast Google's popularity has increased, right up through early 2006. Ask.com traffic line is at the very bottom of the chart.

Web2.0 Search Engines

These are the new breed, some labelled as web2.0 applications. The definition of web 2.0 is still fairly broad, but it's easy to see the award winning stuff. They're the tip of the iceberg of advanced search applications for what is known as the semantic web. They literally add another dimension to searching. Some offer visual search using an initial image that you select or even draw. Others let you search by color or meta tags of audio files. A few of these engines include Like, Princeton Shape, SystemOne Retrievr, Mnemomap, Casual, KWMap, Ujiko, Webbrain. [See Information Aesthetics for writeups about most of these.]

Most of these new engines are works in progress that need a few generations of revisions. A few are truly brilliant, all of them innovative. Some use meta level concepts such as synonym matching, color or shape similarity, thematic concepts, semantics. There's even a device that carves rivers, canyons and valleys into foam based on search engine queries.

All of them appear to improve the search experience, but mostly for advanced users who are familiar with unusual search paradigms. If you're interested, visit some to get a sense of them. The rest of this article focuses on traditional text-based search engines.

Text-Based Search Engines - Overview

Text-based search engines are the mainstay of the web. They've come and gone, and will continue to do so. Many left the public web and focused on corporate Intranets (private webs; also part of the "invisble web"). It took Google, however, to successfully monetize a public search engine.

Because Google and other engines store a list of all the search queries that users perform, there is a vast pool of information that can be data-mined from the queries. For example, CNBC TV ran a segment about how a murderer was convicted of killing his wife, based on the overwhelming evidence of gruesome search queries. They were able to trace the queries back to his personal computer at home. Using pattern analysis and other evidence, they convicted him.

Information is essentially cheap on the Internet. It's what you make of it, data-mining for patterns, that can be valuable. Though the average person does not care about that. They are looking for something specific, and usually, they get more search results than they care to look at. That's where power search techniques come in. They are not very complicated, and use a fairly simple syntax to give you the power to cull the search results down to what you are really looking for - most of the time, anyway.

Glossary: Search Engine + Related

Before discussing ways to refine search queries, let's have a look at a few terms either specifically related to search engines, or related to topics in this article.

Anchor Text
When ever you see a hyperlink on a web page, the actual words used to specify the link are referred to as the anchor text.

Blog/ Weblog
A blog (aka weblog) is a special website that has been structured with articles (blog posts) in reverse chronological order. Blog posts are also organized into page groups and monthly archives. They have a structural advantage in search engines, though they often result in false search results.

Bot/ Spider
A search engine bot or spider is a special automated web application that indexes web pages for a search engine.

Cache
Some engines store the full-text of an indexed web page. Whenever the page is updated, the engine's cache will also be updated, eventually. So you can view a cached page from another site by using the "cache:" operator, without leaving the search engine.

Invisible Web
The Invisible Web consists of web sites that are difficult or impossible to find, either because they are not indexed in a search engine or because they require a password.

Query Strings
This simply means the actual text that you enter in a search query, including letters, digits, punctuation, and any special operator characters.

SEM/ SEO
Search engine marketing/ Search engine optimization

Semantic Web
The Semantic Web is a project to derive consistent meaning from websites through advanced search engines. Most web content is designed for humans. When you search for something, you don't always get what you were thinking. The semantic web will improve that by allowing search bots to extract meaning from semantically organized informatino. Sir Tim Berners-Lee, father of the modern Internet, gives his road map for the semantic web (Sept 1998).

SERPs
SERP means Search Engine Results Page - those pages that result when you do a search query.

Stop Words
Stop words are any words, such as "the", "and", "a", "or", that add little value in being part of a search query string. Most engines do not store these when indexing web pages.

Tags
Tags refer to a topic category classification, primarily for weblog sites. So if you write a blog post about food, it might have tags such as "recipe", "italian", "mushrooms", "pasta". Tags are applied by the author of a post.

TLD
TLD means Top Level Domain and refers to the final part of the name of a web domain. For example, http://www.msn.com/ is an URL. The TLD is the ".com" part. The "msn" part is known as the second-level domain.

URL
URL means Uniform Resource Locator and essentially means the web address of a specific web page.

Web Feeds
Web feeds are a special form of web content that organizes new content from a website or blog into the form of headlines and excerpts. Web feeds make it easy to syndicate content online, as well to subscribe to such content for frequent browsing using a "web feed reader". (See "Bloglines" in the final section of this article.)


Refining Search Queries

All text-based search engines work on a query string supplied by the user. But most of the time, the SERPs returned number in the hundreds or even millions of pages, making it difficult to find what you want. To reduce the number of SERPs, we need to refine our search strings. To do that, we need to use special query operators that are derived mostly from Boolean logic, pus a few specialized operators.

All search engines use a fairly common set of advanced query operators (AQOs). However, not all engines process AQOs the same way. So if you do use advanced operators, you will want to play around with them in your favorite search engine to learn how they're handled. The operator descriptions below are generalized; not all engines will support them in exactly the way described.


General Query Operators
These include using double quotes to force results that include a specific text string, brackets "()", Booleans (AND, OR, NOT), and "+" or "-" (plus/ minus). Plus typically means include a term, and minus means exclude a term. For example:

  • library taxonomy is usually the same as +library +taxonomy, which is the same as library AND taxonomy. Both words have to appear in the results, but order and proximity may vary. If you want adjacent words (i.e., that specific string), use double quotes: "library taxonomy". Some engines offer a near operator as well, which controls proximity within a certain number of words, say ten.
  • Plural forms are usually automatically offered, as are some verb forms of a root word, unless double quotes are used.
  • The OR operator might work on exclusion or it work on supplemental rules. For example, libarary OR taxonomy usually means either/both, but could mean one or the other, only (exclusive or), which would make it the same as the next form.
  • library NOT taxonomy means return only those web pages with just the word library, never with taxonomy. This is the same, in most engines, as +library -taxonomy.
  • Brackets help arrange processing order in complex queries. For example: (EMF OR "electro magnetic fields") AND health means that the SERPs must have the word health and either of the terms EMF or "electro magnetic fields". You can add a bit more leeway in some engines by using (EMF OR (electro magnetic fields)) AND health. This lets the order and proximity of the words electro magnetic fields be more flexible in the results.


Site Operators
These are powerful operators that most engines have but which are not always well-known. While there is a common set of operators, a few engines have their own variations. Here is an amalgamated list. A few references are included after this section, if you are interested in finding out more. All of them consist of a predefined keyword and a semicolon, ":", character, which are then followed by a word or URL or domain name, etc. There should be no spaces on either side of the semicolon.

allinanchor:, inanchor: - Use allinanchor: to specify one or more words that must all be in anchor text. (See definition of anchor text in Glossary above.) Use inanchor to specify one word in anchor text and one or more words in the rest of the document body.
Example: allinanchor:librarian

allintitle: , intitle: - Use allintitle: to specify one or more words that must all be in the title of a web page. Use intitle: to check for a single word in the title, and one or more words in the document body.
Example: allintitle: librarians

allinurl:, inurl: - Use allinurl: to specify one or more words to be checked in the URL of a web page. Use inurl: to check one word in the URL and one or more words in the document body.
Example:allinurl: librarians

cache: - Varies by engine, but it typically shows the last cached version of a page.
Example:cache:http://lii.org

define: - Returns definitions of a specific word, from various sources.
Example: define:librarian

domain:, site: - Use with a domain name to limit searches to pages on that site.
Example:site:stanford.edu.

filetype: - Use with a media file type (e.g., PDF) to limit SERPs to that type of document.
Example: library filetype:xls

info: - Provides engine-specific info about a particular URL or its parent site.
Example: info:becomealibrarian.org

link:, linkto: - Use this to find websites linking to a specific URL or domain.
E.g.: link:www.librarian.net.

related: - Engines determine topic similarity of web pages on different sites. This operator, when used with an URL, will return pages from other sites that are similar.
Example: related:lii.org

There are actually many more specialized operators, some of which are covered in the references below. They are not absolutely necessary, but are useful for power users.


Miscellaneous Operators
Some engines offer additional operator functionality by allowing you to click on a checkbox. Some such features include domain exclusion, choice of site TLDs, and date published range. In some engines, you can specify year range by using something like 2000..2006.


Additional References
Here are a few links to pages about advanced queries.


Google


google

While Google is not the oldest existing engine around, it is the most popular, especially among web-savvy users. They have a whole host of features, and they're always adding more. They just added support for 17 new languages. While Google is more selective about what websites and web pages they index, advanced users tend to favor this engine over others. Here is Google's full list of advanced operators for search queries.



Yahoo


yahoo

Yahoo was originally a human-approved directory that you paid to have your site listed in. They still do that, but they added YahooSearch to compete with Google, who ousted many other engines that are no longer around, or put their focus elsewhere than the public web.

Yahoo may show far fewer results for some search keywords than Google, though more complex phrases often show significantly different results. Their advanced features include the ability to search specific TLDs (eg., .gov, .edu, .org, .com) and specific content (Creative Commons, adult, non-adult, subscription content, language results).



MSN


MSN

MSN Search (now called Live Search) is Microsoft's baby. They long dominated desktop computer software but have lagged behind in the Internet race, if their flat stock share price is any indication. For some reason, the average query in MSN tends to produce more SERPS than for Google. Though this conclusion is based on a very small sample of queries over a year. It's 100% likely that MSN uses different criteria to index web pages, and Google is selective, as mentioned earlier. MSN's advanced features include language choice of search interface, language results, and safe search, amongst others. They also allow you to search for images, video and maps, within news or academic sites only, and in web feeds. There is a new QnA (Questions and Answers) beta feature, at the time of this writing, which lets you ask a question that a member of the community may answer for you - better than a search engine.



Other Textual Search Engines

Below are some of the other text-based search engines, each of which enjoys a varying degree of mild popularity. Not every engine is included below (alphabetical order), but the following should give you a light overview of your options. Defunct engines are not mentioned, and this list is by no means comprehensive.


AllTheWeb

alltheweb

AllTheWeb is a search engine and information portal. Content is divided into web, news, pictures, video, and audio. For example, a search for "trees" under the video category produces SERPS of only video files that have the text string "trees" in the file name. If you were looking for MP3 files of legendary blues master Robert Johnson, you could use the audio tab to get all 2,666 results. (Is that a joke? Who knows. Johnson was said to have made a deal with devil, to be the best blues guitarist ever, and reputed to have 3 graves.) AllTheWeb has a number of advance operators in their query language. A look under the hood reveals that AllTheWeb is just YahooSearch white-labelled.
**Alltheweb utilizes the Yahoo database


AltaVista

altavista

Altavista was one of the early challengers for the search engine throne, appearing probably around 1994-95. They were at one time one of the fastest search engines around, based on pure computing horsepower, and were somewhat popular, briefly, until Google appeared. They appear at present to be a white-labelled YahooSearch, with advanced features that are standard.
**Altavista utilizes the Yahoo database


Ask + Excite

askexcite

Ask.com is part of a group of engines and "information retrieval products" owned by IAC Search & Media. This group includes excite.com, which was once extremely popular when it debuted around 1995, a few years before Google. Ask.com was once called AskJeeves, and was white-labelled by a number of web portals that regular readers visited daily.

Ask also offers some non-standard search functionality (added courtesy of Gary Price)

1) Definition of non-alphanumeric searches
We have started to slowly offer non-alphanumeric searches

2) Zip Code search
Notice the box to help you select the proper state and to see all the Zips for a
specific city.

3) Blog and feed search
NOTE THE pull down boxes to subscribe to a feed (even using a competitors
reader) or post the item with one click using digg, Reddit, etc.

4) New, event listings.
Part of the new AskCity service.




Blogsearch (Google)

google blogsearch

Google Blogsearch works in the same way as regular Google, the results are dedicated to weblog sites only. That does not mean blogs are not included in regular Google, but they don't show as prominently there. This way, if you are specifically looking for topics discussed in blogs only, it's easier to find them since many millions of web pages from regular websites have been pre-filtered.



Hotbot (Lycos)

lycos

Lycos has enjoyed some popularity and even a loyal following. You can find a summary of advanced features there.



Indeed



Indeed is a job search engine, in case you are looking for a new job in the Library Sciences field. They also have a version for Canadian jobs.



Information



Information appears to be a specialized engine that has also categorizes content into the groups web search, encyclopedia, blogs, articles, groups.



Librarian's Internet Index

Librarians Internet Index

The publicly-funded LII, or Librarian's Internet Index, is more of an information portal than a search engine. Each week, hand-selected websites adhering to some current theme are added to the Index, and their content can be searched in the LII. There's also a free newsletter that you can subscribe to. New entries can be subscribed to via the web feed.


Northern Light

northern lights

Northern Light focuses on offering searches of a wide variety of business content as well as industry journals.



Technorati

technorati

Like Google Blogsearch, Technorati is dedicated to weblogs only. However, it's far more than just a search engine and includes many features specifically of use to bloggers. Technorati, amongst other features, lets you know what is popular in a number of blog categories and content types (text, video), as well as in topics. It's also easy to determine what other weblogs are linked to a specific weblog. Finally, in addition to searching indexed blog posts, you can also search through blog post tags and other blog directories.



Digg
Digg is a new form of search engine based on social community as the driving force for relevance. The technique is fairly new, but seems to be catching on with increasing popularity. Digg is currently the 2nd highest trafficked site in the "tech" category according to several resources.

Other Librarian Search Resources

Additional Librarian search resources added courtesy of Gary Price's insight.

Miscellaneous Resources

To round out the discussion, here are a couple of other resources that may be of interest to librarians or anyone doing regular research.


API/ SDKs For White-Labelling Custom Engines
Several search engines offer APIs (Application Programmer Interfaces) and SDKs (Software Development Kits) that allow you to embed their functionality into your own web applications. Thus, you could very easily use, say, a customized Google to build a special librarian's search engine, which would index a select set of websites and weblogs pertaining to library sciences.


Bloglines Web Feed Subscription Tool

bloglines

If you plan to browse dozens or even hundreds of websites and weblogs on a daily or otherwise regular basis, one of the best tools for this is Bloglines. Professional bloggers and online researchers have been known to use this tool to monitor new articles/ blog posts from as many as 1,000 sites. The drawback is that only websites/ weblogs that publish a web feed can be tracked in this manner. Bloglines is owned by the same company as Ask.com and Excite.


Meta Search Engines
There are a couple of search tools, such as Dogpile and Metacrawler, that take your query and submit it to several engines simultaneously, returning to you aggregate SERPs.


Search Tutorials
The Learning Site has a six-part tutorial on web searching, sleuthing and sifting through information on the Internet.


Web Search Start Point
Accesscom.com has a somewhat out of date list of 200+ categorized hyperlinks, in case you are looking for something but don't know where to start, as they put it.


Query Views
Ever wonder what other people are searching for? Metacrawler's SearchSpy gives you a scrolling, near-realtime list of actual search query strings. There are two versions: unexposed and exposed, with the latter being unfiltered - that is, with possible adult content
分享到:
评论

相关推荐

    Oracle Customer Data Librarian Implementation Guide

    总的来说,Oracle Customer Data Librarian Implementation Guide为用户提供了全面的指导,帮助他们成功部署和管理客户数据,确保数据的质量和安全性,同时提高业务效率。在实施过程中,遵循文档的步骤并理解其中的...

    Oracle Customer Data Librarian User Guide

    《Oracle Customer Data Librarian 用户指南》是Oracle公司发布的一份针对11i版本的文档,主要作者为Essan Ni,并有多位专家参与贡献。该指南详细介绍了如何使用Oracle Customer Data Librarian这一工具来管理和维护...

    Ultra Librarian工具使用说明

    《Ultra Librarian工具使用详解》 Ultra Librarian是一款专业且高效的元器件符号和封装导入工具,主要用于自动化导入和管理电子元器件的封装信息,避免手工创建可能出现的错误。这款工具支持从器件制造商官方网站如...

    Ultra Librarian 8.3.122

    《 Ultra Librarian 8.3.122:电子设计中的高效库管理工具》 Ultra Librarian 8.3.122是一款专为电子设计师打造的库管理软件,它致力于提供高效、精确的元器件模型转换服务。该版本不仅支持最新的Altium Designer ...

    使用Ultra Librarian导入bxl文件并转成Cadence Allegro Dra文件

    ### 使用Ultra Librarian导入bxl文件并转成Cadence Allegro Dra文件 #### 一、Ultra Librarian简介 Ultra Librarian是一款广泛应用于电子设计领域的工具软件,它可以帮助工程师轻松创建和管理高质量的元器件库。该...

    UltraLibrarian软件

    《 UltraLibrarian软件详解及其应用》 UltraLibrarian软件,作为一个在电子设计领域备受瞩目的工具,其核心功能在于提供元器件封装库的转换服务。这个强大的软件为工程师们提供了便利,使得他们能够轻松地在不同的...

    TI免费的bxl文件转换软件Ultra Librarian

    TI免费的bxl文件转换软件Ultra Librarian。含视频操作及下载地址

    Ultra Librarian.part1

    元件库制作工具,如altium designer等 需要下载part2一并解压 part2连接:http://download.csdn.net/detail/wakojosin/5986021

    Ultra_Librarian_v7.5.114

    《 Ultra Librarian v7.5.114:元件库制作的艺术与技术》 在电子设计领域,元件库是至关重要的资源,它包含了各种电子元器件的模型和参数,为PCB(印制电路板)设计提供了基础。而Ultra Librarian v7.5.114就是一款...

    librarian-chef-0.0.2.gem

    需要使用ruby2.0 requires Ruby version >= 2.0.0. 需要的亲看名字就知道了,太难下载了

    ultra librarian

    这是一个把官网元件封装转换成CANDANCE 、AD元件库的工具。下载后解压,然后一直next,然后就安装好了

    Ultra Librarian (CIS)

    用来制作元件库的软件,如altium designer等。

    Oracle Product Information Management Data Librarian Implementat

    实施指南《Oracle Product Information Management Data Librarian Implementation Guide》旨在指导用户如何有效地部署和配置这一解决方案。该指南由Moitrayee Bhaduri为主要作者,并在2005年10月发布,包含了版权...

    UltraLibrarian(CIS).zip

    Ultra Librarian是一款可以将bxl封装文件转换成封装库的软件,兼容目前市面上的主流电路设计EDA软件,该库以业内最大的 ECAD 元器件库为后盾,代表了 400 多家制造商。轻松找到您所需的零件,导出至 22 种不同的 CAD...

    SysEx Librarian For Mac_v1.4

    **SysEx Librarian For Mac_v1.4:深入理解与应用** SysEx Librarian For Mac_v1.4 是一个专为Mac用户设计的强大系统专属(SysEx)文件管理工具,它使得用户能够方便地利用System Exclusive(SysEx)消息与MIDI...

    Google C++ Style Guide(Google C++编程规范)高清PDF

    Another useful rule of thumb: it's typically not cost effective to inline functions with loops or switch statements (unless, in the common case, the loop or switch statement is never executed)....

    Librarian.mdl

    Librarian.mdl

    librarian-chef-0.0.4.gem

    librarian-chef-0.0.4.gem 需要的一看便知呀

    Ultra Librarian part2

    元件库制作工具,如altium designer等 需要下载part1一并解压

    Distributed.Computing.with.Go

    Chapter 7, Goophr Librarian, is a detailed look at the component that is responsible for maintaining the index for the search terms. We also look at how to search for given terms and how to order our ...

Global site tag (gtag.js) - Google Analytics