Information Retrieval in Practice Extensive lecture slides (in PDF and PPT format) The supplements are available at pocboarentivi.gq This book provides an overview of the important issues in information retrieval, and how those issues affect the design and implementation of search engines. Search Engines. Information Retrieval in Practice. W. BRUCE CROFT. University of Massachusetts, Amherst. DONALD METZLER. Yahoo! Research. TREVOR.
|Language:||English, Japanese, Arabic|
|ePub File Size:||21.47 MB|
|PDF File Size:||17.52 MB|
|Distribution:||Free* [*Registration needed]|
Search Engines: Information Retrieval in Practice [Bruce Croft, Donald Metzler, Trevor Strohman] on pocboarentivi.gq *FREE* shipping on qualifying offers. Search . About the book. Written by a leader in the field of information retrieval, Search Engines: Information Retrieval in Practice is designed to give undergraduate. 1, Search Engines and Information Retrieval, pdf · ppt. 2, Architecture of a Search Engine, pdf · ppt. 3, Crawls and Feeds, pdf · ppt. 4, Processing Text, pdf · ppt.
The document parser uses knowledge of the syntax of the markup language to identify the structure. This does not tell us. Both document and query text must be transformed into tokens in the same manner so that they can be easily compared. HTML is the default language used for specifying the structure of web pages. XML has much more flexibility and is used as a data interchange format for many applications. In many cases. Tokenizing the text is an important first step in this process.
Some applications. The structured data consists of document metadata and other information extracted from the documents. In some languages. Stemming Stemming is another word-level transformation. Tags and other control sequences must be treated appropriately when tokenizing. Similar to stopping. Stopping The stopping component has the simple task of removing common words from the stream of tokens that become index terms. Aggressive stemming can cause search problems. Some stopword lists used in research contain hundreds of words.
It may not be appropriate. The task of the stemming component or stemmer is to group words that are derived from a common stem. The most common words are typically function words that help form sentence structure but contribute little on their own to the description of the topics covered by the text. Because they are so common. Despite these potential advantages. Some search applications use more conservative stemming. Depending on the retrieval model that is used as the basis of the ranking.
To avoid this. By replacing each member of a group with one designated word for example. Other types of documents. This may be as simple as words in bold or words in headings.
Classifier The classifier component identifies class-related metadata for documents or parts of documents. In contrast. Extraction means that this information is recorded in the document data store. Anchor text. Link extraction and analysis Links and the corresponding anchor text in web pages can readily be identified and extracted during document parsing. Information extraction Information extraction is used to identify index terms that are more complex than single words.
Research in this area has focused on techniques for extracting features with specific semantic content. This covers a range of functions that are often described separately. Extracting syntactic features such as noun phrases. Classification techniques assign predefined class labels to documents. Link analysis provides the search engine with a rating of the popularity. The weighting component calculates weights using the document statistics and stores them in lookup tables.
There are many variations of these weights. The idf weight is called inverse document frequency because it gives high weights to terms that occur in very few documents. Weighting Index term weights reflect the relative importance of words in documents.
The document statistics are stored in lookup tables. One of the most common types used in older retrieval models is known as tf. The actual data required is determined by the retrieval model and associated ranking algorithm. The types of data generally required are the counts of index term occurrences both words and more complex features in individual documents. These document groups can be used in a variety of ways during ranking or user interaction. This information is used by the ranking component to compute scores for documents.
The specific form of a weight is determined by the retrieval model. Weights could be calculated as part of the query process. Clustering techniques are used to group related documents without predefined categories.
User Interaction 2. By distributing the indexes for a subset of the documents document distribution. Peer-to-peer search involves a less organized form of distribution where each node in a network maintains its own indexes and collection of documents. The format of the inverted indexes is designed for fast query processing and depends to some extent on the ranking algorithm used.
Index distribution The index distribution component distributes indexes across multiple computers and potentially across multiple sites on a network. An operator is a command in the query language that is used to indicate text that should be treated in a special way. The simplest query languages. Distributing the indexes for a subset of terms term distribution can also support parallel processing of queries. Inversion The inversion component is the core of the indexing process. Its task is to change the stream of document-term information coming from the text transformation component into term-document information for the creation of inverted indexes.
In general. These techniques often leverage the extensive query logs collected for web applications. Boolean query languages have a long history in information retrieval. In both cases. Because the ranking algorithms for most web search engines are designed for keyword queries.
A keyword is simply a word that is important for specifying the topic of a query. Other query languages include these and other operators in a probabilistic framework designed to allow specification of features related to both document structure and content. Query transformation The query transformation component includes a range of techniques that are designed to improve the initial query.
Spell checking and query suggestion are query transformation techniques that produce similar output. The operators in this language include Boolean AND. One of the challenges for search engine design is to give good results for a range of queries.
Query expansion techniques. A typical web query. An example of an operator in a simple query language is the use of quotes to indicate that the enclosed words should occur as a phrase in the document. More complex query languages are available. The simplest processing involves some of the same text transformation techniques used on document text.
For other search engines. In applications that involve documents in multiple languages. This may include tasks such as generating snippets to summarize the retrieved documents. The designers of some search engines explicitly state the retrieval model they use. Ranking Scoring Optimization Distribution 2. In Chapter 7. Relevance feedback is a technique that expands queries based on term occurrences in documents that are identified as relevant by the user.
The features and weights used in a ranking algorithm. The term weights depend on the particular retrieval model being used.
Results output The results output component is responsible for constructing the display of ranked documents coming from the ranking component. This is the task of the performance optimization component. Performance optimization Performance optimization involves the design of ranking algorithms and the associated indexes to decrease response time and increase query throughput.
Given a particular form of document scoring. Caching is another form of distribution where indexes or even ranked document lists from previous queries are left in local memory. In this document-at-a-time scoring. Distribution Given some form of index distribution.
This is referred to as term-at-a-time scoring. The operation of the broker depends on the form of index distribution. Unsafe optimizations.
A query broker decides how to allocate queries to processors in a network and is responsible for assembling the final ranked list for the query. If the query or index term is popular. Safe optimizations guarantee that the scores calculated will be the same as the scores without optimization.
The document scores must be calculated and compared very rapidly in order to determine the ranked order of the documents that are given to the results output component. Another alternative is to access all the indexes for the query terms simultaneously. Query logs can be used for spell checking. A variety of performance measures are used. Performance analysis The performance analysis component involves monitoring and improving overall system performance.
Measures that emphasize the quality of the top-ranked documents. This is a critical part of improving a search engine and selecting values for parameters that are appropriate for the application. A variety of evaluation measures are commonly used. Ranking analysis Given either log data or explicit relevance judgments for a large number of query. The equivalent for performance analysis is simulations.
For ranking analysis. This means that logs of user clicks on documents clickthrough data and information such as the dwell time time spent looking at a document can be used to evaluate and train ranking algorithms. Documents in a result list that are clicked on and browsed tend to be relevant. Another system overview for an earlier general-purpose search engine Inquery is found in Callan et al. A comprehensive description of the Lucene architecture and components can be found in Hatcher and Gospodnetic Now you know the names and the basic functions of the components of a search engine.
References and Further Reading Detailed references on the techniques and models mentioned in the component descriptions will be given in the appropriate chapters. Describe which low-level components are used to answer this type of query and the sequence in which they are used.
There are a few general references for search architectures. Exercises 2. Each chapter describes. The classic research paper on web search engine architecture. A more-like-this query occurs when the user can click on a particular document in the result list and tell the search engine to find documents that are similar to this one. Find some examples of the search engine components described in this chapter in the Galago code.
A database textbook. There are some similarities at the high level. Document filtering is an application that stores a large number of queries or user profiles and compares these profiles to every incoming document on a feed. Although we focus heavily on the technology that makes search engines work. Even useful documents can become less useful over time. Spider Man 2 3. On the other hand. The frustration of finding out-of-date web pages and links in a search result list is.
Every time a search engine adds another document. This is especially true of news and financial information where. Web search engines. The title of this section implies the question. Every document answers at least one question i. In other words. The owners of that site may not want you to copy some of the data. There are at least tens of billions of pages on the Internet. This instantly solves one of the major problems of getting information to search.
Even if you know that you want to copy all the pages from www. Every time a user adds a new blog post or uploads a photo. Unlike some of the other sources of text we will consider later. Along the way. By the end of this chapter you will have a solid grasp on how to get document data into a search engine. Some of the data you want to copy may be available only by typing a request into a form.
Another problem is that web pages are usually not under the control of the people building the search engine database. We will discuss strategies for storing documents and keeping those documents up-to-date. Even if the number of pages in existence today could be measured exactly.
Most organizations do not have enough storage space to store even a large fraction of the Web. The biggest problem is the sheer scale of the Web. Finding and downloading web pages automatically is called crawling. Once the connection is established.
In the figure. Web pages are stored on web servers. A uniform resource locator URL. By convention. Any URL used to describe a web page has three parts: A port is just a bit number that identifies a particular service.
This IP address is a number that is typically 32 bits long. The hostname follows. The program then attempts to connect to a server computer with that IP address. A POST request might be used when you click a button to download something or to edit a web page. If the client wants more pages.
A client can also fetch web pages using POST requests. This convention is useful if you are running a web crawler. GET requests are used for retrieving data that already exists on the server. Just imagine how the request queue works in practice.
Reasonable web crawlers do not fetch more than one page at a time from a particular web server. At any one time. Fetching hundreds of pages at once is good for the person running the web crawler.
To avoid this problem. When a web page like www. Notice that the web crawler spends a lot of its time waiting for responses: During this waiting time. If the crawler finds a new URL that it has not seen before.
In addition. The crawler will then attempt to fetch all of those pages at once. If the web server for www. This kind of behavior from web crawlers tends to make web server administrators very angry.
The crawler starts fetching pages from the request queue. These seeds are added to a URL request queue. This process continues until the crawler either runs out of disk space to store pages or runs out of useful links to add to the request queue. This allows web servers to spend the bulk of their time processing real user requests. The web crawler has two jobs: Once a page is downloaded.
The crawler starts with a set of seeds. The frontier may be a standard queue. If a crawler used only a single thread. To support this. The final block of the example is an optional Sitemap: Assume that the frontier has been initialized. The User-agent: Figure 3. The second block indicates that a crawler named FavoredCrawler gets its own set of rules: Following this line are Allow and Disallow rules that dictate which resources the crawler is allowed to access.
The web crawler needs to have URLs from at least 3. The file is split into blocks of commands that start with a User-agent: FavoredCrawler Disallow: Suppose a web crawler can fetch pages each second. Since many URLs will come from the same servers. To keep an accurate view of the Web. These URLs are added to the frontier. If it can be crawled. Once the text has been retrieved. In a real crawler. This is the most expensive part of the loop. The opposite of a fresh copy is a stale copy. In permitsCrawl.
A simple crawling thread implementation with a few URLs that act as seeds for the crawl. The document text is then parsed so that other URLs can be found. The crawling thread first retrieves a website from the frontier. When all this is finished.
It does little good to continuously check sites that are rarely updated. Some of them. It simply is not possible to check every page every minute.
Notice that the date is also sent along with the response. A HEAD request reduces the cost of checking on a page. The HEAD request returns only header information about the page. The Last-Modified value indicates the last time the page content was changed. Even within a page type there can be huge variations in the modification rate. Over time.
Not only would that attract more negative reactions from web server administrators. Suppose that http: Unless your crawler continually polls http: Freshness is then the fraction of the crawled pages that are currently fresh. They will look at http: Notice that if you want to optimize for freshness. Under the age metric. Age is a better metric to use. Of course. Age and freshness of a single page over time Under the freshness metric. In the top part of the figure.
We can calculate the expected age of a page t days after it was last crawled: If it will never be fresh. This gives us a formula to plug into the P page changed at time x expression: That is.
As the days go by. This positive second derivative means that the older a page gets. By the end of the week. We multiply that by the probability that the page actually changed at time x. Notice that the second derivative of the Age function is always positive. This means that if your crawler crawls each page once a week. Optimizing this metric will never result in the conclusion that optimizing for freshness does. This is because the page is unlikely to have changed in the first day. Notice how the expected age starts at zero.
Studies have shown that. Some studies have estimated that the deep Web is over a hundred. The anchor text in the outgoing links is an important clue of topicality. A focused crawler attempts to download only those pages that are about a particular topic.
Focused crawlers rely on the fact that pages about a topic tend to have links to other pages on the same topic. A less expensive approach is focused. Anchor text data and page link topicality data can be combined together in order to determine which pages should be crawled next.
If it is. In practice. If built correctly. As links from a particular web page are visited. If this were perfectly true.
The computational cost of running a vertical search will also be much less than a full web search. The most accurate way to get web pages for this kind of engine would be to crawl a full copy of the Web and then throw out all unrelated pages. Chapter 9 will introduce text classifiers. This strategy requires a huge amount of disk space and bandwidth.
Static pages are files stored on a web server and displayed in a web browser unmodified. These sites generally want to block access from crawlers. Web administrators of sites with form results and scripted pages often want their sites to be indexed.
Sometimes people make a distinction between static pages and dynamic pages. You are shown flight information only after submitting this trip information. Other websites have static pages that are impossible to crawl because they can be accessed only through web forms. The site owner can usually modify the pages slightly so that links are generated by code on the server instead of by code in the browser.
Many websites have dynamically generated web pages that are easy to crawl. Even though you might want to use a search engine to find flight timetables. Typically it is assumed that static pages are easy to crawl. If a link is not in the raw HTML source of the web page. Most sites that are a part of the deep Web fall into three broad categories: Although this is technically possible. Even with good guesses. In section 3. Adding a million links to the front page of such a site is clearly infeasible.
An example sitemap file A robots. Sitemaps solve both of these problems. Another option is to let the crawler guess what to enter into forms. The first entry also includes a priority tag with a value of 0.
If the sites that are being crawled are close. A sitemap allows crawlers to find this hidden content. A simple web crawler will not attempt to enter anything into a form although some advanced crawlers do.
With an average web page size of 20K. This helps reduce the number of requests that the crawler sends to a website without sacrificing page freshness. As throughput drops and latency rises. Decreased throughput and increased latency work together to make each page request take longer.
There may not be any links on the website to these pages. One reason to use multiple computers is to put the crawler closer to the sites it crawls. The first entry includes a lastmod tag. The changefreq tag gives the crawler a hint about when to check a page again for changes. Suppose these are two product pages. Look at the second and third URLs in the sitemap. Long-distance network connections tend to have lower throughput fewer bytes copied per second and higher latency bytes take longer to cross the network.
Why would a web server administrator go to the trouble to create a sitemap? One reason is that it tells search engines about pages it might not otherwise find. This tells crawlers that this page is more important than other pages on this site. The changefreq tag indicates how often this resource is likely to change.
Search Engines Information Retrieval in Practice.pdf
In the discussion of page freshness. Why would a single crawling computer not be enough? We will consider three reasons. The sitemap also exposes modification times.
Each one contains a URL in a loc tag. If the sites are farther away. Spreading crawling duties among many computers reduces this bookkeeping load. Crawling a large portion of the Web is too much work for a single computer to handle. A crawler has to remember all of the URLs it has already crawled. A distributed crawler is much like a crawler on a single computer. This assigns all the URLs for a particular host to a single crawler. Multiplying 50 by ms. Another reason for multiple crawling computers is to reduce the number of sites the crawler has to remember.
When a crawler sees a new URL. These URLs are gathered in batches. It is easier to maintain that kind of delay by using the same crawling computers for all URLs for the same host. Yet another reason is that crawling can use a lot of computing resources. Although this may promote imbalance since some hosts have more pages than others.
The distributed crawler uses a hash function to assign URLs to crawling computers. The hash function should be computed on just the host part of each URL. These URLs must be easy to access. By assigning domain.
The data structure for this lookup needs to be in RAM. This means that five connections will be needed to transfer 50 pages in one second. This is less true on a desktop system. There are unique challenges in crawling desktop data. In some ways. In desktop search applications. On a desktop computer. Remote file systems from file servers usually do not provide this kind of change notification. This means. The first concerns update speed.
Many of the problems of web crawling change when we look at desktop data. In web crawling. A desktop crawler instead may need to read documents into memory and send them directly to the indexer. Since websites are meant to be viewed with web browsers. Crawling the file system every second is impractical. In companies and organizations. In this section.
This information can be searched using a desktop search tool. Disk space is another concern. We will discuss indexing more in Chapter 5. With a web crawler. This feed contains two articles: RSS has at least three definitions: Really Simple Syndication. This is like a telephone. Most information that is time-sensitive is published. A push feed alerts the subscriber to new documents. The proliferation of standards is the result of an idea that gained popularity too quickly for developers to agree on a single standard.
We will focus primarily on pull feeds in this section. Notice that each entry contains a time indicating when it was published. News articles. News feeds from commercial news agencies are often push feeds. Since each published document has an associated time. The file access permissions of each file must be recorded along with the crawled data. We can distinguish two kinds of document feeds. RSS also has a number of slightly incompatible implementations.
A document feed is particularly interesting for crawlers. This is especially important when we consider crawling shared network file systems. The most common format for pull feeds is called RSS. RDF Site Summary.
A pull feed requires the subscriber to check periodically for new documents. Mark your calendars and check for cheap flights. Feeds give a natural structure to data. RSS feeds are accessed just like a traditional web page. From a crawling perspective. This gives a crawler an indication of how often this feed file should be crawled.
Microsoft Word. Standard text file formats include raw text. These formats are easy to parse. In addition to all of these formats. It is not uncommon for a commercial search engine to support more than a hundred file types. You can see this on any major web search engine.
For some document types. There are tens of other less common word processors with their own file formats. Feeds are easy to parse and contain detailed time information. The most common way to handle a new file format is to use a conversion tool that converts the document content into a tagged text format such as HTML or XML. Search for a PDF document. Most importantly. Note that not all encodings even agree on English.
ASCII encodes letters. The text that you see on this page is a series of little pictures we call letters or glyphs.
Search Engine Books
For English. Until recently. This scheme is fine for the English alphabet of 26 letters. A character encoding is a mapping between bits and glyphs. The computer industry has moved slowly in handling complicated character sets such as Chinese and Arabic. Accurate conversion of formatting information allows the indexer to extract these important features.
Numbers above are mapped to glyphs in the target language. The first values of each encoding are reserved for typical English characters. As we will see later. Other languages. This is critical for obsolete file formats. The Chinese language.
The proliferation of encodings comes from a need for compatibility and to save space.
In binary. Jumping to the twentieth character in a UTF string is easy: Unicode was developed. Because of this. UTF-8 encoding Table 3. By contrast. Unicode is a single mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languages.
The left columns represent ranges of decimal values. This solves the problem of using multiple languages in a single file. The x characters represent binary digits. The second row of the table tells us.
It turns out that there are many ways to translate Unicode numbers to glyphs! Some of the most popular include UTF It makes sense to keep copies of the documents around instead of trying to fetch them again the next time you want to build an index. Most other kinds of search engines need to store documents somewhere. Other kinds of extraction are possible. The bold binary digits are the same as the digits from the table. Keeping old documents allows you to use HEAD requests in your crawler to save on bandwidth.
Crawling for documents can be expensive in terms of both CPU and network load. Notice that if information extraction is used in the search application. The final encoding is CF80 in hexadecimal. The most pervasive kind of information extraction happens in web search engines. We now discuss some of the basic requirements for a document storage system.
In desktop search. The high 5 bits of the character go in the first byte. The simplest document storage is no document storage.
By not storing the intermediate converted documents. These snippets of text give the user an idea of what is inside the retrieved document without actually needing to click on a link. As the crawling process runs. Even if snippets are not necessary. Fast access to the document text is required in order to build document snippets2 for each search result.
Compared to a full relational database. We want a data store such that we can request the content of a document based on its URL.
Databases also tend to come with useful import and analysis tools that can make it easier to manage the document collection. Using a hash function on the URL gives us a number we can use to find the data. For many applications. Most databases also run as a network server. For larger installations. Database vendors also tend to expect that database servers will use the most expensive disk systems. Many companies that run web search engines are reluctant to talk about their internal technologies.
One problem is the sheer volume of document data. Once the document location has been narrowed down to a single file. This could support. For small installations. The easiest way to handle this kind of lookup is with hashing.
Winter 2017 CS293S: Information Retrieval and Web Search
We discuss an alternative to a relational database at the end of this section that addresses some of these concerns. Even a document that seems long to a person is small by modern computer standards. A good size choice might be in the hundreds of megabytes. By storing documents close together. A better solution is to store many documents in a single file. TREC Text. This is why storing each document in its own file is not a very good idea.
Compression techniques exploit this redundancy to make files smaller without losing any of the content. In each format. The Galago search engine includes parsers for three compound document formats: This space savings reduces the cost of storing a lot of documents.
We will cover compression as it is used for document indexing in Chapter 5. While research continues into text compression. That is far bigger than the average web page. Even though large files make sense for data transfer from disk.
At the beginning of the document. Compression works best with large blocks of data. The alternative is to create an entirely new document store by merging the new. If you want random access to the data. The HTML code in the figure will render in the web browser as a link. An example link with anchor text Another important reason to support update is to handle anchor text.
If the document data does not change very much. When it is time to index the document. A simple way to approach this is to use a data store that supports update.
Small blocks reduce compression ratios the amount of space saved but improve request latency. When a document is found that contains anchor text.
Anchor text is an important feature because it provides a concise summary of what the target page is about. Most compression methods do not allow random access. SQL allows users to write complex and computationally expensive queries. In the next few paragraphs. BigTable is a distributed database system originally built for the task of storing web pages. Because some of these queries could take a very long time to complete. The table is split into small pieces.
BigTable is a working system in use internally at Google. A BigTable instance really is a big table. BigTable is the most well known of these systems Chang et al. Most relational databases store their data in files that are constantly modified.
Once file data is written to a BigTable file. The tablets. Most of the engineering in BigTable involves failure recovery. The URL. Periodically the files are merged together to reduce the total number of disk files. This also helps in failure recovery. BigTable stores its data in immutable unchangeable files. To allow for table updates. In BigTable. The row has many columns. The combination of a row key. Any changes to a BigTable tablet are recorded to a transaction log.
In relational database systems. There is no query language. If any tablet server crashes. Performance and correctness measures[ edit ] Main article: Evaluation measures information retrieval The evaluation of an information retrieval system' is the process of assessing how well a system meets the information needs of its users. In general, measurement considers a collection of documents to be searched and a search query.
Traditional evaluation metrics, designed for Boolean retrieval [ clarification needed ] or top-k retrieval, include precision and recall.
All measures assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query. In practice, queries may be ill-posed and there may be different shades of relevancy. Timeline[ edit ] Before the s Joseph Marie Jacquard invents the Jacquard loom , the first machine to use punched cards to control a sequence of operations. That same year, Kent and colleagues published a paper in American Documentation describing the precision and recall measures as well as detailing a proposed "framework" for evaluating an IR system which included statistical sampling methods for determining the number of relevant documents not retrieved.
Cleverdon published early findings of the Cranfield studies, developing a model for IR system evaluation. See: Cyril W. Cranfield Collection of Aeronautics, Cranfield, England, Kent published Information Analysis and Retrieval. Alvin Weinberg. Joseph Becker and Robert M.
Hayes published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories.
New York, Wiley Project Intrex at MIT.The design of indexes for search engines is one of the major topics in this book. The most common way to handle a new file format is to use a conversion tool that converts the document content into a tagged text format such as HTML or XML.
Collection size N is A HEAD request reduces the cost of checking on a page. After all words have been processed. Classifier The classifier component identifies class-related metadata for documents or parts of documents.
We will focus primarily on pull feeds in this section. Models with immanent term interdependencies allow a representation of interdependencies between terms.