Course Syllabus for CS 371R
Information Retrieval and Web Search


Chapter numbers refer to the text: Introduction to Information Retrieval
  1. Introduction: Chapter 1.

    Goals and history of IR. The impact of the web on IR.

  2. Basic IR Models: Chapters 1 & 6.

    Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document frequency) weighting; cosine similarity.

  3. Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval: Chapters 2 & 6.

    Simple tokenizing, stop-word removal, and stemming; inverted indices; efficient processing with sparse vectors; Java implementation.

  4. Experimental Evaluation of IR: Chapter 8.

    Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.

  5. Query Operations and Languages: Chapters 9 and 3.

    Relevance feedback; Query expansion; Query languages.

  6. Text Representation: Section 5.1 and Chapter 10.

    Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages (SGML, HTML, XML).

  7. Web Search: Chapters 19, 20, & 21.

    Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank); shopping agents.

  8. Text Categorization: Chapters 13 & 14.

    Categorization algorithms: Rocchio, nearest neighbor, and naive Bayes. Applications to information filtering and organization.

  9. Language-Model Based Retrieval : Chapter 12.

    Using naive Bayes text classification for ad hoc retrieval. Improved smoothing for document retrieval.

  10. Text Clustering: Chapters 16 & 17.

    Clustering algorithms: agglomerative clustering; k-means; expectation maximization (EM). Applications to web search and information organization.

  11. Recommender Systems: Read this paper by Herlocker et al.

    Collaborative filtering and content-based recommendation of documents and products.

  12. Information Extraction and Integration:

    Extracting data from text; semantic web; collecting and integrating specialized information on the web.