Project 4
CS 371R: Information Retrieval and Web Search
Evaluating Embeddings From Deep Language Models


Due: 11:59pm, November 29, 2023

This project will explore using emdeddings from a LLM to support standard document retrieval. First, you will produce document and query embeddings from a Python-based LLM and store them in directories of files containing precomputed dense vectors. You will then use existing Java code that augments the existing course IR code to support retrieval using pre-computed dense vectors and experimentally evaluate the embeddings you produced. Finally, you will implement and test a "hybrid" approach that combines this dense-retrieval approach with the existing VSR system and experimentally evaluate whether it improves the results compared to purely sparse or dense retrieval alone.

Generating Deep Embeddings

Overview

We will evaluate deep embeddings generated from a pre-trained transformer language model on the Cystic Fibrosis (CF) dataset introduced in Project 2. We will use the HuggingFace Transformers library in the PyTorch deep learning framework to generate the embeddings. Specifically, we will be using the SPECTER2 transformer model which is a language model (LM) trained on scientific paper abstracts. SPECTER2 is first trained on over 6M triplets of scientific paper citations, after which it is trained with additional task-specific adapter modules. You can read more about the basic SPECTER approach in this paper. The pre-trained LM is capable of generating task specific embeddings for scientific tasks when paired with adapters. Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications. Read through the model card on HuggingFace to understand the model better.

Python Environment

You can choose to install the python packages within a conda environment on the UTCS lab machines. To install conda, follow the instructions here. Then you can set up an environment and install packages as follows:
    conda create -n irproj python=3.10
    conda activate irproj
    pip install 'transformers[torch]'
    pip install -U adapter-transformers
Although conda is recommended, you may also just install the python packages as is without using conda.

Python Code

In this project, along with the SPECTER2 Base Model, we shall be using the Proximity Adapter to embed documents and the Adhoc Query Adapter to embed queries. All necessary models and tokenizers have already been downloaded and can be found in /u/mooney/ir-code/models/specter2_2023/. Do not download the models from the HuggingFace website, directly load the models from the above path. We will compare two variants of SPECTER2 embeddings, one with just the base model and one with adapter modules attached to the base model. Each of the two cases is described below. The steps to be followed to generate embeddings are as follows:
  1. Load the model and tokenizer from the appropriate paths as described above.
  2. Tokenize the input text using the tokenizer.
  3. Pass the tokenized input through the model to generate embeddings.
  4. Save the embeddings in the appropriate format.
For generating embeddings, copy the python files in /u/mooney/ir-code/deep_embedder. Make appropriate changes to the generate_embeddings.py template file and run the script as follows: NOTE: Do not modify test_embedder.py. This file will be run as is from the autograder (you can however change the paths for models-folder, etc while debugging your code). You can fill in the generate_embeddings.py template file and code up the functions as you see fit. **You will submit only generate_embeddings.py**.

While saving embeddings, save them as space separated 768-dim vectors in the following format:

Sample embeddings for doc1 and query1 for both base and adapter versions are below:

You can use these to validate your implementation. You don't have to match them exactly, but the values wouldn't be too different.

Deep Retrieval

The course code in /u/mooney/ir-code/ir/ has been augmented with a new class ir.vsr.DeepRetriever for a simple document retriever that uses precomputed dense document embeddings. It takes a directory of pre-computed document embeddings that have the same file names as the original corpus where each file contains a simple space-separated list of real-values representing the document embedding. Another new class used by the deep retriever is DeepDocumentReference which stores a pointer to a file and stores its dense vector and the precomputed vector length (L2 norm) for this vector. The 'retrieve' method returns a list of ranked retrievals for a query represented as a DeepDocumentReference using either Euclidian distance or cosine similarity (if 'cosine' flag is used) to compare the dense vectors. No indexing is used to improve efficiency, a query is compared to every document in the corpus and all of the scored results are included in the array of Retrievals. Ideally, some form of approximate nearest neighbor search like locality sensitivity hashing would be used; however, for the limited number of small documents and queries in the CF corpus, a brute-force approach is tractable.

'Deep' versions of Experiment and ExperimentRated used in Project 2 are also in ir.eval.DeepExperiment and ir.eval.DeepExperimentRated that produce precision-recall and NDCG plots evaluating the DeepRetriever. These use the normal 'queries' and 'queries-rated' files used by the normal experiment code but also takes a directory of query embeddings containing an embedding file (list of real-values). The query embedding directory should have files names Q1,..Qn giving the embeddings of the queries in the order they are in the original 'queries' file. The files will be lexicographically sorted by name and this order should correspond to the order in the queryFile so file numbers should have leading '0''s as needed to sort properly, i.e. Q001, Q002, ... Q099, Q100. A sample DeepExperimentRated trace is here.

Hybrid Retrieval

Disappointedly, my initial results (shown below) showed slightly worse results than the baseline VSR system, except for slightly improved precision for high recall values. Therefore, I hypothesized that combining the normal VSR approach with the deep-learning approach in a "hybrid" method might work best. A simple hybrid approach simply ranks retrieved documents by a simple weighted linear combination of the dense-vector deep cosine similarity with the normal VSR cosine similarity, i.e. λ D + (1 - λ) S, where D is the deep dense cosine similarity and S is the sparse TF/IDF BOW cosine similarity.

Implement and evaluate such a simple hybrid approach by writing the following classes: ir.vsr.HybridRetriever, ir.eval.HybridExperiment, and ir.eval.HybridExperimentRated. The HybridRetriever should combine a DeepRetriever with a normal InvertedIndex to produce a simple weighted linear combination of the results. The evaluation code can be fairly easily generated by properly combining code from the deep and original versions of Experiment and ExperimentRated. The main methods for the Hybrid Experiment classes should take the following args:

Command args:  [DIR] [EMBEDDIR] [QUERIES] [QUERYDIR] [LAMBDA] [OUTFILE] where:
DIR is the name of the directory whose files should be indexed.
EMDEDDIR is the name of the directory whose files have embeddings of the documents in DIR
QUERIES is a file of queries paired with relevant docs (see queryFile).
QUERYDIR is the name of the directory where the query embeddings are stored in files Q1...Qn
LAMBDA is the weight [0,1] to put on the deep cos similarity with (1-LAMBDA) on the VSR cosine sim
OUTFILE is the name of the file to put the output. The plot data for the recall precision curve is 
       stored in this file and a gnuplot file for the graph is the same name with a ".gplot" extension
For your experiments with HybridRetriever use the SPECTER2 Base embeddings and cosine similarity for the DeepRetriever (so it uses the same general metric as InvertedIndex constrained to be between 0 and 1).

Evaluation

Once you generate the embeddings, you can use the ir.eval.DeepExperimentRated class to evaluate the embeddings. The commands to be run are as follows: In addition try the '-cosine' version of DeepRetriever on the Base and Adapter models and for the Hybrid model try alternative values of the λ hyperparameter including 0.3, 0.5, 0.7, 0.8, 0.9. Note that hybrid should use cosine by default, so don't specify it as an argument.

A list of commands that you should test your code on are in this file: commands.sh. You can use this blueprint to run your code and generate results.

Results

You can use these gplot files to generate the plots for your report: all-deep.gplot, all-deep.ndcg.gplot, all.gplot, all.ndcg.gplot You can check commands.sh for the commands to generate the plots. You can check the sample plots for the base model and adapter model for PR and NDCG here: For your final results, generate one set of basic deep retrieval PR and NDCG results including VSR, SPECTER2-Base, SPECTER2-Adapt, SPECTER2-Base(cosine), and SPECTER2-Adapt(cosine). Generate another set of Hybrid PR and NDCG results including VSR, SPECTER2-Base(cosine), and Hybrid using SPECTER2-Base(cosine) combined with VSR using alternative λ values (0.3, 0.5, 0.7, 0.8, 0.9).

Report

Your report should summarize and analze the results of your experiments. Present the results in well organized graphs (you can shrink the graphs and insert them into your report). Include at least the 4 graphs (2 PR curves and 2 NDCG graphs) for the combinations of results specified above. Also answer at least the following questions (You should explicitly answer these questions, i.e. in your report put "Q1. How..?" and then give the answer underneath so that we do not have to search for your answers.):
  1. How does using the SPECTER2 embeddings compare to the VSR baseline in terms of retrieval accuracy (both PR and NDCG) ?
  2. Is there a difference with and without using the adapters?
  3. Why do you think classic VSR might still be out-performing these modern deep-learning methods (designed specifically for scientific documents) on this particular scientific corpus?
  4. How does using Euclidian distance vs. cosine similarity affect the results using the two deep models?
  5. Does the hybrid solution improve over both methods. Why or why not?
  6. What seems to be the best value of the hyperparameter λ?
The report length does not have to be less than the usual 2-page limit because the graphs take up a lot space.

Submission

You should submit your work on Gradescope. In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 4:

Please make sure that your code compiles and runs on the UTCS lab machines (you can run the EmbeddingGenerator on your local laptops).


The submitted files on Gradescope should look like this:


We will not be using the Gradescope autograder for this project since we cannot load the deeep models on their servers. We will be testing your code using autograders on the UTCS machines. This project will have 35% of the grade for the report, so please ensure you answer all questions and follow the rubrics on Gradescope.

Grading Criteria