Project 4
CS 371R: Information Retrieval and Web Search
Evaluating Embeddings From Deep Language Models
Due: 11:59pm, November 29, 2023
This project will explore using emdeddings from a LLM to support
standard document retrieval. First, you will produce document and
query embeddings from a Python-based LLM and store them in directories
of files containing precomputed dense vectors. You will then use existing
Java code that augments the existing course IR code to support retrieval
using pre-computed dense vectors and experimentally evaluate the
embeddings you produced. Finally, you will implement and test a
"hybrid" approach that combines this dense-retrieval approach with the
existing VSR system and experimentally evaluate whether it improves
the results compared to purely sparse or dense retrieval alone.
Generating Deep Embeddings
Overview
We will evaluate deep embeddings generated from a
pre-trained transformer language model on the Cystic Fibrosis (CF)
dataset introduced in Project 2. We will use
the HuggingFace
Transformers library in
the PyTorch deep learning framework
to generate the embeddings. Specifically, we will be using
the SPECTER2
transformer model which is a language model (LM) trained on scientific
paper abstracts. SPECTER2 is first trained on over 6M triplets of
scientific paper citations, after which it is trained with additional
task-specific adapter modules. You can read more about the basic SPECTER
approach in this paper.
The pre-trained LM is capable of
generating task specific embeddings for scientific tasks when paired
with adapters. Given the combination of title and abstract of a
scientific paper or a short texual query, the model can be used to
generate effective embeddings to be used in downstream applications.
Read through
the model
card on HuggingFace to understand the model better.
Python Environment
You can choose to install the python packages within a conda environment on the UTCS lab machines. To install conda, follow the instructions here. Then you can set up an environment and install packages as follows:
conda create -n irproj python=3.10
conda activate irproj
pip install 'transformers[torch]'
pip install -U adapter-transformers
Although conda is recommended, you may also just install the python packages as is without using conda.
Python Code
In this project, along with the SPECTER2 Base Model, we shall be using the Proximity Adapter to embed documents and the Adhoc Query Adapter to embed queries.
All necessary models and tokenizers have already been downloaded and can be found in /u/mooney/ir-code/models/specter2_2023/
. Do not download the models from the HuggingFace website, directly load the models from the above path.
We will compare two variants of SPECTER2 embeddings, one with just the base model and one with adapter modules attached to the base model.
Each of the two cases is described below.
- SPECTER2 Base Model - This is the SPECTER2 base model trained on the SciRepEval training tasks. Load the base model from
/u/mooney/ir-code/models/specter2_2023/base
- SPECTER2 with Adapter Modules - This is the SPECTER2 model with task-specific adapter modules attached to the base model. After loading the base model, load the adapter modules as follows:
- Queries: Load the adapter from
/u/mooney/ir-code/models/specter2_2023/adhoc_query
- Docs: Load the adapter from
/u/mooney/ir-code/models/specter2_2023/proximity
The steps to be followed to generate embeddings are as follows:
- Load the model and tokenizer from the appropriate paths as described above.
- Tokenize the input text using the tokenizer.
- Pass the tokenized input through the model to generate embeddings.
- Save the embeddings in the appropriate format.
For generating embeddings, copy the python files in /u/mooney/ir-code/deep_embedder
. Make appropriate changes to the generate_embeddings.py template file and run the script as follows:
- Base embeddings:
python test_embedder.py
- Adapter embeddings:
python test_embedder.py --use-adapter
NOTE: Do not modify test_embedder.py. This file will be run as is from the autograder (you can however change the paths for models-folder, etc while debugging your code). You can fill in the generate_embeddings.py template file and code up the functions as you see fit. **You will submit only generate_embeddings.py**.
While saving embeddings, save them as space separated 768-dim vectors in the following format:
- Base:
- Docs:
embeddings/specter2_base/docs/RN-00001...RN-01239
- Queries:
embeddings/specter2_base/queries/Q001...Q100
- Adapter:
- Docs:
embeddings/specter2_adapter/docs/RN-00001...RN-01239
- Queries:
embeddings/specter2_adapter/queries/Q001...Q100
Sample embeddings for doc1 and query1 for both base and adapter versions are below:
You can use these to validate your implementation. You don't have to match them exactly, but the values wouldn't be too different.
Deep Retrieval
The course code in /u/mooney/ir-code/ir/ has been augmented with a new
class ir.vsr.DeepRetriever for a simple document retriever that uses
precomputed dense document embeddings. It takes a directory of
pre-computed document embeddings that have the same file names as the
original corpus where each file contains a simple space-separated list
of real-values representing the document embedding. Another new class
used by the deep retriever is DeepDocumentReference which stores a
pointer to a file and stores its dense vector and the precomputed
vector length (L2 norm) for this vector. The 'retrieve' method returns
a list of ranked retrievals for a query represented as a
DeepDocumentReference using either Euclidian distance or cosine
similarity (if 'cosine' flag is used) to compare the dense vectors.
No indexing is used to improve efficiency, a query is compared to
every document in the corpus and all of the scored results are
included in the array of Retrievals. Ideally, some form of
approximate nearest neighbor search like locality sensitivity
hashing would be used; however, for the limited number of small documents
and queries in the CF corpus, a brute-force approach is tractable.
'Deep' versions of Experiment and ExperimentRated used in Project 2
are also in ir.eval.DeepExperiment and ir.eval.DeepExperimentRated
that produce precision-recall and NDCG plots evaluating the
DeepRetriever. These use the normal 'queries' and 'queries-rated'
files used by the normal experiment code but also takes a directory of
query embeddings containing an embedding file (list of
real-values). The query embedding directory should have files names
Q1,..Qn giving the embeddings of the queries in the order they are in
the original 'queries' file. The files will be lexicographically
sorted by name and this order should correspond to the order in the
queryFile so file numbers should have leading '0''s as needed to sort
properly, i.e. Q001, Q002, ... Q099, Q100.
A sample DeepExperimentRated trace is here.
Hybrid Retrieval
Disappointedly, my initial results (shown below) showed slightly worse
results than the baseline VSR system, except for slightly improved
precision for high recall values. Therefore, I hypothesized that
combining the normal VSR approach with the deep-learning approach in a
"hybrid" method might work best. A simple hybrid approach simply
ranks retrieved documents by a simple weighted linear combination of
the dense-vector deep cosine similarity with the normal VSR cosine
similarity, i.e. λ D + (1 - λ) S, where D is the deep
dense cosine similarity and S is the sparse TF/IDF BOW cosine similarity.
Implement and evaluate such a simple hybrid approach by writing the
following classes: ir.vsr.HybridRetriever, ir.eval.HybridExperiment,
and ir.eval.HybridExperimentRated.
The HybridRetriever should combine a DeepRetriever with a normal InvertedIndex to
produce a simple weighted linear combination of the results. The evaluation code can be fairly
easily generated by properly combining code from the deep and original
versions of Experiment and ExperimentRated. The main methods for the Hybrid Experiment classes should take the
following args:
Command args: [DIR] [EMBEDDIR] [QUERIES] [QUERYDIR] [LAMBDA] [OUTFILE] where:
DIR is the name of the directory whose files should be indexed.
EMDEDDIR is the name of the directory whose files have embeddings of the documents in DIR
QUERIES is a file of queries paired with relevant docs (see queryFile).
QUERYDIR is the name of the directory where the query embeddings are stored in files Q1...Qn
LAMBDA is the weight [0,1] to put on the deep cos similarity with (1-LAMBDA) on the VSR cosine sim
OUTFILE is the name of the file to put the output. The plot data for the recall precision curve is
stored in this file and a gnuplot file for the graph is the same name with a ".gplot" extension
For your experiments with HybridRetriever use the SPECTER2 Base embeddings and cosine similarity for the DeepRetriever (so it uses the same general metric as InvertedIndex constrained to be between 0 and 1).
Evaluation
Once you generate the embeddings, you can use the ir.eval.DeepExperimentRated
class to evaluate the embeddings. The commands to be run are as follows:
- VSR Model:
java ir.eval.ExperimentRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated results/vsr
- Base Model:
java ir.eval.DeepExperimentRated embeddings/specter2_base/docs /u/mooney/ir-code/queries/cf/queries-
rated embeddings/specter2_base/queries results/specter2_base
- Adapter Model:
java ir.eval.DeepExperimentRated embeddings/specter2_adapter/docs /u/mooney/ir-code/queries/cf/queries-
rated embeddings/specter2_adapter/queries results/specter2_adapter
- Hybrid Model:
java ir.eval.HybridExperimentRated /u/mooney/ir-code/corpora/cf/ embeddings/specter2_base/docs /u/mooney/ir-code/queries/cf/queries-
rated embeddings/specter2_base/queries 0.5 results/hybrid05
In addition try the '-cosine' version of DeepRetriever on the Base and
Adapter models and for the Hybrid model try alternative values of the λ hyperparameter including 0.3, 0.5, 0.7, 0.8, 0.9. Note that hybrid should use cosine by default, so don't specify it as an argument.
A list of commands that you should test your code on are in this file: commands.sh. You can use this blueprint to run your code and generate results.
Results
You can use these gplot files to generate the plots for your report: all-deep.gplot, all-deep.ndcg.gplot, all.gplot, all.ndcg.gplot
You can check commands.sh for the commands to generate the plots.
You can check the sample plots for the base model and adapter model for PR and NDCG here:
For your final results, generate one set of basic deep retrieval PR and NDCG results including VSR, SPECTER2-Base, SPECTER2-Adapt, SPECTER2-Base(cosine), and SPECTER2-Adapt(cosine). Generate another set of Hybrid PR and NDCG results including VSR, SPECTER2-Base(cosine), and Hybrid using SPECTER2-Base(cosine) combined with VSR using alternative λ values (0.3, 0.5, 0.7, 0.8, 0.9).
Report
Your report should summarize and analze
the results of your experiments. Present the results in well
organized graphs (you can shrink the graphs and insert them into your
report). Include at least the 4 graphs (2 PR curves and 2 NDCG graphs)
for the combinations of results specified above. Also answer at least
the following questions (You should explicitly answer these
questions, i.e. in your report put "Q1. How..?" and then give the
answer underneath so that we do not have to search for your
answers.):
- How does using the SPECTER2 embeddings compare to the VSR baseline in terms of retrieval accuracy (both PR and NDCG) ?
- Is there a difference with and without using the adapters?
- Why do you think classic VSR might still be out-performing these modern deep-learning methods (designed specifically for scientific documents) on this particular scientific corpus?
- How does using Euclidian distance vs. cosine similarity affect the results using the two deep models?
- Does the hybrid solution improve over both methods. Why or why not?
- What seems to be the best value of the hyperparameter λ?
The report length does not have to be less than the usual 2-page limit because the graphs take up a lot space.
Submission
You should submit your work on Gradescope. In submitting your solution, follow the general course instructions on
submitting projects on the course homepage. Along with that, follow these specific instructions for Project 4:
- Populate the functions in
generate_embeddings.py
to generate SPECTER2 embeddings for the base and adapter versions
- Create at least the following new classes described above:
- ir.vsr.HybridRetriever
- ir.eval.HybridExperiment
- ir.eval.HybridExperimentRated
- For this assignment, you need to submit the following files:
-
code/
- A folder containing all your code (generate_embeddings.py, modified *.java and *.class files). Please do not modify the original java files but extend each class and override the appropriate methods.
-
report.pdf
- A PDF report of your experiment as described above with the plots referenced in the instructions.
results/
- A folder containing the data files used to generate your plots with the following contents:
vsr, vsr.ndcg
specter2_base, specter2_base.ndcg
specter2_adapter, specter2_adapter.ndcg
hybrid, hybrid.ndcg
Make sure that these files match the output of the code that you submit.
The code folder should have at least the following contents:
Name
---------------------------------------
generate_embeddings.py
ir/vsr/HybridRetriever.java
ir/eval/HybridExperiment.java
ir/eval/HybridExperimentRated.java
The results folder should have these contents:
Name
---------------------------------------
vsr vsr.ndcg
specter2_base specter2_base.ndcg
specter2_adapter specter2_adapter.ndcg
specter2_base_cos specter2_base_cos.ndcg
specter2_adapter_cos specter2_adapter_cos.ndcg
hybrid03 hybrid03.ndcg
hybrid05 hybrid05.ndcg
hybrid07 hybrid07.ndcg
hybrid08 hybrid08.ndcg
hybrid09 hybrid09.ndcg
Please make sure that your code compiles and runs on the UTCS lab machines (you can run the EmbeddingGenerator on your local laptops).
The submitted files on Gradescope should look like this:
We will not be using the Gradescope autograder for this project since we cannot load the deeep models on their servers. We will be testing your code using autograders on the UTCS machines. This project will have 35% of the grade for the report, so please ensure you answer all questions and follow the rubrics on Gradescope.
Grading Criteria
- 10%: Your program compiles successfully and functions normally without throwing exceptions.
- 10%: Your implementation is efficient. For example, it shouldn't change the overall time complexity. Also, it shouldn't significantly increase the average time it takes to respond to a query.
- 35% Working code that correctly implements embedding generation and hybrid retrieval
- 10%: Good programming style with necessary comments, intuitive variable/function names, and appropriate indentation.
- 35%: Quality of report, clear presentation of results, good analysis & discussion, and your answers to the questions above.