CS 371R Information Retrieval and Web Search: Project 4

Project 4
CS 371R: Information Retrieval and Web Search
Evaluating Embeddings From Deep Language Models

Due: 11:59pm, Dec. 1, 2025

This project will explore using emdeddings from an LLM to support standard document retrieval. You will use document and query embeddings from a recent LLM specialized for scientific literature stored as precomputed dense vectors. You will then use existing Java code that augments the existing course IR code to support retrieval using pre-computed dense vectors and experimentally evaluate these deep embeddings. Finally, you will implement and test a "hybrid" approach that combines this dense-retrieval approach with the existing VSR system and experimentally evaluate whether it improves the results compared to purely sparse or dense retrieval alone.

Generated Deep Embeddings

Overview

We will evaluate deep dense embeddings generated from a pre-trained transformer language model on the Cystic Fibrosis (CF) dataset introduced in Project 2. Specifically, we will be using the SPECTER2 transformer model which is an LLM trained on scientific paper abstracts. SPECTER2 is first trained on over 6M triplets of scientific paper citations, after which it is trained with additional task-specific adapter modules. You can read more about the basic SPECTER approach in this paper. The pre-trained LM is capable of generating task specific embeddings for scientific tasks when paired with adapters. Given the combination of title and abstract of a scientific paper or a short texual query, the model can be used to generate effective embeddings to be used in downstream applications. Read through the model card on HuggingFace to understand the model better.

In this project, along with the SPECTER2 Base Model, we shall be using the Proximity Adapter to embed documents and the Adhoc Query Adapter to embed queries. We will compare two variants of SPECTER2 embeddings, one with just the base model and one with adapter modules attached to the base model. Each of the two cases is described below.

SPECTER2 Base Model - This is the SPECTER2 base model trained on a range of scientific document tasks from a dataset called called SciRepEval.
SPECTER2 with Adapter Modules - This is the SPECTER2 model with task-specific adapter modules attached to the base model that are specialized for encoding documents and queries for search retrieval.

We have precomputed embeddings for the CF documents and queries stored as space-separated 768-dimensional vectors in the following files:

Base Model Embeddings:
- Docs: /u/mooney/ir-code/embeddings/specter2_base/docs/RN-00001...RN-01239
- Queries: /u/mooney/ir-code/embeddings/specter2_base/queries/Q001...Q100
Adapter Model Embeddings:
- Docs: /u/mooney/ir-code/embeddings/specter2_adapter/docs/RN-00001...RN-01239
- Queries: /u/mooney/ir-code/embeddings/specter2_adapter/queries/Q001...Q100

Deep Retrieval

The course code in /u/mooney/ir-code/ir/ has been augmented with a class ir.vsr.DeepRetriever for a simple document retriever that uses precomputed dense document embeddings. It takes a directory of pre-computed document embeddings that have the same file names as the original corpus where each file contains a simple space-separated list of real-values representing the document embedding. Another new class used by the deep retriever is DeepDocumentReference which stores a pointer to a file and stores its dense vector and the precomputed vector length (L2 norm) for this vector. The 'retrieve' method returns a list of ranked retrievals for a query represented as a DeepDocumentReference using either Euclidian distance (default) or cosine similarity (if '-cosine' flag is used) to compare the dense vectors. No indexing is used to improve efficiency, a query is compared to every document in the corpus and all of the scored results are included in the array of Retrievals. Ideally, some form of approximate nearest neighbor search like locality sensitivity hashing would be used; however, for the limited number of small documents and queries in the CF corpus, a brute-force approach is tractable.

'Deep' versions of Experiment and ExperimentRated used in Project 2 are also in ir.eval.DeepExperiment and ir.eval.DeepExperimentRated that produce precision-recall and NDCG plots evaluating the DeepRetriever. These use the normal 'queries' and 'queries-rated' files used by the normal experiment code but also takes a directory of query embeddings containing an embedding file (list of real-values). The query embedding directory should have files names Q1,..Qn giving the embeddings of the queries in the order they are in the original 'queries' file. The files will be lexicographically sorted by name and this order should correspond to the order in the queryFile so file numbers should have leading '0''s as needed to sort properly, i.e. Q001, Q002, ... Q099, Q100. A sample DeepExperimentRated trace is here.

Hybrid Retrieval

Disappointedly, my initial results (shown below) showed slightly worse results than the baseline VSR system, except for slightly improved precision for high recall values. Therefore, I hypothesized that combining the normal VSR approach with the deep-learning approach in a "hybrid" method might work best. A simple hybrid approach simply ranks retrieved documents by a simple weighted linear combination of the dense-vector deep cosine similarity with the normal VSR cosine similarity, i.e. λ D + (1 - λ) S, where D is the deep dense cosine similarity and S is the sparse TF/IDF BOW cosine similarity.

Implement and evaluate such a simple hybrid approach by writing the following classes: ir.vsr.HybridRetriever, ir.eval.HybridExperiment, and ir.eval.HybridExperimentRated. The HybridRetriever should combine a DeepRetriever with a normal InvertedIndex to produce a simple weighted linear combination of the results. The evaluation code can be fairly easily generated by properly combining code from the deep and original versions of Experiment and ExperimentRated. The main methods for the Hybrid Experiment classes should take the following args:

Command args:  [DIR] [EMBEDDIR] [QUERIES] [QUERYDIR] [LAMBDA] [OUTFILE] where:
DIR is the name of the directory whose files should be indexed.
EMDEDDIR is the name of the directory whose files have embeddings of the documents in DIR
QUERIES is a file of queries paired with relevant docs (see queryFile).
QUERYDIR is the name of the directory where the query embeddings are stored in files Q1...Qn
LAMBDA is the weight [0,1] to put on the deep cos similarity with (1-LAMBDA) on the VSR cosine sim
OUTFILE is the name of the file to put the output. The plot data for the recall precision curve is 
       stored in this file and a gnuplot file for the graph is the same name with a ".gplot" extension

For your experiments with HybridRetriever use the SPECTER2 Base embeddings and cosine similarity for the DeepRetriever (so it uses the same general metric as InvertedIndex constrained to be between 0 and 1).

Evaluation

Once you generate the embeddings, you can use the ir.eval.DeepExperimentRated class to evaluate the embeddings. The commands to be run are as follows:

VSR Model:

 java ir.eval.ExperimentRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated results/vsr

Base Model:

java ir.eval.DeepExperimentRated /u/mooney/ir-code/embeddings/specter2_base/docs /u/mooney/ir-code/queries/cf/queries-
        rated /u/mooney/ir-code/embeddings/specter2_base/queries results/specter2_base

Adapter Model:

java ir.eval.DeepExperimentRated /u/mooney/ir-code/embeddings/specter2_adapter/docs /u/mooney/ir-code/queries/cf/queries-
        rated /u/mooney/ir-code/embeddings/specter2_adapter/queries results/specter2_adapter

Hybrid Model:

java ir.eval.HybridExperimentRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/embeddings/specter2_base/docs /u/mooney/ir-code/queries/cf/queries-
rated /u/mooney/ir-code/embeddings/specter2_base/queries 0.5 results/hybrid05

In addition try the '-cosine' version of DeepRetriever on the Base and Adapter models and for the Hybrid model try alternative values of the λ hyperparameter including 0.3, 0.5, 0.7, 0.8, 0.9. Note that hybrid should use cosine by default, so don't specify it as an argument.

Results

You can use these gplot files to generate the plots for your report: all-deep.gplot, all-deep.ndcg.gplot, all.gplot, all.ndcg.gplot You can check the sample plots for the base model and adapter model for PR and NDCG here:

For your final results, generate one set of basic deep retrieval PR and NDCG results including VSR, SPECTER2-Base, SPECTER2-Adapt, SPECTER2-Base(cosine), and SPECTER2-Adapt(cosine). Generate another set of Hybrid PR and NDCG results including VSR, SPECTER2-Base(cosine), and Hybrid using SPECTER2-Base(cosine) combined with VSR using alternative λ values (0.3, 0.5, 0.7, 0.8, 0.9).

Report

Your report should summarize and analze the results of your experiments. Present the results in well organized graphs (you can shrink the graphs and insert them into your report). Include at least the 4 graphs (2 PR curves and 2 NDCG graphs) for the combinations of results specified above. Also answer at least the following questions (You should explicitly answer these questions, i.e. in your report put "Q1. How..?" and then give the answer underneath so that we do not have to search for your answers.):

How does using the SPECTER2 embeddings compare to the VSR baseline in terms of retrieval accuracy (both PR and NDCG) ?
Is there a difference with and without using the adapters?
Why do you think classic VSR might still be out-performing these modern deep-learning methods (designed specifically for scientific documents) on this particular scientific corpus?
How does using Euclidian distance vs. cosine similarity affect the results using the two deep models?
Does the hybrid solution improve over both methods. Why or why not?
What seems to be the best value of the hyperparameter λ?

The report length does not have to be less than the usual 2-page limit because the graphs take up a lot space.

Submission

You should submit your work on Gradescope. In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 4:

Create at least the following new classes described above:
1. ir.vsr.HybridRetriever
2. ir.eval.HybridExperiment
3. ir.eval.HybridExperimentRated
For this assignment, you need to submit the following files:

code/ - A folder containing all your code, added and modified *.java and *.class files. Please do not modify the original java files but extend each class and override the appropriate methods.
report.pdf - A PDF report of your experiment as described above with the plots referenced in the instructions.
results/ - A folder containing the data files used to generate your plots with the following contents:
- vsr, vsr.ndcg
- specter2_base, specter2_base.ndcg
- specter2_adapter, specter2_adapter.ndcg
- hybrid, hybrid.ndcg
Make sure that these files match the output of the code that you submit.


     Name
---------------------------------------
ir/vsr/HybridRetriever.java
ir/eval/HybridExperiment.java
ir/eval/HybridExperimentRated.java


                Name
---------------------------------------
vsr                     vsr.ndcg
specter2_base           specter2_base.ndcg
specter2_adapter        specter2_adapter.ndcg
specter2_base_cos       specter2_base_cos.ndcg
specter2_adapter_cos    specter2_adapter_cos.ndcg
hybrid03                hybrid03.ndcg
hybrid05                hybrid05.ndcg
hybrid07                hybrid07.ndcg
hybrid08                hybrid08.ndcg
hybrid09                hybrid09.ndcg

Please make sure that your code compiles and runs on the UTCS lab machines.

        # Test a single hybrid experiment
        java -cp /u/mooney/ir-code:code ir.eval.HybridExperimentRated \
            /u/mooney/ir-code/corpora/cf \
            /u/mooney/ir-code/embeddings/specter2_base/docs \
            /u/mooney/ir-code/queries/cf/queries-rated \
            /u/mooney/ir-code/embeddings/specter2_base/queries \
            0.5 results/hybrid05
        
        # Verify output files created
        ls -lh results/hybrid05*
        # Should show: hybrid05, hybrid05.gplot, hybrid05.ndcg, hybrid05.ndcg.gplot
        
        # Check format (PR: 11 lines, NDCG: 10 lines)
        wc -l results/hybrid05 results/hybrid05.ndcg

Gradescope Autograder

This project includes an automated autograder on Gradescope. Its score is for reference only and does not directly determine your final grade. The autograder provides feedback on:

Whether your code compiles and runs
Whether your implementation produces correct results
Specific errors or issues in your code

Test	Points	Checks
1.1 File Structure	5 pts	All required Java files present in correct package structure
1.2 Compilation	10 pts	Code compiles without errors (or pre-compiled .class files work)
2.1 Experiment Execution	15 pts	All 10 experiment configurations run successfully
2.2 Numerical Accuracy	20 pts	Hybrid results match reference solution within tolerance

VSR and deep experiments (specter2_base, etc.) use the base framework and are not graded by the autograder, but you still need to submit those results for your report.

Grading Criteria

50% Working code that correctly implements the hybrid method and its experimental evaluation.
15%: Efficient code with good programming style with necessary comments, intuitive variable/function names, and appropriate indentation.
35%: Quality of report, clear presentation of results, good analysis & discussion, and your answers to the questions above.

Project 4 CS 371R: Information Retrieval and Web Search Evaluating Embeddings From Deep Language Models