This repository implements the retrieval pipeline from the paper STAR: A Simple Training-free Approach for Recommendations using Large Language Models. It aims to help understand how a training-free recommendation system can be built using:
- LLM embeddings for semantic similarity
- User interaction patterns for collaborative signals
- Temporal decay for recent history weighting
The embeddings are the foundation of semantic similarity:
class ItemEmbeddingGenerator:
def create_embedding_input(self, item_data: Dict) -> TextEmbeddingInput:
# Creates rich text prompts including:
# - Full item description
# - Title
# - Category hierarchy
# - Brand (if not ASIN-like)
# - Price and sales rank
Key implementation details:
- Uses Vertex AI's
text-embedding-005
model (768 dimensions) - Excludes IDs/URLs to avoid trivial matching
- Preserves complete metadata structure
The core scoring logic combines three components:
- Semantic Matrix (R_s):
# Compute cosine similarities between normalized embeddings
semantic_matrix = 1 - cdist(embeddings_array, embeddings_array, metric='cosine')
np.fill_diagonal(semantic_matrix, 0) # Zero out self-similarities
- Collaborative Matrix (R_c) (
collaborative_relationships.py
):
# Normalize by user activity sqrt
user_activity = np.sum(interaction_matrix, axis=0)
normalized = interaction_matrix / np.sqrt(user_activity)
collaborative_matrix = normalized @ normalized.T
- Scoring Formula:
score = 0.0
for t, (hist_item, rating) in enumerate(zip(reversed(user_history), reversed(ratings))):
sem_sim = semantic_matrix[cand_idx, hist_idx]
collab_sim = collaborative_matrix[cand_idx, hist_idx]
combined_sim = (semantic_weight * sem_sim + (1 - semantic_weight) * collab_sim)
score += (1/n) * rating * (temporal_decay ** t) * combined_sim
The code strictly maintains temporal order (temporal_utils.py
):
- Sorts reviews by timestamp
- Handles duplicate timestamps
- Ensures test items are truly last in sequence
The evaluation (evaluation_metrics.py
) matches the paper's setup:
- Leave-last-out evaluation
- use the full dataset for evaluation
- Metrics: Hits@5/10, NDCG@5/10
-
Download the Stanford SNAP 5-core Amazon datasets using
download_data.py
.
For example:poetry run python download_data.py --category beauty
This downloads
reviews_Beauty_5.json.gz
andmeta_Beauty.json.gz
into thedata/
folder. -
Check data with
check_data.py
:poetry run python check_data.py
This prints the first few lines and verifies the JSON parse.
Note: These files named
reviews_Beauty_5.json.gz
etc. are already 5-core datasets. The code still enforces ≥5 interactions, but typically no users/items are removed since the data is already filtered.
-
Install Python dependencies via Poetry:
poetry install
-
Run the main pipeline:
poetry run python src/main.py
This:
- Loads reviews and metadata,
- Sorts each user’s reviews by timestamp (fixing potential out-of-order entries),
- Creates or loads item embeddings,
- Computes the semantic and collaborative matrices,
- Splits data into train/val/test in a leave-last-out manner,
- Runs evaluation with 99 negative samples for each user’s test item,
- Prints final Hits@K, NDCG@K metrics.
-
Data Quality Matters
- Use
DataQualityChecker
to verify metadata richness - Check for duplicate timestamps
- Verify chronological ordering
- Use
-
Embedding Generation
- Include all relevant metadata for rich embeddings
- Avoid ID/URL information that could leak
- Use consistent field ordering in prompts
-
Matrix Computation
- Normalize embeddings before similarity
- Proper user activity normalization for collaborative
- Zero out diagonal elements
-
Common Issues
- Future item leakage in negative sampling
- Timestamp ordering issues
- Inadequate metadata in prompts
# Retrieval parameters (star_retrieval.py)
semantic_weight = 0.5 # Weight between semantic/collaborative
temporal_decay = 0.7 # Decay factor for older items
history_length = 3 # Number of recent items to use
# Evaluation parameters (evaluation_metrics.py)
k_values = [5, 10] # Top-k for metrics
The code provides detailed statistics:
Semantic Matrix Statistics:
- mean_sim: Average semantic similarity
- sparsity: Fraction of zero elements
- min/max_sim: Similarity range
Collaborative Matrix Statistics:
- mean_nonzero: Average co-occurrence strength
- sparsity: Interaction density
These help diagnose if the embeddings or collaborative signals are working as expected.
Final Results:
Results for Beauty dataset:
------------------------------
Metric Score
------------------------------
hit@10 0.0923
hit@5 0.0632
ndcg@10 0.0521
ndcg@5 0.0428
------------------------------
See beauty_results.md for the results on the Beauty dataset.
See Application Data Specification for how to prepare your own data.
@article{lee2024star,
title={STAR: A Simple Training-free Approach for Recommendations using Large Language Models},
author={Lee, Dong-Ho and Kraft, Adam and Jin, Long and Mehta, Nikhil and Xu, Taibai and Hong, Lichan and Chi, Ed H. and Yi, Xinyang},
journal={arXiv preprint arXiv:2410.16458},
year={2024}
}