An empirical research study that investigates how Large Language Models (LLMs) behave when deployed as system recommenders in the mobile app domain. This study systematically evaluates multiple LLM providers (OpenAI, Google Gemini, and Mistral) to understand their recommendation patterns, consistency, and behavior when generating app rankings based on specific features and categories.
This empirical investigation examines LLM behavior in mobile app recommendation scenarios across different AI-powered categories and features. The study focuses on:
- Multi-LLM Behavioral Analysis: Comparing recommendation patterns across OpenAI GPT-4, Google Gemini, and Mistral models
- Feature-Based Recommendation Studies: Analyzing how LLMs generate app recommendations for specific app features (e.g., "Photo effects", "Go Live", "Collaborate with others")
- Category-Based Behavioral Analysis: Examining LLM behavior when evaluating apps within AI-powered categories (e.g., "AI-powered entertainment", "AI-powered productivity")
- Consistency Measurement: Quantifying ranking consistency both within and across different LLM models
- Behavioral Visualization: Creating comprehensive visualizations of LLM recommendation patterns and criteria
llm-recommender-system/
βββ code/ # Main source code
β βββ llm/ # LLM integration modules
β β βββ google/ # Google Gemini implementation
β β βββ mistral/ # Mistral AI implementation
β β βββ openai/ # OpenAI implementation
β β βββ create_assistant.py # Abstract assistant creation
β β βββ use_assistant.py # Assistant usage utilities
β βββ consistency/ # Ranking consistency analysis
β β βββ app_consistency.py # App ranking consistency
β β βββ app_internal_consistency.py
β β βββ ranking_criteria_consistency.py
β βββ correlation/ # Correlation analysis tools
β βββ data-processor/ # Data processing utilities
β βββ visualization/ # Visualization modules
β βββ criteria_visualization.py
β βββ source_visualization.py
βββ data/ # Data directory
β βββ input/ # Input data and configurations
β β βββ prompts/ # System and user prompts
β β βββ schema/ # JSON schemas for responses
β β βββ use-case/ # Categories and features data
β βββ output/ # Generated outputs
β β βββ category/ # Category-based results
β β βββ features/ # Feature-based results
β β βββ evaluation/ # Evaluation metrics
β β βββ search/ # Search results
β βββ assistants/ # Stored assistant IDs
βββ experiments-*.py # Experiment runner scripts
βββ hot-fix.py # Utility scripts
- OpenAI GPT-4: Advanced reasoning and ranking capabilities
- Google Gemini: Web search integration for real-time data
- Mistral AI: Cost-effective alternative with strong performance
- Multi-Model Comparison: Evaluate consistency across different LLMs
- Feature-Specific Rankings: Generate recommendations for specific app features
- Category-Based Analysis: Analyze apps within AI-powered categories
- Consistency Metrics:
- Rank-Biased Overlap (RBO)
- Jaccard Similarity
- Internal consistency (within model)
- External consistency (across models)
- Semantic Clustering: Group similar ranking criteria using embeddings
- Active Learning: Interactive threshold optimization for criteria deduplication
- Visualization: Heatmaps, dendrograms, and comprehensive charts
- Data Processing: Automated merging and cleaning of recommendation data
The empirical study examines LLM behavior when generating recommendations for 16 specific app features:
- Broadcast messages to multiple contacts
- Send files
- Watch streams
- Go Live
- Play playlist on shuffle mode
- Access to podcasts
- Build photo collage
- Photo effects
- Access to movies
- Rate movies
- Keeping up with friends
- Play games
- Collaborate with others
- Write notes
- Search for offer on item
- List items for sale
- Python 3.8+
- Required API keys for LLM providers
- Clone the repository:
git clone <repository-url>
cd llm-recommender-system- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
# Create .env file with your API keys
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key
MISTRAL_API_KEY=your_mistral_key# Run experiments for all LLM providers
python experiments-gemini-rq1.py
python experiments-mistral-rq1.py
python experiments-openai-rq1.py# Run category-based experiments
python experiments-gemini-rq3.py
python experiments-mistral-rq3.py
python experiments-openai-rq3.pypython -m code.llm.google.search_gemini_rq1 \
--output ./data/output/features/rq1/gemini/k20_Photo_effects \
--k 20 \
--search "Photo effects" \
--n 10 \
--model "gemini-2.0-flash" \
--system-prompt "data/input/prompts/system-prompt-output-rq1.txt"python -m code.llm.openai.search_openai_rq1 \
--output ./data/output/features/rq1/openai/k20_Photo_effects \
--k 20 \
--search "Photo effects" \
--n 10 \
--model "gpt-4o" \
--system-prompt "data/input/prompts/system-prompt-output-rq1.txt"python -m code.llm.mistral.search_mistral_rq1 \
--output ./data/output/features/rq1/mistral/k20_Photo_effects \
--k 20 \
--search "Photo effects" \
--n 10 \
--model "mistral-large-latest" \
--system-prompt "data/input/prompts/system-prompt-output-rq1.txt"python -m code.consistency.app_consistency \
--input data/output/evaluation/app_rankings.csv \
--output data/output/evaluation/consistencypython -m code.consistency.ranking_criteria_consistency \
--input data/output/evaluation/app_ranking_criteria.csv \
--output data/output/evaluation/consistency/ranking_criteriapython -m code.visualization.criteria_visualization \
--input data/output/features/rq1/gemini/all_criteria.csv \
--output data/output/features/rq1/gemini/ \
--similarity-threshold 0.72python -m code.visualization.source_visualization \
--input data/output/features/rq1/gemini/all_criteria.csv \
--output data/output/features/rq1/gemini/- JSON Responses: Individual LLM responses for each experimental trial
- CSV Rankings: Consolidated app rankings across all trials and models
- Consistency Metrics: RBO and Jaccard similarity calculations
- Visualization Files: Heatmaps, dendrograms, and analysis charts
- Evaluation Reports: Comprehensive analysis of LLM behavior patterns
data/output/
βββ features/rq1/ # Feature-based analysis results
β βββ gemini/ # Google Gemini results
β βββ mistral/ # Mistral AI results
β βββ openai/ # OpenAI results
βββ category/rq1/ # Category-based analysis results
β βββ gemini/ # Google Gemini results
β βββ mistral/ # Mistral AI results
β βββ openai/ # OpenAI results
βββ evaluation/ # Consistency and correlation analysis
β βββ consistency/ # Ranking consistency metrics
β βββ correlation/ # Cross-model correlation analysis
βββ search/ # Search functionality results
- Model Consistency: Analysis of ranking consistency within and across LLM models
- Feature Sensitivity: How different app features affect recommendation patterns
- Category Behavior: LLM behavior variations across AI-powered app categories
- Ranking Criteria: Semantic analysis of ranking criteria used by different models
- Heatmaps: Model comparison matrices showing ranking similarities
- Dendrograms: Hierarchical clustering of ranking criteria
- Consistency Charts: RBO and Jaccard similarity visualizations
- Correlation Plots: Cross-model correlation analysis
- ...
- ....
This project is licensed under the GPL version 3 - see the LICENSE file for details.
- ...