Skip to content

nlp4se/llm-app-recommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

43 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Empirical Study: LLM Behavior as System Recommenders in Mobile App Domain

An empirical research study that investigates how Large Language Models (LLMs) behave when deployed as system recommenders in the mobile app domain. This study systematically evaluates multiple LLM providers (OpenAI, Google Gemini, and Mistral) to understand their recommendation patterns, consistency, and behavior when generating app rankings based on specific features and categories.

πŸ“‹ Study Overview

This empirical investigation examines LLM behavior in mobile app recommendation scenarios across different AI-powered categories and features. The study focuses on:

  • Multi-LLM Behavioral Analysis: Comparing recommendation patterns across OpenAI GPT-4, Google Gemini, and Mistral models
  • Feature-Based Recommendation Studies: Analyzing how LLMs generate app recommendations for specific app features (e.g., "Photo effects", "Go Live", "Collaborate with others")
  • Category-Based Behavioral Analysis: Examining LLM behavior when evaluating apps within AI-powered categories (e.g., "AI-powered entertainment", "AI-powered productivity")
  • Consistency Measurement: Quantifying ranking consistency both within and across different LLM models
  • Behavioral Visualization: Creating comprehensive visualizations of LLM recommendation patterns and criteria

πŸ“ Project Structure

llm-recommender-system/
β”œβ”€β”€ code/                          # Main source code
β”‚   β”œβ”€β”€ llm/                      # LLM integration modules
β”‚   β”‚   β”œβ”€β”€ google/               # Google Gemini implementation
β”‚   β”‚   β”œβ”€β”€ mistral/              # Mistral AI implementation
β”‚   β”‚   β”œβ”€β”€ openai/               # OpenAI implementation
β”‚   β”‚   β”œβ”€β”€ create_assistant.py   # Abstract assistant creation
β”‚   β”‚   └── use_assistant.py      # Assistant usage utilities
β”‚   β”œβ”€β”€ consistency/              # Ranking consistency analysis
β”‚   β”‚   β”œβ”€β”€ app_consistency.py    # App ranking consistency
β”‚   β”‚   β”œβ”€β”€ app_internal_consistency.py
β”‚   β”‚   └── ranking_criteria_consistency.py
β”‚   β”œβ”€β”€ correlation/              # Correlation analysis tools
β”‚   β”œβ”€β”€ data-processor/           # Data processing utilities
β”‚   └── visualization/            # Visualization modules
β”‚       β”œβ”€β”€ criteria_visualization.py
β”‚       └── source_visualization.py
β”œβ”€β”€ data/                         # Data directory
β”‚   β”œβ”€β”€ input/                    # Input data and configurations
β”‚   β”‚   β”œβ”€β”€ prompts/              # System and user prompts
β”‚   β”‚   β”œβ”€β”€ schema/               # JSON schemas for responses
β”‚   β”‚   └── use-case/             # Categories and features data
β”‚   β”œβ”€β”€ output/                   # Generated outputs
β”‚   β”‚   β”œβ”€β”€ category/             # Category-based results
β”‚   β”‚   β”œβ”€β”€ features/             # Feature-based results
β”‚   β”‚   β”œβ”€β”€ evaluation/           # Evaluation metrics
β”‚   β”‚   └── search/               # Search results
β”‚   └── assistants/               # Stored assistant IDs
β”œβ”€β”€ experiments-*.py              # Experiment runner scripts
└── hot-fix.py                    # Utility scripts

πŸš€ Features

Supported LLM Providers

  • OpenAI GPT-4: Advanced reasoning and ranking capabilities
  • Google Gemini: Web search integration for real-time data
  • Mistral AI: Cost-effective alternative with strong performance

Analysis Capabilities

  • Multi-Model Comparison: Evaluate consistency across different LLMs
  • Feature-Specific Rankings: Generate recommendations for specific app features
  • Category-Based Analysis: Analyze apps within AI-powered categories
  • Consistency Metrics:
    • Rank-Biased Overlap (RBO)
    • Jaccard Similarity
    • Internal consistency (within model)
    • External consistency (across models)

Advanced Analytics

  • Semantic Clustering: Group similar ranking criteria using embeddings
  • Active Learning: Interactive threshold optimization for criteria deduplication
  • Visualization: Heatmaps, dendrograms, and comprehensive charts
  • Data Processing: Automated merging and cleaning of recommendation data

🎯 Study Features

The empirical study examines LLM behavior when generating recommendations for 16 specific app features:

  • Broadcast messages to multiple contacts
  • Send files
  • Watch streams
  • Go Live
  • Play playlist on shuffle mode
  • Access to podcasts
  • Build photo collage
  • Photo effects
  • Access to movies
  • Rate movies
  • Keeping up with friends
  • Play games
  • Collaborate with others
  • Write notes
  • Search for offer on item
  • List items for sale

πŸ› οΈ Experimental Setup

Prerequisites

  • Python 3.8+
  • Required API keys for LLM providers

Environment Setup

  1. Clone the repository:
git clone <repository-url>
cd llm-recommender-system
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
# Create .env file with your API keys
OPENAI_API_KEY=your_openai_key
GOOGLE_API_KEY=your_google_key
MISTRAL_API_KEY=your_mistral_key

πŸ”¬ Running Experiments

Feature-Based Behavioral Analysis (RQ1)

# Run experiments for all LLM providers
python experiments-gemini-rq1.py
python experiments-mistral-rq1.py
python experiments-openai-rq1.py

Category-Based Behavioral Analysis (RQ3)

# Run category-based experiments
python experiments-gemini-rq3.py
python experiments-mistral-rq3.py
python experiments-openai-rq3.py

Individual LLM Searches

Google Gemini

python -m code.llm.google.search_gemini_rq1 \
    --output ./data/output/features/rq1/gemini/k20_Photo_effects \
    --k 20 \
    --search "Photo effects" \
    --n 10 \
    --model "gemini-2.0-flash" \
    --system-prompt "data/input/prompts/system-prompt-output-rq1.txt"

OpenAI

python -m code.llm.openai.search_openai_rq1 \
    --output ./data/output/features/rq1/openai/k20_Photo_effects \
    --k 20 \
    --search "Photo effects" \
    --n 10 \
    --model "gpt-4o" \
    --system-prompt "data/input/prompts/system-prompt-output-rq1.txt"

Mistral

python -m code.llm.mistral.search_mistral_rq1 \
    --output ./data/output/features/rq1/mistral/k20_Photo_effects \
    --k 20 \
    --search "Photo effects" \
    --n 10 \
    --model "mistral-large-latest" \
    --system-prompt "data/input/prompts/system-prompt-output-rq1.txt"

Consistency Analysis

App Ranking Consistency

python -m code.consistency.app_consistency \
    --input data/output/evaluation/app_rankings.csv \
    --output data/output/evaluation/consistency

Ranking Criteria Consistency

python -m code.consistency.ranking_criteria_consistency \
    --input data/output/evaluation/app_ranking_criteria.csv \
    --output data/output/evaluation/consistency/ranking_criteria

Visualization

Criteria Visualization

python -m code.visualization.criteria_visualization \
    --input data/output/features/rq1/gemini/all_criteria.csv \
    --output data/output/features/rq1/gemini/ \
    --similarity-threshold 0.72

Source Visualization

python -m code.visualization.source_visualization \
    --input data/output/features/rq1/gemini/all_criteria.csv \
    --output data/output/features/rq1/gemini/

πŸ“Š Experimental Output Structure

Generated Data

  • JSON Responses: Individual LLM responses for each experimental trial
  • CSV Rankings: Consolidated app rankings across all trials and models
  • Consistency Metrics: RBO and Jaccard similarity calculations
  • Visualization Files: Heatmaps, dendrograms, and analysis charts
  • Evaluation Reports: Comprehensive analysis of LLM behavior patterns

Data Organization

data/output/
β”œβ”€β”€ features/rq1/           # Feature-based analysis results
β”‚   β”œβ”€β”€ gemini/            # Google Gemini results
β”‚   β”œβ”€β”€ mistral/           # Mistral AI results
β”‚   └── openai/            # OpenAI results
β”œβ”€β”€ category/rq1/          # Category-based analysis results
β”‚   β”œβ”€β”€ gemini/            # Google Gemini results
β”‚   β”œβ”€β”€ mistral/           # Mistral AI results
β”‚   └── openai/            # OpenAI results
β”œβ”€β”€ evaluation/            # Consistency and correlation analysis
β”‚   β”œβ”€β”€ consistency/       # Ranking consistency metrics
β”‚   └── correlation/       # Cross-model correlation analysis
└── search/               # Search functionality results

πŸ“ˆ Results and Analysis

Key Findings

  • Model Consistency: Analysis of ranking consistency within and across LLM models
  • Feature Sensitivity: How different app features affect recommendation patterns
  • Category Behavior: LLM behavior variations across AI-powered app categories
  • Ranking Criteria: Semantic analysis of ranking criteria used by different models

Visualization Examples

  • Heatmaps: Model comparison matrices showing ranking similarities
  • Dendrograms: Hierarchical clustering of ranking criteria
  • Consistency Charts: RBO and Jaccard similarity visualizations
  • Correlation Plots: Cross-model correlation analysis

πŸ“š References

Research Papers

  • ...

Technical Resources

Related Work

  • ....

πŸ“„ License

This project is licensed under the GPL version 3 - see the LICENSE file for details.

πŸ“„ Acknowledgments

  • ...

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published