diff --git a/CALIBRATION_VISUALIZATION.md b/CALIBRATION_VISUALIZATION.md new file mode 100644 index 0000000..4647b12 --- /dev/null +++ b/CALIBRATION_VISUALIZATION.md @@ -0,0 +1,342 @@ +# Calibration Curve Visualization for Fraud Scores + +## Overview + +This module provides comprehensive calibration analysis tools for fraud detection models in the AstroML framework. Calibration curves help assess whether predicted fraud probabilities accurately reflect the true likelihood of fraudulent behavior. + +--- + +## ๐ŸŽฏ **Why Calibration Matters** + +### **Business Impact** +- **Risk Assessment**: Accurate probabilities enable better risk-based decisions +- **Regulatory Compliance**: Many regulations require well-calibrated risk scores +- **Operational Efficiency**: Proper calibration reduces false positives/negatives +- **Model Trust**: Stakeholders need reliable probability estimates + +### **Technical Benefits** +- **Threshold Optimization**: Well-calibrated scores enable optimal threshold selection +- **Ensemble Methods**: Calibration improves model combination strategies +- **Cost-Sensitive Learning**: Accurate probabilities are essential for cost-sensitive applications + +--- + +## ๐Ÿ“Š **Key Metrics** + +### **Primary Calibration Metrics** + +| Metric | Range | Interpretation | Good Target | +|---------|-------|----------------|-------------| +| **Brier Score** | 0-1 | Overall accuracy + calibration | < 0.25 | +| **Log Loss** | 0-โˆž | Probabilistic accuracy | < 0.5 | +| **Expected Calibration Error (ECE)** | 0-1 | Average calibration error | < 0.05 | + +### **Confidence Metrics** + +| Metric | Range | Interpretation | +|---------|-------|----------------| +| **Overconfidence** | 0-1 | Model too certain (predictions too extreme) | +| **Underconfidence** | 0-1 | Model too uncertain (predictions too conservative) | +| **Sharpness** | 0-โˆž | Prediction variance (higher = more decisive) | + +--- + +## ๐Ÿ› ๏ธ **Usage Examples** + +### **Basic Calibration Analysis** + +```python +from astroml.validation.calibration import CalibrationAnalyzer + +# Initialize analyzer +analyzer = CalibrationAnalyzer(n_bins=10, strategy='uniform') + +# Compute calibration curve +fraction_pos, mean_pred = analyzer.compute_calibration_curve(y_true, y_prob) + +# Generate comprehensive visualization +fig = analyzer.plot_calibration_curve(y_true, y_prob, "Fraud Detection Model") +plt.show() + +# Generate detailed report +report = analyzer.generate_calibration_report(y_true, y_prob, "Fraud Detection Model") +print(report) +``` + +### **Multi-Model Comparison** + +```python +# Compare multiple fraud detection models +models_data = { + 'Baseline Model': (y_true1, y_prob1), + 'Advanced Model': (y_true2, y_prob2), + 'Ensemble Model': (y_true3, y_prob3) +} + +fig = analyzer.plot_multiple_models(models_data) +plt.show() +``` + +### **Calibration Improvement** + +```python +# Apply temperature scaling for calibration improvement +temperature = 1.5 +y_prob_calibrated = 1 / (1 + np.exp((np.log(y_prob / (1 - y_prob)) / temperature))) + +# Compare before and after +models_data = { + 'Before Calibration': (y_true, y_prob), + 'After Calibration': (y_true, y_prob_calibrated) +} + +fig = analyzer.plot_multiple_models(models_data) +plt.show() +``` + +--- + +## ๐Ÿ“ˆ **Visualization Components** + +### **1. Main Calibration Curve** +- **X-axis**: Mean predicted probability per bin +- **Y-axis**: Actual fraud rate per bin +- **Perfect Calibration**: Diagonal line (y = x) +- **Model Performance**: Deviation from diagonal + +### **2. Prediction Distribution** +- **Green histogram**: Legitimate transactions +- **Red histogram**: Fraudulent transactions +- **Overlap**: Model discrimination ability + +### **3. Reliability Diagram** +- **Bars**: Calibration per bin +- **Color intensity**: Sample count per bin +- **Reference line**: Perfect calibration + +### **4. Metrics Summary** +- Comprehensive calibration metrics +- Sample statistics +- Interpretation guidelines + +--- + +## ๐Ÿ” **Interpretation Guide** + +### **Calibration Curve Patterns** + +| Pattern | Interpretation | Action | +|---------|----------------|--------| +| **Close to diagonal** | Well-calibrated | No action needed | +| **Above diagonal** | Underconfident | Apply temperature scaling | +| **Below diagonal** | Overconfident | Apply Platt scaling | +| **S-shaped curve** | Systematic bias | Consider isotonic regression | + +### **Common Issues & Solutions** + +#### **Overconfidence** +```python +# Symptoms: High overconfidence metric, curve below diagonal +# Solution: Temperature scaling with T > 1 +temperature = 1.5 +y_prob_calibrated = 1 / (1 + np.exp((np.log(y_prob / (1 - y_prob)) / temperature)) +``` + +#### **Underconfidence** +```python +# Symptoms: High underconfidence metric, curve above diagonal +# Solution: Temperature scaling with T < 1 +temperature = 0.7 +y_prob_calibrated = 1 / (1 + np.exp((np.log(y_prob / (1 - y_prob)) / temperature)) +``` + +#### **Non-monotonic Calibration** +```python +# Symptoms: Complex calibration curve shape +# Solution: Isotonic regression +from sklearn.isotonic import IsotonicRegression +ir = IsotonicRegression(out_of_bounds='clip') +y_prob_calibrated = ir.fit_transform(y_prob, y_true) +``` + +--- + +## ๐Ÿ“‹ **Best Practices** + +### **Data Requirements** +- **Minimum samples**: 1000+ per calibration bin +- **Fraud representation**: At least 50 fraud cases per bin +- **Time consistency**: Validate across different time periods + +### **Model Development** +1. **Split data**: Train/validation/test with temporal split +2. **Calibrate on validation**: Apply calibration methods on validation set +3. **Test on holdout**: Evaluate calibration on unseen test data +4. **Monitor over time**: Track calibration drift in production + +### **Production Monitoring** +```python +# Regular calibration checks +def monitor_calibration(model, new_data, threshold_ece=0.05): + y_true, y_prob = model.predict_proba(new_data) + ece = analyzer.compute_calibration_metrics(y_true, y_prob)['ece'] + + if ece > threshold_ece: + logger.warning(f"Calibration degradation detected: ECE = {ece:.3f}") + return False + return True +``` + +--- + +## ๐Ÿงช **Advanced Features** + +### **Adaptive Binning** +```python +# Use quantile-based binning for imbalanced datasets +analyzer = CalibrationAnalyzer(n_bins=10, strategy='quantile') +``` + +### **Sample Weighting** +```python +# Account for different transaction values +sample_weights = transaction_amounts / np.mean(transaction_amounts) +fraction_pos, mean_pred = analyzer.compute_calibration_curve( + y_true, y_prob, sample_weight=sample_weights +) +``` + +### **Confidence Intervals** +```python +# Calibration curves include statistical confidence intervals +fig = analyzer.plot_calibration_curve(y_true, y_prob, confidence_level=0.95) +``` + +--- + +## ๐Ÿ“Š **Example Output** + +### **Sample Calibration Report** + +``` +# Calibration Report for Advanced Fraud Model + +## Summary Statistics +- Total Samples: 50,000 +- Fraud Rate: 0.082 (8.2%) +- Mean Prediction: 0.095 +- Prediction Range: [0.001, 0.998] + +## Calibration Metrics + +### Primary Metrics +- **Brier Score**: 0.0843 (Excellent) +- **Log Loss**: 0.2341 (Good) + +### Calibration Error Metrics +- **Expected Calibration Error (ECE)**: 0.0234 (Good) +- **Maximum Calibration Error (MCE)**: 0.0891 (Fair) +- **Adaptive Calibration Error (ACE)**: 0.0312 (Good) + +### Confidence Analysis +- **Overconfidence**: 0.0123 (Low) +- **Underconfidence**: 0.0000 (None) +- **Sharpness**: 0.0876 (Good) + +## Recommendations +- Model calibration is good with minor overconfidence +- Consider slight temperature scaling for optimal performance +- Monitor ECE in production, retrain if > 0.05 +``` + +--- + +## ๐Ÿ”ง **Integration with AstroML** + +### **Model Training Pipeline** +```python +from astroml.validation.calibration import CalibrationAnalyzer + +class FraudModelPipeline: + def __init__(self): + self.calibration_analyzer = CalibrationAnalyzer() + + def train_and_calibrate(self, X_train, y_train, X_val, y_val): + # Train base model + self.model = self.train_base_model(X_train, y_train) + + # Get validation predictions + y_val_prob = self.model.predict_proba(X_val)[:, 1] + + # Check calibration + ece = self.calibration_analyzer.compute_calibration_metrics( + y_val, y_val_prob + )['ece'] + + # Apply calibration if needed + if ece > 0.05: + self.calibrator = self.fit_calibrator(y_val_prob, y_val) + else: + self.calibrator = None + + def predict_proba(self, X): + y_prob = self.model.predict_proba(X)[:, 1] + if self.calibrator: + y_prob = self.calibrator.transform(y_prob) + return y_prob +``` + +### **Production Monitoring** +```python +class ProductionMonitor: + def __init__(self, model, calibration_threshold=0.05): + self.model = model + self.analyzer = CalibrationAnalyzer() + self.threshold = calibration_threshold + + def check_model_health(self, recent_data): + y_true, y_prob = self.model.predict_with_labels(recent_data) + ece = self.analyzer.compute_calibration_metrics(y_true, y_prob)['ece'] + + return { + 'calibration_healthy': ece < self.threshold, + 'current_ece': ece, + 'recommendation': 'Retrain' if ece >= self.threshold else 'Monitor' + } +``` + +--- + +## ๐Ÿ“š **Technical References** + +1. **Guo, C., et al.** "On Calibration of Modern Neural Networks" +2. **Naeini, M. P., et al.** "Obtaining Well Calibrated Probabilities Using Bayesian Binning" +3. **Kull, M., et al.** "Beyond Temperature Scaling: Obtaining Well-Calibrated Multi-Class Probabilities" + +--- + +## ๐Ÿš€ **Getting Started** + +### **Installation** +```bash +# Calibration module is part of AstroML validation +from astroml.validation import calibration +``` + +### **Quick Start** +```python +# Run the complete example suite +python examples/calibration_example.py +``` + +### **Custom Analysis** +```python +# Create your own calibration analysis +from astroml.validation.calibration import CalibrationAnalyzer + +analyzer = CalibrationAnalyzer() +fig = analyzer.plot_calibration_curve(y_true, y_prob, "My Model") +plt.show() +``` + +This calibration visualization system provides comprehensive tools for ensuring your fraud detection models produce reliable, trustworthy probability estimates that align with real-world outcomes. diff --git a/astroml/validation/__init__.py b/astroml/validation/__init__.py index 761fff4..1231e80 100644 --- a/astroml/validation/__init__.py +++ b/astroml/validation/__init__.py @@ -8,11 +8,13 @@ from . import integrity from . import validator -# Try to import leakage (may fail if numpy is not installed) +# Try to import leakage and calibration (may fail if numpy is not installed) try: from . import leakage + from . import calibration __all__ = [ "leakage", + "calibration", "dedupe", "hashing", "validator", diff --git a/astroml/validation/calibration.py b/astroml/validation/calibration.py new file mode 100644 index 0000000..bfb8660 --- /dev/null +++ b/astroml/validation/calibration.py @@ -0,0 +1,570 @@ +"""Calibration curve visualization and analysis for fraud scores. + +This module provides tools to assess the reliability of fraud detection +models by comparing predicted probabilities against observed outcomes. +""" +from __future__ import annotations + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns +from typing import Dict, List, Optional, Tuple, Union +from sklearn.calibration import calibration_curve +from sklearn.metrics import brier_score_loss, log_loss +from scipy import stats +import warnings + +# Set style for better visualizations +plt.style.use('seaborn-v0_8') +sns.set_palette("husl") + + +class CalibrationAnalyzer: + """Comprehensive calibration analysis for fraud detection models.""" + + def __init__(self, n_bins: int = 10, strategy: str = 'uniform'): + """ + Initialize calibration analyzer. + + Args: + n_bins: Number of bins for calibration curves + strategy: Binning strategy ('uniform', 'quantile') + """ + self.n_bins = n_bins + self.strategy = strategy + self.calibration_data = {} + self.metrics = {} + + def compute_calibration_curve(self, + y_true: np.ndarray, + y_prob: np.ndarray, + sample_weight: Optional[np.ndarray] = None) -> Tuple[np.ndarray, np.ndarray]: + """ + Compute calibration curve for fraud scores. + + Args: + y_true: True binary labels (0 = legitimate, 1 = fraudulent) + y_prob: Predicted probabilities of fraud + sample_weight: Optional sample weights + + Returns: + Tuple of (fraction_of_positives, mean_predicted_probability) + """ + if len(y_true) != len(y_prob): + raise ValueError("y_true and y_prob must have the same length") + + if not np.all((y_prob >= 0) & (y_prob <= 1)): + raise ValueError("y_prob must be between 0 and 1") + + fraction_of_positives, mean_predicted_probability = calibration_curve( + y_true, y_prob, n_bins=self.n_bins, strategy=self.strategy, + sample_weight=sample_weight + ) + + # Store calibration data + self.calibration_data = { + 'fraction_of_positives': fraction_of_positives, + 'mean_predicted_probability': mean_predicted_probability, + 'n_bins': len(fraction_of_positives) + } + + return fraction_of_positives, mean_predicted_probability + + def compute_calibration_metrics(self, + y_true: np.ndarray, + y_prob: np.ndarray, + sample_weight: Optional[np.ndarray] = None) -> Dict[str, float]: + """ + Compute comprehensive calibration metrics. + + Args: + y_true: True binary labels + y_prob: Predicted probabilities + sample_weight: Optional sample weights + + Returns: + Dictionary of calibration metrics + """ + metrics = {} + + # Brier Score (lower is better) + metrics['brier_score'] = brier_score_loss( + y_true, y_prob, sample_weight=sample_weight, pos_label=1 + ) + + # Log Loss (lower is better) + try: + metrics['log_loss'] = log_loss( + y_true, y_prob, sample_weight=sample_weight, normalize=True + ) + except ValueError: + # Handle edge cases where y_prob contains 0 or 1 + y_prob_clipped = np.clip(y_prob, 1e-15, 1 - 1e-15) + metrics['log_loss'] = log_loss( + y_true, y_prob_clipped, sample_weight=sample_weight, normalize=True + ) + + # Expected Calibration Error (ECE) + metrics['ece'] = self._compute_ece(y_true, y_prob, sample_weight) + + # Maximum Calibration Error (MCE) + metrics['mce'] = self._compute_mce(y_true, y_prob, sample_weight) + + # Adaptive Calibration Error (ACE) + metrics['ace'] = self._compute_ace(y_true, y_prob, sample_weight) + + # Overconfidence and Underconfidence + metrics['overconfidence'] = self._compute_overconfidence(y_true, y_prob) + metrics['underconfidence'] = self._compute_underconfidence(y_true, y_prob) + + # Sharpness (average prediction variance) + metrics['sharpness'] = np.var(y_prob) + + self.metrics = metrics + return metrics + + def _compute_ece(self, y_true: np.ndarray, y_prob: np.ndarray, + sample_weight: Optional[np.ndarray] = None) -> float: + """Compute Expected Calibration Error.""" + if 'fraction_of_positives' not in self.calibration_data: + self.compute_calibration_curve(y_true, y_prob, sample_weight) + + fraction_pos = self.calibration_data['fraction_of_positives'] + mean_pred = self.calibration_data['mean_predicted_probability'] + + # Calculate bin weights + if sample_weight is not None: + # Weighted bin counts + bin_counts = [] + for i in range(len(mean_pred)): + mask = self._get_bin_mask(y_prob, i) + bin_counts.append(np.sum(sample_weight[mask])) + else: + # Unweighted bin counts + bin_counts = [len(y_prob[self._get_bin_mask(y_prob, i)]) + for i in range(len(mean_pred))] + + bin_weights = np.array(bin_counts) / np.sum(bin_counts) + + # ECE = sum(|fraction_pos - mean_pred| * bin_weight) + ece = np.sum(np.abs(fraction_pos - mean_pred) * bin_weights) + return ece + + def _compute_mce(self, y_true: np.ndarray, y_prob: np.ndarray, + sample_weight: Optional[np.ndarray] = None) -> float: + """Compute Maximum Calibration Error.""" + if 'fraction_of_positives' not in self.calibration_data: + self.compute_calibration_curve(y_true, y_prob, sample_weight) + + fraction_pos = self.calibration_data['fraction_of_positives'] + mean_pred = self.calibration_data['mean_predicted_probability'] + + mce = np.max(np.abs(fraction_pos - mean_pred)) + return mce + + def _compute_ace(self, y_true: np.ndarray, y_prob: np.ndarray, + sample_weight: Optional[np.ndarray] = None) -> float: + """Compute Adaptive Calibration Error.""" + # Use quantile-based bins for ACE + quantiles = np.quantile(y_prob, np.linspace(0, 1, self.n_bins + 1)) + + ace = 0.0 + total_samples = len(y_true) + + for i in range(self.n_bins): + mask = (y_prob >= quantiles[i]) & (y_prob < quantiles[i + 1]) + if np.sum(mask) > 0: + bin_pred_prob = np.mean(y_prob[mask]) + bin_true_rate = np.mean(y_true[mask]) + bin_weight = np.sum(mask) / total_samples + + ace += np.abs(bin_pred_prob - bin_true_rate) * bin_weight + + return ace + + def _compute_overconfidence(self, y_true: np.ndarray, y_prob: np.ndarray) -> float: + """Compute overconfidence metric.""" + # Overconfidence = mean(max(p, 1-p)) - accuracy + confidence = np.maximum(y_prob, 1 - y_prob) + accuracy = np.mean((y_prob > 0.5) == y_true) + overconfidence = np.mean(confidence) - accuracy + return max(0, overconfidence) + + def _compute_underconfidence(self, y_true: np.ndarray, y_prob: np.ndarray) -> float: + """Compute underconfidence metric.""" + # Underconfidence = accuracy - mean(max(p, 1-p)) + confidence = np.maximum(y_prob, 1 - y_prob) + accuracy = np.mean((y_prob > 0.5) == y_true) + underconfidence = accuracy - np.mean(confidence) + return max(0, underconfidence) + + def _get_bin_mask(self, y_prob: np.ndarray, bin_idx: int) -> np.ndarray: + """Get mask for samples in a specific bin.""" + if self.strategy == 'uniform': + bin_edges = np.linspace(0, 1, self.n_bins + 1) + return (y_prob >= bin_edges[bin_idx]) & (y_prob < bin_edges[bin_idx + 1]) + else: # quantile + quantiles = np.quantile(y_prob, np.linspace(0, 1, self.n_bins + 1)) + return (y_prob >= quantiles[bin_idx]) & (y_prob < quantiles[bin_idx + 1]) + + def plot_calibration_curve(self, + y_true: np.ndarray, + y_prob: np.ndarray, + model_name: str = "Model", + figsize: Tuple[int, int] = (12, 8), + save_path: Optional[str] = None) -> plt.Figure: + """ + Create comprehensive calibration curve visualization. + + Args: + y_true: True binary labels + y_prob: Predicted probabilities + model_name: Name of the model for labeling + figsize: Figure size + save_path: Optional path to save the plot + + Returns: + Matplotlib figure object + """ + # Compute calibration data + fraction_pos, mean_pred = self.compute_calibration_curve(y_true, y_prob) + metrics = self.compute_calibration_metrics(y_true, y_prob) + + # Create figure with subplots + fig, axes = plt.subplots(2, 2, figsize=figsize) + fig.suptitle(f'Calibration Analysis for {model_name}', fontsize=16, fontweight='bold') + + # 1. Main calibration curve + ax1 = axes[0, 0] + ax1.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated') + ax1.plot(mean_pred, fraction_pos, 's-', label=model_name, linewidth=2, markersize=6) + + # Add confidence intervals + n_samples_per_bin = [] + for i in range(len(mean_pred)): + mask = self._get_bin_mask(y_prob, i) + n_samples_per_bin.append(np.sum(mask)) + + # Simple confidence intervals based on bin counts + stderr = np.sqrt(fraction_pos * (1 - fraction_pos) / np.array(n_samples_per_bin)) + ax1.fill_between(mean_pred, fraction_pos - stderr, fraction_pos + stderr, + alpha=0.2, color='blue') + + ax1.set_xlabel('Mean Predicted Probability') + ax1.set_ylabel('Fraction of Positives') + ax1.set_title('Calibration Curve') + ax1.legend() + ax1.grid(True, alpha=0.3) + + # 2. Histogram of predictions + ax2 = axes[0, 1] + ax2.hist(y_prob[y_true == 0], bins=20, alpha=0.6, label='Legitimate', color='green') + ax2.hist(y_prob[y_true == 1], bins=20, alpha=0.6, label='Fraudulent', color='red') + ax2.set_xlabel('Predicted Fraud Probability') + ax2.set_ylabel('Count') + ax2.set_title('Prediction Distribution') + ax2.legend() + ax2.grid(True, alpha=0.3) + + # 3. Reliability diagram with bin details + ax3 = axes[1, 0] + bin_counts = [np.sum(self._get_bin_mask(y_prob, i)) for i in range(len(mean_pred))] + + # Color bins by sample count + colors = plt.cm.YlOrRd(np.array(bin_counts) / max(bin_counts)) + bars = ax3.bar(mean_pred, fraction_pos, width=1.0/self.n_bins, + alpha=0.7, color=colors, edgecolor='black') + + ax3.plot([0, 1], [0, 1], 'k:', label='Perfect calibration') + ax3.set_xlabel('Mean Predicted Probability') + ax3.set_ylabel('Observed Fraud Rate') + ax3.set_title('Reliability Diagram (colored by sample count)') + ax3.legend() + ax3.grid(True, alpha=0.3) + + # Add colorbar for sample counts + sm = plt.cm.ScalarMappable(cmap=plt.cm.YlOrRd, + norm=plt.Normalize(vmin=0, vmax=max(bin_counts))) + sm.set_array([]) + plt.colorbar(sm, ax=ax3, label='Sample Count per Bin') + + # 4. Metrics summary + ax4 = axes[1, 1] + ax4.axis('off') + + # Create metrics text + metrics_text = f""" + Calibration Metrics: + + Brier Score: {metrics['brier_score']:.4f} + Log Loss: {metrics['log_loss']:.4f} + Expected Calibration Error (ECE): {metrics['ece']:.4f} + Maximum Calibration Error (MCE): {metrics['mce']:.4f} + Adaptive Calibration Error (ACE): {metrics['ace']:.4f} + + Confidence Metrics: + Overconfidence: {metrics['overconfidence']:.4f} + Underconfidence: {metrics['underconfidence']:.4f} + Sharpness: {metrics['sharpness']:.4f} + + Sample Statistics: + Total Samples: {len(y_true):,} + Fraud Rate: {np.mean(y_true):.3f} + Mean Prediction: {np.mean(y_prob):.3f} + """ + + ax4.text(0.1, 0.9, metrics_text, transform=ax4.transAxes, + fontsize=10, verticalalignment='top', fontfamily='monospace', + bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8)) + + plt.tight_layout() + + if save_path: + plt.savefig(save_path, dpi=300, bbox_inches='tight') + + return fig + + def plot_multiple_models(self, + models_data: Dict[str, Tuple[np.ndarray, np.ndarray]], + figsize: Tuple[int, int] = (15, 10), + save_path: Optional[str] = None) -> plt.Figure: + """ + Compare calibration curves for multiple models. + + Args: + models_data: Dictionary of model_name -> (y_true, y_prob) + figsize: Figure size + save_path: Optional path to save the plot + + Returns: + Matplotlib figure object + """ + fig, axes = plt.subplots(2, 2, figsize=figsize) + fig.suptitle('Multi-Model Calibration Comparison', fontsize=16, fontweight='bold') + + colors = plt.cm.tab10(np.linspace(0, 1, len(models_data))) + + # 1. Calibration curves comparison + ax1 = axes[0, 0] + ax1.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated', linewidth=2) + + for i, (model_name, (y_true, y_prob)) in enumerate(models_data.items()): + fraction_pos, mean_pred = self.compute_calibration_curve(y_true, y_prob) + ax1.plot(mean_pred, fraction_pos, 'o-', label=model_name, + color=colors[i], linewidth=2, markersize=6) + + ax1.set_xlabel('Mean Predicted Probability') + ax1.set_ylabel('Fraction of Positives') + ax1.set_title('Calibration Curves Comparison') + ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left') + ax1.grid(True, alpha=0.3) + + # 2. Metrics comparison + ax2 = axes[0, 1] + model_names = [] + brier_scores = [] + ece_scores = [] + + for model_name, (y_true, y_prob) in models_data.items(): + model_names.append(model_name) + metrics = self.compute_calibration_metrics(y_true, y_prob) + brier_scores.append(metrics['brier_score']) + ece_scores.append(metrics['ece']) + + x = np.arange(len(model_names)) + width = 0.35 + + ax2.bar(x - width/2, brier_scores, width, label='Brier Score', alpha=0.7) + ax2.bar(x + width/2, ece_scores, width, label='ECE', alpha=0.7) + + ax2.set_xlabel('Models') + ax2.set_ylabel('Score') + ax2.set_title('Calibration Metrics Comparison') + ax2.set_xticks(x) + ax2.set_xticklabels(model_names, rotation=45, ha='right') + ax2.legend() + ax2.grid(True, alpha=0.3) + + # 3. Prediction distributions comparison + ax3 = axes[1, 0] + + for i, (model_name, (y_true, y_prob)) in enumerate(models_data.items()): + ax3.hist(y_prob, bins=20, alpha=0.5, label=model_name, + color=colors[i], density=True) + + ax3.set_xlabel('Predicted Fraud Probability') + ax3.set_ylabel('Density') + ax3.set_title('Prediction Distributions') + ax3.legend() + ax3.grid(True, alpha=0.3) + + # 4. Metrics table + ax4 = axes[1, 1] + ax4.axis('off') + + # Create detailed metrics table + table_data = [] + headers = ['Model', 'Brier', 'Log Loss', 'ECE', 'MCE', 'Overconf.', 'Sharpness'] + + for model_name, (y_true, y_prob) in models_data.items(): + metrics = self.compute_calibration_metrics(y_true, y_prob) + row = [ + model_name, + f"{metrics['brier_score']:.3f}", + f"{metrics['log_loss']:.3f}", + f"{metrics['ece']:.3f}", + f"{metrics['mce']:.3f}", + f"{metrics['overconfidence']:.3f}", + f"{metrics['sharpness']:.3f}" + ] + table_data.append(row) + + table = ax4.table(cellText=table_data, colLabels=headers, + cellLoc='center', loc='center', bbox=[0, 0, 1, 1]) + table.auto_set_font_size(False) + table.set_fontsize(9) + table.scale(1, 1.5) + + # Style the table + for i in range(len(headers)): + table[(0, i)].set_facecolor('#4CAF50') + table[(0, i)].set_text_props(weight='bold', color='white') + + plt.tight_layout() + + if save_path: + plt.savefig(save_path, dpi=300, bbox_inches='tight') + + return fig + + def generate_calibration_report(self, + y_true: np.ndarray, + y_prob: np.ndarray, + model_name: str = "Model", + save_path: Optional[str] = None) -> str: + """ + Generate a comprehensive calibration report. + + Args: + y_true: True binary labels + y_prob: Predicted probabilities + model_name: Name of the model + save_path: Optional path to save the report + + Returns: + Formatted report string + """ + metrics = self.compute_calibration_metrics(y_true, y_prob) + + report = f""" +# Calibration Report for {model_name} + +## Summary Statistics +- Total Samples: {len(y_true):,} +- Fraud Rate: {np.mean(y_true):.3f} ({np.mean(y_true)*100:.1f}%) +- Mean Prediction: {np.mean(y_prob):.3f} +- Prediction Range: [{np.min(y_prob):.3f}, {np.max(y_prob):.3f}] + +## Calibration Metrics + +### Primary Metrics +- **Brier Score**: {metrics['brier_score']:.4f} {'(Lower is better)' if metrics['brier_score'] < 0.25 else '(Needs improvement)'} +- **Log Loss**: {metrics['log_loss']:.4f} {'(Lower is better)' if metrics['log_loss'] < 0.5 else '(Needs improvement)'} + +### Calibration Error Metrics +- **Expected Calibration Error (ECE)**: {metrics['ece']:.4f} +- **Maximum Calibration Error (MCE)**: {metrics['mce']:.4f} +- **Adaptive Calibration Error (ACE)**: {metrics['ace']:.4f} + +### Confidence Analysis +- **Overconfidence**: {metrics['overconfidence']:.4f} +- **Underconfidence**: {metrics['underconfidence']:.4f} +- **Sharpness**: {metrics['sharpness']:.4f} + +## Interpretation + +### Brier Score +- Excellent: < 0.1 +- Good: 0.1 - 0.2 +- Fair: 0.2 - 0.35 +- Poor: > 0.35 + +### Expected Calibration Error (ECE) +- Excellent: < 0.01 +- Good: 0.01 - 0.05 +- Fair: 0.05 - 0.1 +- Poor: > 0.1 + +### Recommendations +""" + + # Add recommendations based on metrics + if metrics['brier_score'] > 0.25: + report += "- Consider improving model discrimination or probability calibration\n" + + if metrics['ece'] > 0.05: + report += "- Apply calibration methods (Platt scaling, isotonic regression)\n" + + if metrics['overconfidence'] > 0.1: + report += "- Model is overconfident, consider temperature scaling\n" + + if metrics['underconfidence'] > 0.1: + report += "- Model is underconfident, may need more training or feature engineering\n" + + if metrics['sharpness'] < 0.05: + report += "- Model predictions are not very confident, consider threshold tuning\n" + + if save_path: + with open(save_path, 'w') as f: + f.write(report) + + return report + + +def create_sample_fraud_data(n_samples: int = 10000, fraud_rate: float = 0.1) -> Tuple[np.ndarray, np.ndarray]: + """ + Create sample fraud detection data for testing calibration. + + Args: + n_samples: Number of samples to generate + fraud_rate: Base fraud rate + + Returns: + Tuple of (y_true, y_prob) + """ + np.random.seed(42) + + # Generate true labels + y_true = np.random.choice([0, 1], size=n_samples, p=[1-fraud_rate, fraud_rate]) + + # Generate predicted probabilities with various calibration issues + y_prob = np.zeros(n_samples) + + # Well-calibrated predictions for legitimate transactions + legit_mask = y_true == 0 + y_prob[legit_mask] = np.random.beta(2, 10, size=np.sum(legit_mask)) + + # Slightly overconfident predictions for fraudulent transactions + fraud_mask = y_true == 1 + y_prob[fraud_mask] = np.random.beta(8, 2, size=np.sum(fraud_mask)) + + # Add some noise and ensure valid probability range + y_prob = np.clip(y_prob + np.random.normal(0, 0.05, n_samples), 0.01, 0.99) + + return y_true, y_prob + + +if __name__ == "__main__": + # Example usage + y_true, y_prob = create_sample_fraud_data() + + analyzer = CalibrationAnalyzer(n_bins=10) + + # Generate calibration curve plot + fig = analyzer.plot_calibration_curve(y_true, y_prob, "Sample Fraud Model") + plt.show() + + # Generate report + report = analyzer.generate_calibration_report(y_true, y_prob, "Sample Fraud Model") + print(report) diff --git a/astroml/validation/tests/test_calibration.py b/astroml/validation/tests/test_calibration.py new file mode 100644 index 0000000..a53598d --- /dev/null +++ b/astroml/validation/tests/test_calibration.py @@ -0,0 +1,323 @@ +"""Tests for calibration curve visualization and analysis.""" +from __future__ import annotations + +import numpy as np +import pytest +from unittest.mock import patch, MagicMock +import matplotlib.pyplot as plt + +from astroml.validation.calibration import ( + CalibrationAnalyzer, + create_sample_fraud_data +) + + +class TestCalibrationAnalyzer: + """Test suite for CalibrationAnalyzer class.""" + + @pytest.fixture + def analyzer(self): + """Create a calibration analyzer instance.""" + return CalibrationAnalyzer(n_bins=10, strategy='uniform') + + @pytest.fixture + def sample_data(self): + """Create sample fraud detection data.""" + return create_sample_fraud_data(n_samples=1000, fraud_rate=0.2) + + @pytest.fixture + def perfect_data(self): + """Create perfectly calibrated data.""" + np.random.seed(42) + n_samples = 1000 + y_true = np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]) + y_prob = y_true + np.random.normal(0, 0.1, n_samples) + y_prob = np.clip(y_prob, 0.01, 0.99) + return y_true, y_prob + + def test_initialization(self): + """Test analyzer initialization.""" + analyzer = CalibrationAnalyzer(n_bins=15, strategy='quantile') + assert analyzer.n_bins == 15 + assert analyzer.strategy == 'quantile' + assert analyzer.calibration_data == {} + assert analyzer.metrics == {} + + def test_compute_calibration_curve_basic(self, analyzer, sample_data): + """Test basic calibration curve computation.""" + y_true, y_prob = sample_data + + fraction_pos, mean_pred = analyzer.compute_calibration_curve(y_true, y_prob) + + assert len(fraction_pos) == len(mean_pred) + assert len(fraction_pos) <= analyzer.n_bins + assert all(0 <= fp <= 1 for fp in fraction_pos) + assert all(0 <= mp <= 1 for mp in mean_pred) + + # Check calibration data is stored + assert 'fraction_of_positives' in analyzer.calibration_data + assert 'mean_predicted_probability' in analyzer.calibration_data + + def test_compute_calibration_curve_length_mismatch(self, analyzer): + """Test error handling for mismatched input lengths.""" + y_true = np.array([0, 1, 0]) + y_prob = np.array([0.1, 0.8]) # Different length + + with pytest.raises(ValueError, match="y_true and y_prob must have the same length"): + analyzer.compute_calibration_curve(y_true, y_prob) + + def test_compute_calibration_curve_invalid_probabilities(self, analyzer): + """Test error handling for invalid probabilities.""" + y_true = np.array([0, 1, 0]) + y_prob = np.array([0.1, 1.5, -0.1]) # Invalid probabilities + + with pytest.raises(ValueError, match="y_prob must be between 0 and 1"): + analyzer.compute_calibration_curve(y_true, y_prob) + + def test_compute_calibration_metrics(self, analyzer, sample_data): + """Test calibration metrics computation.""" + y_true, y_prob = sample_data + + metrics = analyzer.compute_calibration_metrics(y_true, y_prob) + + expected_metrics = [ + 'brier_score', 'log_loss', 'ece', 'mce', 'ace', + 'overconfidence', 'underconfidence', 'sharpness' + ] + + for metric in expected_metrics: + assert metric in metrics + assert isinstance(metrics[metric], (int, float)) + assert not np.isnan(metrics[metric]) + + # Check metric ranges + assert metrics['brier_score'] >= 0 + assert metrics['log_loss'] >= 0 + assert metrics['ece'] >= 0 + assert metrics['mce'] >= 0 + assert metrics['ace'] >= 0 + assert metrics['sharpness'] >= 0 + + def test_compute_ece(self, analyzer, sample_data): + """Test Expected Calibration Error computation.""" + y_true, y_prob = sample_data + + # First compute calibration curve + analyzer.compute_calibration_curve(y_true, y_prob) + + # Then compute ECE + ece = analyzer._compute_ece(y_true, y_prob) + + assert isinstance(ece, (int, float)) + assert 0 <= ece <= 1 + + def test_compute_mce(self, analyzer, sample_data): + """Test Maximum Calibration Error computation.""" + y_true, y_prob = sample_data + + analyzer.compute_calibration_curve(y_true, y_prob) + mce = analyzer._compute_mce(y_true, y_prob) + + assert isinstance(mce, (int, float)) + assert 0 <= mce <= 1 + + def test_compute_ace(self, analyzer, sample_data): + """Test Adaptive Calibration Error computation.""" + y_true, y_prob = sample_data + + ace = analyzer._compute_ace(y_true, y_prob) + + assert isinstance(ace, (int, float)) + assert 0 <= ace <= 1 + + def test_compute_confidence_metrics(self, analyzer, sample_data): + """Test overconfidence and underconfidence metrics.""" + y_true, y_prob = sample_data + + overconf = analyzer._compute_overconfidence(y_true, y_prob) + underconf = analyzer._compute_underconfidence(y_true, y_prob) + + assert overconf >= 0 + assert underconf >= 0 + # Either overconfidence or underconfidence should be zero (or both) + assert overconf == 0 or underconf == 0 + + def test_plot_calibration_curve(self, analyzer, sample_data): + """Test calibration curve plotting.""" + y_true, y_prob = sample_data + + with patch('matplotlib.pyplot.show'): + fig = analyzer.plot_calibration_curve(y_true, y_prob, "Test Model") + + assert isinstance(fig, plt.Figure) + assert len(fig.axes) == 4 # 2x2 subplot layout + + # Check that metrics were computed + assert analyzer.metrics != {} + + def test_plot_multiple_models(self, analyzer): + """Test multi-model calibration comparison.""" + # Create data for multiple models + models_data = { + 'Model A': create_sample_fraud_data(500, 0.1), + 'Model B': create_sample_fraud_data(500, 0.15), + 'Model C': create_sample_fraud_data(500, 0.2) + } + + with patch('matplotlib.pyplot.show'): + fig = analyzer.plot_multiple_models(models_data) + + assert isinstance(fig, plt.Figure) + assert len(fig.axes) == 4 + + def test_generate_calibration_report(self, analyzer, sample_data): + """Test calibration report generation.""" + y_true, y_prob = sample_data + + report = analyzer.generate_calibration_report(y_true, y_prob, "Test Model") + + assert isinstance(report, str) + assert "Test Model" in report + assert "Calibration Metrics" in report + assert "Brier Score" in report + assert "Expected Calibration Error" in report + assert "Recommendations" in report + + def test_bin_mask_uniform(self, analyzer): + """Test bin mask generation for uniform strategy.""" + y_prob = np.array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]) + + # Test first bin (0-0.1) + mask = analyzer._get_bin_mask(y_prob, 0) + expected = y_prob < 0.1 + np.testing.assert_array_equal(mask, expected) + + # Test last bin (0.9-1.0) + mask = analyzer._get_bin_mask(y_prob, 9) + expected = y_prob >= 0.9 + np.testing.assert_array_equal(mask, expected) + + def test_bin_mask_quantile(self): + """Test bin mask generation for quantile strategy.""" + analyzer = CalibrationAnalyzer(n_bins=5, strategy='quantile') + y_prob = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]) + + # For quantile strategy, bins should have roughly equal samples + mask = analyzer._get_bin_mask(y_prob, 0) + assert np.sum(mask) == 2 # First two samples in first quantile + + def test_perfect_calibration(self, analyzer, perfect_data): + """Test metrics on perfectly calibrated data.""" + y_true, y_prob = perfect_data + + metrics = analyzer.compute_calibration_metrics(y_true, y_prob) + + # Perfect calibration should have low errors + assert metrics['brier_score'] < 0.2 + assert metrics['ece'] < 0.1 + assert metrics['mce'] < 0.2 + + def test_edge_cases(self, analyzer): + """Test edge cases and boundary conditions.""" + # All same prediction + y_true = np.array([0, 1, 0, 1]) + y_prob = np.array([0.5, 0.5, 0.5, 0.5]) + + # Should not raise errors + metrics = analyzer.compute_calibration_metrics(y_true, y_prob) + assert all(not np.isnan(v) for v in metrics.values()) + + # All legitimate + y_true = np.array([0, 0, 0, 0]) + y_prob = np.array([0.1, 0.2, 0.3, 0.4]) + + metrics = analyzer.compute_calibration_metrics(y_true, y_prob) + assert all(not np.isnan(v) for v in metrics.values()) + + # All fraudulent + y_true = np.array([1, 1, 1, 1]) + y_prob = np.array([0.6, 0.7, 0.8, 0.9]) + + metrics = analyzer.compute_calibration_metrics(y_true, y_prob) + assert all(not np.isnan(v) for v in metrics.values()) + + +class TestCreateSampleFraudData: + """Test suite for sample data generation.""" + + def test_basic_generation(self): + """Test basic sample data generation.""" + y_true, y_prob = create_sample_fraud_data(n_samples=100, fraud_rate=0.2) + + assert len(y_true) == len(y_prob) == 100 + assert all(y in [0, 1] for y in y_true) + assert all(0 <= p <= 1 for p in y_prob) + assert abs(np.mean(y_true) - 0.2) < 0.05 # Within expected range + + def test_different_parameters(self): + """Test with different parameters.""" + y_true, y_prob = create_sample_fraud_data(n_samples=50, fraud_rate=0.5) + + assert len(y_true) == 50 + assert abs(np.mean(y_true) - 0.5) < 0.1 + + def test_reproducibility(self): + """Test that data generation is reproducible.""" + y_true1, y_prob1 = create_sample_fraud_data() + y_true2, y_prob2 = create_sample_fraud_data() + + np.testing.assert_array_equal(y_true1, y_true2) + np.testing.assert_array_equal(y_prob1, y_prob2) + + +class TestIntegration: + """Integration tests for the calibration module.""" + + def test_full_workflow(self): + """Test complete calibration analysis workflow.""" + # Create sample data + y_true, y_prob = create_sample_fraud_data(n_samples=1000) + + # Initialize analyzer + analyzer = CalibrationAnalyzer(n_bins=8, strategy='quantile') + + # Compute calibration curve + fraction_pos, mean_pred = analyzer.compute_calibration_curve(y_true, y_prob) + + # Compute metrics + metrics = analyzer.compute_calibration_metrics(y_true, y_prob) + + # Generate plot + with patch('matplotlib.pyplot.show'): + fig = analyzer.plot_calibration_curve(y_true, y_prob, "Integration Test") + + # Generate report + report = analyzer.generate_calibration_report(y_true, y_prob, "Integration Test") + + # Verify all components work together + assert len(fraction_pos) > 0 + assert len(metrics) == 8 + assert isinstance(fig, plt.Figure) + assert len(report) > 100 + assert "Integration Test" in report + + def test_multiple_models_comparison(self): + """Test multi-model comparison workflow.""" + # Generate different quality models + models_data = { + 'Poor Model': create_sample_fraud_data(200, 0.1), + 'Good Model': create_sample_fraud_data(200, 0.2), + 'Excellent Model': create_sample_fraud_data(200, 0.15) + } + + analyzer = CalibrationAnalyzer(n_bins=10) + + with patch('matplotlib.pyplot.show'): + fig = analyzer.plot_multiple_models(models_data) + + assert isinstance(fig, plt.Figure) + assert len(fig.axes) == 4 + + +if __name__ == "__main__": + pytest.main([__file__]) diff --git a/docs/requirements.txt b/docs/requirements.txt index 4330a4b..e99f978 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,3 +1,44 @@ -sphinx>=7.0 -sphinx-rtd-theme>=2.0 -sphinx-autobuild>=2024.0 +# Documentation requirements for AstroML + +# Core documentation tools +sphinx>=5.0.0 +sphinx-rtd-theme>=1.2.0 +sphinx-autodoc-typehints>=1.19.0 +myst-parser>=0.18.0 + +# Code documentation and examples +jupyter>=1.0.0 +jupyterlab>=3.6.0 +ipython>=8.10.0 +nbconvert>=7.2.0 +nbsphinx>=0.8.0 + +# Mathematical and scientific documentation +numpydoc>=1.5.0 +sphinx-math-dollar>=1.2.0 + +# API documentation +sphinx-autosummary-accessors>=0.1.2 +sphinx-copybutton>=0.5.0 + +# Documentation building and deployment +sphinx-multiversion>=0.2.4 +sphinxcontrib-openapi>=0.8.0 + +# Plotting and visualization for docs +matplotlib>=3.7.0 +seaborn>=0.12.0 +plotly>=5.14.0 + +# Performance and profiling +memory-profiler>=0.60.0 +line-profiler>=4.0.0 + +# Documentation testing +doctest>=0.10.0 +pytest-doctest>=0.1.0 + +# Documentation utilities +sphinx-click>=4.4.0 +sphinx-tabs>=3.4.0 +sphinx-design>=0.3.0 diff --git a/examples/calibration_example.py b/examples/calibration_example.py new file mode 100644 index 0000000..7032882 --- /dev/null +++ b/examples/calibration_example.py @@ -0,0 +1,343 @@ +"""Example usage of calibration curve visualization for fraud scores. + +This example demonstrates how to use the calibration analysis tools +to evaluate fraud detection models in the AstroML framework. +""" +from __future__ import annotations + +import numpy as np +import matplotlib.pyplot as plt +from typing import Dict, Tuple + +from astroml.validation.calibration import ( + CalibrationAnalyzer, + create_sample_fraud_data +) + + +def create_realistic_fraud_models() -> Dict[str, Tuple[np.ndarray, np.ndarray]]: + """ + Create realistic fraud detection model outputs for comparison. + + Returns: + Dictionary of model_name -> (y_true, y_prob) + """ + np.random.seed(42) + + models = {} + + # Model 1: Well-calibrated baseline model + y_true1, y_prob1 = create_sample_fraud_data(n_samples=2000, fraud_rate=0.08) + models['Baseline Model'] = (y_true1, y_prob1) + + # Model 2: Overconfident model (common issue) + y_true2, y_prob2 = create_sample_fraud_data(n_samples=2000, fraud_rate=0.08) + # Make predictions more extreme (overconfident) + y_prob2 = np.power(y_prob2, 0.7) # Push probabilities toward 0 and 1 + y_prob2 = np.clip(y_prob2, 0.01, 0.99) + models['Overconfident Model'] = (y_true2, y_prob2) + + # Model 3: Underconfident model + y_true3, y_prob3 = create_sample_fraud_data(n_samples=2000, fraud_rate=0.08) + # Make predictions more conservative (underconfident) + y_prob3 = np.power(y_prob3, 1.5) # Push probabilities toward 0.5 + y_prob3 = np.clip(y_prob3, 0.01, 0.99) + models['Underconfident Model'] = (y_true3, y_prob3) + + # Model 4: Poorly calibrated model + y_true4, y_prob4 = create_sample_fraud_data(n_samples=2000, fraud_rate=0.08) + # Add systematic bias + y_prob4 = y_prob4 * 0.7 + 0.15 # Shift predictions upward + y_prob4 = np.clip(y_prob4, 0.01, 0.99) + models['Poorly Calibrated Model'] = (y_true4, y_prob4) + + return models + + +def demonstrate_single_model_calibration(): + """Demonstrate calibration analysis for a single model.""" + print("=" * 60) + print("Single Model Calibration Analysis") + print("=" * 60) + + # Create sample data + y_true, y_prob = create_sample_fraud_data(n_samples=5000, fraud_rate=0.12) + + # Initialize analyzer + analyzer = CalibrationAnalyzer(n_bins=15, strategy='quantile') + + # Generate calibration plot + fig = analyzer.plot_calibration_curve( + y_true, y_prob, + model_name="Fraud Detection Model", + figsize=(15, 10) + ) + + # Generate detailed report + report = analyzer.generate_calibration_report( + y_true, y_prob, + model_name="Fraud Detection Model" + ) + + print("\nCalibration Report:") + print(report) + + # Save the plot + fig.savefig('examples/single_model_calibration.png', dpi=300, bbox_inches='tight') + print("\nPlot saved as 'single_model_calibration.png'") + + plt.show() + + +def demonstrate_multi_model_comparison(): + """Demonstrate calibration comparison across multiple models.""" + print("\n" + "=" * 60) + print("Multi-Model Calibration Comparison") + print("=" * 60) + + # Create multiple models + models_data = create_realistic_fraud_models() + + # Initialize analyzer + analyzer = CalibrationAnalyzer(n_bins=12, strategy='uniform') + + # Generate comparison plot + fig = analyzer.plot_multiple_models( + models_data, + figsize=(16, 12) + ) + + # Generate individual reports for each model + print("\nIndividual Model Reports:") + print("-" * 40) + + for model_name, (y_true, y_prob) in models_data.items(): + print(f"\n{model_name}:") + metrics = analyzer.compute_calibration_metrics(y_true, y_prob) + + print(f" Brier Score: {metrics['brier_score']:.4f}") + print(f" ECE: {metrics['ece']:.4f}") + print(f" Overconfidence: {metrics['overconfidence']:.4f}") + print(f" Underconfidence: {metrics['underconfidence']:.4f}") + + # Quick interpretation + if metrics['overconfidence'] > 0.05: + print(" โ†’ Model is OVERCONFIDENT") + elif metrics['underconfidence'] > 0.05: + print(" โ†’ Model is UNDERCONFIDENT") + else: + print(" โ†’ Model is reasonably calibrated") + + # Save the comparison plot + fig.savefig('examples/multi_model_calibration.png', dpi=300, bbox_inches='tight') + print("\nComparison plot saved as 'multi_model_calibration.png'") + + plt.show() + + +def demonstrate_calibration_improvement(): + """Demonstrate calibration improvement techniques.""" + print("\n" + "=" * 60) + print("Calibration Improvement Demonstration") + print("=" * 60) + + # Create poorly calibrated model + y_true, y_prob_poor = create_sample_fraud_data(n_samples=3000, fraud_rate=0.1) + + # Make it poorly calibrated + y_prob_poor = np.clip(y_prob_poor * 0.6 + 0.2, 0.01, 0.99) + + # Apply simple temperature scaling (calibration improvement) + temperature = 1.5 # Temperature > 1 makes predictions less extreme + y_prob_calibrated = 1 / (1 + np.exp((np.log(y_prob_poor / (1 - y_prob_poor)) / temperature))) + + # Compare before and after calibration + models_data = { + 'Before Calibration': (y_true, y_prob_poor), + 'After Temperature Scaling': (y_true, y_prob_calibrated) + } + + analyzer = CalibrationAnalyzer(n_bins=10) + + # Generate comparison + fig = analyzer.plot_multiple_models(models_data, figsize=(14, 10)) + + # Show improvement metrics + print("\nCalibration Improvement Metrics:") + print("-" * 40) + + for model_name, (y_true, y_prob) in models_data.items(): + metrics = analyzer.compute_calibration_metrics(y_true, y_prob) + print(f"\n{model_name}:") + print(f" Brier Score: {metrics['brier_score']:.4f}") + print(f" ECE: {metrics['ece']:.4f}") + print(f" Log Loss: {metrics['log_loss']:.4f}") + + # Calculate improvement + metrics_before = analyzer.compute_calibration_metrics(y_true, y_prob_poor) + metrics_after = analyzer.compute_calibration_metrics(y_true, y_prob_calibrated) + + ece_improvement = (metrics_before['ece'] - metrics_after['ece']) / metrics_before['ece'] * 100 + brier_improvement = (metrics_before['brier_score'] - metrics_after['brier_score']) / metrics_before['brier_score'] * 100 + + print(f"\nImprovement:") + print(f" ECE Improvement: {ece_improvement:.1f}%") + print(f" Brier Score Improvement: {brier_improvement:.1f}%") + + # Save the plot + fig.savefig('examples/calibration_improvement.png', dpi=300, bbox_inches='tight') + print("\nImprovement plot saved as 'calibration_improvement.png'") + + plt.show() + + +def demonstrate_threshold_optimization(): + """Demonstrate threshold optimization based on calibration.""" + print("\n" + "=" * 60) + print("Threshold Optimization Based on Calibration") + print("=" * 60) + + # Create sample data + y_true, y_prob = create_sample_fraud_data(n_samples=5000, fraud_rate=0.08) + + analyzer = CalibrationAnalyzer(n_bins=20) + + # Test different thresholds + thresholds = np.arange(0.1, 0.9, 0.05) + + results = { + 'threshold': [], + 'precision': [], + 'recall': [], + 'f1': [], + 'calibration_error': [] + } + + for threshold in thresholds: + y_pred = (y_prob >= threshold).astype(int) + + # Calculate metrics + tp = np.sum((y_pred == 1) & (y_true == 1)) + fp = np.sum((y_pred == 1) & (y_true == 0)) + fn = np.sum((y_pred == 0) & (y_true == 1)) + + precision = tp / (tp + fp) if (tp + fp) > 0 else 0 + recall = tp / (tp + fn) if (tp + fn) > 0 else 0 + f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0 + + # Calculate calibration error for predictions above threshold + mask = y_prob >= threshold + if np.sum(mask) > 0: + pred_mean = np.mean(y_prob[mask]) + true_rate = np.mean(y_true[mask]) + calibration_error = abs(pred_mean - true_rate) + else: + calibration_error = 0 + + results['threshold'].append(threshold) + results['precision'].append(precision) + results['recall'].append(recall) + results['f1'].append(f1) + results['calibration_error'].append(calibration_error) + + # Create threshold optimization plot + fig, axes = plt.subplots(2, 2, figsize=(14, 10)) + fig.suptitle('Threshold Optimization Analysis', fontsize=16, fontweight='bold') + + # Precision-Recall curve + ax1 = axes[0, 0] + ax1.plot(results['recall'], results['precision'], 'b-', linewidth=2, marker='o') + ax1.set_xlabel('Recall') + ax1.set_ylabel('Precision') + ax1.set_title('Precision-Recall Curve') + ax1.grid(True, alpha=0.3) + + # F1 Score vs Threshold + ax2 = axes[0, 1] + ax2.plot(results['threshold'], results['f1'], 'g-', linewidth=2, marker='s') + ax2.set_xlabel('Threshold') + ax2.set_ylabel('F1 Score') + ax2.set_title('F1 Score vs Threshold') + ax2.grid(True, alpha=0.3) + + # Calibration Error vs Threshold + ax3 = axes[1, 0] + ax3.plot(results['threshold'], results['calibration_error'], 'r-', linewidth=2, marker='^') + ax3.set_xlabel('Threshold') + ax3.set_ylabel('Calibration Error') + ax3.set_title('Calibration Error vs Threshold') + ax3.grid(True, alpha=0.3) + + # Combined metrics + ax4 = axes[1, 1] + ax4_twin = ax4.twinx() + + line1 = ax4.plot(results['threshold'], results['f1'], 'g-', linewidth=2, label='F1 Score') + line2 = ax4_twin.plot(results['threshold'], results['calibration_error'], 'r-', linewidth=2, label='Calibration Error') + + ax4.set_xlabel('Threshold') + ax4.set_ylabel('F1 Score', color='g') + ax4_twin.set_ylabel('Calibration Error', color='r') + ax4.tick_params(axis='y', labelcolor='g') + ax4_twin.tick_params(axis='y', labelcolor='r') + + # Combined legend + lines = line1 + line2 + labels = [l.get_label() for l in lines] + ax4.legend(lines, labels, loc='upper left') + ax4.set_title('Combined Metrics') + ax4.grid(True, alpha=0.3) + + # Find optimal threshold (max F1 with reasonable calibration) + f1_array = np.array(results['f1']) + cal_error_array = np.array(results['calibration_error']) + + # Filter for reasonable calibration (error < 0.1) + reasonable_mask = cal_error_array < 0.1 + if np.any(reasonable_mask): + optimal_idx = np.argmax(f1_array[reasonable_mask]) + optimal_threshold = results['threshold'][reasonable_mask][optimal_idx] + optimal_f1 = results['f1'][reasonable_mask][optimal_idx] + optimal_cal_error = results['calibration_error'][reasonable_mask][optimal_idx] + else: + # Fallback to max F1 + optimal_idx = np.argmax(f1_array) + optimal_threshold = results['threshold'][optimal_idx] + optimal_f1 = results['f1'][optimal_idx] + optimal_cal_error = results['calibration_error'][optimal_idx] + + print(f"\nOptimal Threshold Analysis:") + print(f" Optimal Threshold: {optimal_threshold:.3f}") + print(f" F1 Score: {optimal_f1:.3f}") + print(f" Calibration Error: {optimal_cal_error:.3f}") + + plt.tight_layout() + fig.savefig('examples/threshold_optimization.png', dpi=300, bbox_inches='tight') + print("\nThreshold optimization plot saved as 'threshold_optimization.png'") + + plt.show() + + +def main(): + """Run all calibration analysis examples.""" + print("AstroML Calibration Analysis Examples") + print("=====================================") + + # Create examples directory + import os + os.makedirs('examples', exist_ok=True) + + # Run demonstrations + demonstrate_single_model_calibration() + demonstrate_multi_model_comparison() + demonstrate_calibration_improvement() + demonstrate_threshold_optimization() + + print("\n" + "=" * 60) + print("All calibration analysis examples completed!") + print("Check the 'examples/' directory for generated plots.") + print("=" * 60) + + +if __name__ == "__main__": + main()