Machine learning analysis of marriage outcomes using 5,000 synthetic couples to identify predictive patterns, risk factors, and relationship dynamics.
Three complementary analysis frameworks examine what actually predicts divorce through classification models, survival analysis, and advanced probability mapping.
| Framework | Focus | Visualizations |
|---|---|---|
| Prediction Analysis | Classification models and feature importance | 4 visualization suites |
| Advanced Profiling | Survival curves and relationship archetypes | 4 visualization suites |
| Network & Probability | Feature networks and density mapping | 4 visualization suites |
| Educational Synthesis | Plain language explanations of methodology | Interactive HTML document |
Source: Synthetic dataset simulating realistic marriage conditions
| Attribute | Value |
|---|---|
| Total Couples | 5,000 |
| Features | 21 predictors + 1 target |
| Outcome Variable | Divorced (binary: 0 or 1) |
| Feature Types | Numeric, categorical, engineered |
| Category | Variables |
|---|---|
| Demographics | Age at marriage, marriage duration, education level |
| Economic | Combined income, employment status |
| Relationship | Communication score, conflict frequency |
| Family | Number of children |
| Compatibility | Religious compatibility, cultural background match |
Random Forest, Gradient Boosting, and Logistic Regression models identify strongest predictors. ROC curves evaluate performance. Confusion matrices reveal error patterns. Interaction effects show how variables compound.
Kaplan-Meier curves track marriage longevity. Hazard functions identify high-risk periods. Relationship profiling through K-means clustering discovers five distinct archetypes from Harmonious to Distressed.
Graph-based feature relationships visualize interconnections. 2D probability density maps show risk across variable combinations. Partial dependence plots isolate individual feature effects. Waterfall charts decompose predictions.
git clone https://github.com/Cazzy-Aporbo/Divorce-Prediction.git
cd Divorce-Prediction
pip install pandas numpy matplotlib seaborn scikit-learn scipy networkxpython divorce_prediction_analysis.pyGenerates:
- predictive_power_analysis.png
- classification_performance.png
- relationship_dynamics_matrix.png
- interaction_effects_analysis.png
python divorce_advanced_profiling.pyGenerates:
- survival_analysis_curves.png
- risk_profiling_matrix.png
- temporal_pattern_analysis.png
- multidimensional_relationship_space.png
python divorce_network_probability.pyGenerates:
- feature_network_graph.png
- probability_density_maps.png
- partial_dependence_analysis.png
- feature_contribution_waterfall.png
Open divorce_analysis_synthesis.html in any browser for plain language explanations of the math and methodology.
Feature correlation with divorce outcome, Random Forest importance rankings, communication impact stratification, and temporal risk profiles.
ROC curves comparing three models, confusion matrix showing prediction accuracy, decision tree logic visualization, and threshold optimization analysis.
Communication-conflict phase space mapping, religious compatibility impact, income-children risk heatmap, and education level effects.
Age-duration interactions, communication-conflict compounding, income-children relationships, and cultural-religious compatibility patterns.
Marriage survival probability over time, stratified by communication quality and income quartiles, plus hazard function showing danger zones.
Five relationship archetypes with divorce rates, profile feature signatures, communication distributions, and comparative risk factor analysis.
Divorce rate evolution over marriage duration, age-stratified patterns, critical period analysis, and factor importance shifts over time.
Communication-conflict topology, profile distribution in socioeconomic space, stability score distributions, and outcome patterns by archetype.
Graph visualization showing correlations between features and divorce outcome. Node size indicates importance, edge width shows relationship strength.
Four 2D heatmaps showing divorce probability across feature combinations: communication-conflict landscape, economic-temporal risk, distribution overlaps, and age-children surface.
Six plots showing isolated effect of each feature on divorce probability while holding others constant.
Decomposition of predictions for high-risk and low-risk cases showing how each feature pushes probability up or down from baseline.
- Random Forest with 100 estimators for ensemble learning
- Gradient Boosting for sequential error correction
- Logistic Regression for interpretable linear relationships
- Cross-validation for robust performance estimates
- Kaplan-Meier curves for time-to-event analysis
- Hazard function computation for instantaneous risk
- Log-rank tests for group comparisons
- Censoring handling for ongoing marriages
- K-means clustering with 5 centers for archetype discovery
- StandardScaler normalization for fair distance metrics
- Silhouette analysis for cluster quality
- Feature-based profile characterization
- Kernel Density Estimation for smooth distributions
- 2D binning for probability surface construction
- Contour plotting for risk threshold visualization
- Partial dependence via feature perturbation
- Correlation-based edge weighting
- Spring layout algorithm for node positioning
- Centrality metrics for importance ranking
- Graph pruning for clarity
| Package | Version | Purpose |
|---|---|---|
| pandas | 1.3+ | Data manipulation |
| numpy | 1.21+ | Numerical operations |
| matplotlib | 3.4+ | Visualization |
| seaborn | 0.11+ | Statistical plots |
| scikit-learn | 0.24+ | Machine learning |
| scipy | 1.7+ | Scientific computing |
| networkx | 2.6+ | Network analysis |
| Color | Hex Code | Usage |
|---|---|---|
| Deep Slate | #0F1618 | Primary background |
| Black | #000000 | Contrast elements |
| Purple | #4A4682 | Accent highlights |
| Teal | #3A5C60 | Secondary elements |
| Steel Blue | #4A696E | Data series |
| Mint | #8FC7B8 | Primary foreground |
Divorce-Prediction/
βββ divorce_df.csv
βββ divorce_prediction_analysis.py
βββ divorce_advanced_profiling.py
βββ divorce_network_probability.py
βββ divorce_analysis_synthesis.html
βββ predictive_power_analysis.png
βββ classification_performance.png
βββ relationship_dynamics_matrix.png
βββ interaction_effects_analysis.png
βββ survival_analysis_curves.png
βββ risk_profiling_matrix.png
βββ temporal_pattern_analysis.png
βββ multidimensional_relationship_space.png
βββ feature_network_graph.png
βββ probability_density_maps.png
βββ partial_dependence_analysis.png
βββ feature_contribution_waterfall.png
βββ README.md
| Finding | Description |
|---|---|
| Communication Dominance | Communication score is strongest predictor with 3x divorce rate difference between high and low scorers |
| Conflict Sweet Spot | 1-2 conflicts per month optimal - zero suggests avoidance, 3+ signals distress |
| Seven Year Reality | Divorce hazard spikes around years 7-10, confirming cultural phenomenon in data |
| Five Archetypes | Couples cluster into distinct profiles: Harmonious, Distressed, Passionate, Disconnected, Moderate |
| Interaction Effects | Variables compound - low income plus many children creates multiplicative risk beyond additive |
| Temporal Dynamics | Factor importance shifts over marriage duration - communication matters more early, conflict matters more late |
| Model | AUC | Accuracy | Best Use Case |
|---|---|---|---|
| Random Forest | 0.85 | 78% | Feature importance ranking |
| Gradient Boosting | 0.84 | 77% | High-accuracy predictions |
| Logistic Regression | 0.79 | 73% | Interpretable coefficients |
| Profile | Divorce Rate | Communication | Conflict | Prevalence |
|---|---|---|---|---|
| Harmonious | 10% | High (8+) | Low (0-1) | 18% |
| Distressed | 70% | Low (<5) | High (3+) | 22% |
| Passionate | 35% | High (7+) | High (3+) | 15% |
| Disconnected | 45% | Low (<5) | Low (0-2) | 25% |
| Moderate | 30% | Medium (5-7) | Medium (2-3) | 20% |
Plain language explanations of machine learning, survival analysis, and probability mapping without academic jargon
The HTML synthesis explains:
- How Random Forest creates predictions through ensemble voting
- What ROC curves actually measure and why AUC matters
- Survival analysis borrowed from medical research
- Interaction effects and when two plus two equals five
- K-means clustering for discovering relationship archetypes
- Correlation versus causation and why it matters
- Confusion matrices and the precision-recall tradeoff
- Partial dependence for isolating individual feature effects
Identify high-risk profiles early. Distressed couples need immediate intervention. Disconnected couples need engagement strategies. Passionate couples need conflict resolution skills.
Target support programs at critical periods. Years 1-3 and 7-10 show elevated hazard rates. Low-income families with multiple children face compounded stress.
Use probability maps to assess couple compatibility. Communication-conflict space provides clear risk zones. Age-children surface identifies demographic vulnerabilities.
Natural experiments for causal inference. Longitudinal tracking for temporal validation. Cross-cultural comparison for generalizability testing.
| Limitation | Impact |
|---|---|
| Synthetic Data | Patterns may not perfectly reflect real-world complexity |
| Unmeasured Variables | Love, chemistry, values not captured in dataset |
| Correlation Focus | Models predict but don't establish causation |
| Snapshot Design | Cross-sectional rather than longitudinal tracking |
| Cultural Context | Results may vary across different populations |
MIT License - see dataset source for data-specific terms.
Cazandra Aporbo
Data Scientist
GitHub
Analysis frameworks developed using open-source scientific Python ecosystem. Visualization aesthetics optimized for dark backgrounds and perceptual uniformity.