A comprehensive statistical and exploratory data analysis of the Wisconsin Breast Cancer dataset for cancer diagnosis prediction.
This project provides a complete analysis of breast cancer diagnostic data, using statistical methods and exploratory data analysis to identify key characteristics that distinguish malignant from benign tumors.
- Source: Wisconsin Breast Cancer Dataset
- Samples: 569 patients
- Features: 30 numeric features + 1 target variable
- Task: Binary classification (Malignant vs Benign)
- Target Distribution: 357 Benign (62.7%), 212 Malignant (37.3%)
- No missing values - excellent data quality
- Strong predictive features identified through correlation analysis
- Statistically significant differences between malignant and benign groups
- Feature clustering reveals natural groupings of measurements
- concave points_worst (correlation: 0.794)
- perimeter_worst (correlation: 0.783)
- concave points_mean (correlation: 0.777)
- radius_worst (correlation: 0.776)
- perimeter_mean (correlation: 0.743)
- Hypothesis Testing: Highly significant differences in tumor size (p < 0.001)
- Effect Size: Large clinical effect (Cohen's d = 0.875)
- Clinical Significance: Malignant tumors are 5.32 units larger in radius on average
├── breast_Cancer_analysis.ipynb # Main analysis notebook
├── Breast_Cancer_dataset # Dataset file
└── README.md # This file
- Data Summary and Overview - Dataset characteristics and quality assessment
- Data Exploration Plan - Structured approach to analysis
- Exploratory Data Analysis - Statistical analysis and visualizations
- Key Findings and Insights - Summary of important discoveries
- Hypotheses Development - Three research hypotheses formulated
- Statistical Significance Testing - Rigorous hypothesis testing
- Feature Relationship Analysis - Correlation and clustering analysis
- Conclusions and Next Steps - Summary and recommendations
- Python 3.x
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Matplotlib & Seaborn - Data visualization
- SciPy - Statistical analysis and hypothesis testing
- Jupyter Notebook - Interactive analysis environment
- Comprehensive correlation heatmaps
- Statistical distribution comparisons
- Feature relationship dendrograms
- Hypothesis testing visualizations
- Clinical significance plots
- ✅ Significant size differences between tumor types
- ✅ Large effect sizes indicating clinical relevance
- ✅ Robust feature correlations for prediction
- ✅ Clear feature clustering patterns identified
- Larger tumors strongly associated with malignancy
- Shape irregularities (concave points) are key indicators
- Multiple complementary features provide diagnostic confidence
- Statistical foundation supports clinical decision-making
- Implement classification models (Random Forest, SVM, Neural Networks)
- Perform feature selection optimization
- Cross-validation and hyperparameter tuning
- Model performance evaluation
- External dataset validation
- Ensemble method development
- Clinical decision support tool creation
- Cost-effectiveness analysis
Analysis completed: September 2025
Methods: Statistical Analysis, EDA, Hypothesis Testing
Tools: Python, Jupyter, Statistical Libraries