Skip to content

wadhwa19/Breast_Cancer_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🏥 Breast Cancer Dataset Analysis

A comprehensive statistical and exploratory data analysis of the Wisconsin Breast Cancer dataset for cancer diagnosis prediction.

📊 Project Overview

This project provides a complete analysis of breast cancer diagnostic data, using statistical methods and exploratory data analysis to identify key characteristics that distinguish malignant from benign tumors.

📋 Dataset Information

  • Source: Wisconsin Breast Cancer Dataset
  • Samples: 569 patients
  • Features: 30 numeric features + 1 target variable
  • Task: Binary classification (Malignant vs Benign)
  • Target Distribution: 357 Benign (62.7%), 212 Malignant (37.3%)

🔬 Analysis Highlights

Key Findings

  • No missing values - excellent data quality
  • Strong predictive features identified through correlation analysis
  • Statistically significant differences between malignant and benign groups
  • Feature clustering reveals natural groupings of measurements

Top Discriminative Features

  1. concave points_worst (correlation: 0.794)
  2. perimeter_worst (correlation: 0.783)
  3. concave points_mean (correlation: 0.777)
  4. radius_worst (correlation: 0.776)
  5. perimeter_mean (correlation: 0.743)

Statistical Results

  • Hypothesis Testing: Highly significant differences in tumor size (p < 0.001)
  • Effect Size: Large clinical effect (Cohen's d = 0.875)
  • Clinical Significance: Malignant tumors are 5.32 units larger in radius on average

📁 File Structure

├── breast_Cancer_analysis.ipynb    # Main analysis notebook
├── Breast_Cancer_dataset          # Dataset file
└── README.md                      # This file

🚀 Notebook Sections

  1. Data Summary and Overview - Dataset characteristics and quality assessment
  2. Data Exploration Plan - Structured approach to analysis
  3. Exploratory Data Analysis - Statistical analysis and visualizations
  4. Key Findings and Insights - Summary of important discoveries
  5. Hypotheses Development - Three research hypotheses formulated
  6. Statistical Significance Testing - Rigorous hypothesis testing
  7. Feature Relationship Analysis - Correlation and clustering analysis
  8. Conclusions and Next Steps - Summary and recommendations

🛠️ Technologies Used

  • Python 3.x
  • Pandas - Data manipulation and analysis
  • NumPy - Numerical computing
  • Matplotlib & Seaborn - Data visualization
  • SciPy - Statistical analysis and hypothesis testing
  • Jupyter Notebook - Interactive analysis environment

📈 Key Visualizations

  • Comprehensive correlation heatmaps
  • Statistical distribution comparisons
  • Feature relationship dendrograms
  • Hypothesis testing visualizations
  • Clinical significance plots

🎯 Results Summary

Statistical Validation

  • Significant size differences between tumor types
  • Large effect sizes indicating clinical relevance
  • Robust feature correlations for prediction
  • Clear feature clustering patterns identified

Clinical Implications

  • Larger tumors strongly associated with malignancy
  • Shape irregularities (concave points) are key indicators
  • Multiple complementary features provide diagnostic confidence
  • Statistical foundation supports clinical decision-making

🔄 Next Steps

Machine Learning Development

  • Implement classification models (Random Forest, SVM, Neural Networks)
  • Perform feature selection optimization
  • Cross-validation and hyperparameter tuning
  • Model performance evaluation

Advanced Analytics

  • External dataset validation
  • Ensemble method development
  • Clinical decision support tool creation
  • Cost-effectiveness analysis

Analysis completed: September 2025
Methods: Statistical Analysis, EDA, Hypothesis Testing
Tools: Python, Jupyter, Statistical Libraries

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published