This project aims to classify liver cancer types, specifically Cholangiocarcinoma (CCK) and Hepatocellular Carcinoma (CHC), using machine learning techniques such as Principal Component Analysis (PCA), Logistic Regression, and Random Forests. The dataset includes 147 tumors across 146 patients with four different phases: PORT, ART, VEIN, and TARD.
Our ultimate goal is to enhance diagnostic accuracy and provide insights for medical professionals to better understand liver cancer types.
├── data_analysis.ipynb # Main analysis notebook
├── datasets/ # Data folder (not included in this repo)
├── README.md # Project README
└── images/ # Folder for visual outputs and figures
The dataset contains 147 tumors across 146 patients, categorized into different classes based on tumor type. We focus on a binary classification problem distinguishing CCK from CHC tumors. Each tumor is analyzed across four medical imaging phases:
- PORT
- ART
- VEIN
- TARD
Each tumor has 428 variables, which are reduced through dimensionality reduction techniques before classification.
Several machine learning models are used to analyze and classify the liver tumor data:
-
Principal Component Analysis (PCA): Used to reduce the dimensionality of the dataset from 428 variables to 2 principal components (PC1 and PC2).
- Variances explained:
- PC1: 39.2%
- PC2: 22.23%
- Cumulative variance: 61.43%
- Variances explained:
-
Logistic Regression: A simple and interpretable model for binary classification.
-
Random Forest: A more complex, non-linear model to enhance classification performance and account for variable interactions.
- Install Python (>= 3.7).
- Install necessary libraries by running:
pip install -r requirements.txt
To run the analysis, you can open the provided Jupyter notebook and run the cells sequentially:
jupyter notebook data_analysis.ipynb
Make sure to have the dataset in the correct directory path (datasets/
).
- pandas
- numpy
- scikit-learn
- matplotlib
Below is a visualization of the data reduced to two dimensions using PCA. The distinction between the CCK and CHC tumors is clearly visible.
-
Logistic Regression:
- A more interpretable model but may struggle with complex data interactions.
-
Random Forest:
- A robust model providing better performance with nonlinear relationships in the dataset.
- Mixed Tumor Types: Expand the model to handle mixed tumors (CHC + CCK).
- Larger Datasets: Train the models on a larger dataset to improve generalization.
This project is licensed under the MIT License - see the LICENSE file for details.
You can include images by downloading visuals (like a PCA plot) or generating results from your notebook and saving them in the images/
folder to display in the README.
Let me know if you want specific visuals or customizations for this README!