Liver Cancer Diagnosis Project

Overview

This project aims to classify liver cancer types, specifically Cholangiocarcinoma (CCK) and Hepatocellular Carcinoma (CHC), using machine learning techniques such as Principal Component Analysis (PCA), Logistic Regression, and Random Forests. The dataset includes 147 tumors across 146 patients with four different phases: PORT, ART, VEIN, and TARD.

Our ultimate goal is to enhance diagnostic accuracy and provide insights for medical professionals to better understand liver cancer types.

Project Structure

├── data_analysis.ipynb        # Main analysis notebook
├── datasets/                  # Data folder (not included in this repo)
├── README.md                  # Project README
└── images/                    # Folder for visual outputs and figures

Dataset

The dataset contains 147 tumors across 146 patients, categorized into different classes based on tumor type. We focus on a binary classification problem distinguishing CCK from CHC tumors. Each tumor is analyzed across four medical imaging phases:

PORT
ART
VEIN
TARD

Each tumor has 428 variables, which are reduced through dimensionality reduction techniques before classification.

Models

Several machine learning models are used to analyze and classify the liver tumor data:

Principal Component Analysis (PCA): Used to reduce the dimensionality of the dataset from 428 variables to 2 principal components (PC1 and PC2).
- Variances explained:
  - PC1: 39.2%
  - PC2: 22.23%
- Cumulative variance: 61.43%
Logistic Regression: A simple and interpretable model for binary classification.
Random Forest: A more complex, non-linear model to enhance classification performance and account for variable interactions.

How to Run

Prerequisites

Install Python (>= 3.7).
Install necessary libraries by running:

pip install -r requirements.txt

Running the Project

To run the analysis, you can open the provided Jupyter notebook and run the cells sequentially:

jupyter notebook data_analysis.ipynb

Make sure to have the dataset in the correct directory path (datasets/).

Key Dependencies

pandas
numpy
scikit-learn
matplotlib

Results

PCA Visualization

Below is a visualization of the data reduced to two dimensions using PCA. The distinction between the CCK and CHC tumors is clearly visible.

Logistic Regression vs. Random Forest

Logistic Regression:
- A more interpretable model but may struggle with complex data interactions.
Random Forest:
- A robust model providing better performance with nonlinear relationships in the dataset.

Future Work

Mixed Tumor Types: Expand the model to handle mixed tumors (CHC + CCK).
Larger Datasets: Train the models on a larger dataset to improve generalization.

License

This project is licensed under the MIT License - see the LICENSE file for details.

You can include images by downloading visuals (like a PCA plot) or generating results from your notebook and saving them in the images/ folder to display in the README.

Let me know if you want specific visuals or customizations for this README!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Liver Cancer Diagnosis Project

Overview

Table of Contents

Project Structure

Dataset

Models

How to Run

Prerequisites

Running the Project

Key Dependencies

Results

PCA Visualization

Logistic Regression vs. Random Forest

Future Work

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Liver Cancer Diagnosis Project

Overview

Table of Contents

Project Structure

Dataset

Models

How to Run

Prerequisites

Running the Project

Key Dependencies

Results

PCA Visualization

Logistic Regression vs. Random Forest

Future Work

License