AI & ML Internship – Task 1: Understanding Dataset & Data Types
This repository contains my submission for Task 1 of the AI & ML Internship. The task focuses on understanding the dataset structure, identifying different data types, and analyzing whether the dataset is suitable for machine learning.
Task Objective
The main objective of this task is to:
Understand the dataset and its structure
Identify different types of data such as numerical, categorical, ordinal, and binary
Analyze data quality issues like missing values and imbalance
Identify the target variable and input features
Check if the dataset is suitable for machine learning
Dataset Used
For this task, I worked on:
Titanic Dataset
Students Performance Dataset
These datasets are commonly used for practicing machine learning concepts.
Tools Used
Python
Pandas
NumPy
Google Colab Notebook
Work Done
Loaded the dataset in Google Colab using Pandas.
Displayed the first and last few rows to understand the structure.
Used df.info() to check data types and missing values.
Used df.describe() to view statistical summaries of numerical columns.
Identified different feature types:
Numerical
Categorical
Ordinal
Binary
Checked unique values in categorical columns to understand data distribution.
Identified the target variable and input features.
Analyzed the dataset size and discussed its suitability for machine learning.
Wrote observations about missing values and possible data imbalance.
Observations
The dataset contains a mix of numerical and categorical features.
Some columns have missing values that need preprocessing.
Certain categories appear imbalanced, which may affect model performance.
After cleaning and preprocessing, the dataset is suitable for machine learning tasks.
Deliverables
Google Colab Notebook with complete data analysis
One-page dataset analysis report
Learning Outcome
Through this task, I learned:
How to explore and understand a dataset before modeling
The importance of identifying feature types
How to detect missing values and imbalance
Why data understanding is a critical step in machine learning
How to Use
Clone this repository using GitHub.
Open the notebook in Google Colab or Jupyter Notebook.
Run the cells to view the dataset analysis.
Author
Muskan Pandey AI & ML Internship