GitHub - UBC-MDS/dataprofiler_group-30

Contributors

Dongchun Chen, Ismail (Husain) Bhinderwala, Jingyuan Wang

🚀 Meet `datpro` : Your Data’s Best Friend

If you’ve ever worked with raw data, you know the struggle—messy columns, missing values, outliers lurking where you least expect them. Before you can even start your analysis, you spend hours cleaning, summarizing, and trying to make sense of what’s in front of you.

That’s why we built datpro—a simple yet powerful Python package that makes data profiling fast, easy, and intuitive. Whether you're trying to spot anomalies, summarize key statistics, or visualize your dataset, datpro does the heavy lifting so you can focus on what really matters—getting insights.

✨ Why Use `datpro`?

Imagine you're working on a new dataset. You want to quickly:
✅ Understand the structure and key statistics
✅ Find missing values, duplicates, and outliers
✅ Generate visualizations without writing long scripts

Instead of juggling multiple tools, datpro lets you do all of this with just a few lines of code. It’s lightweight, flexible, and fits right into your workflow.

🔍 What Can `datpro` Do?

summarize_data(): Summarizes numeric columns in a given DataFrame by calculating key statistical metrics.This function gives an overview of key statistics of numeric columns. It returns a summary DataFrame containing the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values for each numeric column.
detect_anomalies(): Detects anomalies in a dataset by identifying missing values, outliers, and duplicates. It calculates the percentage of missing data for each column, detects numerical outliers using the interquartile range (IQR) method, and identifies duplicate rows. The function also allows users to specify a particular anomaly type to focus on, making it flexible for targeted data quality checks. If no specific type is provided, all anomaly categories are analyzed by default. This helps in understanding and addressing potential data quality issues efficiently.
plotify(): A versatile function that simplifies DataFrame visualization by automatically generating appropriate plots based on the data types of your columns. It supports various plot types, including histograms and density plots for numeric data, bar charts for categorical data, scatter plots for pairwise numeric relationships, correlation heatmaps for exploring numeric variable relationships, and box plots for numeric vs. categorical comparisons. For pairwise categorical columns, it generates stacked bar charts. The function dynamically analyzes your DataFrame and provides insightful visualizations tailored to your data structure, making exploratory data analysis efficient and comprehensive.

While tools like ydata-profiling provide auto-generated reports, datpro is designed to be modular—so you can use only what you need, when you need it.

📦 Installation

$ pip install datpro

Usage

Let’s say you're analyzing employee data and need a quick overview. Instead of manually checking each column, let datpro do the work:

import pandas as pd
import datpro as dp

# Example DataFrame
data = {
    'age': [25, 30, 35, None, 40, 30, 35, 100],
    'salary': [50000, 60000, 70000, 80000, 90000, None, 85000, 400000],
    'department': ['HR', 'Finance', 'HR', 'IT', 'Finance', 'IT', 'HR', 'Finance']
}
df = pd.DataFrame(data)

# Summarize numeric data
summary = dp.summarize_data(df)
print("Summary of numeric data:")
print(summary)

# Detect anomalies
anomalies = dp.detect_anomalies(df)
print("Anomaly detection report:")
print(anomalies)

# Visualize data
print("Generating visualizations...")
dp.plotify(df, plot_types=['histogram', 'box', 'correlation'])

And just like that, you get a clear, structured summary, an anomaly report, and meaningful visualizations without spending hours on manual data exploration.

Run the tests

Run the following command in terminal to execute the tests:

$ pytest tests/

🤝 Want to Contribute?

We’d love your help in improving datpro! If you have ideas, bug fixes, or feature suggestions, check out our contribution guidelines.

By contributing, you agree to follow our Code of Conduct—we’re all about collaboration and respect.

📜 License

datpro was created by Dongchun Chen, Ismail (Husain) Bhinderwala, and Jingyuan Wang. It's open-source and licensed under MIT, so feel free to use and improve it!

Credits

datpro was created with cookiecutter and the py-pkgs-cookiecutter template.

Name	Name	Last commit message	Last commit date
Latest commit actions-user 1.1.7 Feb 3, 2025 0fb9502 · Feb 3, 2025 History 93 Commits
.github/workflows	.github/workflows	change token name	Jan 30, 2025
data	data	fix: Move generate_data.py to scripts/ folder	Feb 1, 2025
docs	docs	Fix: Feedback addressed by returning the plots in dictionary and also…	Feb 2, 2025
scripts	scripts	fix: Move generate_data.py to scripts/ folder	Feb 1, 2025
src/datpro	src/datpro	fix: Feedback addressed by review from Milestone2, add example in doc…	Feb 3, 2025
tests	tests	Fix: Added a test to check if the plots are being saved to the specif…	Feb 2, 2025
.gitignore	.gitignore	Initial commit	Jan 9, 2025
.readthedocs.yml	.readthedocs.yml	try another python version	Jan 25, 2025
CHANGELOG.md	CHANGELOG.md	1.1.7	Feb 3, 2025
CONDUCT.md	CONDUCT.md	Initial commit	Jan 9, 2025
CONTRIBUTING.md	CONTRIBUTING.md	change package name, add dependencies	Jan 23, 2025
LICENSE	LICENSE	fix: correct author names in pyproject.toml and LICENSE	Feb 1, 2025
README.md	README.md	fix: Feedback addressed by review from Celine, add a section in READM…	Feb 3, 2025
poetry.lock	poetry.lock	Added imports in init. Changed Readme as well	Jan 23, 2025
pyproject.toml	pyproject.toml	1.1.7	Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contributors

🚀 Meet `datpro` : Your Data’s Best Friend

✨ Why Use `datpro`?

🔍 What Can `datpro` Do?

📦 Installation

Usage

Run the tests

🤝 Want to Contribute?

📜 License

Credits

About

Releases 9

Packages

Contributors 4

Languages

License

UBC-MDS/dataprofiler_group-30

Folders and files

Latest commit

History

Repository files navigation

Contributors

🚀 Meet datpro : Your Data’s Best Friend

✨ Why Use datpro?

🔍 What Can datpro Do?

📦 Installation

Usage

Run the tests

🤝 Want to Contribute?

📜 License

Credits

About

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 4

Languages

🚀 Meet `datpro` : Your Data’s Best Friend

✨ Why Use `datpro`?

🔍 What Can `datpro` Do?

Packages