This is a python package that performs exploratory data analysis for users. It takes in a csv file and generates 3 documents that comprise of a text report containing a descriptive summary, a series of plots and a cleaned csv output.
- Python 3.8x
- matplotlib==3.1.2
- numpy==1.18.1
- pandas==1.0.0
- PySimpleGUI==4.19.0
- scikit-learn==0.22.1
- scipy==1.4.1
- seaborn==0.10.0
- statsmodels==0.11.1
- more-itertools==8.3.0
- You can clone or download my package.
- Using terminal, move to the directory.
- Example for Mac OS users:
$ cd Downloads/Edator - Install the required packages using:
pip install -r requirements.txt
- After that, change directory into the Script folder using:
$ cd Script - Now, execute the main.py file by:
$ python main.py
- You should see the following:
-
Choose the format of the file (csv or xls), the path to the file and the paths to export the plots, the report and the cleaned csv file to.
-
Done!
How I deal with NaN value is that I only remove the affected rows when the percentage of NaN within that column is less than 5%. This applies to both numerical and categorical values. For anything above 5%, I replace the NaN values with median. For categorical values, the NaN values will be replace by mode.
Dealing with zeros is much harder as it is challenging to differentiate between a zero that is meaningful (has a purpose and should not be removed) and a zero that serves no purpose and can potentially add more noise to the dataset. Hence, I decided to inform the user about the percentage of zeros in the dataset.
I use Z-score to detect outliers. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.
In most cases, a threshold of 3 or -3 is used to filter off outliers and I have used this approach for all of my analysis.
For correlation, I included:
- Pearson and Spearman correlation for numerical-numerical variables.
- One Way ANOVA for numerical-categorical variables
- Chi-Square test for categorical-categorical variables
Using itertools.combinations, I identify every possible combinations among numerical-numerical variables, numerical-categorical variables and categorical-categorical variables. I then apply the correlation test based on the criteria I have set above.
For plots, I created:
- Scatterplot for numerical variables
- Countplot for categorical variables
- Boxplot for numerical-categorical variables
Similar to correlation, I used itertools.combinations to create every possible plot. I have also added the hue feature to each scatterplot. I will only do so when the categorical variable has less than 5 unique values. Example, if hue = "fruits", I should only see 4 types of fruits.
- Take in more file outputs beyond CSV and Excel
- Gathering user input, I will increase the variety of plots beyond scatterplots, barplots and boxplots.
- Report generated will be in HTML format.

