Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Validation Component #2

Open
vaasu2002 opened this issue Jan 13, 2023 · 0 comments
Open

Data Validation Component #2

vaasu2002 opened this issue Jan 13, 2023 · 0 comments
Assignees

Comments

@vaasu2002
Copy link
Member

vaasu2002 commented Jan 13, 2023

Data validation is an important step in the training pipeline as it helps to ensure that the data is
accurate and suitable for use in the training pipeline.In the data validation step of our training
pipeline, we check if all the categorial and numerical column in the training and testing data is
present or not. If they are present then we continue the training process other we raise an
exception, as all the necessary features that will help in making a robust model, are not present
in the data, thus stopping the pipeline.
Another data validation technique we will be using is checking for data drift. With time
statistical properties of data change this is called data drift. It can occur for a variety of reasons,
such as changes in the underlying system or environment being measured, changes in the data
collection process, or changes in the data itself.
An example of data drift might be a machine learning model that is trained to predict the
demand for a particular product based on historical sales data. If the model is trained on data
from the first half of the year and then deployed to make predictions for the second half of the
year, the data may have drifted due to changes in the market or in consumer behavior. As a
result, the model may no longer be able to accurately predict the demand for the product,
leading to poor performance of the machine learning model.
In the training stage, we will be checking data drift between our training and testing data. Using
the Two-Sample Kolmogorov-Smirnov Test which is an inbuild function inside the stats
the module of the Scipy library we will be determining if the cumulative distribution of the two
samples(datasets) come from the same or different distribution

@vaasu2002 vaasu2002 self-assigned this Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant