You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data validation is an important step in the training pipeline as it helps to ensure that the data is
accurate and suitable for use in the training pipeline.In the data validation step of our training
pipeline, we check if all the categorial and numerical column in the training and testing data is
present or not. If they are present then we continue the training process other we raise an
exception, as all the necessary features that will help in making a robust model, are not present
in the data, thus stopping the pipeline.
Another data validation technique we will be using is checking for data drift. With time
statistical properties of data change this is called data drift. It can occur for a variety of reasons,
such as changes in the underlying system or environment being measured, changes in the data
collection process, or changes in the data itself.
An example of data drift might be a machine learning model that is trained to predict the
demand for a particular product based on historical sales data. If the model is trained on data
from the first half of the year and then deployed to make predictions for the second half of the
year, the data may have drifted due to changes in the market or in consumer behavior. As a
result, the model may no longer be able to accurately predict the demand for the product,
leading to poor performance of the machine learning model.
In the training stage, we will be checking data drift between our training and testing data. Using
the Two-Sample Kolmogorov-Smirnov Test which is an inbuild function inside the stats
the module of the Scipy library we will be determining if the cumulative distribution of the two
samples(datasets) come from the same or different distribution
The text was updated successfully, but these errors were encountered:
Data validation is an important step in the training pipeline as it helps to ensure that the data is
accurate and suitable for use in the training pipeline.In the data validation step of our training
pipeline, we check if all the categorial and numerical column in the training and testing data is
present or not. If they are present then we continue the training process other we raise an
exception, as all the necessary features that will help in making a robust model, are not present
in the data, thus stopping the pipeline.
Another data validation technique we will be using is checking for data drift. With time
statistical properties of data change this is called data drift. It can occur for a variety of reasons,
such as changes in the underlying system or environment being measured, changes in the data
collection process, or changes in the data itself.
An example of data drift might be a machine learning model that is trained to predict the
demand for a particular product based on historical sales data. If the model is trained on data
from the first half of the year and then deployed to make predictions for the second half of the
year, the data may have drifted due to changes in the market or in consumer behavior. As a
result, the model may no longer be able to accurately predict the demand for the product,
leading to poor performance of the machine learning model.
In the training stage, we will be checking data drift between our training and testing data. Using
the Two-Sample Kolmogorov-Smirnov Test which is an inbuild function inside the stats
the module of the Scipy library we will be determining if the cumulative distribution of the two
samples(datasets) come from the same or different distribution
The text was updated successfully, but these errors were encountered: