-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Odd size revamp #247
Odd size revamp #247
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #247 +/- ##
==========================================
+ Coverage 95.93% 96.10% +0.16%
==========================================
Files 16 16
Lines 985 1001 +16
Branches 194 195 +1
==========================================
+ Hits 945 962 +17
+ Misses 20 19 -1
Partials 20 20 ☔ View full report in Codecov by Sentry. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Just curious if you think a size distribution that is multi-modal (e.g. mostly 8Mb and 12Mb images, but a handful of, say, 16Mb images) should work well with this IQR rule?
@@ -153,17 +150,17 @@ | |||
"cell_type": "markdown", | |||
"metadata": {}, | |||
"source": [ | |||
"The main way to interface with your data is via the `Imagelab` class. This class can be used to understand the issues in your dataset at a high level (global overview) and low level (issues and quality scores for each image) as well as additional information about the dataset. It has three main attributes:\n", | |||
"The main way to interface with your data is via the [Imagelab](https://cleanvision.readthedocs.io/en/latest/cleanvision/imagelab.html#cleanvision.imagelab.Imagelab) class. This class can be used to understand the issues in your dataset at a high level (global overview) and low level (issues and quality scores for each image) as well as additional information about the dataset. It has three main attributes:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this link be "stable" or are these the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But both should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tutorial installs the latest version (main branch), so I think this link is good.
It wouldn't unless you change the |
Previous logic:
Any image which is x times larger/smaller than the median size in the dataset, where x was hardcoded to 10.
New logic:
Now we compute the q1 = 25th percentile and q3 = 75th percentile for the dataset
An image is marked as oddly sized if
size > q1 + 3 * IQR
orsize < q3 - 3 * IQR
.For scoring the issues, distance of each image is computed from the midpoint, where
midpoint = (q1+q3) / 2
and normalized such that the images within the range have a score of 0.5 or less and outlier images have score > 0.5. There is one edge case whereq1 = median = q3
and in this case the images with size = median are assigned a score of 1.0 and rest of scores are scaled accordingly, for this case the threshold is 1.0. This threshold can be supplied as a hyperparameter.Since cleanvision shows low scores for problematic images, these scores are flipped and clipped between 0 and 1.
pd.Series.describe()
instead of values for the whole dataset.Here's an example of how it would look
size
instead oforiginal size
def mark_issue(self, scores: pd.DataFrame, threshold: float, issue_type: str):
todef mark_issue(self, scores: pd.DataFrame, issue_type: str, threshold: Optional[float] = None)
This makes threshold arg optional more visibly.
Results on cleanvision test/demo dataset