-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): dataset fields statistics #1360
base: master
Are you sure you want to change the base?
Conversation
type: array | ||
items: | ||
type: string | ||
description: 'Keys of the fields for which the statistics are provided.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the list of all fields from dataset.fields
in DB or really the list of fields for which we have statistics? If the latter one then the question is if we need to be returning this redundant information as
response.data.fields === Object.keys(response.data.statistics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the list of all fields defined in dataset.schema.fields
.
I was thinking the same but because we store it separately and in docs there is this sentence: When you configure the dataset fields schema, we generate a field list and measure the following statistics
, I though it has some reason.
But from implementation it looks like there will be always all fields in statistics
, which means
response.data.fields === Object.keys(response.data.statistics)
If some field defined in dataset schema is never in dataset itself, it will have emptyCount=number_of_items
.
And if some field is in dataset but isn't defined in dataset schema, it won't be in statistics.
So I assume you are right and it's redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such case, I'd remove it; we can always add new properties later, but we can't ever remove them as that would break existing integrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already updated it, but now I wonder, whether it would be better like this having the fields right under data
object. E.g. { "data": { "name": { "emptyCount": 100 } } }
Or keep the data.statistics
object having the place to add other things under data
in the future 🤔 such as
{ "data": { "statistics": { "name": { "emptyCount": 100 } } } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@netmilk and @fnesveda , what do you think?
/v2/datasets/{datasetId}/field-statistics
{
"data":{
"statistics":{
"someValue":{
"emptyCount":100
},
"anotherValue":{
"min":100,
"max":200,
"emptyCount":0
}
}
}
}
vs simply
{
"data":{
"someValue":{
"emptyCount":100
},
"anotherValue":{
"min":100,
"max":200,
"emptyCount":0
}
}
}
The former one is extensible and the latter one is simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
/v2/datasets/{datasetId}/statistics
{
"data":{
"fieldStatistics":{
"someValue":{
"emptyCount":100
},
"anotherValue":{
"min":100,
"max":200,
"emptyCount":0
}
}
}
}
That way, if we want to add more statistics about datasets later on, we can do it in the same endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the ticket https://github.com/apify/apify-core/issues/18807 was note about this:
Basically we want the API to return the output of the data from the dataset statistics collection.
And the endpoint could potentially be /<datasetId>/stats or /<datasetId>/validation-statistics if it's the first one then we might want to add also the normal dataset statistics there, so that might be confusing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about what is better. We could split it but then we will have a few more endpoints in the docs. Or we can go with /stats
and have this properties. Considering we don't plan to add much anytime soon, I'd go with a single endpoint for stats for simplicity.
What would you prefer @netmilk ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer @fnesveda's proposal: The uri .../statistics
+ fieldStatistics
objects under fieldStatistics
property, especially if you foresee additional types of statistics in the future returned in the response.
Naming the key just statistics
doesn't provide any additional semantic meaning, it would now introduce just an additional nesting and it would lead to an overload of the term in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- name: token | ||
in: query | ||
description: | | ||
API authentication token. It is required only when using the `username~dataset-name` format for `datasetId`. | ||
style: form | ||
explode: true | ||
schema: | ||
type: string | ||
example: soSkq9ekdmfOslopH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrunkat @netmilk - @drobnikj suggest in this PR to require the authentication always in this endpoint. Is it right?
This is questionable as we plan to remove anonymous access for storage. We can do it for new endpoints already.
I would make this isAuthenticationRequired: true, and restrict this endpoint.
- name: token | |
in: query | |
description: | | |
API authentication token. It is required only when using the `username~dataset-name` format for `datasetId`. | |
style: form | |
explode: true | |
schema: | |
type: string | |
example: soSkq9ekdmfOslopH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Let's make it authenticated = less work in the next months. As we will turn all endpoints this way.
Solves https://github.com/apify/apify-core/issues/18807 - proposal for new API endpoint
/v2/datasets/{datasetId}/field-statistics
which should return dataset field statistics