Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): dataset fields statistics #1360

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

MFori
Copy link
Contributor

@MFori MFori commented Dec 18, 2024

Solves https://github.com/apify/apify-core/issues/18807 - proposal for new API endpoint /v2/datasets/{datasetId}/field-statistics which should return dataset field statistics

@MFori MFori added the t-console Issues with this label are in the ownership of the console team. label Dec 18, 2024
@MFori MFori requested review from netmilk and mtrunkat December 18, 2024 10:20
@MFori MFori self-assigned this Dec 18, 2024
type: array
items:
type: string
description: 'Keys of the fields for which the statistics are provided.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the list of all fields from dataset.fields in DB or really the list of fields for which we have statistics? If the latter one then the question is if we need to be returning this redundant information as

response.data.fields === Object.keys(response.data.statistics)

Copy link
Contributor Author

@MFori MFori Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the list of all fields defined in dataset.schema.fields.

I was thinking the same but because we store it separately and in docs there is this sentence: When you configure the dataset fields schema, we generate a field list and measure the following statistics, I though it has some reason.

But from implementation it looks like there will be always all fields in statistics, which means

response.data.fields === Object.keys(response.data.statistics)

If some field defined in dataset schema is never in dataset itself, it will have emptyCount=number_of_items.
And if some field is in dataset but isn't defined in dataset schema, it won't be in statistics.

So I assume you are right and it's redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, I'd remove it; we can always add new properties later, but we can't ever remove them as that would break existing integrations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already updated it, but now I wonder, whether it would be better like this having the fields right under data object. E.g. { "data": { "name": { "emptyCount": 100 } } }

Or keep the data.statistics object having the place to add other things under data in the future 🤔 such as
{ "data": { "statistics": { "name": { "emptyCount": 100 } } } }

Copy link
Member

@mtrunkat mtrunkat Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@netmilk and @fnesveda , what do you think?

/v2/datasets/{datasetId}/field-statistics
{
   "data":{
      "statistics":{
         "someValue":{
            "emptyCount":100
         },
         "anotherValue":{
            "min":100,
            "max":200,
            "emptyCount":0
         }
      }
   }
}

vs simply

{
   "data":{
      "someValue":{
         "emptyCount":100
      },
      "anotherValue":{
         "min":100,
         "max":200,
         "emptyCount":0
      }
   }
}

The former one is extensible and the latter one is simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

/v2/datasets/{datasetId}/statistics
{
   "data":{
      "fieldStatistics":{
         "someValue":{
            "emptyCount":100
         },
         "anotherValue":{
            "min":100,
            "max":200,
            "emptyCount":0
         }
      }
   }
}

That way, if we want to add more statistics about datasets later on, we can do it in the same endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ticket https://github.com/apify/apify-core/issues/18807 was note about this:

Basically we want the API to return the output of the data from the dataset statistics collection.
And the endpoint could potentially be /<datasetId>/stats or /<datasetId>/validation-statistics if it's the first one then we might want to add also the normal dataset statistics there, so that might be confusing...

Copy link
Member

@mtrunkat mtrunkat Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about what is better. We could split it but then we will have a few more endpoints in the docs. Or we can go with /stats and have this properties. Considering we don't plan to add much anytime soon, I'd go with a single endpoint for stats for simplicity.

What would you prefer @netmilk ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer @fnesveda's proposal: The uri .../statistics + fieldStatistics objects under fieldStatistics property, especially if you foresee additional types of statistics in the future returned in the response.

Naming the key just statistics doesn't provide any additional semantic meaning, it would now introduce just an additional nesting and it would lead to an overload of the term in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I used @fnesveda's approach, please take a look @mtrunkat @netmilk

@MFori MFori requested a review from mtrunkat December 18, 2024 17:19
Comment on lines 19 to 27
- name: token
in: query
description: |
API authentication token. It is required only when using the `username~dataset-name` format for `datasetId`.
style: form
explode: true
schema:
type: string
example: soSkq9ekdmfOslopH
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtrunkat @netmilk - @drobnikj suggest in this PR to require the authentication always in this endpoint. Is it right?

This is questionable as we plan to remove anonymous access for storage. We can do it for new endpoints already.
I would make this isAuthenticationRequired: true, and restrict this endpoint.

Suggested change
- name: token
in: query
description: |
API authentication token. It is required only when using the `username~dataset-name` format for `datasetId`.
style: form
explode: true
schema:
type: string
example: soSkq9ekdmfOslopH

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let's make it authenticated = less work in the next months. As we will turn all endpoints this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-console Issues with this label are in the ownership of the console team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants