Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): dataset fields statistics #1360

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
title: DatasetFieldStatistics
type: object
properties:
min:
type: number
description: 'Minimum value of the field. For numbers, this is calculated directly. For strings, this is the length of the shortest string. For arrays, this is the length of the shortest array. For objects, this is the number of keys in the smallest object.'
nullable: true
max:
type: number
description: 'Maximum value of the field. For numbers, this is calculated directly. For strings, this is the length of the longest string. For arrays, this is the length of the longest array. For objects, this is the number of keys in the largest object.'
nullable: true
nullCount:
type: number
description: 'How many items in the dataset have a null value for this field.'
nullable: true
emptyCount:
type: number
description: 'How many items in the dataset are `undefined`, meaning that for example empty string is not considered empty.'
nullable: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
title: GetDatasetFieldStatisticsResponse
required:
- data
type: object
properties:
data:
type: object
required:
- fields
- statistics
properties:
fields:
type: array
items:
type: string
description: 'Keys of the fields for which the statistics are provided.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the list of all fields from dataset.fields in DB or really the list of fields for which we have statistics? If the latter one then the question is if we need to be returning this redundant information as

response.data.fields === Object.keys(response.data.statistics)

Copy link
Contributor Author

@MFori MFori Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the list of all fields defined in dataset.schema.fields.

I was thinking the same but because we store it separately and in docs there is this sentence: When you configure the dataset fields schema, we generate a field list and measure the following statistics, I though it has some reason.

But from implementation it looks like there will be always all fields in statistics, which means

response.data.fields === Object.keys(response.data.statistics)

If some field defined in dataset schema is never in dataset itself, it will have emptyCount=number_of_items.
And if some field is in dataset but isn't defined in dataset schema, it won't be in statistics.

So I assume you are right and it's redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, I'd remove it; we can always add new properties later, but we can't ever remove them as that would break existing integrations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already updated it, but now I wonder, whether it would be better like this having the fields right under data object. E.g. { "data": { "name": { "emptyCount": 100 } } }

Or keep the data.statistics object having the place to add other things under data in the future 🤔 such as
{ "data": { "statistics": { "name": { "emptyCount": 100 } } } }

Copy link
Member

@mtrunkat mtrunkat Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@netmilk and @fnesveda , what do you think?

/v2/datasets/{datasetId}/field-statistics
{
   "data":{
      "statistics":{
         "someValue":{
            "emptyCount":100
         },
         "anotherValue":{
            "min":100,
            "max":200,
            "emptyCount":0
         }
      }
   }
}

vs simply

{
   "data":{
      "someValue":{
         "emptyCount":100
      },
      "anotherValue":{
         "min":100,
         "max":200,
         "emptyCount":0
      }
   }
}

The former one is extensible and the latter one is simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

/v2/datasets/{datasetId}/statistics
{
   "data":{
      "fieldStatistics":{
         "someValue":{
            "emptyCount":100
         },
         "anotherValue":{
            "min":100,
            "max":200,
            "emptyCount":0
         }
      }
   }
}

That way, if we want to add more statistics about datasets later on, we can do it in the same endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ticket https://github.com/apify/apify-core/issues/18807 was note about this:

Basically we want the API to return the output of the data from the dataset statistics collection.
And the endpoint could potentially be /<datasetId>/stats or /<datasetId>/validation-statistics if it's the first one then we might want to add also the normal dataset statistics there, so that might be confusing...

Copy link
Member

@mtrunkat mtrunkat Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about what is better. We could split it but then we will have a few more endpoints in the docs. Or we can go with /stats and have this properties. Considering we don't plan to add much anytime soon, I'd go with a single endpoint for stats for simplicity.

What would you prefer @netmilk ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer @fnesveda's proposal: The uri .../statistics + fieldStatistics objects under fieldStatistics property, especially if you foresee additional types of statistics in the future returned in the response.

Naming the key just statistics doesn't provide any additional semantic meaning, it would now introduce just an additional nesting and it would lead to an overload of the term in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I used @fnesveda's approach, please take a look @mtrunkat @netmilk

statistics:
type: object
additionalProperties:
$ref: ./DatasetFieldStatistics.yaml
description: 'Statistics for each field. The keys are the same as in the `fields` array.'
4 changes: 4 additions & 0 deletions apify-api/openapi/components/tags.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -758,6 +758,10 @@
x-legacy-doc-urls:
- '#/reference/datasets/item-collection'
x-trait: 'true'
- name: Datasets/Field statistics
x-displayName: Field statistics
x-parent-tag-name: Datasets
x-trait: 'true'
- name: Request queues
x-displayName: Request queues
x-legacy-doc-urls:
Expand Down
1 change: 1 addition & 0 deletions apify-api/openapi/components/x-tag-groups.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@
- Datasets/Dataset collection
- Datasets/Dataset
- Datasets/Item collection
- Datasets/Field statistics
- name: Request queues
tags:
- Request queues
Expand Down
2 changes: 2 additions & 0 deletions apify-api/openapi/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,8 @@ paths:
$ref: 'paths/datasets/datasets@{datasetId}.yaml'
'/v2/datasets/{datasetId}/items':
$ref: 'paths/datasets/datasets@{datasetId}@items.yaml'
'/v2/datasets/{datasetId}/field-statistics':
$ref: 'paths/datasets/datasets@{datasetId}@field-statistics.yaml'
/v2/request-queues:
$ref: paths/request-queues/request-queues.yaml
'/v2/request-queues/{queueId}':
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
get:
tags:
- Datasets/Field statistics
summary: Get field statistics
description: |
Returns field statistics for given dataset.
When you configure the dataset [fields schema](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema/validation), we generate a field list and measure the statistics such as `min`, `max`, `nullCount` and `emptyCount`.

See dataset fields schema [documentation](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema/validation#dataset-field-statistics) for more information.

operationId: dataset_field_statistics_get
parameters:
- name: datasetId
in: path
description: Dataset ID or `username~dataset-name`.
required: true
style: simple
schema:
type: string
example: WkzbQMuFYuamGv3YF
- name: token
in: query
description: |
API authentication token. It is required only when using the `username~dataset-name` format for `datasetId`.
style: form
explode: true
schema:
type: string
example: soSkq9ekdmfOslopH
responses:
'200':
description: ''
content:
application/json:
schema:
$ref: "../../components/schemas/datasets/GetDatasetFieldStatisticsResponse.yaml"
example:
data:
fields: ["name", "price"]
statistics: {
name: {
nullCount: 122
},
price: {
min: 59,
max: 89
}
}
# TODO: add clients methods
# x-js-parent: DatasetClient
# x-js-name: fieldStatistics
# x-js-doc-url: https://docs.apify.com/api/client/js/reference/class/DatasetClient#fieldStatistics
# x-py-parent: DatasetClientAsync
# x-py-name: field_statistics
# x-py-doc-url: https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync#field_statistics
2 changes: 1 addition & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading