Skip to content

Commit

Permalink
Release: 2.14.5 (#6219)
Browse files Browse the repository at this point in the history
  • Loading branch information
albertvillanova committed Oct 24, 2023
1 parent 22d750f commit 1a598a0
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@

setup(
name="datasets",
version="2.14.4", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="2.14.5", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="HuggingFace community-driven open-source library of datasets",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "2.14.4"
__version__ = "2.14.5"

from .arrow_dataset import Dataset
from .arrow_reader import ReadInstruction
Expand Down

1 comment on commit 1a598a0

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006568 / 0.011353 (-0.004785) 0.003920 / 0.011008 (-0.007088) 0.084644 / 0.038508 (0.046136) 0.075319 / 0.023109 (0.052210) 0.314028 / 0.275898 (0.038129) 0.350930 / 0.323480 (0.027450) 0.005274 / 0.007986 (-0.002712) 0.003296 / 0.004328 (-0.001032) 0.064692 / 0.004250 (0.060441) 0.053484 / 0.037052 (0.016432) 0.314474 / 0.258489 (0.055985) 0.373728 / 0.293841 (0.079887) 0.030785 / 0.128546 (-0.097762) 0.008482 / 0.075646 (-0.067164) 0.287171 / 0.419271 (-0.132100) 0.051585 / 0.043533 (0.008052) 0.319586 / 0.255139 (0.064447) 0.333481 / 0.283200 (0.050281) 0.024103 / 0.141683 (-0.117580) 1.477348 / 1.452155 (0.025193) 1.564192 / 1.492716 (0.071476)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.234426 / 0.018006 (0.216420) 0.447022 / 0.000490 (0.446532) 0.007785 / 0.000200 (0.007585) 0.000223 / 0.000054 (0.000169)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028411 / 0.037411 (-0.009001) 0.081448 / 0.014526 (0.066922) 0.643406 / 0.176557 (0.466850) 0.150882 / 0.737135 (-0.586254) 0.099957 / 0.296338 (-0.196382)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.380065 / 0.215209 (0.164856) 3.786338 / 2.077655 (1.708683) 1.845895 / 1.504120 (0.341775) 1.651506 / 1.541195 (0.110312) 1.718432 / 1.468490 (0.249942) 0.481514 / 4.584777 (-4.103263) 3.518689 / 3.745712 (-0.227023) 3.239472 / 5.269862 (-2.030390) 2.019484 / 4.565676 (-2.546193) 0.056624 / 0.424275 (-0.367651) 0.007201 / 0.007607 (-0.000406) 0.453250 / 0.226044 (0.227205) 4.533900 / 2.268929 (2.264972) 2.357417 / 55.444624 (-53.087208) 1.959401 / 6.876477 (-4.917076) 2.182984 / 2.142072 (0.040911) 0.578748 / 4.805227 (-4.226479) 0.134022 / 6.500664 (-6.366642) 0.060176 / 0.075469 (-0.015293)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.285423 / 1.841788 (-0.556365) 19.009917 / 8.074308 (10.935609) 14.269455 / 10.191392 (4.078063) 0.172044 / 0.680424 (-0.508380) 0.018799 / 0.534201 (-0.515402) 0.392381 / 0.579283 (-0.186902) 0.415742 / 0.434364 (-0.018622) 0.453084 / 0.540337 (-0.087254) 0.626881 / 1.386936 (-0.760055)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007071 / 0.011353 (-0.004282) 0.004059 / 0.011008 (-0.006949) 0.064864 / 0.038508 (0.026356) 0.082207 / 0.023109 (0.059097) 0.401518 / 0.275898 (0.125620) 0.440300 / 0.323480 (0.116820) 0.005537 / 0.007986 (-0.002449) 0.003387 / 0.004328 (-0.000941) 0.064721 / 0.004250 (0.060470) 0.059282 / 0.037052 (0.022230) 0.422479 / 0.258489 (0.163990) 0.441441 / 0.293841 (0.147600) 0.032588 / 0.128546 (-0.095958) 0.008476 / 0.075646 (-0.067170) 0.071122 / 0.419271 (-0.348150) 0.048456 / 0.043533 (0.004923) 0.397275 / 0.255139 (0.142136) 0.419922 / 0.283200 (0.136722) 0.025119 / 0.141683 (-0.116563) 1.494473 / 1.452155 (0.042318) 1.562734 / 1.492716 (0.070018)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.223006 / 0.018006 (0.205000) 0.441940 / 0.000490 (0.441450) 0.004728 / 0.000200 (0.004528) 0.000100 / 0.000054 (0.000045)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032899 / 0.037411 (-0.004512) 0.097374 / 0.014526 (0.082849) 0.106328 / 0.176557 (-0.070228) 0.159354 / 0.737135 (-0.577781) 0.107043 / 0.296338 (-0.189296)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.432240 / 0.215209 (0.217031) 4.319790 / 2.077655 (2.242136) 2.296526 / 1.504120 (0.792406) 2.141768 / 1.541195 (0.600573) 2.228748 / 1.468490 (0.760258) 0.492544 / 4.584777 (-4.092232) 3.651541 / 3.745712 (-0.094171) 3.288034 / 5.269862 (-1.981828) 2.057921 / 4.565676 (-2.507755) 0.057919 / 0.424275 (-0.366356) 0.007347 / 0.007607 (-0.000260) 0.509605 / 0.226044 (0.283561) 5.092092 / 2.268929 (2.823164) 2.750518 / 55.444624 (-52.694106) 2.421300 / 6.876477 (-4.455177) 2.646553 / 2.142072 (0.504481) 0.598149 / 4.805227 (-4.207078) 0.131703 / 6.500664 (-6.368961) 0.059763 / 0.075469 (-0.015706)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.340406 / 1.841788 (-0.501382) 19.888576 / 8.074308 (11.814268) 15.084079 / 10.191392 (4.892687) 0.164248 / 0.680424 (-0.516176) 0.020125 / 0.534201 (-0.514076) 0.394682 / 0.579283 (-0.184601) 0.429332 / 0.434364 (-0.005032) 0.471680 / 0.540337 (-0.068658) 0.665083 / 1.386936 (-0.721853)

Please sign in to comment.