Skip to content

Commit

Permalink
Release: 2.14.6 (#6342)
Browse files Browse the repository at this point in the history
release: 2.14.6
  • Loading branch information
lhoestq authored and albertvillanova committed Oct 24, 2023
1 parent c52d90e commit 06c3ffb
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@

setup(
name="datasets",
version="2.14.5", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="2.14.6", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="HuggingFace community-driven open-source library of datasets",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "2.14.5"
__version__ = "2.14.6"

from .arrow_dataset import Dataset
from .arrow_reader import ReadInstruction
Expand Down

2 comments on commit 06c3ffb

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006689 / 0.011353 (-0.004664) 0.003949 / 0.011008 (-0.007059) 0.083652 / 0.038508 (0.045144) 0.071456 / 0.023109 (0.048347) 0.314100 / 0.275898 (0.038202) 0.361947 / 0.323480 (0.038467) 0.004125 / 0.007986 (-0.003861) 0.003985 / 0.004328 (-0.000343) 0.064429 / 0.004250 (0.060179) 0.055380 / 0.037052 (0.018328) 0.326543 / 0.258489 (0.068054) 0.376491 / 0.293841 (0.082651) 0.031820 / 0.128546 (-0.096726) 0.008582 / 0.075646 (-0.067065) 0.289175 / 0.419271 (-0.130097) 0.053342 / 0.043533 (0.009810) 0.321044 / 0.255139 (0.065905) 0.345363 / 0.283200 (0.062164) 0.025257 / 0.141683 (-0.116426) 1.507694 / 1.452155 (0.055539) 1.586166 / 1.492716 (0.093449)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.228326 / 0.018006 (0.210319) 0.449039 / 0.000490 (0.448549) 0.006674 / 0.000200 (0.006474) 0.000234 / 0.000054 (0.000179)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028163 / 0.037411 (-0.009249) 0.082184 / 0.014526 (0.067659) 0.746603 / 0.176557 (0.570047) 0.174312 / 0.737135 (-0.562823) 0.100046 / 0.296338 (-0.196293)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.398620 / 0.215209 (0.183411) 3.958655 / 2.077655 (1.881000) 1.921031 / 1.504120 (0.416911) 1.739017 / 1.541195 (0.197822) 1.790264 / 1.468490 (0.321774) 0.482202 / 4.584777 (-4.102575) 3.588951 / 3.745712 (-0.156761) 3.216253 / 5.269862 (-2.053608) 1.992300 / 4.565676 (-2.573376) 0.056850 / 0.424275 (-0.367425) 0.007562 / 0.007607 (-0.000045) 0.473505 / 0.226044 (0.247460) 4.733126 / 2.268929 (2.464197) 2.521351 / 55.444624 (-52.923273) 2.172975 / 6.876477 (-4.703502) 2.343656 / 2.142072 (0.201584) 0.578087 / 4.805227 (-4.227140) 0.131902 / 6.500664 (-6.368762) 0.060247 / 0.075469 (-0.015222)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.262018 / 1.841788 (-0.579770) 18.492898 / 8.074308 (10.418590) 14.155975 / 10.191392 (3.964583) 0.170271 / 0.680424 (-0.510153) 0.018527 / 0.534201 (-0.515674) 0.391644 / 0.579283 (-0.187639) 0.413278 / 0.434364 (-0.021085) 0.459711 / 0.540337 (-0.080627) 0.621301 / 1.386936 (-0.765635)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006656 / 0.011353 (-0.004697) 0.003978 / 0.011008 (-0.007030) 0.065569 / 0.038508 (0.027061) 0.070538 / 0.023109 (0.047429) 0.406595 / 0.275898 (0.130697) 0.436739 / 0.323480 (0.113259) 0.005455 / 0.007986 (-0.002531) 0.003365 / 0.004328 (-0.000963) 0.064474 / 0.004250 (0.060224) 0.055481 / 0.037052 (0.018428) 0.404323 / 0.258489 (0.145834) 0.440957 / 0.293841 (0.147116) 0.032692 / 0.128546 (-0.095854) 0.008646 / 0.075646 (-0.067001) 0.071625 / 0.419271 (-0.347647) 0.048439 / 0.043533 (0.004906) 0.403190 / 0.255139 (0.148051) 0.431252 / 0.283200 (0.148052) 0.021968 / 0.141683 (-0.119715) 1.541450 / 1.452155 (0.089295) 1.571178 / 1.492716 (0.078462)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.220633 / 0.018006 (0.202627) 0.438881 / 0.000490 (0.438391) 0.003910 / 0.000200 (0.003711) 0.000092 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031487 / 0.037411 (-0.005924) 0.092015 / 0.014526 (0.077489) 0.105273 / 0.176557 (-0.071284) 0.155985 / 0.737135 (-0.581150) 0.104896 / 0.296338 (-0.191443)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.426794 / 0.215209 (0.211585) 4.258034 / 2.077655 (2.180380) 2.298295 / 1.504120 (0.794175) 2.150244 / 1.541195 (0.609050) 2.220197 / 1.468490 (0.751707) 0.482585 / 4.584777 (-4.102192) 3.616610 / 3.745712 (-0.129102) 3.231079 / 5.269862 (-2.038783) 1.991405 / 4.565676 (-2.574272) 0.057172 / 0.424275 (-0.367103) 0.007204 / 0.007607 (-0.000404) 0.500565 / 0.226044 (0.274520) 4.997270 / 2.268929 (2.728342) 2.791970 / 55.444624 (-52.652655) 2.457247 / 6.876477 (-4.419230) 2.628052 / 2.142072 (0.485979) 0.582815 / 4.805227 (-4.222412) 0.131794 / 6.500664 (-6.368870) 0.059641 / 0.075469 (-0.015828)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.354988 / 1.841788 (-0.486800) 19.131794 / 8.074308 (11.057486) 14.838394 / 10.191392 (4.647002) 0.165598 / 0.680424 (-0.514826) 0.020074 / 0.534201 (-0.514127) 0.396229 / 0.579283 (-0.183054) 0.430108 / 0.434364 (-0.004256) 0.491266 / 0.540337 (-0.049072) 0.646840 / 1.386936 (-0.740096)

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006685 / 0.011353 (-0.004668) 0.003997 / 0.011008 (-0.007011) 0.087833 / 0.038508 (0.049325) 0.072768 / 0.023109 (0.049659) 0.307823 / 0.275898 (0.031925) 0.348631 / 0.323480 (0.025151) 0.004117 / 0.007986 (-0.003869) 0.004992 / 0.004328 (0.000664) 0.067820 / 0.004250 (0.063570) 0.055451 / 0.037052 (0.018398) 0.318957 / 0.258489 (0.060467) 0.357282 / 0.293841 (0.063441) 0.031306 / 0.128546 (-0.097240) 0.008702 / 0.075646 (-0.066944) 0.292940 / 0.419271 (-0.126331) 0.053219 / 0.043533 (0.009686) 0.306812 / 0.255139 (0.051673) 0.329096 / 0.283200 (0.045896) 0.026333 / 0.141683 (-0.115350) 1.482455 / 1.452155 (0.030300) 1.553290 / 1.492716 (0.060573)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.223221 / 0.018006 (0.205215) 0.453463 / 0.000490 (0.452973) 0.002992 / 0.000200 (0.002792) 0.000084 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028764 / 0.037411 (-0.008648) 0.084449 / 0.014526 (0.069923) 0.097121 / 0.176557 (-0.079436) 0.152706 / 0.737135 (-0.584430) 0.098731 / 0.296338 (-0.197608)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.396142 / 0.215209 (0.180933) 3.942547 / 2.077655 (1.864893) 1.929574 / 1.504120 (0.425454) 1.750275 / 1.541195 (0.209080) 1.761705 / 1.468490 (0.293215) 0.477921 / 4.584777 (-4.106856) 3.633022 / 3.745712 (-0.112690) 3.294242 / 5.269862 (-1.975620) 2.039106 / 4.565676 (-2.526571) 0.056459 / 0.424275 (-0.367816) 0.007736 / 0.007607 (0.000129) 0.471868 / 0.226044 (0.245823) 4.729055 / 2.268929 (2.460127) 2.447061 / 55.444624 (-52.997564) 2.062401 / 6.876477 (-4.814076) 2.294270 / 2.142072 (0.152197) 0.575593 / 4.805227 (-4.229634) 0.131962 / 6.500664 (-6.368702) 0.059765 / 0.075469 (-0.015704)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.301281 / 1.841788 (-0.540506) 18.836185 / 8.074308 (10.761877) 14.267568 / 10.191392 (4.076176) 0.176001 / 0.680424 (-0.504423) 0.018925 / 0.534201 (-0.515276) 0.400253 / 0.579283 (-0.179030) 0.419626 / 0.434364 (-0.014738) 0.468363 / 0.540337 (-0.071974) 0.647242 / 1.386936 (-0.739694)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006895 / 0.011353 (-0.004458) 0.004129 / 0.011008 (-0.006879) 0.065237 / 0.038508 (0.026729) 0.075298 / 0.023109 (0.052189) 0.408780 / 0.275898 (0.132882) 0.440833 / 0.323480 (0.117354) 0.006111 / 0.007986 (-0.001875) 0.003316 / 0.004328 (-0.001013) 0.065015 / 0.004250 (0.060764) 0.056089 / 0.037052 (0.019036) 0.413388 / 0.258489 (0.154899) 0.454509 / 0.293841 (0.160668) 0.033201 / 0.128546 (-0.095346) 0.008578 / 0.075646 (-0.067068) 0.071473 / 0.419271 (-0.347798) 0.048799 / 0.043533 (0.005267) 0.411923 / 0.255139 (0.156784) 0.420577 / 0.283200 (0.137378) 0.022451 / 0.141683 (-0.119232) 1.504651 / 1.452155 (0.052496) 1.562836 / 1.492716 (0.070120)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.212011 / 0.018006 (0.194005) 0.451499 / 0.000490 (0.451010) 0.005048 / 0.000200 (0.004848) 0.000099 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032051 / 0.037411 (-0.005361) 0.093425 / 0.014526 (0.078899) 0.105768 / 0.176557 (-0.070789) 0.159522 / 0.737135 (-0.577613) 0.107111 / 0.296338 (-0.189228)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.429075 / 0.215209 (0.213866) 4.281620 / 2.077655 (2.203965) 2.302248 / 1.504120 (0.798128) 2.156373 / 1.541195 (0.615178) 2.292953 / 1.468490 (0.824463) 0.491095 / 4.584777 (-4.093682) 3.682012 / 3.745712 (-0.063700) 3.355180 / 5.269862 (-1.914681) 2.088757 / 4.565676 (-2.476920) 0.057594 / 0.424275 (-0.366681) 0.007383 / 0.007607 (-0.000224) 0.508920 / 0.226044 (0.282875) 5.083140 / 2.268929 (2.814212) 2.830892 / 55.444624 (-52.613732) 2.538064 / 6.876477 (-4.338413) 2.773471 / 2.142072 (0.631399) 0.591664 / 4.805227 (-4.213564) 0.134470 / 6.500664 (-6.366194) 0.061795 / 0.075469 (-0.013674)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.367235 / 1.841788 (-0.474552) 20.143622 / 8.074308 (12.069314) 14.809485 / 10.191392 (4.618093) 0.197121 / 0.680424 (-0.483303) 0.020486 / 0.534201 (-0.513715) 0.406520 / 0.579283 (-0.172763) 0.436532 / 0.434364 (0.002168) 0.477516 / 0.540337 (-0.062822) 0.649380 / 1.386936 (-0.737556)

Please sign in to comment.