Skip to content

Commit

Permalink
Release: 1.6.0
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Apr 20, 2021
1 parent dc95ade commit 40bb9e6
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 3 deletions.
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
# The short X.Y version
version = ""
# The full version, including alpha/beta/rc tags
release = "1.5.0"
release = "1.6.0"


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@

setup(
name="datasets",
version="1.5.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="1.6.0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description=DOCLINES[0],
long_description="\n".join(DOCLINES[2:]),
author="HuggingFace Inc.",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "1.5.0.dev0"
__version__ = "1.6.0"

import pyarrow
from pyarrow import total_allocated_bytes
Expand Down

2 comments on commit 40bb9e6

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.023798 / 0.011353 (0.012445) 0.017359 / 0.011008 (0.006351) 0.052346 / 0.038508 (0.013838) 0.038566 / 0.023109 (0.015457) 0.397777 / 0.275898 (0.121879) 0.429755 / 0.323480 (0.106275) 0.011602 / 0.007986 (0.003616) 0.005174 / 0.004328 (0.000846) 0.012144 / 0.004250 (0.007894) 0.054393 / 0.037052 (0.017341) 0.391196 / 0.258489 (0.132707) 0.461223 / 0.293841 (0.167382) 0.178575 / 0.128546 (0.050029) 0.144584 / 0.075646 (0.068938) 0.471669 / 0.419271 (0.052397) 0.451078 / 0.043533 (0.407545) 0.402093 / 0.255139 (0.146954) 0.427082 / 0.283200 (0.143882) 1.818628 / 0.141683 (1.676945) 1.924235 / 1.452155 (0.472080) 2.050306 / 1.492716 (0.557590)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.018397 / 0.018006 (0.000391) 0.000475 / 0.000490 (-0.000015) 0.000199 / 0.000200 (-0.000001) 0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.049135 / 0.037411 (0.011723) 0.023746 / 0.014526 (0.009220) 0.032881 / 0.176557 (-0.143675) 0.054322 / 0.737135 (-0.682814) 0.034806 / 0.296338 (-0.261532)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.536516 / 0.215209 (0.321307) 5.411101 / 2.077655 (3.333446) 2.399303 / 1.504120 (0.895183) 2.075573 / 1.541195 (0.534378) 2.139118 / 1.468490 (0.670628) 7.641936 / 4.584777 (3.057159) 6.747050 / 3.745712 (3.001338) 9.478884 / 5.269862 (4.209022) 8.327582 / 4.565676 (3.761905) 0.748419 / 0.424275 (0.324144) 0.011337 / 0.007607 (0.003730) 0.673803 / 0.226044 (0.447758) 6.789266 / 2.268929 (4.520337) 3.605281 / 55.444624 (-51.839343) 3.023948 / 6.876477 (-3.852529) 3.116456 / 2.142072 (0.974383) 7.856719 / 4.805227 (3.051492) 4.936142 / 6.500664 (-1.564522) 7.973346 / 0.075469 (7.897876)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.334522 / 1.841788 (10.492734) 14.233399 / 8.074308 (6.159091) 42.639224 / 10.191392 (32.447832) 0.960193 / 0.680424 (0.279769) 0.619575 / 0.534201 (0.085374) 0.846052 / 0.579283 (0.266769) 0.675200 / 0.434364 (0.240836) 0.759837 / 0.540337 (0.219499) 1.681786 / 1.386936 (0.294850)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.024731 / 0.011353 (0.013378) 0.016361 / 0.011008 (0.005353) 0.052794 / 0.038508 (0.014286) 0.039221 / 0.023109 (0.016112) 0.358330 / 0.275898 (0.082432) 0.384023 / 0.323480 (0.060543) 0.011867 / 0.007986 (0.003881) 0.005215 / 0.004328 (0.000886) 0.011471 / 0.004250 (0.007221) 0.060624 / 0.037052 (0.023572) 0.354031 / 0.258489 (0.095542) 0.394146 / 0.293841 (0.100305) 0.173191 / 0.128546 (0.044645) 0.137124 / 0.075646 (0.061477) 0.458397 / 0.419271 (0.039126) 0.482534 / 0.043533 (0.439002) 0.343158 / 0.255139 (0.088019) 0.384075 / 0.283200 (0.100875) 1.986538 / 0.141683 (1.844855) 1.970650 / 1.452155 (0.518495) 2.016002 / 1.492716 (0.523286)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.017988 / 0.018006 (-0.000018) 0.000419 / 0.000490 (-0.000070) 0.000200 / 0.000200 (0.000000) 0.000055 / 0.000054 (0.000000)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.041826 / 0.037411 (0.004415) 0.023560 / 0.014526 (0.009035) 0.032424 / 0.176557 (-0.144133) 0.048784 / 0.737135 (-0.688351) 0.030214 / 0.296338 (-0.266124)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.515860 / 0.215209 (0.300651) 5.064995 / 2.077655 (2.987341) 2.374331 / 1.504120 (0.870211) 2.032647 / 1.541195 (0.491452) 2.039654 / 1.468490 (0.571164) 7.284068 / 4.584777 (2.699291) 6.497594 / 3.745712 (2.751882) 9.231881 / 5.269862 (3.962019) 8.137452 / 4.565676 (3.571776) 0.740687 / 0.424275 (0.316412) 0.011250 / 0.007607 (0.003643) 0.634690 / 0.226044 (0.408645) 6.385092 / 2.268929 (4.116164) 3.576756 / 55.444624 (-51.867868) 3.024843 / 6.876477 (-3.851633) 3.020211 / 2.142072 (0.878138) 7.585863 / 4.805227 (2.780636) 6.673547 / 6.500664 (0.172883) 7.762227 / 0.075469 (7.686758)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.094649 / 1.841788 (10.252861) 13.970157 / 8.074308 (5.895848) 42.653895 / 10.191392 (32.462503) 0.954281 / 0.680424 (0.273857) 0.653846 / 0.534201 (0.119646) 0.840919 / 0.579283 (0.261636) 0.669184 / 0.434364 (0.234820) 0.753811 / 0.540337 (0.213474) 1.626253 / 1.386936 (0.239317)

CML watermark

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.025657 / 0.011353 (0.014305) 0.017979 / 0.011008 (0.006970) 0.057917 / 0.038508 (0.019409) 0.040598 / 0.023109 (0.017489) 0.369462 / 0.275898 (0.093564) 0.409188 / 0.323480 (0.085708) 0.012495 / 0.007986 (0.004509) 0.005406 / 0.004328 (0.001077) 0.012740 / 0.004250 (0.008489) 0.053203 / 0.037052 (0.016151) 0.357163 / 0.258489 (0.098674) 0.415039 / 0.293841 (0.121198) 0.180392 / 0.128546 (0.051846) 0.149249 / 0.075646 (0.073602) 0.470625 / 0.419271 (0.051353) 0.472773 / 0.043533 (0.429240) 0.385641 / 0.255139 (0.130502) 0.401641 / 0.283200 (0.118441) 1.884313 / 0.141683 (1.742630) 1.966217 / 1.452155 (0.514062) 2.005766 / 1.492716 (0.513050)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.018381 / 0.018006 (0.000375) 0.000472 / 0.000490 (-0.000018) 0.000180 / 0.000200 (-0.000020) 0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.049239 / 0.037411 (0.011827) 0.027343 / 0.014526 (0.012817) 0.032138 / 0.176557 (-0.144419) 0.050638 / 0.737135 (-0.686497) 0.032516 / 0.296338 (-0.263823)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.528575 / 0.215209 (0.313366) 5.127312 / 2.077655 (3.049657) 2.478301 / 1.504120 (0.974181) 2.155765 / 1.541195 (0.614570) 2.159532 / 1.468490 (0.691042) 7.716309 / 4.584777 (3.131532) 7.032695 / 3.745712 (3.286983) 9.897024 / 5.269862 (4.627162) 8.478060 / 4.565676 (3.912384) 0.796837 / 0.424275 (0.372562) 0.012420 / 0.007607 (0.004813) 0.672482 / 0.226044 (0.446437) 6.614490 / 2.268929 (4.345561) 3.619045 / 55.444624 (-51.825580) 3.157535 / 6.876477 (-3.718942) 3.222804 / 2.142072 (1.080732) 8.316613 / 4.805227 (3.511386) 5.801345 / 6.500664 (-0.699319) 7.562244 / 0.075469 (7.486775)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.946749 / 1.841788 (11.104962) 14.386661 / 8.074308 (6.312353) 42.675459 / 10.191392 (32.484067) 0.994317 / 0.680424 (0.313893) 0.677810 / 0.534201 (0.143609) 0.879490 / 0.579283 (0.300207) 0.702888 / 0.434364 (0.268524) 0.815691 / 0.540337 (0.275354) 1.786458 / 1.386936 (0.399522)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.026639 / 0.011353 (0.015286) 0.016968 / 0.011008 (0.005960) 0.053488 / 0.038508 (0.014980) 0.040990 / 0.023109 (0.017881) 0.391468 / 0.275898 (0.115569) 0.419872 / 0.323480 (0.096392) 0.011548 / 0.007986 (0.003562) 0.006021 / 0.004328 (0.001692) 0.011810 / 0.004250 (0.007560) 0.072578 / 0.037052 (0.035526) 0.368128 / 0.258489 (0.109638) 0.409392 / 0.293841 (0.115551) 0.181709 / 0.128546 (0.053163) 0.140888 / 0.075646 (0.065242) 0.467502 / 0.419271 (0.048230) 0.481175 / 0.043533 (0.437642) 0.361829 / 0.255139 (0.106690) 0.397535 / 0.283200 (0.114336) 1.877830 / 0.141683 (1.736147) 2.051703 / 1.452155 (0.599549) 2.098484 / 1.492716 (0.605767)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.019984 / 0.018006 (0.001977) 0.000455 / 0.000490 (-0.000035) 0.000186 / 0.000200 (-0.000014) 0.000058 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043626 / 0.037411 (0.006214) 0.024809 / 0.014526 (0.010283) 0.033560 / 0.176557 (-0.142997) 0.051457 / 0.737135 (-0.685678) 0.032501 / 0.296338 (-0.263837)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.509983 / 0.215209 (0.294774) 4.944591 / 2.077655 (2.866937) 2.253467 / 1.504120 (0.749347) 1.973815 / 1.541195 (0.432620) 1.987211 / 1.468490 (0.518721) 7.474555 / 4.584777 (2.889778) 6.721563 / 3.745712 (2.975850) 9.551561 / 5.269862 (4.281700) 8.199687 / 4.565676 (3.634011) 0.741217 / 0.424275 (0.316942) 0.010231 / 0.007607 (0.002624) 0.617441 / 0.226044 (0.391396) 6.121524 / 2.268929 (3.852596) 3.314725 / 55.444624 (-52.129899) 2.862248 / 6.876477 (-4.014229) 2.833091 / 2.142072 (0.691018) 7.531522 / 4.805227 (2.726295) 7.270122 / 6.500664 (0.769458) 9.178513 / 0.075469 (9.103044)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.331112 / 1.841788 (10.489324) 14.328543 / 8.074308 (6.254235) 41.548669 / 10.191392 (31.357277) 0.932877 / 0.680424 (0.252453) 0.664344 / 0.534201 (0.130143) 0.874565 / 0.579283 (0.295281) 0.699682 / 0.434364 (0.265318) 0.796579 / 0.540337 (0.256241) 1.676036 / 1.386936 (0.289100)

CML watermark

Please sign in to comment.