Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-12549: [JS] Table and RecordBatch should not extend Vector, mak…
…e JS lib smaller This pull request addresses a number of issues that requires a more substantial refactor. The main goals are: 1. Eliminate cruft by dropping support for outdated browsers/environments. 2. Reduce total surface area by eliminating unnecessary `Vector`, `Chunked`, and `Column` classes. 3. Reduce the amount of the library pulled in when Table, RecordBatch, or Vector classes are imported. In this pull request, we have eliminated type specific Vector classes. There is now only one vector that has a data instance and we use type-specific visitors. Record batches don't inherit from vectors anymore. Neither do Tables. Columns are gone. To create vectors and tables, we now have separate methods that can be easily tree shaken. We also added tests for the bundles, fixed some issues with bundling in webpack, updated dependencies (including typescript and flatbuffers). We also added memoization to dictionary vectors to reduce the overhead of decoding UTF-8 to strings. A quick overview of Arrow with the new API: https://observablehq.com/d/9480eccb30a21010. Also addresses: * [ARROW-10255](https://issues.apache.org/jira/browse/ARROW-10255) * [ARROW-11347](https://issues.apache.org/jira/browse/ARROW-11347) * [ARROW-12548](https://issues.apache.org/jira/browse/ARROW-12548) * [ARROW-13514](https://issues.apache.org/jira/browse/ARROW-13514) * [ARROW-10220](https://issues.apache.org/jira/browse/ARROW-10220) * [ARROW-14933](https://issues.apache.org/jira/browse/ARROW-14933) * [ARROW-12538](https://issues.apache.org/jira/browse/ARROW-12538) * [ARROW-12536](https://issues.apache.org/jira/browse/ARROW-12536) ## Performance comparison: ### Master: ``` Prepare Data: 502.401ms Running "Parse" suite... dataset: tracks, function: Table.from 15,578 ops/s ±0.67%, 0.064 ms, 94 samples dataset: tracks, function: readBatches 15,853 ops/s ±0.59%, 0.063 ms, 97 samples dataset: tracks, function: serialize 969 ops/s ±1.8%, 1 ms, 93 samples Running "Get values by index" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 78 ops/s ±0.090%, 13 ms, 82 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 79 ops/s ±0.090%, 13 ms, 70 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.59 ops/s ±25%, 563 ms, 9 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.74 ops/s ±3.2%, 576 ms, 9 samples Running "Iterate vectors" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 85 ops/s ±0.14%, 12 ms, 74 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 85 ops/s ±0.11%, 12 ms, 75 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.51 ops/s ±3.1%, 657 ms, 8 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.49 ops/s ±4.0%, 666 ms, 8 samples Running "Slice toArray vectors" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 2,588 ops/s ±3.0%, 0.4 ms, 74 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 2,345 ops/s ±1.7%, 0.43 ms, 73 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.29 ops/s ±5.3%, 760 ms, 8 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 1.28 ops/s ±4.1%, 784 ms, 8 samples Running "Slice vectors" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 4,212,193 ops/s ±0.23%, 0 ms, 100 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 4,400,234 ops/s ±0.80%, 0 ms, 92 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 4,764,651 ops/s ±0.13%, 0 ms, 101 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 4,763,581 ops/s ±0.050%, 0 ms, 98 samples Running "DataFrame Iterate" suite... dataset: tracks, length: 1,000,000 23.1 ops/s ±2.1%, 43 ms, 43 samples Running "DataFrame Count By" suite... dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 535 ops/s ±0.050%, 1.9 ms, 99 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 535 ops/s ±0.040%, 1.9 ms, 96 samples Running "DataFrame Filter-Scan Count" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32, test: gt, value: 0 57 ops/s ±0.090%, 18 ms, 75 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32, test: gt, value: 0 57 ops/s ±0.050%, 18 ms, 74 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>, test: eq, value: Seattle 99 ops/s ±0.060%, 10 ms, 86 samples Running "DataFrame Filter-Iterate" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32, test: gt, value: 0 37 ops/s ±0.12%, 27 ms, 66 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32, test: gt, value: 0 37 ops/s ±0.14%, 27 ms, 66 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>, test: eq, value: Seattle 70 ops/s ±0.45%, 14 ms, 73 samples Running "DataFrame Direct Count" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32, test: gt, value: 0 160 ops/s ±0.040%, 6.3 ms, 83 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32, test: gt, value: 0 162 ops/s ±0.12%, 6.1 ms, 85 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8>, test: eq, value: Seattle 1.51 ops/s ±5.6%, 664 ms, 8 samples ``` ### This branch: ``` Running "vectorFromArray" suite... from: numbers 106 ops/s ±1.1%, 9.3 ms, 79 samples from: booleans 101 ops/s ±1.4%, 9.8 ms, 76 samples from: dictionary 105 ops/s ±4.1%, 9 ms, 78 samples Running "Iterate Vector" suite... from: uint8Array 896 ops/s ±0.21%, 1.1 ms, 94 samples from: uint16Array 896 ops/s ±0.82%, 1.1 ms, 94 samples from: uint32Array 884 ops/s ±0.39%, 1.1 ms, 95 samples from: uint64Array 285 ops/s ±0.19%, 3.5 ms, 92 samples from: int8Array 882 ops/s ±0.65%, 1.1 ms, 95 samples from: int16Array 899 ops/s ±0.37%, 1.1 ms, 95 samples from: int32Array 887 ops/s ±0.46%, 1.1 ms, 92 samples from: int64Array 280 ops/s ±0.60%, 3.5 ms, 91 samples from: float32Array 805 ops/s ±0.86%, 1.2 ms, 90 samples from: float64Array 814 ops/s ±0.44%, 1.2 ms, 92 samples from: numbers 812 ops/s ±0.39%, 1.2 ms, 91 samples from: booleans 284 ops/s ±0.14%, 3.5 ms, 92 samples from: dictionary 298 ops/s ±0.44%, 3.3 ms, 91 samples from: string 16.2 ops/s ±3.9%, 59 ms, 45 samples Running "Spread Vector" suite... from: uint8Array 360 ops/s ±1.2%, 2.7 ms, 93 samples from: uint16Array 374 ops/s ±0.55%, 2.6 ms, 92 samples from: uint32Array 372 ops/s ±1.1%, 2.6 ms, 91 samples from: uint64Array 164 ops/s ±0.66%, 6 ms, 78 samples from: int8Array 372 ops/s ±0.64%, 2.7 ms, 96 samples from: int16Array 380 ops/s ±0.42%, 2.6 ms, 94 samples from: int32Array 375 ops/s ±0.87%, 2.6 ms, 92 samples from: int64Array 164 ops/s ±0.64%, 6.1 ms, 86 samples from: float32Array 327 ops/s ±0.62%, 3 ms, 85 samples from: float64Array 318 ops/s ±1.1%, 3.1 ms, 91 samples from: numbers 326 ops/s ±0.74%, 3 ms, 89 samples from: booleans 178 ops/s ±0.92%, 5.6 ms, 84 samples from: dictionary 189 ops/s ±0.51%, 5.2 ms, 89 samples from: string 14.8 ops/s ±3.7%, 65 ms, 41 samples Running "toArray Vector" suite... from: uint8Array 28,488,216 ops/s ±0.22%, 0 ms, 101 samples from: uint16Array 28,777,482 ops/s ±0.41%, 0 ms, 98 samples from: uint32Array 28,387,333 ops/s ±0.25%, 0 ms, 97 samples from: uint64Array 23,412,763 ops/s ±0.68%, 0 ms, 97 samples from: int8Array 21,497,600 ops/s ±0.22%, 0 ms, 94 samples from: int16Array 21,990,137 ops/s ±0.16%, 0 ms, 101 samples from: int32Array 21,809,196 ops/s ±0.68%, 0 ms, 96 samples from: int64Array 20,084,822 ops/s ±0.68%, 0 ms, 93 samples from: float32Array 18,452,580 ops/s ±0.83%, 0 ms, 96 samples from: float64Array 18,527,057 ops/s ±0.54%, 0 ms, 92 samples from: numbers 18,555,045 ops/s ±0.52%, 0 ms, 99 samples from: booleans 178 ops/s ±0.43%, 5.6 ms, 84 samples from: dictionary 189 ops/s ±0.61%, 5.3 ms, 89 samples from: string 15.8 ops/s ±0.76%, 63 ms, 43 samples Running "get Vector" suite... from: uint8Array 441 ops/s ±1.1%, 2.2 ms, 95 samples from: uint16Array 441 ops/s ±0.48%, 2.2 ms, 95 samples from: uint32Array 443 ops/s ±0.23%, 2.2 ms, 96 samples from: uint64Array 414 ops/s ±0.68%, 2.4 ms, 93 samples from: int8Array 439 ops/s ±0.30%, 2.3 ms, 95 samples from: int16Array 447 ops/s ±0.35%, 2.2 ms, 96 samples from: int32Array 439 ops/s ±0.48%, 2.3 ms, 94 samples from: int64Array 415 ops/s ±0.17%, 2.4 ms, 97 samples from: float32Array 472 ops/s ±0.49%, 2.1 ms, 94 samples from: float64Array 471 ops/s ±0.26%, 2.1 ms, 97 samples from: numbers 473 ops/s ±0.22%, 2.1 ms, 98 samples from: booleans 429 ops/s ±0.25%, 2.3 ms, 97 samples from: dictionary 464 ops/s ±0.23%, 2.1 ms, 96 samples from: string 17.8 ops/s ±1.3%, 56 ms, 48 samples Running "Parse" suite... dataset: tracks, function: read recordBatches 12,047 ops/s ±0.77%, 0.082 ms, 100 samples dataset: tracks, function: write recordBatches 1,028 ops/s ±0.72%, 0.96 ms, 96 samples Running "Get values by index" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 46 ops/s ±0.12%, 22 ms, 61 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 46 ops/s ±0.15%, 22 ms, 61 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 25.3 ops/s ±0.37%, 39 ms, 46 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 25.1 ops/s ±0.76%, 39 ms, 46 samples Running "Iterate vectors" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 84 ops/s ±0.20%, 12 ms, 73 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 82 ops/s ±0.65%, 12 ms, 72 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 30 ops/s ±0.94%, 33 ms, 54 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 30 ops/s ±0.41%, 33 ms, 54 samples Running "Slice toArray vectors" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 2,911 ops/s ±3.3%, 0.33 ms, 86 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 2,765 ops/s ±3.2%, 0.35 ms, 77 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 18 ops/s ±1.2%, 55 ms, 49 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 18.2 ops/s ±0.73%, 54 ms, 50 samples Running "Slice vectors" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 4,338,570 ops/s ±0.52%, 0 ms, 94 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 4,341,418 ops/s ±0.41%, 0 ms, 97 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 3,656,243 ops/s ±0.45%, 0 ms, 101 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 3,598,448 ops/s ±1.0%, 0 ms, 97 samples Running "Spread vectors" suite... dataset: tracks, column: lat, length: 1,000,000, type: Float32 16 ops/s ±4.3%, 59 ms, 44 samples dataset: tracks, column: lng, length: 1,000,000, type: Float32 16.1 ops/s ±4.2%, 60 ms, 45 samples dataset: tracks, column: origin, length: 1,000,000, type: Dictionary<Int8, Utf8> 17.8 ops/s ±1.5%, 55 ms, 49 samples dataset: tracks, column: destination, length: 1,000,000, type: Dictionary<Int8, Utf8> 17.6 ops/s ±1.7%, 55 ms, 48 samples Running "Table" suite... Iterate, dataset: tracks, numRows: 1,000,000 27 ops/s ±0.28%, 37 ms, 49 samples Spread, dataset: tracks, numRows: 1,000,000 8.73 ops/s ±3.7%, 111 ms, 25 samples toArray, dataset: tracks, numRows: 1,000,000 8.15 ops/s ±4.9%, 115 ms, 26 samples get, dataset: tracks, numRows: 1,000,000 17.2 ops/s ±0.31%, 58 ms, 47 samples Running "Table Direct Count" suite... dataset: tracks, column: lat, numRows: 1,000,000, type: Float32, test: gt, value: 0 74 ops/s ±0.16%, 14 ms, 77 samples dataset: tracks, column: lng, numRows: 1,000,000, type: Float32, test: gt, value: 0 74 ops/s ±0.20%, 14 ms, 77 samples dataset: tracks, column: origin, numRows: 1,000,000, type: Dictionary<Int8, Utf8>, test: eq, value: Seattle 80 ops/s ±0.060%, 12 ms, 71 samples ``` Closes apache#10371 from trxcllnt/fea/simplify Lead-authored-by: [Paul Taylor <[email protected]>] Co-authored-by: Dominik Moritz <[email protected]> Co-authored-by: ptaylor <[email protected]> Signed-off-by: Dominik Moritz <[email protected]>
- Loading branch information