GitHub - arrabyte/bigsort: BigSort is sample that leverage on the seastar framework to perform sort of big files that cannot be handled in the main memory

bigsort

A sort application that allow to sort 4k blocks of data inside a big file that cannot be handled in main memory.

Bigsort is thinked to be a test application to try and leverage the seastar framework

Future improvements could provide some usefulness to this application.

bigsort handle big data using two strategy: internal and external sort.

In the first step the input file is partitioned and each partition is bound to main memory.

Blocks of each partition can be sorted in main memory and stored to file : filename.partition_number

At the end of this phase we have a set of sorted files that need to be merged together, for this goal I've used a simple merge sort

algorithm that compare the head element of each file and get the min at each iteration.

Example:

unsorted data = [z,t,s,c,a,b,r,i,l,m,d,e] every character is a chunck of 4k, there are N element.

in memory it's possible to load M chunks at time unsorted loadable chunks partitioning [z,t,s] [c,a,b] [r,i,l] [m,d,e]

every chunks could be sorted in memory and stored to disk Sort by std::stable_sort, sort or boost

[z,t,s] [c,a,b] [r,i,l] [m,d,e] => [s,t,z] [a,b,c] [i,l,r] [d,e,m]

now we have (K) sorted subset of size C. k-way mergesort could merge them obtaining N sorted element.

read from disk the first element of every chuck, store the min on a new file.

[s,t,z] [a,b,c] [i,l,r] [d,e,m] => a
[s,t,z] [c] [i,l,r] [d,e,m] => b
[s,t,z] [] [i,l,r] [d,e,m] => c
[s,t,z] [] [i,l,r] [d,e,m] => d
[s,t,z] [] [i,l,r] [m] => e
[s,t,z] [] [l,r] [m] => i
[s,t,z] [] [r] [m] => l
[s,t,z] [] [r] [] => m
[s,t,z] [] [] [] => r
[t,z] [] [] [] => s
[z] [] [] [] => t
[] [] [] [] => z

Build

To build bigsort the seastar framework is needed.

create a build folder\n
cd build\n
cmake ..
make

Run

./bigsort filename

bigsort create a new file filename.sort in the same folder of filename.

Run test

to setup the folder where test files are handled edit TEST_PATTERN_DIR inside test/CMakeFiles.txt
tests/unit_test -t test_internal_sort
tests/unit_test -t test_external_sort

Big test pattern utility

To test in dept bigsort a big file to be sorted is needed.
The exe genbigfile build two patterns of ~10GB (default) up to ~60GB specifing size by cmdline, one file is ordered and the other unordered to compare bigsort results.
The pattern is build using a text binary sequence inside the block, therefore are possibile 4096^2 permutation of any block, therefore 4096^2 * block_size = 68.719.476.736 bytes as bound, but tt's easily extendible to produce any size by repeat patterns.
tests/genbigfile path

Example

Let's start generating a pattern of 2GB

alex@wildcat:~/develop/bigsort/build_res$ ./tests/genbigfile --size 2 big2
Start genbigfile big2 of size:2 GB -- blocks:524288
16 Mbytes has been written
32 Mbytes has been written
48 Mbytes has been written
64 Mbytes has been written
...
...
2032 Mbytes has been written
big file have been done in 162480ms
2048 Mbytes has been written

alex@wildcat:~/develop/bigsort/build_res$ ls -l big2.\*
-rw-r--r-- 1 alex alex 2147483648 feb 11 00:50 big2.sorted
-rw-r--r-- 1 alex alex 2147483648 feb 11 00:50 big2.unsorted

Now sort the unsorted pattern

alex@wildcat:~/develop/bigsort/build_res$ ./bigsort big2.unsorted
bigsort lexicographic sort of 4K blocks.
Available memory for internal sort 888 Mb
file name big2.unsorted

At the end is possible compare the results by diff or better by md5sum.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
tests		tests
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
bigsort.cc		bigsort.cc
block.cc		block.cc
block.hh		block.hh
file_utils.hh		file_utils.hh
improvements.md		improvements.md
sort_strategies.cc		sort_strategies.cc
sort_strategies.hh		sort_strategies.hh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigsort

Build

Run

Run test

Big test pattern utility

Example

About

Releases

Packages

Languages

License

arrabyte/bigsort

Folders and files

Latest commit

History

Repository files navigation

bigsort

Build

Run

Run test

Big test pattern utility

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages