Small Files Archiving Solution

Change Log

2025.01.13: v2 released
- Support GUI based operation
2024.02.25: v1 released

What is small file archiving solution?

small file archiving solution is aimed to transfer millions of files into Amazon S3 efficiently aggregating small files into big tarfiles. This application supports not only trasnferring but also searching original file and restore it individually. This feature will help user reduce cloud storage cost reducing PUT request cost, and transfer to cloud faster than transferring original files. The codes of this solution are developed by me and Amazon Q Developer.

Features

Aggregating files on on-premises storage, and generating tarfiles and uploading to S3 directly
Providing manifest files which is including tarfile, subset file, date, file size, first block, last block, md5
Finding tarfiles which includes specific subset file by condition, such as filename, date range.
Retrieving subset file itself from a tarfile in S3 using byte-range
Generating tarfiles from input file, instead of scanning filesystem
GUI based operation, which is based on streamlit

Pre-requsites

linux
python >= 3.7
streamlit
aws cli
aws configure to access Amazon S3

Installing

Clone github repository

git clone https://github.com/aws-samples/small-files-archiving-solution.git
cd small-files-archiving-solution/current

Install required packages:

pip install -r requirements.txt

Running an application

To run the application, use the following command:

streamlit run app.py --server.fileWatcherType none

or

./run-app.sh

After running a script, you will see url address to access, then open this url on your web broswer.

Using an application

Archiving

Select "Archiver" from the sidebar.
Choose the source type and batch strategy, then click the "Next" button.

source type: supports filesystem or S3
destination type: supports S3
batch stragegy:
- size: size strategy means that tarfile will be generated by specified size.
- count: count strategy means that tarfile will be generated based on specified original file numbers.

3. Enter the required parameters and click the "Run" button to start archiving.

Numbers of Thread: This parameter specifies the number of threads to be used for the archiving process. More threads can speed up the process by performing multiple operations in parallel, but it also requires more system resources.
Storage Classes: This is the destination S3 Storage class where tar files are stored
Source Path: This is the path to the source directory or file system from which the files will be archived. It is used when the source type is a file system
Destination Bucket: This is the name of the S3 bucket where the archived files will be stored. It is required for both file system to S3 (fs_to_s3) and S3 to S3
Destination Prefix: This is the prefix (or folder path) within the destination S3 bucket where the archived files will be stored. It helps in organizing the files within the bucket.
Max Tarfile Size(MB,GB): This parameter specifies the maximum size of each tar file created during the archiving process. It can be specified in units like MB (Megabytes) or GB (Gigabytes). This helps in managing the size of the archived files and ensures they do not exceed a certain limit.

Confirm the result

navigate s3 console to destination bucket and prefix.
confirm new prefix for archived files and manifests files

ec2-user@ip v2% aws s3 ls s3://your-dst-bucket/day20250123/
                           PRE archives/
                           PRE manifests/

confirm tarfiles in archives prefix

ec2-user@ip v2% aws s3 ls s3://your-dst-bucket/day20250123/archives/ 
2025-01-13 17:06:21   10895360 archive_20250113_080619_0001.tar
2025-01-13 17:06:21   10885120 archive_20250113_080619_0002.tar
2025-01-13 17:06:20   10905600 archive_20250113_080619_0003.tar
2025-01-13 17:06:21   10895360 archive_20250113_080619_0004.tar
2025-01-13 17:06:21   10895360 archive_20250113_080619_0005.tar
2025-01-13 17:06:22   10905600 archive_20250113_080619_0006.tar
2025-01-13 17:06:22   10885120 archive_20250113_080619_0007.tar
2025-01-13 17:06:22   10915840 archive_20250113_080619_0008.tar
2025-01-13 17:06:22   10885120 archive_20250113_080619_0009.tar
2025-01-13 17:06:22   10885120 archive_20250113_080619_0010.tar
2025-01-13 17:06:23   10946560 archive_20250113_080619_0011.tar
2025-01-13 17:06:23   10905600 archive_20250113_080619_0012.tar
2025-01-13 17:06:23   10926080 archive_20250113_080619_0013.tar
2025-01-13 17:06:23   10885120 archive_20250113_080619_0014.tar
2025-01-13 17:06:23   10874880 archive_20250113_080619_0015.tar
2025-01-13 17:06:24   10915840 archive_20250113_080619_0016.tar
2025-01-13 17:06:24   10905600 archive_20250113_080619_0017.tar
2025-01-13 17:06:24   10895360 archive_20250113_080619_0018.tar
2025-01-13 17:06:24   10895360 archive_20250113_080619_0019.tar
2025-01-13 17:06:24    4935680 archive_20250113_080619_0020.tar

confirm manifest files in manifests prefix

ec2-user@ip v2% aws s3 ls s3://your-dst-bucket/day20250123/manifests/ 
2025-01-13 17:06:21      84191 manifest_20250113_080619_0001.csv
2025-01-13 17:06:21      82740 manifest_20250113_080619_0002.csv
2025-01-13 17:06:21      82755 manifest_20250113_080619_0003.csv
2025-01-13 17:06:21      83878 manifest_20250113_080619_0004.csv
2025-01-13 17:06:21      82590 manifest_20250113_080619_0005.csv
2025-01-13 17:06:23      86792 manifest_20250113_080619_0006.csv
2025-01-13 17:06:23      80970 manifest_20250113_080619_0007.csv
2025-01-13 17:06:23      85648 manifest_20250113_080619_0008.csv
2025-01-13 17:06:22      81118 manifest_20250113_080619_0009.csv
2025-01-13 17:06:23      81145 manifest_20250113_080619_0010.csv
2025-01-13 17:06:24      89525 manifest_20250113_080619_0011.csv
2025-01-13 17:06:24      84372 manifest_20250113_080619_0012.csv
2025-01-13 17:06:24      86457 manifest_20250113_080619_0013.csv
2025-01-13 17:06:24      81787 manifest_20250113_080619_0014.csv
2025-01-13 17:06:24      82103 manifest_20250113_080619_0015.csv
2025-01-13 17:06:24      84048 manifest_20250113_080619_0016.csv
2025-01-13 17:06:25      85186 manifest_20250113_080619_0017.csv
2025-01-13 17:06:25      82892 manifest_20250113_080619_0018.csv
2025-01-13 17:06:25      82963 manifest_20250113_080619_0019.csv
2025-01-13 17:06:24      38798 manifest_20250113_080619_0020.csv

Logs

Whenever archiving program runs, it generate logfile under logs /{dst_prefix} directory. With this logs, you can find out errors, result messages.

Searching

Select "Search" from the sidebar.
Enter the S3 bucket name and prefix.

bucket name: Destination bucket
prefix: Prefix which contains manifest of destination bucket.

Enter the search type and value, then click the "Search" button. Search feature supports 2 types; by name, and by duration.
Confirm query result
In Select file lists, click the item which you would like to restore, then required parameters would be filled automatically when you select Restore menu in sidebar.

Restoring

Select "Restore" from the sidebar.
Enter the required parameters and click the "Restore" button to start restoring. You can find required parameters from Search result.

Enter the bucket name: bucket name where the archived files is stored.
Enter the tar name (tar file): tarfile which contains a file which you're looking for.
Enter the start byte: start byte position in tarfile
Enter the stop byte: end byte position in tarfile

Confirm file is restored

access streamlit workstation
find restored file in $git_repo/v2/restored_data directory

[ec2-user@ip v2]$ ls -l restored_data/d0001/dir0018/index.html
-rw-rw-r-- 1 ec2-user ec2-user 44 Dec 30  2019 restored_data/d0001/dir0018/index.html

Using Shell

Instead of streamlit application, you can use shell command.

[ec2-user@ip v2]$ ls -l shell
total 12
-rw-r--r-- 1 ec2-user ec2-user 1830 Jan 10 10:04 run-archiver.sh
-rw-rw-r-- 1 ec2-user ec2-user  209 Dec 27 01:30 run-restore.sh
-rw-rw-r-- 1 ec2-user ec2-user  385 Dec 27 01:31 run-search.sh

Performance comparison

To be updated soon

Conclusion

Small file archiving solution is built to provide efficient way of archiving small file on Amazon S3. Combing small files into big TAR file can help customer reduce PUT request cost and monitoring cost, and storing data in Amazon S3 Intelligent Tiering help customer save storage cost specially for long-term archiving data. With this solution, customer can search and restore specific file when he needs to retrieve it.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
images		images
sandbox		sandbox
scripts		scripts
v1		v1
v2		v2
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
Spec_for_small_file_archiving_solution.md		Spec_for_small_file_archiving_solution.md
current		current
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Files Archiving Solution

Change Log

What is small file archiving solution?

Features

Pre-requsites

Installing

Running an application

Using an application

Archiving

Logs

Searching

Restoring

Using Shell

Performance comparison

Conclusion

Security

License

About

Releases

Packages

Contributors 2

Languages

License

aws-samples/small-files-archiving-solution

Folders and files

Latest commit

History

Repository files navigation

Small Files Archiving Solution

Change Log

What is small file archiving solution?

Features

Pre-requsites

Installing

Running an application

Using an application

Archiving

Logs

Searching

Restoring

Using Shell

Performance comparison

Conclusion

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages