-
Notifications
You must be signed in to change notification settings - Fork 2
Support for Archives(zip,tar) #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
75badd8
ff59ec8
0a6eeca
a52ae4f
e735358
1e654d2
073193a
9e0cf2b
7a94408
7c136db
8496ea5
297abdb
c3a74fa
829ac05
018786a
309c1dc
5b8246a
972476b
f3d603c
d6d4331
8430e13
782ab23
e2370c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| Albatross | ||
| Acorn Woodpecker | ||
| American Kestrel | ||
| Anna's Hummingbird | ||
| Bald Eagle | ||
| Baltimore Oriole | ||
| Barn Swallow | ||
| Belted Kingfisher | ||
| Bicolored Antbird | ||
| Black Capped Chickadee | ||
| Black Skimmer | ||
| Blue Jay | ||
| Bluebird | ||
| Bobolink | ||
| Bohemian Waxwing | ||
| Brown Creeper | ||
| Brown Pelican | ||
| Burrowing Owl | ||
| California Condor | ||
| California Quail | ||
| Canada Goose | ||
| Cardinal | ||
| Caspian Tern | ||
| Cedar Waxwing | ||
| Chestnut Sided Warbler | ||
| Chimney Swift | ||
| Chipping Sparrow | ||
| Clark's Nutcracker | ||
| Clay Colored Sparrow | ||
| Cliff Swallow | ||
| Columbiformes | ||
| Common Eider | ||
| Common Goldeneye | ||
| Common Grackle | ||
| Common Loon | ||
| Common Merganser | ||
| Common Raven | ||
| Common Tern | ||
| Common Yellowthroat | ||
| Coopers Hawk | ||
| Cory's Shearwater | ||
| Crested Flycatcher | ||
| Curve Billed Thrasher | ||
| Dark Eyed Junco | ||
| Dickcissel | ||
| Dovekie | ||
| Downy Woodpecker | ||
| Drab Seedeater | ||
| Dunnock | ||
| Eastern Bluebird | ||
| Eastern Meadowlark | ||
| Eastern Phoebe | ||
| Eastern Screech Owl | ||
| Eastern Towhee | ||
| Eastern Wood Pewee | ||
| Eared Grebe | ||
| Egyptian Plover | ||
| Elanus leucurus | ||
| Evening Grosbeak | ||
| Eared Quetzal | ||
| Eurasian Wigeon | ||
| European Starling | ||
| Fabulous Flamingo | ||
| Ferruginous Hawk | ||
| Fiscal Flycatcher | ||
| Flammulated Owl | ||
| Flatbill | ||
| Flesh Footed Shearwater | ||
| Florida Jay | ||
| Fringilla coelebs | ||
| Fulmar | ||
| Gadwall | ||
| Gambel's Quail | ||
| Gannet | ||
| Garden Warbler | ||
| Gnatcatcher | ||
| Godwit | ||
| Golden Eagle | ||
| Golden Winged Warbler | ||
| Goldeneye | ||
| Goldfinch | ||
| Goosander | ||
| Goshawk | ||
| Grace's Warbler | ||
| Grasshopper Sparrow | ||
| Gray Catbird | ||
| Great Black Backed Gull | ||
| Great Blue Heron | ||
| Great Crested Flycatcher | ||
| Great Horned Owl | ||
| Great Kiskadee | ||
| Great Spotted Woodpecker | ||
| Great Tit | ||
| Grebe | ||
| Greenbul | ||
| Green Heron | ||
| Green Tailed Towhee | ||
| Green Winged Teal | ||
| Greenlet | ||
| Grey Kingbird | ||
| Grey Owl | ||
| Grosbeaks | ||
| Grouse | ||
| Gull | ||
| Hairy Woodpecker | ||
| Hammond's Flycatcher | ||
| Harris Hawk | ||
| Harris Sparrow | ||
| Hawaiian Creeper | ||
| Hawaiian Goose | ||
| Hawfinch | ||
| Heathland Francolin | ||
| Herring Gull | ||
| Hoary Puffleg | ||
| Hooded Merganser | ||
| Hooded Oriole | ||
| Hooded Warbler | ||
| Hoopoe | ||
| Horned Auk | ||
| Horned Grebe | ||
| Horned Lark | ||
| House Finch | ||
| House Sparrow | ||
| House Wren | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,239 @@ | ||||||
| # Archive Task | ||||||
|
|
||||||
| The `archive` task pack and unpack file data in various archive formats (TAR, ZIP), enabling efficient data packaging and extraction within pipelines. | ||||||
ma-gk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ## Function | ||||||
|
|
||||||
| The archive task handles two primary operations: | ||||||
| - **Pack**: Creates archives from input data (e.g., create a ZIP or TAR file) | ||||||
| - **Unpack**: Extract archives to retrieve individual files | ||||||
ma-gk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ## Behavior | ||||||
|
|
||||||
| The archive task operates in two modes depending on the specified action: | ||||||
|
|
||||||
| - **Pack mode** (`action: pack`): Takes input data records and creates an archive file. Each record's data is packaged into the specified archive format with the configured filename. The task outputs the complete archive data. | ||||||
|
|
||||||
| - **Unpack mode** (`action: unpack`): Takes archive file data as input and extracts individual files. For each file found in the archive, the task outputs a separate record containing that file's data. | ||||||
|
|
||||||
| ## Configuration Fields | ||||||
|
|
||||||
| | Field | Type | Default | Description | | ||||||
| |-------|------|---------|-------------| | ||||||
| | `name` | string | - | Task name for identification | | ||||||
| | `type` | string | `archive` | Must be "archive" | | ||||||
| | `format` | string | `zip` | Archive format: `zip` or `tar` | | ||||||
| | `action` | string | `pack` | Operation to perform: `pack` or `unpack` | | ||||||
| | `file_name` | string | - | Name of the file within the archive (required for `pack` action) | | ||||||
|
|
||||||
| ### File Name Format | ||||||
|
|
||||||
| The `file_name` field specifies how files are stored within archives. Different formats have specific requirements: | ||||||
|
|
||||||
| #### ZIP Archives | ||||||
| - **Paths**: Filenames can include directory paths, represented using forward slashes (/) | ||||||
| - Example: `docs/readme.txt` | ||||||
| - **Separators**: Only forward slashes (/) are allowed as folder separators, regardless of platform | ||||||
| - **Relative Paths**: Filenames must be relative (no drive letters like C: and no leading slash /) | ||||||
| - **Case Sensitivity**: ZIP stores filenames as is, but whether they are case-sensitive depends on the extraction platform | ||||||
| - **Directories**: End directory names with a trailing slash (/) to indicate a folder | ||||||
| - **Duplicates**: Duplicate names are allowed, but may cause confusion for some zip tools | ||||||
| - **Allowed Characters**: Supports Unicode, but stick to common, portable characters for best compatibility | ||||||
|
|
||||||
| #### TAR Archives | ||||||
| - **Paths**: Filenames can include paths separated by forward slashes (/) | ||||||
| - Example: `src/main.c` | ||||||
| - **Relative and Absolute Paths**: Both relative (foo.txt) and absolute paths (/foo.txt) can technically be stored, but using relative paths is strongly recommended for portability and to avoid extraction issues | ||||||
| - **Case Sensitivity**: Tar files store names as is; case sensitivity depends on the underlying filesystem | ||||||
| - **Long Paths**: Traditional tar limits path length to 100 bytes for the filename, but modern tar formats (ustar, pax) allow longer names | ||||||
| - **Directories**: Represented as entries ending in a slash (/) | ||||||
| - **Duplicates**: Duplicate filenames are possible; later entries usually overwrite earlier ones on extraction | ||||||
| - **Allowed Characters**: Generally supports any characters, but best practice is to stick to ASCII (letters, digits, underscores, dashes, periods, slashes) for maximum compatibility | ||||||
|
|
||||||
| ## Supported Formats | ||||||
|
|
||||||
| ### ZIP | ||||||
| - **Extension**: `.zip` | ||||||
| - **Use case**: Cross-platform, widely supported compression format | ||||||
| - **Features**: Individual file pack, preserves file structure | ||||||
| - **Pack**: Creates a ZIP archive with single or multiple files | ||||||
| - **Unpack**: Extracts all regular files from ZIP archive | ||||||
|
|
||||||
| ### TAR | ||||||
| - **Extension**: `.tar` or `.tar.gz` | ||||||
| - **Use case**: Unix/Linux native format, streaming support | ||||||
| - **Features**: Preserves file metadata, supports packing (with gzip) | ||||||
| - **Pack**: Creates a TAR archive with file metadata | ||||||
| - **Unpack**: Extracts all regular files from TAR archive (including gzip-compressed) | ||||||
|
|
||||||
| ## Example Configurations | ||||||
|
|
||||||
| ### Pack a file into ZIP | ||||||
| ```yaml | ||||||
| tasks: | ||||||
| - name: create_zip | ||||||
| type: archive | ||||||
| format: zip | ||||||
| action: pack | ||||||
| file_name: output.txt | ||||||
| ``` | ||||||
|
|
||||||
| ### Unpack ZIP archive | ||||||
| ```yaml | ||||||
| tasks: | ||||||
| - name: unpack_zip | ||||||
| type: archive | ||||||
| format: zip | ||||||
| action: unpack | ||||||
| ``` | ||||||
|
|
||||||
| ### Pack a file into TAR | ||||||
| ```yaml | ||||||
| tasks: | ||||||
| - name: create_tar | ||||||
| type: archive | ||||||
| format: tar | ||||||
| action: pack | ||||||
| file_name: data.txt | ||||||
| ``` | ||||||
|
|
||||||
| ### Unpack TAR.GZ archive | ||||||
| ```yaml | ||||||
| tasks: | ||||||
| - name: unpack_tar_gz | ||||||
| type: archive | ||||||
| format: tar | ||||||
| action: unpack | ||||||
| ``` | ||||||
|
|
||||||
| ## Complete Pipeline Examples | ||||||
|
|
||||||
| ### Read files, pack to ZIP, write to file | ||||||
| ```yaml | ||||||
| tasks: | ||||||
| - name: read_source | ||||||
| type: file | ||||||
| path: source/*.txt | ||||||
|
|
||||||
| - name: pack_to_zip | ||||||
| type: archive | ||||||
| format: zip | ||||||
| action: pack | ||||||
| file_name: archive.txt | ||||||
|
|
||||||
| - name: write_archive | ||||||
| type: file | ||||||
| path: output/archive.zip | ||||||
| ``` | ||||||
|
|
||||||
| ### Extract TAR.GZ and write individual files | ||||||
| ```yaml | ||||||
| tasks: | ||||||
| - name: read_archive | ||||||
| type: file | ||||||
| path: data.tar.gz | ||||||
|
|
||||||
| - name: decompress_file | ||||||
| type: compress | ||||||
| format: gzip | ||||||
| action: decompress | ||||||
|
|
||||||
| - name: unpack_files | ||||||
| type: archive | ||||||
| format: tar | ||||||
| action: unpack | ||||||
|
|
||||||
| - name: write_unpacked | ||||||
| type: file | ||||||
| path: /output/data.txt | ||||||
| ``` | ||||||
|
|
||||||
| ### Multi-step packing pipeline | ||||||
| ```yaml | ||||||
| tasks: | ||||||
| - name: read_data | ||||||
| type: file | ||||||
| path: test/pipelines/birds.txt | ||||||
|
|
||||||
| - name: pack_zip | ||||||
| type: archive | ||||||
| format: zip | ||||||
| action: pack | ||||||
| file_name: birds.zip | ||||||
|
|
||||||
| - name: unpack_zip | ||||||
| type: archive | ||||||
| format: zip | ||||||
| action: unpack | ||||||
|
|
||||||
| - name: write_result | ||||||
| type: file | ||||||
| path: unpacked_birds/birds.txt | ||||||
| ``` | ||||||
|
|
||||||
| ## Data Flow | ||||||
|
|
||||||
| ### Pack Operation | ||||||
| ``` | ||||||
| Input Records | ||||||
| ↓ | ||||||
| [Record Data] → Archive Creation → [Archive Bytes] → Output | ||||||
| ``` | ||||||
|
|
||||||
| ### Unpack Operation | ||||||
| ``` | ||||||
| Input Records | ||||||
| ↓ | ||||||
| [Archive Bytes] → File Extraction → [File 1], [File 2], ... → Output | ||||||
| ``` | ||||||
|
|
||||||
| ## Use Cases | ||||||
|
|
||||||
| - **Data packaging**: Bundle multiple files into a single archive | ||||||
| - **Data extraction**: Process archived data within pipelines | ||||||
| - **Archive conversion**: Convert between ZIP and TAR formats | ||||||
| - **Backup workflows**: Create and manage compressed backups | ||||||
| - **Data distribution**: Package files for downstream consumption | ||||||
|
|
||||||
| ## Error Handling | ||||||
|
|
||||||
| - **Missing file_name**: Throws error if `file_name` is not specified for `pack` action | ||||||
| - **Invalid format**: Throws error if format is not `zip` or `tar` | ||||||
| - **Invalid action**: Throws error if action is not `pack` or `unpack` | ||||||
| - **Corrupt archive**: May throw error when unpacking malformed archives | ||||||
| - **Empty data**: Skips processing of empty records | ||||||
|
|
||||||
| ## Technical Details | ||||||
|
|
||||||
| ### ZIP Format | ||||||
| - Uses Go's `archive/zip` package | ||||||
| - Supports standard ZIP compression | ||||||
| - Preserves file metadata (size, modification time) | ||||||
| - Regular files only (directories not included in unpacking) | ||||||
|
|
||||||
| ### TAR Format | ||||||
| - Uses Go's `archive/tar` package | ||||||
| - Supports raw TAR and gzip-compressed TAR files | ||||||
| - Automatic format detection for gzip compression | ||||||
| - Preserves tar header information | ||||||
| - Regular files only (directories and special files filtered) | ||||||
|
|
||||||
| ## Performance Considerations | ||||||
|
|
||||||
| - **Memory usage**: Entire archive loaded into memory for processing | ||||||
| - **Compression ratio**: ZIP typically provides better compression than TAR alone | ||||||
| - **Processing speed**: TAR is generally faster than ZIP due to simpler format | ||||||
| - **Large files**: For very large archives, consider chunking or streaming approaches | ||||||
|
|
||||||
| ## Sample Pipelines | ||||||
|
|
||||||
| - `test/pipelines/zip_pack_test.yaml` - Create ZIP archives | ||||||
| - `test/pipelines/zip_unpack_test.yaml` - Extract from ZIP archives | ||||||
| - `test/pipelines/tar_unpack_multifile_test.yaml` - Extract multiple files from TAR archive | ||||||
|
|
||||||
| ## Security Considerations | ||||||
|
|
||||||
| - Archives are processed in-memory; ensure sufficient memory for large files | ||||||
| - ZIP bomb protection: Be cautious with untrusted archive sources | ||||||
| - Path traversal: Archive extraction validates file paths to prevent escaping base directory | ||||||
|
||||||
| - Path traversal: Archive extraction validates file paths to prevent escaping base directory | |
| - Path traversal: Archive extraction does not currently normalize or validate file paths; if extracted data is written to disk by callers or future changes, they must implement their own validation to prevent directory traversal (e.g., `../`) from escaping the intended base directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file appears to be a duplicate of test/pipelines/birds.txt and is located at the repository root, which is likely unintentional. Test data files should be placed in the test/pipelines directory along with the related test YAML files. This file should either be removed or moved to the appropriate test directory.