Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
75badd8
Support for ZIP file compression and decompression.
ma-gk Jan 28, 2026
ff59ec8
Merge remote-tracking branch 'origin/main' into zip_support
ma-gk Jan 28, 2026
0a6eeca
Merge remote-tracking branch 'origin' into zip_support
ma-gk Jan 29, 2026
a52ae4f
Support for tar and zip archive
ma-gk Jan 29, 2026
e735358
refactor: remove zip format support from compression handlers
ma-gk Jan 29, 2026
1e654d2
feat: add zip packing and unpacking test pipelines
ma-gk Jan 29, 2026
073193a
fix: correct tar archive writing logic and buffer handling
ma-gk Jan 30, 2026
9e0cf2b
test file for the tar with multi file output
ma-gk Jan 30, 2026
7a94408
feat: add comprehensive README for archive task with ZIP and TAR support
ma-gk Jan 30, 2026
7c136db
feat: add birds file and update zip pack/unpack test configurations
ma-gk Jan 30, 2026
8496ea5
refactor: rename extraction tasks for clarity and consistency in README
ma-gk Jan 30, 2026
297abdb
refactor: rename action types for clarity in archiving process
ma-gk Jan 30, 2026
c3a74fa
fix: correct error handling in tar archive read function and improve …
ma-gk Jan 30, 2026
829ac05
fix: improve error handling in zip archive read function
ma-gk Jan 30, 2026
018786a
removed duplicate file
ma-gk Jan 30, 2026
309c1dc
multi file support with proper naming conventions
ma-gk Jan 30, 2026
5b8246a
Refactored code used map instead of switch case
ma-gk Feb 2, 2026
972476b
refactor: replace string literals with context keys for file path han…
ma-gk Feb 4, 2026
f3d603c
Merge branch 'main' into zip_support
Mayureshpawar29 Feb 4, 2026
d6d4331
fix: update log message for empty filepath in context
ma-gk Feb 5, 2026
8430e13
Merge remote-tracking branch 'origin' into zip_support
ma-gk Feb 5, 2026
782ab23
Merge branch 'zip_support' of ssh://github.com/patterninc/caterpillar…
ma-gk Feb 5, 2026
e2370c0
refactor: rename context keys for file path handling in archive and f…
ma-gk Feb 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions birds_file.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
Albatross
Acorn Woodpecker
American Kestrel
Anna's Hummingbird
Bald Eagle
Baltimore Oriole
Barn Swallow
Belted Kingfisher
Bicolored Antbird
Black Capped Chickadee
Black Skimmer
Blue Jay
Bluebird
Bobolink
Bohemian Waxwing
Brown Creeper
Brown Pelican
Burrowing Owl
California Condor
California Quail
Canada Goose
Cardinal
Caspian Tern
Cedar Waxwing
Chestnut Sided Warbler
Chimney Swift
Chipping Sparrow
Clark's Nutcracker
Clay Colored Sparrow
Cliff Swallow
Columbiformes
Common Eider
Common Goldeneye
Common Grackle
Common Loon
Common Merganser
Common Raven
Common Tern
Common Yellowthroat
Coopers Hawk
Cory's Shearwater
Crested Flycatcher
Curve Billed Thrasher
Dark Eyed Junco
Dickcissel
Dovekie
Downy Woodpecker
Drab Seedeater
Dunnock
Eastern Bluebird
Eastern Meadowlark
Eastern Phoebe
Eastern Screech Owl
Eastern Towhee
Eastern Wood Pewee
Eared Grebe
Egyptian Plover
Elanus leucurus
Evening Grosbeak
Eared Quetzal
Eurasian Wigeon
European Starling
Fabulous Flamingo
Ferruginous Hawk
Fiscal Flycatcher
Flammulated Owl
Flatbill
Flesh Footed Shearwater
Florida Jay
Fringilla coelebs
Fulmar
Gadwall
Gambel's Quail
Gannet
Garden Warbler
Gnatcatcher
Godwit
Golden Eagle
Golden Winged Warbler
Goldeneye
Goldfinch
Goosander
Goshawk
Grace's Warbler
Grasshopper Sparrow
Gray Catbird
Great Black Backed Gull
Great Blue Heron
Great Crested Flycatcher
Great Horned Owl
Great Kiskadee
Great Spotted Woodpecker
Great Tit
Grebe
Greenbul
Green Heron
Green Tailed Towhee
Green Winged Teal
Greenlet
Grey Kingbird
Grey Owl
Grosbeaks
Grouse
Gull
Hairy Woodpecker
Hammond's Flycatcher
Harris Hawk
Harris Sparrow
Hawaiian Creeper
Hawaiian Goose
Hawfinch
Heathland Francolin
Herring Gull
Hoary Puffleg
Hooded Merganser
Hooded Oriole
Hooded Warbler
Hoopoe
Horned Auk
Horned Grebe
Horned Lark
House Finch
House Sparrow
House Wren
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file appears to be a duplicate of test/pipelines/birds.txt and is located at the repository root, which is likely unintentional. Test data files should be placed in the test/pipelines directory along with the related test YAML files. This file should either be removed or moved to the appropriate test directory.

Suggested change
Albatross
Acorn Woodpecker
American Kestrel
Anna's Hummingbird
Bald Eagle
Baltimore Oriole
Barn Swallow
Belted Kingfisher
Bicolored Antbird
Black Capped Chickadee
Black Skimmer
Blue Jay
Bluebird
Bobolink
Bohemian Waxwing
Brown Creeper
Brown Pelican
Burrowing Owl
California Condor
California Quail
Canada Goose
Cardinal
Caspian Tern
Cedar Waxwing
Chestnut Sided Warbler
Chimney Swift
Chipping Sparrow
Clark's Nutcracker
Clay Colored Sparrow
Cliff Swallow
Columbiformes
Common Eider
Common Goldeneye
Common Grackle
Common Loon
Common Merganser
Common Raven
Common Tern
Common Yellowthroat
Coopers Hawk
Cory's Shearwater
Crested Flycatcher
Curve Billed Thrasher
Dark Eyed Junco
Dickcissel
Dovekie
Downy Woodpecker
Drab Seedeater
Dunnock
Eastern Bluebird
Eastern Meadowlark
Eastern Phoebe
Eastern Screech Owl
Eastern Towhee
Eastern Wood Pewee
Eared Grebe
Egyptian Plover
Elanus leucurus
Evening Grosbeak
Eared Quetzal
Eurasian Wigeon
European Starling
Fabulous Flamingo
Ferruginous Hawk
Fiscal Flycatcher
Flammulated Owl
Flatbill
Flesh Footed Shearwater
Florida Jay
Fringilla coelebs
Fulmar
Gadwall
Gambel's Quail
Gannet
Garden Warbler
Gnatcatcher
Godwit
Golden Eagle
Golden Winged Warbler
Goldeneye
Goldfinch
Goosander
Goshawk
Grace's Warbler
Grasshopper Sparrow
Gray Catbird
Great Black Backed Gull
Great Blue Heron
Great Crested Flycatcher
Great Horned Owl
Great Kiskadee
Great Spotted Woodpecker
Great Tit
Grebe
Greenbul
Green Heron
Green Tailed Towhee
Green Winged Teal
Greenlet
Grey Kingbird
Grey Owl
Grosbeaks
Grouse
Gull
Hairy Woodpecker
Hammond's Flycatcher
Harris Hawk
Harris Sparrow
Hawaiian Creeper
Hawaiian Goose
Hawfinch
Heathland Francolin
Herring Gull
Hoary Puffleg
Hooded Merganser
Hooded Oriole
Hooded Warbler
Hoopoe
Horned Auk
Horned Grebe
Horned Lark
House Finch
House Sparrow
House Wren
# This file previously contained duplicate bird test data.
# The authoritative test data file now lives at: test/pipelines/birds.txt
# This stub is kept only to avoid reintroducing the duplicate by accident.

Copilot uses AI. Check for mistakes.
239 changes: 239 additions & 0 deletions internal/pkg/pipeline/task/archive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
# Archive Task

The `archive` task pack and unpack file data in various archive formats (TAR, ZIP), enabling efficient data packaging and extraction within pipelines.

## Function

The archive task handles two primary operations:
- **Pack**: Creates archives from input data (e.g., create a ZIP or TAR file)
- **Unpack**: Extract archives to retrieve individual files

## Behavior

The archive task operates in two modes depending on the specified action:

- **Pack mode** (`action: pack`): Takes input data records and creates an archive file. Each record's data is packaged into the specified archive format with the configured filename. The task outputs the complete archive data.

- **Unpack mode** (`action: unpack`): Takes archive file data as input and extracts individual files. For each file found in the archive, the task outputs a separate record containing that file's data.

## Configuration Fields

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `name` | string | - | Task name for identification |
| `type` | string | `archive` | Must be "archive" |
| `format` | string | `zip` | Archive format: `zip` or `tar` |
| `action` | string | `pack` | Operation to perform: `pack` or `unpack` |
| `file_name` | string | - | Name of the file within the archive (required for `pack` action) |

### File Name Format

The `file_name` field specifies how files are stored within archives. Different formats have specific requirements:

#### ZIP Archives
- **Paths**: Filenames can include directory paths, represented using forward slashes (/)
- Example: `docs/readme.txt`
- **Separators**: Only forward slashes (/) are allowed as folder separators, regardless of platform
- **Relative Paths**: Filenames must be relative (no drive letters like C: and no leading slash /)
- **Case Sensitivity**: ZIP stores filenames as is, but whether they are case-sensitive depends on the extraction platform
- **Directories**: End directory names with a trailing slash (/) to indicate a folder
- **Duplicates**: Duplicate names are allowed, but may cause confusion for some zip tools
- **Allowed Characters**: Supports Unicode, but stick to common, portable characters for best compatibility

#### TAR Archives
- **Paths**: Filenames can include paths separated by forward slashes (/)
- Example: `src/main.c`
- **Relative and Absolute Paths**: Both relative (foo.txt) and absolute paths (/foo.txt) can technically be stored, but using relative paths is strongly recommended for portability and to avoid extraction issues
- **Case Sensitivity**: Tar files store names as is; case sensitivity depends on the underlying filesystem
- **Long Paths**: Traditional tar limits path length to 100 bytes for the filename, but modern tar formats (ustar, pax) allow longer names
- **Directories**: Represented as entries ending in a slash (/)
- **Duplicates**: Duplicate filenames are possible; later entries usually overwrite earlier ones on extraction
- **Allowed Characters**: Generally supports any characters, but best practice is to stick to ASCII (letters, digits, underscores, dashes, periods, slashes) for maximum compatibility

## Supported Formats

### ZIP
- **Extension**: `.zip`
- **Use case**: Cross-platform, widely supported compression format
- **Features**: Individual file pack, preserves file structure
- **Pack**: Creates a ZIP archive with single or multiple files
- **Unpack**: Extracts all regular files from ZIP archive

### TAR
- **Extension**: `.tar` or `.tar.gz`
- **Use case**: Unix/Linux native format, streaming support
- **Features**: Preserves file metadata, supports packing (with gzip)
- **Pack**: Creates a TAR archive with file metadata
- **Unpack**: Extracts all regular files from TAR archive (including gzip-compressed)

## Example Configurations

### Pack a file into ZIP
```yaml
tasks:
- name: create_zip
type: archive
format: zip
action: pack
file_name: output.txt
```

### Unpack ZIP archive
```yaml
tasks:
- name: unpack_zip
type: archive
format: zip
action: unpack
```

### Pack a file into TAR
```yaml
tasks:
- name: create_tar
type: archive
format: tar
action: pack
file_name: data.txt
```

### Unpack TAR.GZ archive
```yaml
tasks:
- name: unpack_tar_gz
type: archive
format: tar
action: unpack
```

## Complete Pipeline Examples

### Read files, pack to ZIP, write to file
```yaml
tasks:
- name: read_source
type: file
path: source/*.txt

- name: pack_to_zip
type: archive
format: zip
action: pack
file_name: archive.txt

- name: write_archive
type: file
path: output/archive.zip
```

### Extract TAR.GZ and write individual files
```yaml
tasks:
- name: read_archive
type: file
path: data.tar.gz

- name: decompress_file
type: compress
format: gzip
action: decompress

- name: unpack_files
type: archive
format: tar
action: unpack

- name: write_unpacked
type: file
path: /output/data.txt
```

### Multi-step packing pipeline
```yaml
tasks:
- name: read_data
type: file
path: test/pipelines/birds.txt

- name: pack_zip
type: archive
format: zip
action: pack
file_name: birds.zip

- name: unpack_zip
type: archive
format: zip
action: unpack

- name: write_result
type: file
path: unpacked_birds/birds.txt
```

## Data Flow

### Pack Operation
```
Input Records
[Record Data] → Archive Creation → [Archive Bytes] → Output
```

### Unpack Operation
```
Input Records
[Archive Bytes] → File Extraction → [File 1], [File 2], ... → Output
```

## Use Cases

- **Data packaging**: Bundle multiple files into a single archive
- **Data extraction**: Process archived data within pipelines
- **Archive conversion**: Convert between ZIP and TAR formats
- **Backup workflows**: Create and manage compressed backups
- **Data distribution**: Package files for downstream consumption

## Error Handling

- **Missing file_name**: Throws error if `file_name` is not specified for `pack` action
- **Invalid format**: Throws error if format is not `zip` or `tar`
- **Invalid action**: Throws error if action is not `pack` or `unpack`
- **Corrupt archive**: May throw error when unpacking malformed archives
- **Empty data**: Skips processing of empty records

## Technical Details

### ZIP Format
- Uses Go's `archive/zip` package
- Supports standard ZIP compression
- Preserves file metadata (size, modification time)
- Regular files only (directories not included in unpacking)

### TAR Format
- Uses Go's `archive/tar` package
- Supports raw TAR and gzip-compressed TAR files
- Automatic format detection for gzip compression
- Preserves tar header information
- Regular files only (directories and special files filtered)

## Performance Considerations

- **Memory usage**: Entire archive loaded into memory for processing
- **Compression ratio**: ZIP typically provides better compression than TAR alone
- **Processing speed**: TAR is generally faster than ZIP due to simpler format
- **Large files**: For very large archives, consider chunking or streaming approaches

## Sample Pipelines

- `test/pipelines/zip_pack_test.yaml` - Create ZIP archives
- `test/pipelines/zip_unpack_test.yaml` - Extract from ZIP archives
- `test/pipelines/tar_unpack_multifile_test.yaml` - Extract multiple files from TAR archive

## Security Considerations

- Archives are processed in-memory; ensure sufficient memory for large files
- ZIP bomb protection: Be cautious with untrusted archive sources
- Path traversal: Archive extraction validates file paths to prevent escaping base directory
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation claims "Archive extraction validates file paths to prevent escaping base directory" but no such validation is implemented in the code. Both tar.go and zip.go extract files without checking for path traversal attacks (e.g., filenames containing "../"). While the current implementation doesn't write to disk, if this changes in the future, it could create a security vulnerability. Either implement the validation or update the documentation to accurately reflect the current behavior.

Suggested change
- Path traversal: Archive extraction validates file paths to prevent escaping base directory
- Path traversal: Archive extraction does not currently normalize or validate file paths; if extracted data is written to disk by callers or future changes, they must implement their own validation to prevent directory traversal (e.g., `../`) from escaping the intended base directory

Copilot uses AI. Check for mistakes.
- File permissions: TAR format supports Unix permissions; ZIP has limited permission support
Loading
Loading