Skip to content

Archive

Work with archive files directly from a DAG step without relying on shell utilities. The executor is built on top of github.com/mholt/archives and streams data for efficiency.

Supported Formats

Archive Formats

FormatExtensionReadWritePassword SupportNotes
ZIP.zipNoFull read/write support
TAR.tarNoFull read/write support
RAR.rarYes (read)Read-only; extraction with password
7-Zip.7zYes (read)Read-only; extraction with password

Compression Formats (Single File)

FormatExtensionCompression LevelNotes
GZIP.gz0-9 (default: -1)Configurable compression
Bzip2.bz20-9 (default: -1)Configurable compression
XZ.xzFixedHigh compression ratio
Zstandard.zst, .zstdFixedFast with good compression
LZ4.lz4FixedVery fast, lower ratio

Combined Formats (Archive + Compression)

FormatExtensionsReadWriteCompression Level
TAR+GZIP.tar.gz, .tgz0-9 (default: -1)
TAR+Bzip2.tar.bz2, .tbz2, .tbz0-9 (default: -1)
TAR+XZ.tar.xz, .txzFixed
TAR+Zstandard.tar.zst, .tar.zstdFixed

Format Detection

The executor automatically detects archive format from:

  1. File extension - Recognizes all standard extensions (.tar.gz, .zip, etc.)
  2. Magic bytes - Examines file headers when extension is ambiguous
  3. Explicit configuration - Override with format field when needed

Supported Operations

CommandDescription
extractUnpack an archive into a directory
createCreate an archive from files/folders
listEnumerate entries in an archive

Quick Start

yaml
steps:
  - name: unpack
    type: archive
    config:
      source: logs.tar.gz
      destination: ./logs
    command: extract

  - name: package
    type: archive
    config:
      source: ./logs
      destination: logs-backup.tar.gz
    command: create

  - name: inspect
    type: archive
    config:
      source: logs-backup.tar.gz
    command: list
    output: ARCHIVE_INDEX

extract and create emit a JSON summary (files processed, bytes, duration, etc.) on stdout. list outputs a JSON array of entries so subsequent steps can filter or inspect the archive with tools like jq.

Configuration

FieldDescriptionTypeDefaultNotes
sourceInput archive or directorystringrequiredPath to archive file (extract/list) or source directory (create)
destinationOutput directory or archive pathstring. (extract)Target directory (extract) or output archive path (create); optional for list
formatArchive format overridestringauto-detectExplicit format: zip, tar, tar.gz, tar.bz2, tar.xz, tar.zst, 7z, rar, etc.
compression_levelCompression levelint-1-1 = default, 0 = none, 1-9 = level; applies to gzip and bzip2 only
overwriteReplace existing filesboolfalseWhen false, extraction fails if destination file exists
strip_componentsStrip leading path segmentsint0Remove N leading directories from paths (like tar --strip-components=N)
preserve_pathsPreserve full pathsbooltrueWhen false, only extracts the basename of each file
includeInclude glob patterns[]stringall filesOnly process files matching these patterns (e.g., **/*.csv)
excludeExclude glob patterns[]stringnoneSkip files matching these patterns (applied after include)
follow_symlinksFollow symlinks when creatingboolfalseWhen true, dereferences symlinks; when false, preserves them
verify_integrityVerify archive after operationboolfalsePerforms full read pass to validate archive integrity
continue_on_errorContinue on individual file errorsboolfalseLogs errors but continues processing remaining files
dry_runSimulate operationboolfalseCalculate metrics without writing files to disk
passwordArchive passwordstringnoneExtraction only for password-protected 7z and rar archives

All fields support environment interpolation (${VAR}) and outputs from previous steps.

Additional Examples

Selective Extraction

yaml
working_dir: /data/pipeline

steps:
  - name: extract-csv
    type: archive
    config:
      source: dataset.tar.zst
      destination: ./data
      include:
        - "**/*.csv"
      strip_components: 1
    command: extract

Create Archive With Verification

yaml
working_dir: /deploy/release

steps:
  - name: bundle-artifacts
    type: archive
    config:
      source: ./dist
      destination: dist.tar.gz
      format: tar.gz
      verify_integrity: true
    command: create

Extract Password-Protected 7z (Read-Only)

yaml
working_dir: /data/decrypted

secrets:
  - name: ARCHIVE_PASSWORD
    provider: env
    key: ARCHIVE_PASSWORD

steps:
  - name: unpack-secure
    type: archive
    config:
      source: secure-data.7z
      destination: ./decrypted
      password: ${ARCHIVE_PASSWORD}
      include:
        - "**/*.csv"
      overwrite: true
    command: extract

Important: Password protection is read-only. You can extract password-protected 7z and rar archives, but creating encrypted archives is not supported.

Security Features

The executor implements security protections against malicious archives:

  • Path traversal prevention - Rejects archives with entries escaping the destination directory
  • Symlink validation - Blocks symlinks with absolute targets or paths escaping the destination
  • Safe path handling - Validates all extracted paths before writing files

These protections defend against "zip slip" and similar archive-based attacks.

Limitations

FormatLimitation
RARRead-only; cannot create RAR archives
7-ZipRead-only; cannot create 7z archives
Password ProtectionExtraction only; cannot create encrypted archives
Compression LevelsOnly GZIP and Bzip2 support configurable levels (0-9)

Output Format

Extract and Create Operations

Both extract and create commands output JSON to stdout with operation metrics:

json
{
  "operation": "extract",
  "source": "logs.tar.gz",
  "destination": "./logs",
  "filesExtracted": 1523,
  "bytesExtracted": 45829384,
  "filesSkipped": 0,
  "duration": "1.234s",
  "verifyPerformed": false,
  "errors": []
}

List Operation

The list command outputs a JSON array of archive entries:

json
{
  "operation": "list",
  "source": "logs.tar.gz",
  "totalFiles": 1523,
  "totalSize": 45829384,
  "verified": false,
  "duration": "0.123s",
  "files": [
    {
      "path": "logs/app.log",
      "size": 12345,
      "mode": "-rw-r--r--",
      "modTime": "2025-11-02T12:34:56Z",
      "isDir": false
    }
  ]
}

Released under the MIT License.