Skip to content

Conversation

@tobilg
Copy link
Owner

@tobilg tobilg commented Nov 8, 2025

Add NDJSON Logging Format with Optimized Structured Logging

Summary

Adds NDJSON (Newline Delimited JSON) as a new logging format alongside CSV. Implements direct Value→JSON conversion for efficient structured logging, a helper function to reduce boilerplate, and maintains 100% backward compatibility.

Key Features

1. NDJSON Format Support

  • New NDJSONLogStorage class for NDJSON output format
  • Configurable via SET log_storage = 'ndjson'
  • NDJSON profiler output via SET profiler_format = 'ndjson'
  • Each log entry is a complete JSON object on a single line
  • Machine-readable format ideal for log processing pipelines

2. Direct Value→JSON Conversion

Added WriteLogEntryStructured() API to LogStorage hierarchy:

  • Base class provides string fallback (CSV compatibility)
  • NDJSONLogStorage converts DuckDB Value objects directly to JSON
  • Eliminates unnecessary Value→String→JSON round-trip
  • ValueToJSON() helper handles STRUCT, LIST, MAP, and all primitive types

3. Structured Message Constructors

Added ConstructLogMessageValue() methods to all log types:

  • PhysicalOperatorLogType - Operator execution details
  • CheckpointLogType - Checkpoint and vacuum operations
  • HTTPLogType - HTTP request/response data
  • FileSystemLogType - Filesystem operations
  • MetricsLogType - Query metrics

4. HTTPFS Stats in JSON/NDJSON Profiling (Infrastructure)

Added infrastructure for structured HTTPFS statistics in JSON/NDJSON profiler output:

  • New WriteProfilingInformationToJSON() virtual method in ClientContextState base class
  • Allows extensions to contribute structured metrics to JSON/NDJSON profiling output
  • Maintains 100% backward compatibility with text-based profiling formats (query_tree, EXPLAIN ANALYZE)

Note: The HTTPFS extension implementation is in a separate PR: duckdb/duckdb-httpfs#158

When httpfs PR merges, structured metrics will include:

  • total_bytes_received - Total bytes downloaded via HTTP(S)
  • total_bytes_sent - Total bytes uploaded via HTTP(S)
  • head_count - Number of HTTP HEAD requests
  • get_count - Number of HTTP GET requests
  • put_count - Number of HTTP PUT requests
  • post_count - Number of HTTP POST requests
  • delete_count - Number of HTTP DELETE requests

Benefits:

  • Machine-readable HTTP statistics in profiling output
  • Enables programmatic analysis of remote file access patterns
  • Extensible pattern for other extensions to provide JSON profiling metrics

Implementation Details

Call Sites Updated (6 locations)

Migrated to direct structured API using helper:

  • PhysicalHashJoin (2 sites) - Finalize and Repartition events
  • JoinHashTable (1 site) - Build event
  • RowGroupCollection (2 sites) - Vacuum and Checkpoint tasks
  • HTTPUtil (1 site) - Request logging

Backward Compatibility

100% backward compatible:

  • CSV format unchanged, uses existing string-based logging
  • Polymorphic design: format-specific implementation via virtual dispatch
  • All existing DUCKDB_LOG macro calls work unchanged
  • Default WriteLogEntryStructured() falls back to .ToString() for CSV

Dual-Path Architecture

  1. Direct Structured API (PhysicalOperator, Checkpoint, HTTP logs)

    • Direct Value→JSON conversion
    • Most efficient path: Value → JSON → NDJSON
  2. Buffered Path with JSON Parsing (FileSystem, Metrics logs)

    • Existing macro-based logging continues to work
    • Enhanced FlushChunk() parses JSON strings via yyjson
    • Path: Value → ConstructLogMessage() → JSON string → yyjson parse → NDJSON

Testing

All tests pass (274 total assertions):

  • logging_ndjson.test - 83 assertions (NEW)
  • logging_csv.test - 182 assertions
  • test_enable_profile.test - 9 assertions

Coverage includes all log types (FileSystem, Query, Metrics, HTTP, PhysicalOperator, Checkpoint) in both formats.

New Files

  • test/sql/logging/logging_ndjson.test - Comprehensive NDJSON test suite

Core Infrastructure

  • src/include/duckdb/logging/log_storage.hpp - WriteLogEntryStructured API
  • src/logging/log_storage.cpp - ValueToJSON implementation, NDJSONLogStorage
  • src/include/duckdb/logging/log_manager.hpp - LogStructured helper
  • src/logging/log_manager.cpp - Helper implementation
  • src/include/duckdb/logging/log_type.hpp - ConstructLogMessageValue declarations
  • src/logging/log_types.cpp - ConstructLogMessageValue implementations
  • src/include/duckdb/logging/logger.hpp - GetLogManager() accessor

Call Sites

  • src/execution/operator/join/physical_hash_join.cpp
  • src/execution/join_hashtable.cpp
  • src/storage/table/row_group_collection.cpp
  • src/main/http/http_util.cpp

Profiling Support

  • src/main/query_profiler.cpp - NDJSON format support, extension stats integration
  • src/include/duckdb/common/enums/profiler_format.hpp - NDJSON enum
  • src/main/settings/custom_settings.cpp - Settings integration
  • src/include/duckdb/main/client_context_state.hpp - WriteProfilingInformationToJSON API for extensions

Usage

Enable NDJSON Logging

-- Enable logging with NDJSON format to file
CALL enable_logging(
    storage='file',
    storage_format='ndjson',
    storage_path='/tmp/duckdb.ndjson'
);

-- Or use shorthand (storage_path implies storage='file')
CALL enable_logging(
    storage_format='ndjson',
    storage_path='/tmp/duckdb.ndjson'
);

-- Run queries - all logs written to /tmp/duckdb.log in NDJSON format
SELECT * FROM my_table;

-- Disable logging
CALL disable_logging();

-- Note: NDJSON format can only be configured via enable_logging() function.
-- There is no SET storage_format command.

NDJSON Profiler Output

SET enable_profiling = 'ndjson';
SELECT * FROM 'https://example.com/data.parquet';

JSON Profiler Output with HTTPFS Stats

SET enable_profiling = 'json';
SELECT * FROM 'https://example.com/data.parquet';

Example NDJSON Logging Output

{"timestamp":"2025-01-15T10:30:45.123456","level":"INFO","type":"physical_operator","message":{"operator":"PhysicalHashJoin","event":"Finalize","external":"false"},"scope":"THREAD","connection_id":1,"transaction_id":2,"query_id":3}
{"timestamp":"2025-01-15T10:30:45.234567","level":"INFO","type":"checkpoint","message":{"database":"main","table":"my_table","task":"vacuum","segment_idx":0,"merge_count":3},"scope":"DATABASE"}
{"timestamp":"2025-01-15T10:30:45.345678","level":"INFO","type":"http","message":{"method":"GET","url":"https://example.com/data","status_code":200,"bytes":1024},"scope":"DATABASE"}

Example JSON Profiler Output with httpfs extension stats

Note: The example below shows httpfs_stats which will be available after duckdb-httpfs#158 merges. The infrastructure to support this is included in this PR.

{
    "query_name": "SELECT * FROM 'https://example.com/data.parquet' LIMIT 1;",
    "latency": 0.059454625,
    "rows_returned": 1,
    "total_bytes_read": 10308,
    "httpfs_stats": {
        "total_bytes_received": 10308,
        "total_bytes_sent": 0,
        "head_count": 1,
        "get_count": 1,
        "put_count": 0,
        "post_count": 0,
        "delete_count": 0
    },
    "children": [
        {
            "operator_type": "STREAMING_LIMIT",
            "operator_timing": 0.000001,
            "children": [
                {
                    "operator_type": "TABLE_SCAN",
                    "operator_name": "PARQUET_SCAN",
                    "operator_timing": 0.000051,
                    "extra_info": {
                        "Function": "PARQUET_SCAN",
                        "Total Files Read": "1"
                    }
                }
            ]
        }
    ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants