Skip to content

Improve logging in Arrow Flight SQL code paths #270

@jghoman

Description

@jghoman

Summary

The Flight SQL data plane has essentially no server-side logging. Infrastructure logging (worker lifecycle, shutdown) is solid, but queries, errors, auth events, and session lifecycle in the Flight path leave no trace.

Critical Gaps

No query execution logging (High)

Neither duckdbservice/flight_handler.go, server/flight_executor.go, nor server/flightsqlingress/ingress.go logs any query at any level. A query that fails, times out, or produces unexpected results leaves no server-side trace — only the gRPC error code is visible on the client.

Affected locations:

  • flight_handler.go:226-232 (DoGetStatement)
  • flight_handler.go:273-287 (DoPutCommandStatementUpdate)
  • flight_executor.go:163 (QueryContext)
  • flight_executor.go:212 (ExecContext)
  • ingress.go:398-430 (DoGetStatement)
  • ingress.go:448 (DoPutCommandStatementUpdate)

RowsAffected error silently converted to 0 (High — correctness bug)

// flight_handler.go:283-285
affected, err := result.RowsAffected()
if err != nil {
    return 0, nil  // silently reports 0 rows affected
}

A DML statement could mutate rows, fail to report the count, and the client sees UPDATE 0 with no error.

DestroySession RPC failure silently dropped (High)

session_mgr.go:128_ = worker.DestroySession(ctx, session.SessionToken) — an RPC call on a 5-second timeout that can fail; failure is never logged, making session cleanup leaks impossible to diagnose.

Medium Priority Gaps

Flight auth success/failure never logged

ingress.go:290-295authenticateBasicCredentials returns a gRPC error on failure but never calls slog. Auth success is also silent. Compare: the PG path logs both at control.go:510,527.

Dead worker discovery bypasses crash logging

worker_mgr.go:687-695cleanDeadWorkersLocked discovers dead workers during AcquireWorker and calls cleanupDeadWorker in a goroutine, bypassing the slog.Warn("Worker crashed.") path at line 925.

No per-query debug logging (text, latency, row count)

Prometheus counters exist but there's no per-query observability at any log level across the entire Flight SQL data plane.

Low Priority Gaps

  • Flight ingress session create/destroy/expiry never logged (ingress.go)
  • ~15+ _ = resource.Close() silent drops across service.go, flight_handler.go, session_mgr.go, worker_mgr.go — resource close failures silently swallowed

What's Already Well-Covered

  • Worker management (spawn, health check, retire, kill, crash)
  • Control plane PG connections (auth, disconnect)
  • Shutdown sequences (graceful drain, handover, signals)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions