feat: Measure execution status duration by morgan-wowk · Pull Request #174 · TangleML/tangle

morgan-wowk · 2026-03-18T04:37:44Z

Changes

Adds status_updated_at column to ExecutionNode table to track when execution status last changed
Implements SQLAlchemy event listener to automatically update status_updated_at timestamp when container_execution_status changes
Creates execution_status_transition_duration histogram metric to measure time spent in each execution status
Adds _transition_execution_status() helper function to centralize status updates and metric recording across all status transitions
Implements database migration logic to add the new status_updated_at column to existing tables
Replaces direct status assignments throughout the orchestrator with calls to the new transition helper function

Show of work

Note: Attribute names have since changed to `execution_status_` prefix

Local smoke-test and verification completed ✅

morgan-wowk · 2026-03-18T04:37:57Z

feat: Measure execution status duration #174 👈 (View in Graphite)
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

morgan-wowk · 2026-03-18T04:49:25Z

Load Impact: `feat: Measure execution status duration`

The commit adds zero additional database calls during steady-state operation. status_updated_at is written as part of the existing UPDATE statement that already commits container_execution_status — SQLAlchemy batches both columns into one flush. Reading the previous status and timestamp before each transition is always an in-memory access from the already-loaded SQLAlchemy instance. The only one-time database cost is a single SELECT on every process restart to check for the new column, plus a single ALTER TABLE on first deployment. On the metrics side, record_status_transition appends to an in-memory histogram bucket and returns immediately; the OTel SDK exports observations to the collector in background batches, adding one additional metric stream with no synchronous I/O on the hot path.

Area	Impact
DB calls at startup	+1 `SELECT` per restart; +1 `ALTER TABLE` on first deploy only
DB calls per transition	None — `status_updated_at` piggybacks on the existing `UPDATE`
DB calls for reading previous status	None — in-memory reads from the live SQLAlchemy instance
CPU per transition	Negligible — one `datetime.now()` + one histogram bucket increment
Memory	Negligible — one additional `datetime` field per loaded `ExecutionNode`
Network (OTel export)	One additional histogram metric stream, exported in background batches

cloud_pipelines_backend/backend_types_sql.py

yuechao-qin · 2026-03-18T20:28:31Z

cloud_pipelines_backend/database_ops.py


+def _add_columns_if_missing(db_engine: sqlalchemy.Engine) -> None:
+    """Add new nullable columns to existing tables via ALTER TABLE when they are
+    not yet present.  SQLAlchemy's create_all() only creates missing tables, not
+    missing columns, so new columns require an explicit migration step."""
+    _COLUMN_MIGRATIONS: list[tuple[str, str, str]] = [
+        # (table_name, column_name, column_definition)
+        ("execution_node", "status_updated_at", "DATETIME"),
+    ]
+    inspector = sqlalchemy.inspect(db_engine)
+    for table_name, column_name, column_def in _COLUMN_MIGRATIONS:
+        existing_columns = {col["name"] for col in inspector.get_columns(table_name)}
+        if column_name not in existing_columns:
+            with db_engine.connect() as conn:
+                conn.execute(
+                    sqlalchemy.text(
+                        f"ALTER TABLE {table_name} ADD COLUMN {column_name} {column_def}"
+                    )
+                )
+                conn.commit()


Add * in the function for keyword-only arguments, good style even with just 1 parameter.

Done — added * to _add_columns_if_missing.

yuechao-qin · 2026-03-18T20:33:02Z

cloud_pipelines_backend/database_ops.py

+    missing columns, so new columns require an explicit migration step."""
+    _COLUMN_MIGRATIONS: list[tuple[str, str, str]] = [
+        # (table_name, column_name, column_definition)
+        ("execution_node", "status_updated_at", "DATETIME"),


Let's derive these instead of hard coding them

_COLUMN_MIGRATIONS = [ bts.ExecutionNode.__table__.c.statys_updated_at, ] # Example col = bts.ExecutionNode.__table__.c.statys_updated_at table_name = col.table.name # "execution_node" column_name = col.name # "statys_updated_at" column_type = col.type # DateTime()

Done — _COLUMN_MIGRATIONS now holds bts.ExecutionNode.table.c.status_updated_at and derives table name, column name, and type from it.

yuechao-qin · 2026-03-18T20:40:02Z

cloud_pipelines_backend/database_ops.py

+                conn.execute(
+                    sqlalchemy.text(
+                        f"ALTER TABLE {table_name} ADD COLUMN {column_name} {column_def}"
+                    )
+                )


Let's use DDL for this and try not to use text if possible. For type safety and portability, which is why we're using SQLAlchemy.

from sqlalchemy import Column, DateTime from sqlalchemy.schema import AddColumn col_obj = bts.ExecutionNode.__table__.c.statys_updated_at with db_engine.connect() as conn: conn.execute(AddColumn("execution_node", col_obj.copy())) conn.commit()

sqlalchemy.schema.AddColumn isn't available in our SQLAlchemy version, so I kept sqlalchemy.text but compile the type string from the model via col.type.compile(dialect=db_engine.dialect) to stay dialect-portable

yuechao-qin · 2026-03-18T20:45:37Z

cloud_pipelines_backend/database_ops.py

    return db_engine


+def _add_columns_if_missing(db_engine: sqlalchemy.Engine) -> None:


From my backfill lessons learned, i would suggestion

add logs to know if the migration is happening or not, and any errors

a single transaction, even if you only have 1 column, for future proofing. Else remove the for loop for now to make the code align.

Merging all my suggestions into something like this:

_COLUMN_MIGRATIONS = [ bts.ExecutionNode.__table__.c.statys_updated_at, ] inspector = sqlalchemy.inspect(db_engine) with db_engine.connect() as conn: for col in _COLUMN_MIGRATIONS: existing = {c["name"] for c in inspector.get_columns(col.table.name)} if col.name not in existing: _logger.info( f"Migrating: ALTER TABLE {col.table.name} ADD COLUMN {col.name} ({col.type})" ) try: conn.execute(sqlalchemy.schema.AddColumn(col.table.name, col.copy())) except sqlalchemy.exc.OperationalError: _logger.info( f"Column {col.table.name}.{col.name} already exists (concurrent migration)" ) else: _logger.info( f"Column {col.table.name}.{col.name} already exists — skipping" ) conn.commit()

Done — adopted your suggestion almost verbatim: single with db_engine.connect() wrapping the loop, an info log before each ALTER TABLE, and a catch for OperationalError on concurrent migration

cloud_pipelines_backend/orchestrator_sql.py

yuechao-qin · 2026-03-18T20:50:41Z

cloud_pipelines_backend/orchestrator_sql.py

+    The caller is always responsible for supplying prev_status and
+    prev_status_updated_at.  Typically these are read directly from the live
+    instance immediately before the call.  At call sites that cross a
+    session.rollback() boundary, the values should be captured before the
+    rollback while the instance is still live, as rollback expires all loaded
+    instances and reading an expired attribute would trigger an unintended
+    lazy-load SELECT.


Curious, for my learning purposes, what does "rollback will expire ..." mean? Can you give me some examples. I'm not following the comment.

Expanded the docstring with a concrete explanation: rollback() marks every attribute on every loaded instance as "expired", so reading prev_status or prev_status_updated_at after a rollback would silently fire a fresh SELECT instead of returning the in-memory value.

yuechao-qin · 2026-03-18T21:08:03Z

cloud_pipelines_backend/orchestrator_sql.py

+                _get_current_time() - prev_status_updated_at
+            ).total_seconds(),
+        )
+    execution.container_execution_status = new_status


this is really cool!

Curious, this should handle ALL node status transitions right? Asking because I might want to piggyback off this function in the future for doing status search in the search API.

yuechao-qin · 2026-03-18T21:15:33Z

cloud_pipelines_backend/instrumentation/metrics.py

+execution_status_transition_duration = orchestrator_meter.create_histogram(
+    name="execution.status_transition.duration",
+    description="Duration an execution spent in a status before transitioning to the next status",
+    unit="s",


Can this be a strenum? so we know all the units that are possible.

Done — added MetricUnit(str, enum.Enum) with SECONDS = "s" and ERRORS = "{error}", used for both instruments.

cloud_pipelines_backend/database_ops.py

**Changes:** * Adds histogram measurement for execution node status duration without adding additional database load to the system

morgan-wowk marked this pull request as ready for review March 18, 2026 04:45

morgan-wowk requested a review from Ark-kun as a code owner March 18, 2026 04:45

morgan-wowk requested a review from Volv-G March 18, 2026 05:03

yuechao-qin requested changes Mar 18, 2026

View reviewed changes

morgan-wowk force-pushed the task-status-duration branch from c1a2020 to a53502b Compare March 20, 2026 18:50

morgan-wowk requested a review from yuechao-qin March 20, 2026 18:57

morgan-wowk force-pushed the task-status-duration branch from a53502b to 959b2f6 Compare March 20, 2026 19:00

yuechao-qin approved these changes Mar 20, 2026

View reviewed changes

cloud_pipelines_backend/database_ops.py Show resolved Hide resolved

feat: Measure execution status duration

949e41c

**Changes:** * Adds histogram measurement for execution node status duration without adding additional database load to the system

morgan-wowk force-pushed the task-status-duration branch from 959b2f6 to 949e41c Compare March 20, 2026 19:10

		return db_engine


		def _add_columns_if_missing(db_engine: sqlalchemy.Engine) -> None:

Conversation

morgan-wowk commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Show of work

Note: Attribute names have since changed to execution_status_ prefix

Uh oh!

morgan-wowk commented Mar 18, 2026

Uh oh!

morgan-wowk commented Mar 18, 2026

Load Impact: feat: Measure execution status duration

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

morgan-wowk commented Mar 18, 2026 •

edited

Loading

Note: Attribute names have since changed to `execution_status_` prefix

Load Impact: `feat: Measure execution status duration`