Skip to content

Fix intermittent loss of final large markdown in print_to_web websocket flow#100

Merged
Yazawazi merged 3 commits intodevelopfrom
fix/websocket-print-to-web-reliability
Feb 21, 2026
Merged

Fix intermittent loss of final large markdown in print_to_web websocket flow#100
Yazawazi merged 3 commits intodevelopfrom
fix/websocket-print-to-web-reliability

Conversation

@forrestbao
Copy link
Copy Markdown
Member

Summary

This PR hardens print_to_web websocket response handling so large final markdown payloads are not intermittently lost when the socket closes immediately after the last send.

Why this fix

We observed a rare production symptom: after a long Step 3 (e.g., "[Step 3] Generating answer... Time: 88s"), logs appeared in the output panel but the final markdown result was occasionally missing.

The issue is high impact for long-running LLM calls because users can wait a long time and still receive no final answer.

How we identified the cause

Investigation signals

  • The missing output occurs after long execution and with large final markdown payloads.
  • The Step 3 timing line still appears, so function execution reaches the end.
  • Behavior is rare/intermittent, matching transport/close timing issues.

Code-path findings

  • print_to_web writes output frames via websocket and then closes immediately.
  • Global sys.stdout/sys.stderr replacement was restored only on the success path, not guaranteed on errors.
  • No explicit end-of-stream protocol marker existed for frontend/backend coordination.

A/B validation

A deterministic regression test reproduces the loss model:

  • Pre-fix behavior (immediate close): Step 3 line present, large final markdown missing.
  • Fixed behavior: Step 3 line present, large final markdown preserved.

What changed

Backend (backend/funix/decorator/call.py)

  • Added safe websocket helpers:
    • send_ws_json(...)
    • send_ws_done() sends {"__funix_event":"done"}
    • close_ws_with_drain(...)
  • Switched websocket close paths to close_ws_with_drain with default 200ms delay.
  • Ensured sys.stdout/sys.stderr are restored in a finally block in output_to_web_function.
  • Normalized websocket error/result sends through send_ws_json.

Frontend (frontend/src/components/FunixFunction/InputPanel.tsx)

  • Added control-frame detection for websocket messages.
  • Ignores protocol control frames (__funix_event) so they do not overwrite user output.
  • Marks request completion when control frame is received.

Tests (backend/funix/test/test_print_to_web_websocket_reliability.py)

  • Added regression tests covering:
    • legacy immediate-close loss behavior for large final markdown
    • fixed-path preservation and done marker behavior

If problem persists, check these first

  1. Proxy/websocket infrastructure

    • Reverse proxy idle timeouts (ALB/Nginx/Cloudflare/websocket gateway)
    • websocket frame/message size limits
    • upstream buffering/compression behavior for large frames
  2. Client runtime

    • Browser console websocket errors/close codes
    • Extensions/interceptors that mutate websocket traffic
  3. Server logs and payload shape

    • Confirm final markdown is actually produced by app logic
    • Confirm done marker is sent and received
  4. Deployment version parity

    • Ensure backend and frontend are both updated to this PR
    • Rebuild and redeploy frontend assets (avoid stale client bundle)
  5. Payload characteristics

    • If payloads are extremely large, consider chunking/streaming strategy in app layer

Verification executed

  • Ran new reliability regression tests on fixed branch: pass.
  • Ran same tests against pre-fix source with explicit PYTHONPATH: pre-fix fails on final large markdown preservation as expected.

@forrestbao forrestbao requested a review from Yazawazi February 21, 2026 00:16
@Yazawazi Yazawazi merged commit 1543055 into develop Feb 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants