test(net): reproduce socket error crash with instrumentation#7891
test(net): reproduce socket error crash with instrumentation#7891khanayan123 wants to merge 7 commits intomasterfrom
Conversation
…ntation When dd-trace wraps Socket.emit in the net instrumentation, unhandled socket errors (EPIPE/ECONNRESET) crash the process with a stack trace pointing to dd-trace's net.js rather than Node.js internals. This can affect applications that rely on process-level error handling with stack-trace-based filtering. The test spawns a child process that creates a TCP connection, triggers a socket error (ECONNRESET) by writing after the remote end closes, and verifies the process does not crash. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7891 +/- ##
==========================================
- Coverage 74.26% 74.25% -0.01%
==========================================
Files 765 765
Lines 35786 35786
==========================================
- Hits 26575 26574 -1
- Misses 9211 9212 +1 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Overall package sizeSelf size: 5.45 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 3.0.0 | 81.15 kB | 815.98 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 62e8d5b | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
BenchmarksBenchmark execution time: 2026-03-31 03:54:48 Comparing candidate commit dcc32c8 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 227 metrics, 33 unstable metrics. |
Add regression tests to verify the ws instrumentation does not crash when clients disconnect abruptly or when sending to destroyed sockets. Results: ws library handles socket errors internally, so the ws instrumentation does NOT reproduce the customer's EPIPE crash. The net instrumentation (raw Socket.emit wrapping) is the only confirmed crash vector, and that crash also occurs without dd-trace. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the child-process-based test that incorrectly asserted "should not crash" for a socket with no error listener (Node.js is supposed to crash in that case). Replace with inline unit tests in index.spec.js that test the actual concern: when an app attaches an 'error' listener, dd-trace's wrapped Socket.emit must not interfere with error delivery or crash the process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Confirm that unhandled socket errors crash the process both with and without dd-trace (expected Node.js behavior). With dd-trace, the stack trace includes datadog-instrumentations/src/net.js; without, it only references Node.js internals. This formalizes the finding that dd-trace does not cause the crash but does appear in the stack trace, which can mislead error attribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Net tests: - Add uncaughtException guard to literally assert "no crash" - Use .once() to prevent double-done - Add fail-fast timer if no socket error is emitted - Make server-side destroy more deterministic with setImmediate - Rename second test to "preserve native error shape" and assert constructor name, errno to confirm no wrapper type WS tests: - Add uncaughtException guard to both tests - Add error listeners on both server and client sides - Use ws.close()/terminate() instead of _socket.destroy() - Assert totalSendAttempts > 0 in burst test to confirm work happened Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Attempt to reproduce APMS-18805 by testing the tracer's HTTP connection
to the agent under conditions matching the customer scenario: high WS
span volume, agent dropping connections, keep-alive socket reuse.
Unit tests (request.spec.js):
- Agent destroys socket mid-write (single request)
- Rapid concurrent requests with agent closing connections
- Large payload flush when agent socket dies
Unit tests with net instrumentation active:
- Same scenarios but with dd-trace wrapping Socket.emit
- Socket reuse with keep-alive when agent kills idle sockets
- Repeated flush cycles with intermittent socket death
Integration test (epipe-crash.spec.js):
- Full tracer with ws+net+http instrumentation
- WS server with burst sends and abrupt client disconnects
- Unstable fake agent that kills connections periodically
- Aggressive agent that kills every connection
Result: all tests pass. The request module's req.once('error') handler
catches socket errors before they become unhandled, even with net
instrumentation wrapping Socket.emit. We were unable to reproduce the
customer's crash through any of these vectors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Socket.emitinnet.js:65becomes the error origin, which can bypass process-level error handling in some applications'error'listener, and asserts the process should not crashNote: The underlying socket error also crashes without dd-trace (no error listener = crash per Node.js spec). However, dd-trace changes the stack trace origin from
node:internal/streams/destroytodd-trace/packages/datadog-instrumentations/src/net.js:65, which can affect applications using stack-trace-based error filtering.Key findings
net.jshas had no functional changes since v5.83.0 — this is a pre-existing issueSocket.emitbut doesn't handle the'error'event case specially