Add query cancellation support via _tasks/_cancel API for PPL queries#5254
Add query cancellation support via _tasks/_cancel API for PPL queries#5254sunil9977 wants to merge 2 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Sunil Ramchandra Pawar <pawar_sr@apple.com>
PR Reviewer Guide 🔍(Review updated until commit fd7642c)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to fd7642c
Previous suggestionsSuggestions up to commit 024a8ca
|
|
Hi @sunil9977, thanks for the changes! Please take a look at the CI failures |
ahkcs
left a comment
There was a problem hiding this comment.
Overall the approach looks good — aligns well with the plan from the issue discussion. Left a couple of minor suggestions inline.
| executionThread.set(thread); | ||
| } | ||
|
|
||
| public void clearExecutionThread() { |
There was a problem hiding this comment.
Minor: there's a small window between task registration and setExecutionThread() where a cancel would see a null thread and be a no-op. In practice this is nearly impossible to hit (cancellation requires separate HTTP roundtrips), but it could matter if the sql-worker pool is saturated.
Since the fix is small, might be worth hardening:
private final AtomicBoolean cancelled = new AtomicBoolean(false);
public void setExecutionThread(Thread thread) {
executionThread.set(thread);
if (cancelled.get()) {
thread.interrupt();
}
}
@Override
public void onCancelled() {
cancelled.set(true);
Thread thread = executionThread.get();
if (thread != null) {
thread.interrupt();
}
}There was a problem hiding this comment.
I don't see any precedent for this in OS core, I wonder if we're missing something? Only 2 hits:
% rg -w 'onCancelled' -B2 -A10
server/src/main/java/org/opensearch/index/reindex/BulkByScrollTask.java
203-
204- @Override
205: public void onCancelled() {
206- /*
207- * If this task is a leader, we don't need to do anything extra because the cancel action cancels child tasks for us
208- * If it's is a worker, we know how to cancel it here
209- * If we don't know whether it's a leader or worker yet, we do nothing here. If the task is later set to be a worker, we cancel the
210- * worker at that time.
211- */
212- if (isWorker()) {
213- workerState.handleCancel();
214- }
215- }
server/src/main/java/org/opensearch/tasks/CancellableTask.java
93- assert reason != null;
94- if (cancelledInfo.trySet(new CancelledInfo(reason))) {
95: onCancelled();
96- }
97- }
98-
99- public boolean isCancelled() {
100- return cancelledInfo.get() != null;
101- }
102-
103- /**
104- * Returns true if this task can potentially have children that need to be cancelled when it parent is cancelled.
105- */
--
113- * Called after the task is cancelled so that it can take any actions that it has to take.
114- */
115: protected void onCancelled() {}
116-
117- /**
118- * Returns true if this task should be automatically cancelled if the coordinating node that
119- * requested this task left the cluster.
120- */
121- public boolean cancelOnParentLeaving() {
122- return true;
123- }
124-
125- @Nullable
| @@ -33,26 +34,48 @@ | |||
| public static final String SQL_WORKER_THREAD_POOL_NAME = "sql-worker"; | |||
| public static final String SQL_BACKGROUND_THREAD_POOL_NAME = "sql_background_io"; | |||
|
|
|||
There was a problem hiding this comment.
Nit: could be private static since the accessor methods already exist.
There was a problem hiding this comment.
These consts are directly referenced in a few places iirc, I originally refactored to this from a lot of places duplicating a "sql-worker" string.
There was a problem hiding this comment.
Sorry, I wasn't referring to the thread pool name constants — those should stay public static. I meant the cancellationCallBackThreadLocal field added in this PR (line 45). It's declared public static but has setCancellationCallback/clearCancellationCallback accessor methods, so the field itself could be private static to prevent direct access.
| @@ -0,0 +1,44 @@ | |||
| /* | |||
There was a problem hiding this comment.
We can rename this file to PPLQueryTask
|
CI failure seems to be caused by spotless check, please run |
|
During e2e testing with the OSD cancel button, we noticed that Adding an interrupt check in // opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java
@Override
public boolean moveNext() {
if (Thread.interrupted()) {
throw new NonFallbackCalciteException(
String.format("Query was cancelled after processing %d rows.", queryCount));
}
if (queryCount >= maxResponseSize) {
return false;
}
// ... rest of method
}This is outside the scope of this PR but would be a nice follow-up to make the cancel experience snappy. In our testing it brought cancel response time from ~50s down to <5s. |
| String prefix = (queryId != null) ? "PPL [queryId=" + queryId + "]: " : "PPL: "; | ||
|
|
||
| if (pplQuery != null && pplQuery.length() > 512) { | ||
| return prefix + pplQuery.substring(0,512) + "..."; |
There was a problem hiding this comment.
suggestion: I don't think this is worth truncating.
The only way to access this description is to directly check the task by ID. Seeing the full query might be useful debug context. Similar debug logic in OSD core exists and doesn't truncate.
| if (Thread.interrupted() || e.getCause() instanceof InterruptedException) { | ||
| if (callBack != null && callBack.isCancelled()) { | ||
| LOG.info("Query was cancelled"); | ||
| throw new OpenSearchException("Query was cancelled."); |
There was a problem hiding this comment.
issue: This exception will propagate as a 500, which is bad for people monitoring errors.
OpenSearchException is the generic top-level exception class, it defaults as internal server error when the status isn't set. You want to use one of its subclasses.
TaskCancelledException seems like a better choice.
That's on me lol, I thought when I implemented it that the CPU part of joins would be fast (the bulk of the time on most joins is waiting for batches), so it'd be overkill to check every iteration and we could just throw when we request new batches. Happy to take it up if there's a followup task. Do you have more precise repro steps? Since I haven't noticed this issue before. I'm also interested in if doing this on every row would have a negative effect on benchmarks. |
|
|
||
| @Override | ||
| public boolean shouldCancelChildrenOnCancellation() { | ||
| return false; |
There was a problem hiding this comment.
Should this be false? Query tasks tend to spawn background IO tasks, but I don't know if those count as children.
I was doing a |
|
Based on @Swiddis's observation that almost no core code overrides I think a simpler approach might work here: pass the |
|
Oh, interesting. If I'm understanding right they implemented their own thread interruption mechanism? I wonder why not use regular |
|
Ok, talked with the core folks, task cancellation is the better approach. Since opensearch uses a thread pool system, interrupting the thread has potential to poison the whole pool. It works for us because we're careful to uninterrupt the thread when handling interruption errors, but there might theoretically be a path that leaves a perpetually-interrupted thread (either currently or in the future). Cancellation also has the benefit of propagating across clusters better (which is more relevant to Core which is fanning out requests to many data nodes regularly). So the correct approach is change our thread interrupts to use the native task cancellation feature. If all that understanding is correct, it means we can also clean up the logic that un-interrupts the thread, but I think it's only like 2 lines anyway so not a massive win. |
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit fd7642c.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit fd7642c |
|
Thank you all for reviews. |
| LOG.warn( | ||
| "Query execution timed out after {}. Interrupting execution thread.", | ||
| timeout); | ||
| executionThread.interrupt(); |
There was a problem hiding this comment.
future PR: this interrupt needs to be changed to cancellation as well (and all code paths that depend on it)
|
@sunil9977 LGTM, one more chore: please add DCO signoffs to the commits as described in https://github.com/opensearch-project/sql/pull/5254/checks?check_run_id=68346380999 & https://github.com/opensearch-project/.github/blob/main/CONTRIBUTING.md#developer-certificate-of-origin. |
Fixes #4887