Skip to content

[FEATURE] Add system to assign a "maximum benefit" for plan analysis rules #215

@jobbish-sql

Description

@jobbish-sql

Which component(s) does this affect?

  • Desktop App
  • CLI Tool
  • SSMS Extension
  • Plan Analysis Rules
  • Documentation

Problem Statement

IMO there are two "holy grails" for plan analysis rules:

  1. Be able to ingest a big, complicated plan and to spit out actionable, properly prioritized advice. Query tuners may have the skill to do this themselves but plans with many operators or statements are extremely unwieldy to work with in SSMS for a variety of reasons.
  2. Be able to ingest a set of estimated plans (such as from query store) and to spit out actionable, properly prioritized advice for them. Obviously much harder to do than point 1.

This feature primarily addresses holy grail 1 but I think it could have positive impacts for 2 as well in that some or all of the same framework could be used.

The "maximum benefit" idea is the same concept as what SSMS shows with its missing index report, although I hope we would do a better job than that. It provides a natural way to rank action items in priority order and allows query tuners to focus on the parts of the plan that will be the most impactful. Right now, the current system suffers from too much noise (cardinality estimates will always be wrong somewhere in big plans) and the info/warning/critical classification can be misleading or feel arbitrary. Those classifications levels also require human input to be set by a developer which is annoying as well.

Overview of redesign to add "maximum benefit" to action items:

  1. Calculate true operator level statistics that account for batch/mode row mode differences, child operators, and parallel zones. The tool is already doing this, although perhaps not perfectly.
  2. Calculate the "maximum benefit" of addressing the wait stats already available in the xml. Some of these are really easy to calculate, like memory grant waits or throttling by governance. For example, if the query executes for 12 seconds but had 4 seconds of memory grant wait time then the maximum impact of addressing the RESOURCE_SEMAPHORE wait is 33%. Other wait types would need more detailed calculations for parallel plans (such as I/O waits), but I think that the math is worth doing. Other wait types are probably just best ignored. Once nice benefit of this approach is that you can slowly add support for more wait types over time. It would be very useful to spit out supported wait types as a text file for something in the source code.
  3. Calculate the "maximum benefit" of improving various operators in the query plan. Similar to wait stats, some of the operators have a maximum benefit that's easy to calculate. For example, if a sort operator takes 1% of overall time and spills to tempdb, the maximum benefit of fixing up that operator (by addressing the spill, in this case) would be 1%. That provides a natural way to rank the apparent issue (tempdb spill) compared to a clustered index scan with a residual predicate that takes up 90% of the elapsed time for the plan. Of course, some operators would need much more complicated heuristics that look at parent operators, but like before, support for operators could be slowly added over time. It would be very useful to spit out supported operators along with their heuristic benefit formulas in a text file. I'd be happy to give my opinion on how to calculate the heuristic for various operators if you buy into the overall idea.
  4. Calculate the "maximum benefit" of cardinality mis-estimations. This would need to look at various child/parent nodes but I don't think that the calculation would be affected by the node that causes the issue. I also think it would be good to call out the nodes which cause the cardinality to be wrong and to perform the calculation only for those nodes. This is different from the current system which can flag many operators as having a cardinality estimate problem even though the issue might have been caused by a single filter that was a child node of all of them.
  5. Try to identify improvements with actionable advice for the operators in part 3. You're already doing a lot of this in the current tool I think. For example, a scan with a residual predicate may have been better off as a seek, or the sort that spills to tempdb needs more memory available to it. As before, it would be helpful if the tool could produce documentation in terms of what types of rule checking happens for each operator.
  6. Now you have a list of possible improvements from wait stats, operator timings, and cardinality estimation issues. Rank the improvements according "maximum benefit" and don't show any operator benefits (step 3) that don't have a suggested fix. Doing it this way solves a lot of the noise problems with the current approach. For example, if a plan has key lookups with only a maximum benefit of 0.1% then they won't be shown as a CRITICAL issue anymore. They would rank at the bottom of issues and tuners could focus on the more important items first.
  7. Move the checks that cannot be assigned a "maximum benefit" somewhere else in the tool. There's a lot of empty space at the top so maybe they could be integrated with the runtime stats? Moving the DOP and memory checks into the "Runtime" box seems to fit well, but I don't think that this is too important of a point.

Overall, I like this approach because it draws a clear separation between all of the annoying math that computers are good at and the expert tuning rules that we would be adding in. One easy way to improve the tool is to find plans for which it cannot detect or appropriately prioritize the root cause of the performance issue. For those situations, there's only going to be two possible fixes: refine the heuristic calculation for "maximum benefit" of the operator causing the problem or to add a possible improvement for an operator for part 5. We would be able to continually add in new checks over time without messing up other parts of the code and contributors would be able to add their own ideas after viewing what we're currently supporting in the text files.

For estimated plans you could hold your nose and use operator costs for part 1 and implement most of part 3. I certainly don't like using operator costs but I don't know what else you could reasonably do for estimated plans. I do think there are a lot of benefits in using the same framework as much as possible between actual and estimated plans. With that said, I'm mostly interested in seeing how this performs in practice for actual plans so maybe the issue of estimated plans could be deferred for now.

Proposed Solution

:^)

Use Case

:^)

Alternatives Considered

:^)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions