Skip to content

Bli/xasolver observability tolerant#239

Open
dwr-bli wants to merge 2 commits intomainfrom
bli/xasolver-observability-tolerant
Open

Bli/xasolver observability tolerant#239
dwr-bli wants to merge 2 commits intomainfrom
bli/xasolver-observability-tolerant

Conversation

@dwr-bli
Copy link
Contributor

@dwr-bli dwr-bli commented Mar 10, 2026

Summary

This PR refactors XASolver to improve observability and runtime diagnostics, and introduces an optional tolerant build mode while preserving legacy behavior by default.

Main changes

  • add structured XA solver logging using SLF4J fluent API
  • add MDC context fields such as run_id, cycleName, timeStep, solver, and mode
  • add shared PerformanceTimer timing for major XA stages:
    • loadModel
    • setConstraints
    • setDVars
    • setWeights
    • solve
    • assign
    • total
  • introduce LEGACY and TOLERANT execution modes
  • keep legacy behavior as the default mode
  • add tolerant build path for:
    • missing constraints
    • missing dvars
    • missing weights
    • missing slack/surplus weights
    • unknown constraint signs
  • aggregate tolerant-mode build gaps using lightweight BuildStats
  • keep core XA solve flow unchanged:
    • load model
    • set constraints
    • set decision variables
    • set weights
    • solve
    • assign dvar values

Behavior notes

  • default behavior remains LEGACY
  • tolerant mode is enabled only when explicitly requested via system properties
  • no intended change to normal legacy XA solve behavior
  • tolerant mode is intended to reduce hard failures caused by partial/missing build artifacts and to provide better diagnostics

Suggested runtime flags

  • -Dwrims.xa.mode=tolerant
    or
  • -Dwrims.xa.tolerant=true

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)
6.9% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)
B Maintainability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@dwr-zroy dwr-zroy added the enhancement New feature or request label Mar 10, 2026
@dwr-zroy dwr-zroy added this to the 3.0 Beta milestone Mar 10, 2026
@dwr-zroy dwr-zroy moved this to Needs Review in WRIMS 3 Development Mar 10, 2026
Copy link
Collaborator

@dwr-zroy dwr-zroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big picture:

  • I need more info on the Tolerant mode before I can approve. It seems to be out of scope of the observability topic of this PR. Let's chat about this one offline.
  • There are changes to the CbcSolver class, which also seems out of scope for this PR as well. I might be missing something here.
  • Misc changes requested about the use of MDC and the use of INFO level.
  • Thoughts on where the RunContext class should be initialized.
  • White-space changes are causing many issues with the sonar code quality tool. We will need to make sure we don't change lines of code (even with whitespace or formatting only) unless we are willing to address those sonar issues. This might mean some annoying line-by-line reverts for this PR.
  • Related to the MDC and white-space notes above: I am not sure MDC is valuable enough to justify the cleanup needed. We can talk more about this comment too.

Comment on lines +104 to +130
logger.atInfo()
.setMessage("Performance Statistics Summary:")
.log();
logger.atInfo()
.setMessage(" Total problems solved: {}")
.addArgument(totalProblemsSolved)
.log();
logger.atInfo()
.setMessage(" Total solve time: {} ms")
.addArgument(totalSolverTime)
.log();
logger.atInfo()
.setMessage(" Total model creation time: {} ms")
.addArgument(totalModelCreationTime)
.log();
logger.atInfo()
.setMessage(" Total constraint setup time: {} ms")
.addArgument(totalConstraintSetupTime)
.log();
logger.atInfo()
.setMessage(" Total variable setup time: {} ms")
.addArgument(totalVariableSetupTime)
.log();
logger.atInfo()
.setMessage(" Average solve time: {} ms")
.addArgument(totalSolverTime / totalProblemsSolved)
.log();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer these as a single logging call, easier to read, fewer lines. Also this message does not currently appear in the WRIMS 2 console during a run, so it should not be at the info level.

The comment about the use of INFO applies to many places below, not all of them are called out. Review the current WRIMS 2 console messages and try to limit additional logs at that INFO level.

Suggested change
logger.atInfo()
.setMessage("Performance Statistics Summary:")
.log();
logger.atInfo()
.setMessage(" Total problems solved: {}")
.addArgument(totalProblemsSolved)
.log();
logger.atInfo()
.setMessage(" Total solve time: {} ms")
.addArgument(totalSolverTime)
.log();
logger.atInfo()
.setMessage(" Total model creation time: {} ms")
.addArgument(totalModelCreationTime)
.log();
logger.atInfo()
.setMessage(" Total constraint setup time: {} ms")
.addArgument(totalConstraintSetupTime)
.log();
logger.atInfo()
.setMessage(" Total variable setup time: {} ms")
.addArgument(totalVariableSetupTime)
.log();
logger.atInfo()
.setMessage(" Average solve time: {} ms")
.addArgument(totalSolverTime / totalProblemsSolved)
.log();
logger.atDebug()
.setMessage("""
Performance Statistics Summary:
Total problems solved: {}
Total solve time: {} ms
Total model creation time: {} ms
Total constraint setup time: {} ms
Total variable setup time: {} ms
Average solve time: {} ms""")
.addArgument(totalProblemsSolved)
.addArgument(totalSolverTime)
.addArgument(totalModelCreationTime)
.addArgument(totalConstraintSetupTime)
.addArgument(totalVariableSetupTime)
.addArgument(totalSolverTime / totalProblemsSolved)
.log();

Comment on lines +60 to +73

private static void putSolverMdc() {
try {
MDC.put("solver", "cbc");
} catch (Exception ignore) {
}
}

private static void clearSolverMdc() {
try {
MDC.remove("solver");
} catch (Exception ignore) {
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we are using MDC when the contexts that are added are entirely local to this class?


// record lp
public static double record_if_obj_diff = 10000.0;
public static double log_if_obj_diff = 500.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 140 - 259 detected as changes in git, because of white-space changes. If code formatter was applied to the whole file that could cause changes in white-space characters. Make sure to configure your code formatter to only apply to VCS changes.

Since changes were made, we fail the sonar check because of unused variables here. You could either:

  1. revert these changes,
  2. or do the cleanup.

Comment on lines +286 to +320
logger.atInfo()
.setMessage("CBC Solver Initialization Configuration:")
.log();
logger.atInfo()
.setMessage(" cbcViolationCheck = {}")
.addArgument(cbcViolationCheck)
.log();
logger.atInfo()
.setMessage(" lowerBoundZero_check = {}")
.addArgument(lowerBoundZero_check)
.log();
logger.atInfo()
.setMessage(" cbcSolutionRounding = {}")
.addArgument(cbcSolutionRounding)
.log();
logger.atInfo()
.setMessage(" usejCbc2021 = {}")
.addArgument(usejCbc2021)
.log();
logger.atInfo()
.setMessage(" usejCbc2021a = {}")
.addArgument(usejCbc2021a)
.log();
logger.atInfo()
.setMessage(" jCbc Version: {}")
.addArgument(cbcVersion)
.log();
logger.atInfo()
.setMessage(" whsScaling = {}")
.addArgument(whsScaling)
.log();
logger.atInfo()
.setMessage(" whsSafe = {}")
.addArgument(whsSafe)
.log();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here about combining the logging calls, and the use of the INFO level


public static void newProblem(){
putSolverMdc();
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this try block then marks the entire function as "new code", and we would need to do massive refactoring to not fail the sonar check. If we do not use MDC, we do not need the try-catch block.

Comment on lines +968 to +977
private static void putMdc(RunContext rc) {
try {
MDC.put("run_id", rc.runId);
MDC.put("cycleName", rc.cycleName);
MDC.put("timeStep", rc.timeStep);
MDC.put("solver", "XA");
MDC.put("mode", rc.modeName());
} catch (Exception ignore) {
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used in one location, I prefer to just move it into the init

Comment on lines +40 to +41
private static final Logger LOG_MAIN =
LoggerFactory.getLogger("gov.ca.water.wrims.engine.core.solver.xa.main");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not prefer this. These logs will still be caught by the root logger. So I do not see an advantage here of not using

Suggested change
private static final Logger LOG_MAIN =
LoggerFactory.getLogger("gov.ca.water.wrims.engine.core.solver.xa.main");
private static final Logger LOG_MAIN =
LoggerFactory.getLogger(XASolver.class);

Comment on lines +46 to +49
private enum Mode {
LEGACY,
TOLERANT
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more of an explanation of the tolerant mode. I do not fully understand how it relates to the logging scope of this PR, it seems like it changes behavior more than just in observability.

Comment on lines +68 to +96
private static final class RunContext {
final Mode mode;
final int cycleIndex1Based;
final String dateStr;
final String cycleName;
final String timeStep;
final String runId;

RunContext(Mode mode) {
this.mode = mode;
this.cycleIndex1Based = ControlData.currCycleIndex + 1;
this.dateStr = ControlData.currMonth + "/" + ControlData.currDay + "/" + ControlData.currYear;
this.cycleName = ControlData.currCycleName;
this.timeStep = String.valueOf(ControlData.timeStep);
this.runId = UUID.randomUUID().toString();
}

String modeName() {
return mode == null ? "legacy" : mode.name().toLowerCase();
}

String ctx() {
return "date=" + dateStr
+ " cycleIndex=" + cycleIndex1Based
+ " cycleName=" + cycleName
+ " timeStep=" + timeStep
+ " run_id=" + runId;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this being it's own package class, and adding a factory method onto the ControlData class to create instances of this class. Right now, this implementation depends on the global state of ControlData which I would like to try and eliminate.

Let me know if this does not make sense.

* It aggregates missing/skipped items and keeps only a few samples,
* avoiding heavy per-item logging in high-frequency loops.
*/
private static final class BuildStats {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer a more descriptive name here. Maybe

Suggested change
private static final class BuildStats {
private static final class TrackedWreslCycleCharateristics {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Needs Review

Development

Successfully merging this pull request may close these issues.

2 participants