Skip to content

Add post-processing step for a grid and generate grid plot#423

Open
EmmaPostolec wants to merge 133 commits intomainfrom
ep/post_processing_grid
Open

Add post-processing step for a grid and generate grid plot#423
EmmaPostolec wants to merge 133 commits intomainfrom
ep/post_processing_grid

Conversation

@EmmaPostolec
Copy link
Copy Markdown
Member

@EmmaPostolec EmmaPostolec commented Jun 5, 2025

Description

Adds proteus grid-analyse CLI command to post-process PROTEUS simulation grids: extract final-timestep outputs to a summary CSV and generate status summary + ECDF grid plots. Driven entirely by existing .grid.toml config — no separate analysis config file needed.

Key changes:

  • src/proteus/grid/post_processing.py: CSV extraction and ECDF/status plotting. Colormap, output variables, plot format all read from top-level grid config keys.
  • src/proteus/utils/plot.py: _preset_labels, _preset_scales, _preset_log_scales dicts for automatic axis labelling, unit scaling, and log-scale detection by quantity name.
  • src/proteus/cli.py: grid-analyse command using @config_option (consistent with other commands).
  • input/ensembles/example.grid.toml: Extended with post-processing keys (update_csv, plot_status, plot_ecdf, colormap, output_variables, plot_format).
  • tests/grid/test_post_processing.py: 26 @pytest.mark.unit tests for pure helper functions (get_label, clean_series, validate_output_variables, flatten_input_parameters, load_ecdf_plot_settings, group_output_by_parameter, etc.).
  • tests/grid/test_grid.py: test_grid_post_process integration test runs post-processing on the dummy grid and asserts CSV + plots are created.
  • Bug fix: load_ecdf_plot_settings was reading colormap from cfg['input_parameters']['colormap'] instead of top-level cfg['colormap'], causing user-provided colormaps to be silently ignored.

Usage:

proteus grid-analyse --config input/ensembles/example.grid.toml

Output layout:

your_grid/
 └─ post_processing/
     ├─ extracted_data/
     │   └─ your_grid_final_extracted_data_{all,completed,running_error}.csv
     └─ grid_plots/
         ├─ summary_grid_statuses_your_grid.{png,pdf}
         ├─ ecdf_grid_plot_your_grid.{png,pdf}
         └─ single_plots_ecdf/

Validation of changes

Tested on multiple grids on Habrok HPC cluster. Unit tests cover all pure helper functions; integration test (test_grid_post_process) runs the full pipeline on the dummy 2-case grid.

Checklist

  • I have followed the contributing guidelines
  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • My changes generate no new warnings or errors
  • I have checked that the tests still pass on my computer
  • I have updated the docs, as appropriate
  • I have added tests for these changes, as appropriate
  • I have checked that all dependencies have been updated, as required

Relevant people

@nichollsh @IKisvardai @Emeline0110 @planetmariana @timlichtenberg

E.N. Postolec and others added 30 commits April 8, 2025 14:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.

Copy link
Copy Markdown

Copilot AI commented Apr 8, 2026

This is a useful tool that will probably see a lot of use in the future, since running grids of models is such a common approach.

I have tried it with a dummy grid, and it works generally as descr...

Tests have been added in tests/grid/test_grid.py as test_grid_post_process — it calls gpostprocess(GRID_CONFIG) on the dummy grid and asserts that the summary CSV, status summary plot, and ECDF plot are all generated correctly.

Copilot AI changed the title Add post processing step for a grid and generate grid plot Add post-processing step for a grid and generate grid plot Apr 8, 2026
@EmmaPostolec EmmaPostolec requested a review from Copilot April 8, 2026 19:06
@EmmaPostolec EmmaPostolec marked this pull request as ready for review April 8, 2026 19:07
@EmmaPostolec EmmaPostolec requested a review from a team as a code owner April 8, 2026 19:07
@EmmaPostolec
Copy link
Copy Markdown
Member Author

Thanks, @EmmaPostolec. Please let me know when you'd like this to be reviewed.

I think it's ready now. I implemented @copilot feedback. I guess the remaining task is to update the pytest to increase the coverage and to pass the cheks for merging ?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

Comment on lines +485 to +489
# Extract plot_format from cfg
plot_format = cfg.get('plot_format')

if 'status' not in df.columns:
raise ValueError("CSV must contain a 'status' column")
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plot_format is read with cfg.get('plot_format') and then used directly in the output filename. If the key is missing or misspelled, this will create files ending in .None (or another invalid extension). It would be safer to default to a known format (e.g. png) and/or validate plot_format is one of the supported values before saving.

Copilot uses AI. Check for mistakes.
Comment on lines +981 to +986
# --- ECDF plots ---
if cfg.get('plot_ecdf', True):
completed_simulations_data_csv = pd.read_csv(summary_csv_completed, sep='\t')
columns_output = validate_output_variables(
completed_simulations_data_csv, cfg['output_variables']
)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main() accesses cfg['output_variables'] unconditionally when plot_ecdf is enabled. If a user enables ECDF plotting but forgets to provide output_variables, this will raise a KeyError rather than a helpful message. Consider using cfg.get('output_variables') and raising a clear ValueError describing the required config key.

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +179
elif method == 'arange':
arr = list(np.arange(value['start'], value['stop'], value['step']))
# Ensure endpoint is included
if not np.isclose(arr[-1], value['stop']):
arr.append(value['stop'])
tested_params[key] = np.array(arr, dtype=float)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the arange method, arr[-1] is accessed without guarding against np.arange(...) returning an empty array (e.g. start == stop or an incompatible step). This can raise IndexError and abort post-processing. Add an empty-check before reading arr[-1], and decide how to handle the degenerate case (e.g. treat it as a single-value grid).

Copilot uses AI. Check for mistakes.
Comment on lines +243 to +247
def extract_solidification_time(cases_data: list, grid_dir: str | Path):
"""
Extract solidification time for each simulation of the grid for
the condition Phi_global < phi_crit at last time step.

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extract_solidification_time() docstring says the condition Phi_global < phi_crit is checked “at last time step”, but the implementation finds the first timestep where the condition becomes true (idxmax() on the boolean condition). Please update the docstring to match the actual behavior (solidification time = first crossing).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EmmaPostolec please can you check this?

Comment on lines +560 to +565
def flatten_input_parameters(d: dict, parent_key: str = '') -> dict:
"""
Flattens a nested input-parameter dictionary from a TOML configuration
into a flat mapping of dot-separated parameter paths to their plotting
configuration.

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flatten_input_parameters() is defined but not called anywhere in this module (or elsewhere in the repo). Keeping unused helpers in a new ~1000-line module makes the code harder to maintain; consider removing it or wiring it into the ECDF settings loading if it’s intended for future config-driven labels/scales.

Copilot uses AI. Check for mistakes.
- `{grid_name}_final_extracted_data_completed.csv` which contains only successful runs (used for ECDF plots)
- `{grid_name}_final_extracted_data_running_error.csv` for only failed simulations with status `Running` or `Error`.

- `grid_plots/` This directory contains a status summary plot and a ECDF plot
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: “a ECDF plot” should be “an ECDF plot”.

Copilot uses AI. Check for mistakes.
reservoir = "outgas" # Escaping reservoir: "bulk", "outgas", "pxuv".

[escape.zephyrus]
Pxuv = 1e-2 # Pressure at which XUV radiation become opaque in the planetary atmosphere [bar]
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar in the inline comment: “XUV radiation become opaque” → “XUV radiation becomes opaque”.

Copilot uses AI. Check for mistakes.
Copilot AI and others added 2 commits April 8, 2026 19:22
Agent-Logs-Url: https://github.com/FormingWorlds/PROTEUS/sessions/43d242cd-ebd2-458b-9dde-201556591aef

Co-authored-by: EmmaPostolec <122358811+EmmaPostolec@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI commented Apr 8, 2026

Thanks, @EmmaPostolec. Please let me know when you'd like this to be reviewed.

I think it's ready now. I implemented @copilot feedback. I guess the remaining task is to update the pytest to inc...

The CI checks are passing on the latest commits. I've now added 26 @pytest.mark.unit tests in a new file tests/grid/test_post_processing.py (commit e65c564) that cover the pure helper functions in post_processing.py: get_grid_name, get_label, get_scale, get_log_scale, latex, clean_series, validate_output_variables, flatten_input_parameters, load_ecdf_plot_settings, and group_output_by_parameter. These are fast (<100ms each), require no running grid or compiled binaries, and run as part of the fast CI gate.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • julialang-s3.julialang.org
    • Triggering command: /usr/bin/python python -m pytest tests/grid/test_post_processing.py -v -m unit (dns block)
    • Triggering command: `/usr/bin/python python -c
      import sys
      sys.path.insert(0, 'src')

Test the imports work

from proteus.grid.post_processing import (
clean_series, flatten_input_parameters, get_grid_name,
get_label, get_log_scale, get_scale, group_output_by_parameter,
latex, load_ecdf_` (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@nichollsh nichollsh self-requested a review April 9, 2026 07:09
Copy link
Copy Markdown
Member

@nichollsh nichollsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @EmmaPostolec, it's great to have some utilities for easily analysing the results of PROTEUS grids.

However, I have some concerns about some of the code that need clarifying and/or actioning. I am worried that a lot of the code is sprawling and some of it is redundant. This makes it hard to review, but also will make it harder to maintain in the future (tech debt) and could lead to some inconsistencies in how grids are analysed.

These comments are quite minor overall, and should not be too difficult to address. Thanks again for adding these! Looking forward to seeing it used frequently :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use print() statements. PROTEUS uses logging throughout its codebase, so all print statements in this file should be replaced with log.info or log.warning as appropriate.

Comment on lines +163 to +179
if method == 'direct':
tested_params[key] = value['values']

elif method == 'linspace':
tested_params[key] = np.linspace(value['start'], value['stop'], value['count'])

elif method == 'logspace':
tested_params[key] = np.logspace(
np.log10(value['start']), np.log10(value['stop']), value['count']
)

elif method == 'arange':
arr = list(np.arange(value['start'], value['stop'], value['step']))
# Ensure endpoint is included
if not np.isclose(arr[-1], value['stop']):
arr.append(value['stop'])
tested_params[key] = np.array(arr, dtype=float)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is a reflection of how the grid axes are constructed in manage.py. However, if the code in manage.py is updated without also updating this code, the analysis will become inconsistent with the grid output. See here:

# Set a dimension by linspace
def set_dimension_linspace(self, name: str, start: float, stop: float, count: int):
self.dim_avars[name] = list(np.linspace(start, stop, count))
self.dim_avars[name] = [float(v) for v in self.dim_avars[name]]
# Set a dimension by arange (inclusive of endpoint)
def set_dimension_arange(self, name: str, start: float, stop: float, step: float):
self.dim_avars[name] = list(np.arange(start, stop, step))
if not np.isclose(self.dim_avars[name][-1], stop):
self.dim_avars[name].append(stop)
self.dim_avars[name] = [float(v) for v in self.dim_avars[name]]
# Set a dimension by logspace
def set_dimension_logspace(self, name: str, start: float, stop: float, count: int):
self.dim_avars[name] = list(np.logspace(np.log10(start), np.log10(stop), count))
self.dim_avars[name] = [float(v) for v in self.dim_avars[name]]
# Set a dimension directly
def set_dimension_direct(self, name: str, values: list, sort: bool = True):
the_list = list(set(values)) # remove duplicates
if (not isinstance(values[0], str)) and sort: # sort if numeric
the_list = sorted(the_list)
self.dim_avars[name] = the_list

Perhaps you could generalise the three functions in manage.py by moving them outside the Grid object, which would allow them to be called by both Grid.__init__() and get_tested_grid_parameters, without duplicating any code.


else:
print(f'⚠️ Unknown method for {key}: {method}')
continue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably error here, rather than continuing.

Comment on lines +243 to +247
def extract_solidification_time(cases_data: list, grid_dir: str | Path):
"""
Extract solidification time for each simulation of the grid for
the condition Phi_global < phi_crit at last time step.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EmmaPostolec please can you check this?


else:
if not columns_printed:
print('Warning: Missing Phi_global or Time column.')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be log.warning

return grouped


def latex(label: str) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a duplicate of latexify here:

def latexify(gas: str):

import matplotlib.cm as cm

cfg = {
'colormap': 'plasma',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use scientific colour maps

Comment on lines +598 to +603
def load_ecdf_plot_settings(cfg, tested_params=None):
"""
Load ECDF plotting settings for both input parameters and output variables
from a configuration dictionary loaded from TOML.

Parameters
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is huge and only used once. Is it actually needed? Seems like a lot of redundant code. Currently, the settings are loaded by the pathway:

_preset dict -> get_ function -> this function -> settings dict

This is quite cumbersome and therefore prone to more errors. Instead, perhaps the ECDF plotting function could call the get_label (etc.) functions directly, and then this whole load_ecdf_plot_settings function and the output_settings and param_settings would not be needed at all.


# List of parameter names (rows) and output names (columns)
param_names = list(param_settings.keys())
out_names = list(output_settings.keys())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about these dictionaries being redundant.

# ---------------------------------------------------------


def main(grid_analyse_toml_file: str | Path):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not call this function main. This implies that it is the main function of PROTEUS, which is confusing and hard to maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docs Update documentation webpage Enhancement A new feature or request Grids/Ensembles Relating to grids or forward models Priority 3: standard Priority level 3: medium time criticality or importance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Include a post-processing grid script to get global grid plot

5 participants