Add data validation and missing data handling practices by nick-gorman · Pull Request #11 · Open-ISP/dev-practices

nick-gorman · 2026-01-29T05:24:54Z

A write-up of the proposed data validation practices we discussed at the 2026 planning workshop.

My notes on this session were a little light, so I mostly went off memory, and have added extra details where I thought they were warranted. As always please treat as draft and let me know what you think, or if you want to add details from your notes.

Addresses #5

Documents schema-based approach for handling missing data in templater and translator modules, including schema enforcement at module boundaries, testing requirements, and documentation standards.

nick-gorman · 2026-02-04T02:58:51Z

Another thought for consideration. In the schemas, should we either allow additional columns which aren't required or optional, but get dropped at enforcement time? I'm thinking of columns like "status" in the ecaa_generators table. It seems strange to make the status column required or even create it and fill it with NaNs if not included. But we could either specify a set of metadata columns that are allowed but dropped (silently) on schema enforcement or just drop (silently) all columns which aren't either required or optional.

EllieKallmier · 2026-02-19T03:57:24Z

data-validation.md

+
+### NaN values in compulsory columns
+
+**Compulsory columns should not permit NaN values.** If a column can contain `NaN` values, the data it represents is effectively optional and the column should be defined as such. This distinction ensures the schema accurately reflects data requirements—compulsory columns guarantee complete data, while optional columns explicitly signal that missing values are acceptable and handled by the model.


I think if we stick with this it might help to shape other decisions around potential templater restructures - particularly about what tables should be combined in the templating step. I'm just imagining for example generation vs. storage assets having different sets of columns that are compulsory under this definition, so it wouldn't make sense to create one combined assets table.

And a follow up q that might clarify for me - is this definition specifically only referring to NaN values, or does it encompass other "empty" values as well? (e.g. "")

That's a really good point. I think having a clear set of compulsory columns that makes sense regardless of whether, say, there are any battery present makes sense. Rather than making battery column compulsory in a combined asset table, but then having them all NaN if there are no batteries. That would be confusing. Anyway, I think a good argument against combining all the asset tables.

Regarding the follow up, maybe NaN is actually the wrong term, maybe null is the better term, which would cover NaN and empty. But we should also maybe consolidate around using a consistent value for nulls.

EllieKallmier · 2026-02-19T03:59:33Z

data-validation.md

+#### Optional Elements
+
+- Missing optional tables are added as empty DataFrames with all schema-defined columns
+- Missing optional columns are added and populated with `NaN` values


Do we need/want to specify a blanket NaN type to use in these cases (and elsewhere) as part of these docs? E.g. None vs pd.NA vs np.nan

Yes, definitely!

EllieKallmier · 2026-02-19T04:04:49Z

data-validation.md

+
+#### Error message conventions
+
+Error messages should identify the table name, column name (if applicable), and the nature of the violation. This consistency aids debugging and helps users quickly identify and resolve issues with their input data.


And include module info too? e.g. for the first example - "Missing required table 'generators' as input to 'translator'" (or something). Maybe that's already assumed but could be good to lay out explicitly (addresses #6 too)

Yep, I think including the module name makes sense.

EllieKallmier · 2026-02-19T04:07:22Z

data-validation.md

+Schema enforcement must be tested to ensure:
+
+- Compulsory tables and columns that are missing raise appropriate errors
+- Optional tables and columns that are missing are correctly added


Is it useful to specify logging practices for stuff like this too, and include notes in this kind of docs about checking logs? I don't have a good sense of best practices for checking logging so that might be overkill

hmmm there is probably a broader conversation to be had about logging practices.

EllieKallmier · 2026-02-19T04:10:37Z

data-validation.md

+
+#### Empty DataFrame Handling
+
+- Functions accept DataFrames with no rows without raising errors


For my clarity - this should be the case on the assumption that if the function requires a particular DataFrame to contain data, the error should have already been raised at the schema enforcement step?

I guess it depends on if its a public function or not. By making a function public we are saying this is safe to use, its an API intended to be called directly by the user. So if it's public, it should probably handle empties gracefully or throw an error saying it can't accept them.

Mmm yep good point, that makes sense!

data-validation.md

EllieKallmier · 2026-02-25T04:54:50Z

Another question I just thought about - do we want to engage with Indexes in a particular way?
I guess a few sub-questions pop out from there -

Do we want to always assume that at the enforcement stage there are 'no' indexes for any tables? e.g. enforce a reset_index() for each table?
Are there any tables with columns for which we want to enforce uniqueness, and if so is there a best way to add that to a schema?
If we allow/want to use indexes, should we write that into the schema like a column and then set index at end of enforcement, or does it have a separate/specific schema feature that describes index properties?

Co-authored-by: EllieKallmier <61219730+EllieKallmier@users.noreply.github.com>

nick-gorman · 2026-02-26T04:57:53Z

Another question I just thought about - do we want to engage with Indexes in a particular way? I guess a few sub-questions pop out from there -

Do we want to always assume that at the enforcement stage there are 'no' indexes for any tables? e.g. enforce a reset_index() for each table?

Are there any tables with columns for which we want to enforce uniqueness, and if so is there a best way to add that to a schema?

If we allow/want to use indexes, should we write that into the schema like a column and then set index at end of enforcement, or does it have a separate/specific schema feature that describes index properties?

I think the answer is probably yes to in general to this:

I'd be happy to assume all data is in columns and indexes could be reset
Enforcing uniqueness for the primary key type column makes sense, would help catch issues in the data that a user could easily miss
I'm open to this, but don't have strong preference

Add data validation and missing data handling practices

71bf3ea

Documents schema-based approach for handling missing data in templater and translator modules, including schema enforcement at module boundaries, testing requirements, and documentation standards.

nick-gorman requested review from EllieKallmier and dylanjmcconnell January 29, 2026 05:25

EllieKallmier reviewed Feb 19, 2026

View reviewed changes

data-validation.md Outdated Show resolved Hide resolved

EllieKallmier added type: feature New feature or request category: style guide Relates to the code style guide labels Feb 25, 2026

EllieKallmier added this to the Define a consistent guide for data validation milestone Feb 25, 2026

EllieKallmier mentioned this pull request Feb 25, 2026

Implement templater review suggestions Open-ISP/ISPyPSA#83

Open

6 tasks

Update data-validation.md

fc2c615

Co-authored-by: EllieKallmier <61219730+EllieKallmier@users.noreply.github.com>

EllieKallmier approved these changes Feb 26, 2026

View reviewed changes


		### NaN values in compulsory columns

		Compulsory columns should not permit NaN values. If a column can contain `NaN` values, the data it represents is effectively optional and the column should be defined as such. This distinction ensures the schema accurately reflects data requirements—compulsory columns guarantee complete data, while optional columns explicitly signal that missing values are acceptable and handled by the model.


		#### Error message conventions

		Error messages should identify the table name, column name (if applicable), and the nature of the violation. This consistency aids debugging and helps users quickly identify and resolve issues with their input data.


		#### Empty DataFrame Handling

		- Functions accept DataFrames with no rows without raising errors

Conversation

nick-gorman commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nick-gorman commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EllieKallmier commented Feb 25, 2026

Uh oh!

nick-gorman commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nick-gorman commented Jan 29, 2026 •

edited

Loading