Add data validation and missing data handling practices#11
Add data validation and missing data handling practices#11nick-gorman wants to merge 2 commits intomainfrom
Conversation
Documents schema-based approach for handling missing data in templater and translator modules, including schema enforcement at module boundaries, testing requirements, and documentation standards.
|
Another thought for consideration. In the schemas, should we either allow additional columns which aren't required or optional, but get dropped at enforcement time? I'm thinking of columns like "status" in the ecaa_generators table. It seems strange to make the status column required or even create it and fill it with NaNs if not included. But we could either specify a set of metadata columns that are allowed but dropped (silently) on schema enforcement or just drop (silently) all columns which aren't either required or optional. |
|
|
||
| ### NaN values in compulsory columns | ||
|
|
||
| **Compulsory columns should not permit NaN values.** If a column can contain `NaN` values, the data it represents is effectively optional and the column should be defined as such. This distinction ensures the schema accurately reflects data requirements—compulsory columns guarantee complete data, while optional columns explicitly signal that missing values are acceptable and handled by the model. |
There was a problem hiding this comment.
I think if we stick with this it might help to shape other decisions around potential templater restructures - particularly about what tables should be combined in the templating step. I'm just imagining for example generation vs. storage assets having different sets of columns that are compulsory under this definition, so it wouldn't make sense to create one combined assets table.
And a follow up q that might clarify for me - is this definition specifically only referring to NaN values, or does it encompass other "empty" values as well? (e.g. "")
There was a problem hiding this comment.
That's a really good point. I think having a clear set of compulsory columns that makes sense regardless of whether, say, there are any battery present makes sense. Rather than making battery column compulsory in a combined asset table, but then having them all NaN if there are no batteries. That would be confusing. Anyway, I think a good argument against combining all the asset tables.
Regarding the follow up, maybe NaN is actually the wrong term, maybe null is the better term, which would cover NaN and empty. But we should also maybe consolidate around using a consistent value for nulls.
| #### Optional Elements | ||
|
|
||
| - Missing optional tables are added as empty DataFrames with all schema-defined columns | ||
| - Missing optional columns are added and populated with `NaN` values |
There was a problem hiding this comment.
Do we need/want to specify a blanket NaN type to use in these cases (and elsewhere) as part of these docs? E.g. None vs pd.NA vs np.nan
|
|
||
| #### Error message conventions | ||
|
|
||
| Error messages should identify the table name, column name (if applicable), and the nature of the violation. This consistency aids debugging and helps users quickly identify and resolve issues with their input data. |
There was a problem hiding this comment.
And include module info too? e.g. for the first example - "Missing required table 'generators' as input to 'translator'" (or something). Maybe that's already assumed but could be good to lay out explicitly (addresses #6 too)
There was a problem hiding this comment.
Yep, I think including the module name makes sense.
| Schema enforcement must be tested to ensure: | ||
|
|
||
| - Compulsory tables and columns that are missing raise appropriate errors | ||
| - Optional tables and columns that are missing are correctly added |
There was a problem hiding this comment.
Is it useful to specify logging practices for stuff like this too, and include notes in this kind of docs about checking logs? I don't have a good sense of best practices for checking logging so that might be overkill
There was a problem hiding this comment.
hmmm there is probably a broader conversation to be had about logging practices.
|
|
||
| #### Empty DataFrame Handling | ||
|
|
||
| - Functions accept DataFrames with no rows without raising errors |
There was a problem hiding this comment.
For my clarity - this should be the case on the assumption that if the function requires a particular DataFrame to contain data, the error should have already been raised at the schema enforcement step?
There was a problem hiding this comment.
I guess it depends on if its a public function or not. By making a function public we are saying this is safe to use, its an API intended to be called directly by the user. So if it's public, it should probably handle empties gracefully or throw an error saying it can't accept them.
There was a problem hiding this comment.
Mmm yep good point, that makes sense!
|
Another question I just thought about - do we want to engage with Indexes in a particular way?
|
Co-authored-by: EllieKallmier <61219730+EllieKallmier@users.noreply.github.com>
I think the answer is probably yes to in general to this:
|
A write-up of the proposed data validation practices we discussed at the 2026 planning workshop.
My notes on this session were a little light, so I mostly went off memory, and have added extra details where I thought they were warranted. As always please treat as draft and let me know what you think, or if you want to add details from your notes.
Addresses #5