Conversation
tkachyna
commented
Mar 1, 2026
- adds scraping of product description
- improves parsing of security packages
- adds status computation
- adds missing method for heuristics
41e2300 to
a0d52d0
Compare
a0d52d0 to
6c497b2
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #558 +/- ##
===========================================
- Coverage 69.97% 56.94% -13.02%
===========================================
Files 78 78
Lines 9126 9203 +77
===========================================
- Hits 6385 5240 -1145
- Misses 2741 3963 +1222 ☔ View full report in Codecov by Sentry. |
src/sec_certs/sample/eucc.py
Outdated
| holder_website = EUCCCertificate._extract_holder_website(metadata.get("holder_website", "")) | ||
|
|
||
| if "package" in metadata and metadata["package"]: | ||
| metadata["package"] = EUCCCertificate._parse_package(metadata["package"]) |
There was a problem hiding this comment.
I have an issue with this. The rest of the function just parses the metadata to fill the certificate attributes, so mutating the dict here is unexpected. This is probably why the param type had to be widened to dict[str, Any]; however, the type of the other_metadata attribute in EUCCCertificate remained unchanged, so it's now inaccurate.
If you want to have it as a part of metadata, I'd move this into _parse_page_metadata where the dict is constructed. Or you can just follow the existing pattern here: create a new certificate attribute and save it there.
But I'd think more about the general semantics of the metadata dict. So far, it was just raw data from the certificate details page used to fill relevant existing certificate attributes. But because there is more useful data with no specific attributes, you decided to save it under the other_metadata attribute, right?
If you now expect there will be more cases where you need to operate on this raw data, transform it, parse it, etc., I'd probably introduce a dataclass attribute in the certificate for this metadata (similar to heuristics, state, pdf_data) so the type hinting is more accurate and it's clearer what fields are available.
But whatever you decide, mutating the dict in the function that should just use it to return a new certificate instance feels wrong for me. At least the type hinting should be corrected in the certificate.
Wdyt?
There was a problem hiding this comment.
Okay, I can move it to _parse_page_metadata, where it would make more sense (I plan to do further parsing of some metadata fields in the near future). The other_metadata fields are only meant to be displayed at the beginning of the certificate pages in sec-certs. There won’t be any further operations with them after they are saved into the EUCCCertificate object.
fill relevant existing certificate attributes
Yes, as these existing certificate attributes are often used for heuristic computation.
There was a problem hiding this comment.
Okay, try whatever feels right for you. If there won’t be any further operations with them, then I can accept this solution as well. I was just confused as a reader.
Personally, after reflecting on this further, I'd keep _parse_page_metadata as it is because it's straightforward. And introduce in the EUCCCertificate dataclass attribute for other_metadata (or whatever name works best). Then you could for instance just create a helper function in that class or wherever that you would call in _from_metadata_dict that takes the original metadata dict and returns the filled dataclass, which is then stored in the certificate. In this way, all of the parsing logic you plan to do would live there.
My reasoning is that this approach would provide clear type hinting and eliminate duplication. Currently, some of the metadata is used to fill the attributes, and then they are stored in the other_metadata as well. But it's just my preference, and I don't say it's the best way to do it.
There was a problem hiding this comment.
I like the idea with the dataclass for it. I ll push the changes tonight. The name other_metadata will be changed for sure, for something like 'enisa_metadata'.
Removed implicit root_dir value for consistency across datasets. In other datasets, root_dir is implicitly created during get_certs_from_web via web_dir creation. EUCC performs no downloads, but still requires root_dir for serialization, so it is now created explicitly. Other stages already create directories inside root_dir, which implicitly creates it when needed.
|
It seems that merging the previous EUCC PR broke CC dataset creation and download. The website is broken after today's update. The errors are: Is this fixed by ca37bf6? If so, we need to merge this ASAP and do a run on the web, as I need the web ready and working as on Monday the 9th I am giving a talk at RWC. |
|
I remember I observed the similar problem when running my EUCC pipeline, for some reason the pdf files under certificates, reports and security targets were missing (see the pic). I added them manually and everything was working fine. I can try running it with this commit. It is still a mystery for me why it did not work. Or @jborsky, does this commit ca37bf6 fix the problem I mentioned above? |
|
Well I see the issue now. sec-certs/src/sec_certs/dataset/cc.py Lines 635 to 649 in cc87bc5 The directory gets created before the downloads happen. Grepping "reports_pdf_dir" in current main in the dataset directory shows no mkdir on it. So it got lost somewhere along the way in a bad refactor. I see that some complicated attr getting code tries to create it, but something gets fucked. Not sure what exactly. If you see this kind of behavior, please work it out before marking a PR ready to merge. |
|
Oh yeah, now I see it: Creates a txt_dir instead of pdf_dir. |
|
Fixed in 30f8746. |
e2c6f86 to
4aef85f
Compare
|
@tkachyna, do you plan to implement any other improvements in this PR? |
