Skip to content

EUCC scraping improvements#558

Open
tkachyna wants to merge 9 commits intomainfrom
eucc-improvements
Open

EUCC scraping improvements#558
tkachyna wants to merge 9 commits intomainfrom
eucc-improvements

Conversation

@tkachyna
Copy link
Collaborator

@tkachyna tkachyna commented Mar 1, 2026

  • adds scraping of product description
  • improves parsing of security packages
  • adds status computation
  • adds missing method for heuristics

@tkachyna tkachyna force-pushed the eucc-improvements branch from 41e2300 to a0d52d0 Compare March 1, 2026 09:47
@tkachyna tkachyna added the cc Related to CC certification label Mar 1, 2026
@tkachyna tkachyna self-assigned this Mar 1, 2026
@tkachyna tkachyna force-pushed the eucc-improvements branch from a0d52d0 to 6c497b2 Compare March 1, 2026 09:51
@tkachyna tkachyna requested a review from jborsky March 1, 2026 09:54
@codecov
Copy link

codecov bot commented Mar 1, 2026

Codecov Report

❌ Patch coverage is 53.16456% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.94%. Comparing base (9109260) to head (491aa6f).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sec_certs/sample/eucc.py 60.00% 28 Missing ⚠️
src/sec_certs/dataset/eucc.py 0.00% 9 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (9109260) and HEAD (491aa6f). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (9109260) HEAD (491aa6f)
4 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #558       +/-   ##
===========================================
- Coverage   69.97%   56.94%   -13.02%     
===========================================
  Files          78       78               
  Lines        9126     9203       +77     
===========================================
- Hits         6385     5240     -1145     
- Misses       2741     3963     +1222     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

holder_website = EUCCCertificate._extract_holder_website(metadata.get("holder_website", ""))

if "package" in metadata and metadata["package"]:
metadata["package"] = EUCCCertificate._parse_package(metadata["package"])
Copy link
Collaborator

@jborsky jborsky Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an issue with this. The rest of the function just parses the metadata to fill the certificate attributes, so mutating the dict here is unexpected. This is probably why the param type had to be widened to dict[str, Any]; however, the type of the other_metadata attribute in EUCCCertificate remained unchanged, so it's now inaccurate.

If you want to have it as a part of metadata, I'd move this into _parse_page_metadata where the dict is constructed. Or you can just follow the existing pattern here: create a new certificate attribute and save it there.

But I'd think more about the general semantics of the metadata dict. So far, it was just raw data from the certificate details page used to fill relevant existing certificate attributes. But because there is more useful data with no specific attributes, you decided to save it under the other_metadata attribute, right?

If you now expect there will be more cases where you need to operate on this raw data, transform it, parse it, etc., I'd probably introduce a dataclass attribute in the certificate for this metadata (similar to heuristics, state, pdf_data) so the type hinting is more accurate and it's clearer what fields are available.

But whatever you decide, mutating the dict in the function that should just use it to return a new certificate instance feels wrong for me. At least the type hinting should be corrected in the certificate.

Wdyt?

Copy link
Collaborator Author

@tkachyna tkachyna Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can move it to _parse_page_metadata, where it would make more sense (I plan to do further parsing of some metadata fields in the near future). The other_metadata fields are only meant to be displayed at the beginning of the certificate pages in sec-certs. There won’t be any further operations with them after they are saved into the EUCCCertificate object.

fill relevant existing certificate attributes

Yes, as these existing certificate attributes are often used for heuristic computation.

Copy link
Collaborator

@jborsky jborsky Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, try whatever feels right for you. If there won’t be any further operations with them, then I can accept this solution as well. I was just confused as a reader.

Personally, after reflecting on this further, I'd keep _parse_page_metadata as it is because it's straightforward. And introduce in the EUCCCertificate dataclass attribute for other_metadata (or whatever name works best). Then you could for instance just create a helper function in that class or wherever that you would call in _from_metadata_dict that takes the original metadata dict and returns the filled dataclass, which is then stored in the certificate. In this way, all of the parsing logic you plan to do would live there.

My reasoning is that this approach would provide clear type hinting and eliminate duplication. Currently, some of the metadata is used to fill the attributes, and then they are stored in the other_metadata as well. But it's just my preference, and I don't say it's the best way to do it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea with the dataclass for it. I ll push the changes tonight. The name other_metadata will be changed for sure, for something like 'enisa_metadata'.

Removed implicit root_dir value for consistency across datasets.

In other datasets, root_dir is implicitly created during
get_certs_from_web via web_dir creation. EUCC performs no
downloads, but still requires root_dir for serialization,
so it is now created explicitly.

Other stages already create directories inside root_dir,
which implicitly creates it when needed.
@J08nY
Copy link
Member

J08nY commented Mar 2, 2026

It seems that merging the previous EUCC PR broke CC dataset creation and download. The website is broken after today's update. The errors are:

[2026-03-02 12:24:04,493] [PID 3854340] [Thread-99 (worker)] [sec_certs.utils.helpers] [ERROR] Failed to download from https://sec-certs.org/proxy/cc/nfs/ccpfiles/files/ep
files/certificat%20ANSSI-CC-2017_73-S01.pdf; [Errno 2] No such file or directory: '/data/flask/instance/cc_dataset/certs/certificates/pdf/759407f9d41632b4.pdf

Is this fixed by ca37bf6? If so, we need to merge this ASAP and do a run on the web, as I need the web ready and working as on Monday the 9th I am giving a talk at RWC.

@tkachyna
Copy link
Collaborator Author

tkachyna commented Mar 2, 2026

I remember I observed the similar problem when running my EUCC pipeline, for some reason the pdf files under certificates, reports and security targets were missing (see the pic). I added them manually and everything was working fine. I can try running it with this commit. It is still a mystery for me why it did not work. Or @jborsky, does this commit ca37bf6 fix the problem I mentioned above?
image

@J08nY
Copy link
Member

J08nY commented Mar 2, 2026

Well I see the issue now.
See this before:

@staged(logger, "Downloading PDFs of CC certification reports.")
def _download_reports(self, fresh: bool = True) -> None:
self.reports_pdf_dir.mkdir(parents=True, exist_ok=True)
certs_to_process = [x for x in self if x.state.report.is_ok_to_download(fresh) and x.report_link]
if not fresh and certs_to_process:
logger.info(
f"Downloading {len(certs_to_process)} PDFs of CC certification reports for which previous download failed."
)
cert_processing.process_parallel(
CCCertificate.download_pdf_report,
certs_to_process,
progress_bar_desc="Downloading PDFs of CC certification reports",
)

The directory gets created before the downloads happen.

Grepping "reports_pdf_dir" in current main in the dataset directory shows no mkdir on it. So it got lost somewhere along the way in a bad refactor. I see that some complicated attr getting code tries to create it, but something gets fucked. Not sure what exactly.

If you see this kind of behavior, please work it out before marking a PR ready to merge.

@J08nY
Copy link
Member

J08nY commented Mar 2, 2026

Oh yeah, now I see it:

https://github.com/crocs-muni/sec-certs/blob/main/src/sec_certs/dataset/cc_eucc_common.py#L71C4-L71C64

Creates a txt_dir instead of pdf_dir.

@J08nY
Copy link
Member

J08nY commented Mar 2, 2026

Fixed in 30f8746.

@tkachyna tkachyna force-pushed the eucc-improvements branch from e2c6f86 to 4aef85f Compare March 2, 2026 23:18
@jborsky
Copy link
Collaborator

jborsky commented Mar 10, 2026

@tkachyna, do you plan to implement any other improvements in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cc Related to CC certification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants