EUCC scraping improvements by tkachyna · Pull Request #558 · crocs-muni/sec-certs

tkachyna · 2026-03-01T09:44:38Z

adds scraping of product description
improves parsing of security packages
adds status computation
adds missing method for heuristics

…uristics

codecov · 2026-03-01T09:55:34Z

Codecov Report

❌ Patch coverage is 53.16456% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.94%. Comparing base (9109260) to head (491aa6f).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/sec_certs/sample/eucc.py	60.00%	28 Missing ⚠️
src/sec_certs/dataset/eucc.py	0.00%	9 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (9109260) and HEAD (491aa6f). Click for more details.

HEAD has 3 uploads less than BASE

Flag BASE (9109260) HEAD (491aa6f)

4 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #558       +/-   ##
===========================================
- Coverage   69.97%   56.94%   -13.02%     
===========================================
  Files          78       78               
  Lines        9126     9203       +77     
===========================================
- Hits         6385     5240     -1145     
- Misses       2741     3963     +1222

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jborsky · 2026-03-02T11:03:41Z

src/sec_certs/sample/eucc.py

        holder_website = EUCCCertificate._extract_holder_website(metadata.get("holder_website", ""))

+        if "package" in metadata and metadata["package"]:
+            metadata["package"] = EUCCCertificate._parse_package(metadata["package"])


I have an issue with this. The rest of the function just parses the metadata to fill the certificate attributes, so mutating the dict here is unexpected. This is probably why the param type had to be widened to dict[str, Any]; however, the type of the other_metadata attribute in EUCCCertificate remained unchanged, so it's now inaccurate.

If you want to have it as a part of metadata, I'd move this into _parse_page_metadata where the dict is constructed. Or you can just follow the existing pattern here: create a new certificate attribute and save it there.

But I'd think more about the general semantics of the metadata dict. So far, it was just raw data from the certificate details page used to fill relevant existing certificate attributes. But because there is more useful data with no specific attributes, you decided to save it under the other_metadata attribute, right?

If you now expect there will be more cases where you need to operate on this raw data, transform it, parse it, etc., I'd probably introduce a dataclass attribute in the certificate for this metadata (similar to heuristics, state, pdf_data) so the type hinting is more accurate and it's clearer what fields are available.

But whatever you decide, mutating the dict in the function that should just use it to return a new certificate instance feels wrong for me. At least the type hinting should be corrected in the certificate.

Wdyt?

Okay, I can move it to _parse_page_metadata, where it would make more sense (I plan to do further parsing of some metadata fields in the near future). The other_metadata fields are only meant to be displayed at the beginning of the certificate pages in sec-certs. There won’t be any further operations with them after they are saved into the EUCCCertificate object.

fill relevant existing certificate attributes

Yes, as these existing certificate attributes are often used for heuristic computation.

Okay, try whatever feels right for you. If there won’t be any further operations with them, then I can accept this solution as well. I was just confused as a reader.

Personally, after reflecting on this further, I'd keep _parse_page_metadata as it is because it's straightforward. And introduce in the EUCCCertificate dataclass attribute for other_metadata (or whatever name works best). Then you could for instance just create a helper function in that class or wherever that you would call in _from_metadata_dict that takes the original metadata dict and returns the filled dataclass, which is then stored in the certificate. In this way, all of the parsing logic you plan to do would live there.

My reasoning is that this approach would provide clear type hinting and eliminate duplication. Currently, some of the metadata is used to fill the attributes, and then they are stored in the other_metadata as well. But it's just my preference, and I don't say it's the best way to do it.

I like the idea with the dataclass for it. I ll push the changes tonight. The name other_metadata will be changed for sure, for something like 'enisa_metadata'.

Removed implicit root_dir value for consistency across datasets. In other datasets, root_dir is implicitly created during get_certs_from_web via web_dir creation. EUCC performs no downloads, but still requires root_dir for serialization, so it is now created explicitly. Other stages already create directories inside root_dir, which implicitly creates it when needed.

J08nY · 2026-03-02T21:27:30Z

It seems that merging the previous EUCC PR broke CC dataset creation and download. The website is broken after today's update. The errors are:

[2026-03-02 12:24:04,493] [PID 3854340] [Thread-99 (worker)] [sec_certs.utils.helpers] [ERROR] Failed to download from https://sec-certs.org/proxy/cc/nfs/ccpfiles/files/ep
files/certificat%20ANSSI-CC-2017_73-S01.pdf; [Errno 2] No such file or directory: '/data/flask/instance/cc_dataset/certs/certificates/pdf/759407f9d41632b4.pdf

Is this fixed by ca37bf6? If so, we need to merge this ASAP and do a run on the web, as I need the web ready and working as on Monday the 9th I am giving a talk at RWC.

tkachyna · 2026-03-02T21:37:58Z

I remember I observed the similar problem when running my EUCC pipeline, for some reason the pdf files under certificates, reports and security targets were missing (see the pic). I added them manually and everything was working fine. I can try running it with this commit. It is still a mystery for me why it did not work. Or @jborsky, does this commit ca37bf6 fix the problem I mentioned above?

J08nY · 2026-03-02T21:48:39Z

Well I see the issue now.
See this before:

sec-certs/src/sec_certs/dataset/cc.py

Lines 635 to 649 in cc87bc5

    
           @staged(logger, "Downloading PDFs of CC certification reports.") 
        
           def _download_reports(self, fresh: bool = True) -> None: 
        
               self.reports_pdf_dir.mkdir(parents=True, exist_ok=True) 
        
               certs_to_process = [x for x in self if x.state.report.is_ok_to_download(fresh) and x.report_link] 
        
               if not fresh and certs_to_process: 
        
                   logger.info( 
        
                       f"Downloading {len(certs_to_process)} PDFs of CC certification reports for which previous download failed." 
        
                   ) 
        
               cert_processing.process_parallel( 
        
                   CCCertificate.download_pdf_report, 
        
                   certs_to_process, 
        
                   progress_bar_desc="Downloading PDFs of CC certification reports", 
        
               )

The directory gets created before the downloads happen.

Grepping "reports_pdf_dir" in current main in the dataset directory shows no mkdir on it. So it got lost somewhere along the way in a bad refactor. I see that some complicated attr getting code tries to create it, but something gets fucked. Not sure what exactly.

If you see this kind of behavior, please work it out before marking a PR ready to merge.

J08nY · 2026-03-02T21:50:08Z

Oh yeah, now I see it:

https://github.com/crocs-muni/sec-certs/blob/main/src/sec_certs/dataset/cc_eucc_common.py#L71C4-L71C64

Creates a txt_dir instead of pdf_dir.

J08nY · 2026-03-02T21:53:44Z

Fixed in 30f8746.

…rovements

src/sec_certs/sample/eucc.py

jborsky · 2026-03-10T15:34:07Z

@tkachyna, do you plan to implement any other improvements in this PR?

tkachyna added 3 commits February 23, 2026 21:23

Improve package and security level parsing; add missing method for he…

85d74fe

…uristics

Add product description scraping

49ec219

Add status computation

3a0d5a2

tkachyna force-pushed the eucc-improvements branch from 41e2300 to a0d52d0 Compare March 1, 2026 09:47

tkachyna added the cc Related to CC certification label Mar 1, 2026

tkachyna self-assigned this Mar 1, 2026

Delete unsused import, remove double decorator, typing

6c497b2

tkachyna force-pushed the eucc-improvements branch from a0d52d0 to 6c497b2 Compare March 1, 2026 09:51

tkachyna requested a review from jborsky March 1, 2026 09:54

jborsky reviewed Mar 2, 2026

View reviewed changes

tkachyna added 3 commits March 2, 2026 23:59

Create EnisaMetadata dataclass

c78976f

Merge remote-tracking branch 'origin/eucc-improvements' into eucc-imp…

16f0c1a

…rovements

Fix typing and mypy

4aef85f

tkachyna force-pushed the eucc-improvements branch from e2c6f86 to 4aef85f Compare March 2, 2026 23:18

jborsky reviewed Mar 3, 2026

View reviewed changes

src/sec_certs/sample/eucc.py Show resolved Hide resolved

Add user agent to fix 429 error when downloading pdf files from enisa

491aa6f

J08nY mentioned this pull request Mar 18, 2026

1.3. Add EUCC certificates scraping and integration #569

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EUCC scraping improvements#558

EUCC scraping improvements#558
tkachyna wants to merge 9 commits intomainfrom
eucc-improvements

tkachyna commented Mar 1, 2026

Uh oh!

codecov bot commented Mar 1, 2026 •

edited

Loading

Uh oh!

jborsky Mar 2, 2026 •

edited

Loading

Uh oh!

tkachyna Mar 2, 2026 •

edited

Loading

Uh oh!

jborsky Mar 2, 2026 •

edited

Loading

Uh oh!

tkachyna Mar 2, 2026

Uh oh!

J08nY commented Mar 2, 2026

Uh oh!

tkachyna commented Mar 2, 2026

Uh oh!

J08nY commented Mar 2, 2026

Uh oh!

J08nY commented Mar 2, 2026

Uh oh!

J08nY commented Mar 2, 2026

Uh oh!

Uh oh!

jborsky commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tkachyna commented Mar 1, 2026

Uh oh!

codecov bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jborsky Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkachyna Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jborsky Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkachyna Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

J08nY commented Mar 2, 2026

Uh oh!

tkachyna commented Mar 2, 2026

Uh oh!

J08nY commented Mar 2, 2026

Uh oh!

J08nY commented Mar 2, 2026

Uh oh!

J08nY commented Mar 2, 2026

Uh oh!

Uh oh!

jborsky commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 1, 2026 •

edited

Loading

jborsky Mar 2, 2026 •

edited

Loading

tkachyna Mar 2, 2026 •

edited

Loading

jborsky Mar 2, 2026 •

edited

Loading