feat: add ossf data fetcher (CM-952)#3839
feat: add ossf data fetcher (CM-952)#3839ulemons wants to merge 12 commits intofeat/add-dal-automatic-project-discoveryfrom
Conversation
f4cc9b8 to
543aa5f
Compare
264892a to
e4e5fc2
Compare
|
|
e4e5fc2 to
0b93da4
Compare
3cbe8ec to
0af35ee
Compare
76c308c to
02ab17d
Compare
0af35ee to
82f29d9
Compare
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
02ab17d to
ed37511
Compare
There was a problem hiding this comment.
Pull request overview
Adds an Automatic Projects Discovery Temporal worker that fetches external criticality datasets (OSSF + LF) and upserts them into the projectCatalog table, while aligning the data-access layer with the underlying schema.
Changes:
- Split
criticalityScoreintoossfCriticalityScoreandlfCriticalityScoreacross DAL types and SQL upsert/insert logic. - Implement a Temporal workflow + activities pipeline to list datasets per source and process datasets in batches via
bulkUpsertProjectCatalog. - Add source abstractions/registry plus OSSF (GCS CSV) and LF (API JSON) fetchers, along with scheduling/docs/deps.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/project-catalog/types.ts | Updates DB types to use ossfCriticalityScore + lfCriticalityScore. |
| services/libs/data-access-layer/src/project-catalog/projectCatalog.ts | Updates select/insert/bulk insert/upsert/update SQL to use the two score columns. |
| services/apps/automatic_projects_discovery_worker/src/workflows/discoverProjects.ts | New workflow orchestration (incremental vs full) + activity timeouts/retries. |
| services/apps/automatic_projects_discovery_worker/src/sources/types.ts | Introduces source/dataset interfaces and CSV vs JSON streaming contract. |
| services/apps/automatic_projects_discovery_worker/src/sources/registry.ts | Registers discovery sources and exposes lookup/list helpers. |
| services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts | Implements dataset listing + CSV row parsing for OSSF snapshots. |
| services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/bucketClient.ts | Minimal HTTP/XML client for listing GCS prefixes + streaming all.csv. |
| services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts | Implements LF dataset descriptor generation + paginated API streaming + parsing. |
| services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts | Registers Temporal schedule (daily cron) and workflow args/timeouts. |
| services/apps/automatic_projects_discovery_worker/src/main.ts | Enables Postgres for this worker and schedules discovery on startup. |
| services/apps/automatic_projects_discovery_worker/src/activities/activities.ts | Adds activities to list sources/datasets and process datasets with batching + upsert. |
| services/apps/automatic_projects_discovery_worker/src/activities.ts | Switches barrel export to explicit named exports for activity typing. |
| services/apps/automatic_projects_discovery_worker/package.json | Adds csv-parse dependency and adjusts local debug script. |
| services/apps/automatic_projects_discovery_worker/README.md | Adds architecture/workflow documentation for the worker. |
| pnpm-lock.yaml | Locks csv-parse dependency version. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev' | ||
| const PAGE_SIZE = 100 |
There was a problem hiding this comment.
DEFAULT_API_URL is set to an ngrok-free.dev address. That’s ephemeral and not suitable as a production/default endpoint; it can also silently route traffic to an unexpected host if the env var isn’t set. Consider removing the default and failing fast when LF_CRITICALITY_SCORE_API_URL is missing (or replace it with a stable, owned HTTPS endpoint).
| // Sort newest-first by date | ||
| datasets.sort((a, b) => b.date.localeCompare(a.date)) |
There was a problem hiding this comment.
Datasets are sorted newest-first only by the date field. If the bucket contains multiple snapshots for the same date (different time prefixes), this sort won’t guarantee the latest snapshot is first, which can make incremental mode pick an older dataset. Consider sorting by the full dataset id/prefix (date+time) or by id descending instead of only date.
| // Sort newest-first by date | |
| datasets.sort((a, b) => b.date.localeCompare(a.date)) | |
| // Sort newest-first by full dataset id (includes date and time) | |
| datasets.sort((a, b) => b.id.localeCompare(a.id)) |
services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
|
|
||
| const log = getServiceLogger() | ||
|
|
||
| const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev' |
There was a problem hiding this comment.
Ngrok development URL hardcoded as production default
High Severity
DEFAULT_API_URL is set to a temporary ngrok tunnel URL (https://hypervascular-nonduplicative-vern.ngrok-free.dev). If LF_CRITICALITY_SCORE_API_URL is not set in the environment, every LfCriticalityScoreSource operation will attempt to reach this ephemeral dev tunnel, which will almost certainly be dead in production. This causes all LF criticality score data fetching to fail silently or with errors.
Additional Locations (1)
| } | ||
|
|
||
| // Sort newest-first by date | ||
| datasets.sort((a, b) => b.date.localeCompare(a.date)) |
There was a problem hiding this comment.
OSSF dataset sort ignores time within same date
Low Severity
The sort in listAvailableDatasets only compares by date (e.g. 2024.07.01), not by the full id which includes the time component (e.g. 2024.07.01/060102). When multiple time-based snapshots exist for the same date, their relative order is determined by insertion order (earliest time first from the GCS listing). In incremental mode, allDatasets[0] would pick the earliest time snapshot for the newest date rather than the latest one.


Note
Medium Risk
Adds a new scheduled Temporal workflow that streams large external datasets and bulk-upserts into Postgres, and changes
projectCatalogto store two separate score fields; failures or schema mismatches could impact data quality and worker stability.Overview
Adds an automatic projects discovery Temporal worker that enumerates discovery sources, lists available dataset snapshots, streams each dataset (CSV via
csv-parseor object-mode JSON), and bulk upserts projects intoprojectCatalogin 5k-row batches.Introduces two discovery sources: OSSF Criticality Score (public GCS CSV snapshots) and LF Criticality Score (paged JSON API, last 12 months), and registers a daily midnight Temporal schedule to run
discoverProjectsin incremental mode with a 2-hour execution timeout.Updates the data-access layer for
projectCatalogto replacecriticalityScorewith separateossfCriticalityScoreandlfCriticalityScorecolumns, including insert/bulk-insert/upsert semantics that preserve existing scores when new data omits them.Written by Cursor Bugbot for commit ed37511. This will update automatically on new commits. Configure here.