Skip to content

feat: add ossf data fetcher (CM-952)#3839

Open
ulemons wants to merge 12 commits intofeat/add-dal-automatic-project-discoveryfrom
feat/add-assf-data-fetcher
Open

feat: add ossf data fetcher (CM-952)#3839
ulemons wants to merge 12 commits intofeat/add-dal-automatic-project-discoveryfrom
feat/add-assf-data-fetcher

Conversation

@ulemons
Copy link
Contributor

@ulemons ulemons commented Feb 10, 2026

Note

Medium Risk
Adds a new scheduled Temporal workflow that streams large external datasets and bulk-upserts into Postgres, and changes projectCatalog to store two separate score fields; failures or schema mismatches could impact data quality and worker stability.

Overview
Adds an automatic projects discovery Temporal worker that enumerates discovery sources, lists available dataset snapshots, streams each dataset (CSV via csv-parse or object-mode JSON), and bulk upserts projects into projectCatalog in 5k-row batches.

Introduces two discovery sources: OSSF Criticality Score (public GCS CSV snapshots) and LF Criticality Score (paged JSON API, last 12 months), and registers a daily midnight Temporal schedule to run discoverProjects in incremental mode with a 2-hour execution timeout.

Updates the data-access layer for projectCatalog to replace criticalityScore with separate ossfCriticalityScore and lfCriticalityScore columns, including insert/bulk-insert/upsert semantics that preserve existing scores when new data omits them.

Written by Cursor Bugbot for commit ed37511. This will update automatically on new commits. Configure here.

@ulemons ulemons self-assigned this Feb 10, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventional Commits FTW!

@ulemons ulemons changed the title Feat/add assf data fetcher feat: add ossf data fetcher Feb 10, 2026
@ulemons ulemons added the POC label Feb 11, 2026
@ulemons ulemons changed the title feat: add ossf data fetcher feat: add ossf data fetcher (CM-952) Feb 11, 2026
@ulemons ulemons changed the base branch from feat/add-project-discovery-worker to feat/add-dal-automatic-project-discovery February 11, 2026 10:46
@ulemons ulemons force-pushed the feat/add-dal-automatic-project-discovery branch 2 times, most recently from f4cc9b8 to 543aa5f Compare March 24, 2026 10:31
@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from 264892a to e4e5fc2 Compare March 24, 2026 10:37
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from e4e5fc2 to 0b93da4 Compare March 24, 2026 10:38
@ulemons ulemons force-pushed the feat/add-dal-automatic-project-discovery branch from 3cbe8ec to 0af35ee Compare March 24, 2026 11:56
@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from 76c308c to 02ab17d Compare March 24, 2026 11:58
@ulemons ulemons marked this pull request as ready for review March 24, 2026 14:01
Copilot AI review requested due to automatic review settings March 24, 2026 14:01
@ulemons ulemons force-pushed the feat/add-dal-automatic-project-discovery branch from 0af35ee to 82f29d9 Compare March 24, 2026 14:02
ulemons and others added 12 commits March 24, 2026 15:02
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
@ulemons ulemons force-pushed the feat/add-assf-data-fetcher branch from 02ab17d to ed37511 Compare March 24, 2026 14:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Automatic Projects Discovery Temporal worker that fetches external criticality datasets (OSSF + LF) and upserts them into the projectCatalog table, while aligning the data-access layer with the underlying schema.

Changes:

  • Split criticalityScore into ossfCriticalityScore and lfCriticalityScore across DAL types and SQL upsert/insert logic.
  • Implement a Temporal workflow + activities pipeline to list datasets per source and process datasets in batches via bulkUpsertProjectCatalog.
  • Add source abstractions/registry plus OSSF (GCS CSV) and LF (API JSON) fetchers, along with scheduling/docs/deps.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/project-catalog/types.ts Updates DB types to use ossfCriticalityScore + lfCriticalityScore.
services/libs/data-access-layer/src/project-catalog/projectCatalog.ts Updates select/insert/bulk insert/upsert/update SQL to use the two score columns.
services/apps/automatic_projects_discovery_worker/src/workflows/discoverProjects.ts New workflow orchestration (incremental vs full) + activity timeouts/retries.
services/apps/automatic_projects_discovery_worker/src/sources/types.ts Introduces source/dataset interfaces and CSV vs JSON streaming contract.
services/apps/automatic_projects_discovery_worker/src/sources/registry.ts Registers discovery sources and exposes lookup/list helpers.
services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts Implements dataset listing + CSV row parsing for OSSF snapshots.
services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/bucketClient.ts Minimal HTTP/XML client for listing GCS prefixes + streaming all.csv.
services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts Implements LF dataset descriptor generation + paginated API streaming + parsing.
services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts Registers Temporal schedule (daily cron) and workflow args/timeouts.
services/apps/automatic_projects_discovery_worker/src/main.ts Enables Postgres for this worker and schedules discovery on startup.
services/apps/automatic_projects_discovery_worker/src/activities/activities.ts Adds activities to list sources/datasets and process datasets with batching + upsert.
services/apps/automatic_projects_discovery_worker/src/activities.ts Switches barrel export to explicit named exports for activity typing.
services/apps/automatic_projects_discovery_worker/package.json Adds csv-parse dependency and adjusts local debug script.
services/apps/automatic_projects_discovery_worker/README.md Adds architecture/workflow documentation for the worker.
pnpm-lock.yaml Locks csv-parse dependency version.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +11 to +12
const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev'
const PAGE_SIZE = 100
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_API_URL is set to an ngrok-free.dev address. That’s ephemeral and not suitable as a production/default endpoint; it can also silently route traffic to an unexpected host if the env var isn’t set. Consider removing the default and failing fast when LF_CRITICALITY_SCORE_API_URL is missing (or replace it with a stable, owned HTTPS endpoint).

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +31
// Sort newest-first by date
datasets.sort((a, b) => b.date.localeCompare(a.date))
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets are sorted newest-first only by the date field. If the bucket contains multiple snapshots for the same date (different time prefixes), this sort won’t guarantee the latest snapshot is first, which can make incremental mode pick an older dataset. Consider sorting by the full dataset id/prefix (date+time) or by id descending instead of only date.

Suggested change
// Sort newest-first by date
datasets.sort((a, b) => b.date.localeCompare(a.date))
// Sort newest-first by full dataset id (includes date and time)
datasets.sort((a, b) => b.id.localeCompare(a.id))

Copilot uses AI. Check for mistakes.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.


const log = getServiceLogger()

const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ngrok development URL hardcoded as production default

High Severity

DEFAULT_API_URL is set to a temporary ngrok tunnel URL (https://hypervascular-nonduplicative-vern.ngrok-free.dev). If LF_CRITICALITY_SCORE_API_URL is not set in the environment, every LfCriticalityScoreSource operation will attempt to reach this ephemeral dev tunnel, which will almost certainly be dead in production. This causes all LF criticality score data fetching to fail silently or with errors.

Additional Locations (1)
Fix in Cursor Fix in Web

}

// Sort newest-first by date
datasets.sort((a, b) => b.date.localeCompare(a.date))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSSF dataset sort ignores time within same date

Low Severity

The sort in listAvailableDatasets only compares by date (e.g. 2024.07.01), not by the full id which includes the time component (e.g. 2024.07.01/060102). When multiple time-based snapshots exist for the same date, their relative order is determined by insertion order (earliest time first from the GCS listing). In incremental mode, allDatasets[0] would pick the earliest time snapshot for the newest date rather than the latest one.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants