feat: add ossf data fetcher (CM-952) by ulemons · Pull Request #3839 · linuxfoundation/crowd.dev

ulemons · 2026-02-10T16:18:23Z

Note

Medium Risk
Adds a new scheduled Temporal workflow that streams large external datasets and bulk-upserts into Postgres, and changes projectCatalog to store two separate score fields; failures or schema mismatches could impact data quality and worker stability.

Overview
Adds an automatic projects discovery Temporal worker that enumerates discovery sources, lists available dataset snapshots, streams each dataset (CSV via csv-parse or object-mode JSON), and bulk upserts projects into projectCatalog in 5k-row batches.

Introduces two discovery sources: OSSF Criticality Score (public GCS CSV snapshots) and LF Criticality Score (paged JSON API, last 12 months), and registers a daily midnight Temporal schedule to run discoverProjects in incremental mode with a 2-hour execution timeout.

Updates the data-access layer for projectCatalog to replace criticalityScore with separate ossfCriticalityScore and lfCriticalityScore columns, including insert/bulk-insert/upsert semantics that preserve existing scores when new data omits them.

^{Written by Cursor Bugbot for commit ed37511. This will update automatically on new commits. Configure here.}

github-actions

Conventional Commits FTW!

CLAassistant · 2026-03-24T10:38:02Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

Copilot

Pull request overview

Adds an Automatic Projects Discovery Temporal worker that fetches external criticality datasets (OSSF + LF) and upserts them into the projectCatalog table, while aligning the data-access layer with the underlying schema.

Changes:

Split criticalityScore into ossfCriticalityScore and lfCriticalityScore across DAL types and SQL upsert/insert logic.
Implement a Temporal workflow + activities pipeline to list datasets per source and process datasets in batches via bulkUpsertProjectCatalog.
Add source abstractions/registry plus OSSF (GCS CSV) and LF (API JSON) fetchers, along with scheduling/docs/deps.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
services/libs/data-access-layer/src/project-catalog/types.ts	Updates DB types to use `ossfCriticalityScore` + `lfCriticalityScore`.
services/libs/data-access-layer/src/project-catalog/projectCatalog.ts	Updates select/insert/bulk insert/upsert/update SQL to use the two score columns.
services/apps/automatic_projects_discovery_worker/src/workflows/discoverProjects.ts	New workflow orchestration (incremental vs full) + activity timeouts/retries.
services/apps/automatic_projects_discovery_worker/src/sources/types.ts	Introduces source/dataset interfaces and CSV vs JSON streaming contract.
services/apps/automatic_projects_discovery_worker/src/sources/registry.ts	Registers discovery sources and exposes lookup/list helpers.
services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts	Implements dataset listing + CSV row parsing for OSSF snapshots.
services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/bucketClient.ts	Minimal HTTP/XML client for listing GCS prefixes + streaming `all.csv`.
services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts	Implements LF dataset descriptor generation + paginated API streaming + parsing.
services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts	Registers Temporal schedule (daily cron) and workflow args/timeouts.
services/apps/automatic_projects_discovery_worker/src/main.ts	Enables Postgres for this worker and schedules discovery on startup.
services/apps/automatic_projects_discovery_worker/src/activities/activities.ts	Adds activities to list sources/datasets and process datasets with batching + upsert.
services/apps/automatic_projects_discovery_worker/src/activities.ts	Switches barrel export to explicit named exports for activity typing.
services/apps/automatic_projects_discovery_worker/package.json	Adds `csv-parse` dependency and adjusts local debug script.
services/apps/automatic_projects_discovery_worker/README.md	Adds architecture/workflow documentation for the worker.
pnpm-lock.yaml	Locks `csv-parse` dependency version.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T14:06:09Z

services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts

+const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev'
+const PAGE_SIZE = 100


DEFAULT_API_URL is set to an ngrok-free.dev address. That’s ephemeral and not suitable as a production/default endpoint; it can also silently route traffic to an unexpected host if the env var isn’t set. Consider removing the default and failing fast when LF_CRITICALITY_SCORE_API_URL is missing (or replace it with a stable, owned HTTPS endpoint).

Copilot · 2026-03-24T14:06:10Z

services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts

+    // Sort newest-first by date
+    datasets.sort((a, b) => b.date.localeCompare(a.date))


Datasets are sorted newest-first only by the date field. If the bucket contains multiple snapshots for the same date (different time prefixes), this sort won’t guarantee the latest snapshot is first, which can make incremental mode pick an older dataset. Consider sorting by the full dataset id/prefix (date+time) or by id descending instead of only date.

Suggested change

// Sort newest-first by date

datasets.sort((a, b) => b.date.localeCompare(a.date))

// Sort newest-first by full dataset id (includes date and time)

datasets.sort((a, b) => b.id.localeCompare(a.id))

services/apps/automatic_projects_discovery_worker/README.md

services/apps/automatic_projects_discovery_worker/src/schedules/scheduleProjectsDiscovery.ts

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-24T14:11:16Z

services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts

+
+const log = getServiceLogger()
+
+const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev'


Ngrok development URL hardcoded as production default

High Severity

DEFAULT_API_URL is set to a temporary ngrok tunnel URL (https://hypervascular-nonduplicative-vern.ngrok-free.dev). If LF_CRITICALITY_SCORE_API_URL is not set in the environment, every LfCriticalityScoreSource operation will attempt to reach this ephemeral dev tunnel, which will almost certainly be dead in production. This causes all LF criticality score data fetching to fail silently or with errors.

Additional Locations (1)

services/apps/automatic_projects_discovery_worker/src/sources/lf-criticality-score/source.ts#L36-L37

cursor · 2026-03-24T14:11:16Z

services/apps/automatic_projects_discovery_worker/src/sources/ossf-criticality-score/source.ts

+    }
+
+    // Sort newest-first by date
+    datasets.sort((a, b) => b.date.localeCompare(a.date))


OSSF dataset sort ignores time within same date

Low Severity

The sort in listAvailableDatasets only compares by date (e.g. 2024.07.01), not by the full id which includes the time component (e.g. 2024.07.01/060102). When multiple time-based snapshots exist for the same date, their relative order is determined by insertion order (earliest time first from the GCS listing). In incremental mode, allDatasets[0] would pick the earliest time snapshot for the newest date rather than the latest one.

ulemons self-assigned this Feb 10, 2026

github-actions bot reviewed Feb 10, 2026

View reviewed changes

ulemons changed the title ~~Feat/add assf data fetcher~~ feat: add ossf data fetcher Feb 10, 2026

ulemons added the POC label Feb 11, 2026

ulemons changed the title ~~feat: add ossf data fetcher~~ feat: add ossf data fetcher (CM-952) Feb 11, 2026

ulemons changed the base branch from feat/add-project-discovery-worker to feat/add-dal-automatic-project-discovery February 11, 2026 10:46

ulemons force-pushed the feat/add-dal-automatic-project-discovery branch 2 times, most recently from f4cc9b8 to 543aa5f Compare March 24, 2026 10:31

ulemons force-pushed the feat/add-assf-data-fetcher branch from 264892a to e4e5fc2 Compare March 24, 2026 10:37

ulemons force-pushed the feat/add-assf-data-fetcher branch from e4e5fc2 to 0b93da4 Compare March 24, 2026 10:38

ulemons force-pushed the feat/add-dal-automatic-project-discovery branch from 3cbe8ec to 0af35ee Compare March 24, 2026 11:56

ulemons force-pushed the feat/add-assf-data-fetcher branch from 76c308c to 02ab17d Compare March 24, 2026 11:58

ulemons marked this pull request as ready for review March 24, 2026 14:01

Copilot AI review requested due to automatic review settings March 24, 2026 14:01

Copilot started reviewing on behalf of ulemons March 24, 2026 14:01 View session

ulemons force-pushed the feat/add-dal-automatic-project-discovery branch from 0af35ee to 82f29d9 Compare March 24, 2026 14:02

ulemons and others added 12 commits March 24, 2026 15:02

fix: push lock file

adc1544

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

feat: schedule structure

54111d2

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

fix: lint

d9620b1

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

fix: lint

5ab8676

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

fix: format

8ed3740

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

fix: lint

32be178

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

fix: update cron expression

d667850

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

feat: mode incremental

caa3de4

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

fix: update readme

16ae6ab

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

fix: add new source

585dedb

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

fix: add dependencies

e914d4b

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

refactor: fix eslint

ed37511

Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

ulemons force-pushed the feat/add-assf-data-fetcher branch from 02ab17d to ed37511 Compare March 24, 2026 14:02

Copilot AI reviewed Mar 24, 2026

View reviewed changes

cursor bot reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ossf data fetcher (CM-952)#3839

feat: add ossf data fetcher (CM-952)#3839
ulemons wants to merge 12 commits intofeat/add-dal-automatic-project-discoveryfrom
feat/add-assf-data-fetcher

ulemons commented Feb 10, 2026 •

edited by cursor bot

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

CLAassistant commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 24, 2026

Uh oh!

cursor bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev'
		const PAGE_SIZE = 100

		// Sort newest-first by date
		datasets.sort((a, b) => b.date.localeCompare(a.date))


		const log = getServiceLogger()

		const DEFAULT_API_URL = 'https://hypervascular-nonduplicative-vern.ngrok-free.dev'

Conversation

ulemons commented Feb 10, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 24, 2026

Choose a reason for hiding this comment

Ngrok development URL hardcoded as production default

Uh oh!

cursor bot Mar 24, 2026

Choose a reason for hiding this comment

OSSF dataset sort ignores time within same date

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ulemons commented Feb 10, 2026 •

edited by cursor bot

Loading

github-actions bot left a comment •

edited

Loading