Add Statcan data scraping functionality for issue #47 by jameslong · Pull Request #51 · BuildCanada/OutcomeTrackerAPI

jameslong · 2025-07-08T01:10:27Z

This pull request introduces functionality for managing and syncing Statistics Canada datasets, including new models, jobs, services, migrations, and tests. It also includes updates to dependencies and configurations to support the new features.

See #47 for context.

New functionality for Statistics Canada dataset management:

app/models/statcan_dataset.rb: Added StatcanDataset model with validations for statcan_url, name, and sync_schedule, as well as methods for determining stale datasets and validating cron expressions.
app/jobs/statcan_cron_job.rb: Created StatcanCronJob to enqueue sync jobs for stale datasets based on their schedules.
app/jobs/statcan_sync_job.rb: Created StatcanSyncJob to fetch and update data for individual datasets using StatcanFetcher.
app/services/statcan_fetcher.rb: Added StatcanFetcher service to fetch and parse CSV data from Statistics Canada URLs.

Database changes:

db/migrate/20250707155320_create_statcan_datasets.rb: Added migration to create the statcan_datasets table with fields for dataset metadata, sync schedule, and current data.
db/schema.rb: Updated schema to include the new statcan_datasets table and its indexes.

Configuration and dependencies:

Gemfile: Added csv and fugit gems to support CSV parsing and cron expression handling. [1] [2]
config/initializers/good_job.rb: Configured a cron schedule for StatcanCronJob to run hourly.
config/environments/test.rb: Changed Active Job queue adapter to :test to use assert_enqueued helper function.

Tests:

test/models/statcan_dataset_test.rb: Added unit tests for the StatcanDataset model, including validations and sync logic.
test/jobs/statcan_cron_job_test.rb: Added tests for StatcanCronJob to verify job enqueueing for stale datasets. (Fd562c1bR1)
test/jobs/statcan_sync_job_test.rb: Added tests for StatcanSyncJob to ensure data fetching and updates work as expected.
test/test_helper.rb: Included minitest/mock for mocking in tests.

These changes collectively enable automated syncing of Statistics Canada datasets, ensuring data is regularly updated and accessible for further processing or analysis.

jameslong · 2025-07-08T01:43:45Z

@xrendan @verrixkio Hey! Here's an initial incomplete draft for #47 .

Few points:

Single cron job for dataset syncing: There's a cron, which enqueues one-off jobs to sync stale datasets. This is a simple/flexible approach, and means scheduling can be easily changed. However, it does introduce some latency to dataset syncs (up to 1 hour as currently set). That feels acceptable to me, but let me know what you think.

No dataset history: For simplicity, I haven't stored any history for the datasets. If that's a problem, let me know and I can update.

No API endpoints yet: Will add this next. Let me know if you have a preference for a specific path (/datasets/:name vs /statcan-datasets/:id etc), or want specific fields included in the json payload etc.

Finally, this is the first time I've written any ruby/rails code, some of it is fairly different to my usual setup (elixir/phoenix), so if I've made any basic mistakes please say. 😅

jameslong · 2025-07-08T01:48:40Z

lib/tasks/statcan.rake

+namespace :statcan do
+  desc "Setup Statcan datasets"
+  task setup_datasets: :environment do
+    statcan_datasets = [


Names/urls/schedules taken from the existing workflows https://github.com/BuildCanada/OutcomeTracker/tree/c165db79919c77fc66f0663c5267bd0b0e300337/.github/workflows.

Is there is a specific reason for the existing schedules?

Note, we might not need this task now that the Avo resource has been added, as it's easy to add the datasets via the admin panel.

It'd be good to add this to the db seeds instead so people have a good database to start with `db/seeds/canada.rb

I originally implemented this as seed data, and then Claude said I should do as a task instead. 😄

Refactored back to seeds: 32e6671.

config/environments/test.rb

test/test_helper.rb

jameslong · 2025-07-08T22:58:11Z

Marked as ready. This should now be feature complete:

api endpoint added
avo resource added

I'll do a final pass on the dataset names/links to double check, but other than that everything is ready for review. 🙂

xrendan

Looks good to me, some minor nits to make this more railsy, but after it's good to go

xrendan · 2025-07-09T01:01:23Z

lib/tasks/statcan.rake

+namespace :statcan do
+  desc "Setup Statcan datasets"
+  task setup_datasets: :environment do
+    statcan_datasets = [


It'd be good to add this to the db seeds instead so people have a good database to start with `db/seeds/canada.rb

config/environments/test.rb

xrendan · 2025-07-09T01:12:16Z

app/jobs/statcan_sync_job.rb

+class StatcanSyncJob < ApplicationJob
+  queue_as :default
+
+  def perform(statcan_dataset_id)


nit: Using an id here is fine (and more space-efficient especially when using sidekiq or a redis-backed) but practically, it's nicer to let rails do it's magic with global_id and just pass in the object.

So instead when it's in the queue the argument looks like: gid://OutcomeTrackerAPI/StatcanDataset/<id> instead of id. This is nice from an operational perspective because you know what you're working with instead of just having an integer.

xrendan · 2025-07-09T01:16:14Z

app/jobs/statcan_sync_job.rb

+
+  def perform(statcan_dataset_id)
+    dataset = StatcanDataset.find(statcan_dataset_id)
+    data = StatcanFetcher.fetch(dataset.statcan_url)


nit: Rails has a convention to prefer fat models and skinny controllers (and imo that applies to jobs too).

I'd rather have a method on a StatcanDataset called sync! instead of having the logic for refreshing live in the job itself.

Combining this and the above suggestion you get:

def perform(statcan_dataset) statcan_dataset.sync! end

xrendan · 2025-07-09T01:20:02Z

app/jobs/statcan_cron_job.rb

+  queue_as :default
+
+  def perform(current_time = Time.current)
+    datasets = StatcanDataset.select(:id, :sync_schedule, :last_synced_at)


nit: You should use a scope on for stale datasets instead of querying directly here

datasets = StatcanDataset.stale.select(:id, :sync_schedule, :last_synced_at)

I'm not sure this is easily doable. The stale logic involves parsing the cron schedule, and I don't think that's possible within a sql query. I could refactor to store e.g. a next_sync_at field, but then we're caching state.

Maybe it's ok to leave with the method approach for now?

Bit new to this, but for clarity, my understanding is:

scopes should return an ActiveRecord::Relation Object (so that scopes are composable)

we need non-sql logic for the cron schedule parsing/calculation

the suggested approach (using a mix of scope + all.select + application logic) wouldn't return a Relation object, and also wouldn't use the provided filters for the all.select query. i.e. it would fetch all the current_data for all datasets, even though that's not required for the cron job

Hope that makes sense. Very possible I'm misunderstanding something!

You're mostly right, you could do it as something like this, but it sucks.

scope :stale, ->(current_time = Time.current) { StatcanDataset.where(id: all.select { |dataset| dataset.needs_sync?(current_time).pluck(:id)) } }

xrendan · 2025-07-09T01:23:13Z

app/models/statcan_dataset.rb

+  validates :sync_schedule, presence: true
+  validate :valid_cron_expression
+
+  def self.filter_stale(datasets, current_time = Time.current)


replace this with the scope

See #51 (comment).

xrendan · 2025-07-09T01:23:15Z

app/models/statcan_dataset.rb

@@ -0,0 +1,30 @@
+class StatcanDataset < ApplicationRecord
+  validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }


You can add your scope here

Suggested change

validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }

scope :stale, ->(current_time = Time.current) {

all.select { |dataset| dataset.needs_sync?(current_time) }

}

validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }

app/services/statcan_fetcher.rb

jameslong · 2025-07-09T02:29:26Z

Looks good to me, some minor nits to make this more railsy, but after it's good to go

Amazing. Thank you for all the feedback. ❤️ Will make the changes tomorrow. 🚀

jameslong · 2025-07-10T00:06:19Z

@xrendan Addressed all changes, except the scope refactor - see comment thread here. Let me know your thoughts on that, happy to make the change if you still think it's best. 🙂

jameslong · 2025-07-11T15:38:56Z

@xrendan Addressed all changes, except the scope refactor - see comment thread here. Let me know your thoughts on that, happy to make the change if you still think it's best. 🙂

@xrendan Bump! Are you ok if I go ahead and merge?

jameslong added 3 commits July 7, 2025 18:02

Add dataset migration

d4b190c

Add Fugit gemfile (for cron expression parsing)

cb81dbc

Add StatcanDataset model and tests

70a07eb

jameslong self-assigned this Jul 8, 2025

jameslong changed the title ~~Add Statcan data scraping functionality (#47)~~ Add Statcan data scraping functionality for issue #47 Jul 8, 2025

jameslong linked an issue Jul 8, 2025 that may be closed by this pull request

Make Statcan scraping more reusable #47

Closed

jameslong added 6 commits July 7, 2025 18:20

Add CSV gem explicitly (not included in Ruby 3.4+)

5eefd9a

Add service to fetch and parse StatCan datasets

ded3511

Add job to sync datasets

06965e0

Add syncing/stale detection logic

62aa380

Add cron job to schedule StatCan dataset syncs

1b24995

Schedule statcan cron job to run every hour

915e0a7

jameslong force-pushed the feature/issue-47-statcan-scraping branch from 5479a30 to 68f85b3 Compare July 8, 2025 01:20

Add StatcanDatasets setup task

0abca46

jameslong force-pushed the feature/issue-47-statcan-scraping branch from 68f85b3 to 0abca46 Compare July 8, 2025 01:32

jameslong commented Jul 8, 2025

View reviewed changes

config/environments/test.rb Show resolved Hide resolved

melkuo requested a review from xrendan July 8, 2025 14:06

Add GET route for StatCan datasets

0c937b3

jameslong commented Jul 8, 2025

View reviewed changes

test/test_helper.rb Show resolved Hide resolved

jameslong added 2 commits July 8, 2025 17:53

Add dataset fixture and simplify new tests

e991333

Add Avo resource for StatCan datasets

4210e51

jameslong force-pushed the feature/issue-47-statcan-scraping branch from 377e9d3 to 4210e51 Compare July 8, 2025 22:55

jameslong marked this pull request as ready for review July 8, 2025 22:57

xrendan requested changes Jul 9, 2025

View reviewed changes

jameslong added 2 commits July 9, 2025 16:38

Review feedback: Add StatcanDatasets seed data

32e6671

Review feedback: Enqueue jobs with global ids

329e46b

jameslong added 5 commits July 9, 2025 16:58

Review feedback: Move sync logic from job -> model

7eb00ab

Fix test indentation (for consistency)

c97200e

Merge branch 'main' into feature/issue-47-statcan-scraping

0e2e23d

Fix: Don't store dataset data if not successfully fetched

f8a8f11

Undo whitespace changes to gemfile

09af215

xrendan self-requested a review July 15, 2025 18:51

xrendan merged commit 7d0fd4d into BuildCanada:main Jul 15, 2025
3 checks passed

		@@ -0,0 +1,30 @@
		class StatcanDataset < ApplicationRecord
		validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }

Conversation

jameslong commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New functionality for Statistics Canada dataset management:

Database changes:

Configuration and dependencies:

Tests:

Uh oh!

jameslong commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jameslong commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xrendan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jameslong commented Jul 9, 2025

Uh oh!

jameslong commented Jul 10, 2025

Uh oh!

jameslong commented Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jameslong commented Jul 8, 2025 •

edited

Loading

jameslong commented Jul 8, 2025 •

edited

Loading