Skip to content

Add Statcan data scraping functionality for issue #47#51

Merged
xrendan merged 20 commits intoBuildCanada:mainfrom
jameslong:feature/issue-47-statcan-scraping
Jul 15, 2025
Merged

Add Statcan data scraping functionality for issue #47#51
xrendan merged 20 commits intoBuildCanada:mainfrom
jameslong:feature/issue-47-statcan-scraping

Conversation

@jameslong
Copy link
Copy Markdown
Contributor

@jameslong jameslong commented Jul 8, 2025

This pull request introduces functionality for managing and syncing Statistics Canada datasets, including new models, jobs, services, migrations, and tests. It also includes updates to dependencies and configurations to support the new features.

See #47 for context.

New functionality for Statistics Canada dataset management:

Database changes:

Configuration and dependencies:

Tests:

These changes collectively enable automated syncing of Statistics Canada datasets, ensuring data is regularly updated and accessible for further processing or analysis.

@jameslong jameslong self-assigned this Jul 8, 2025
@jameslong jameslong changed the title Add Statcan data scraping functionality (#47) Add Statcan data scraping functionality for issue #47 Jul 8, 2025
@jameslong jameslong linked an issue Jul 8, 2025 that may be closed by this pull request
@jameslong jameslong force-pushed the feature/issue-47-statcan-scraping branch from 5479a30 to 68f85b3 Compare July 8, 2025 01:20
@jameslong jameslong force-pushed the feature/issue-47-statcan-scraping branch from 68f85b3 to 0abca46 Compare July 8, 2025 01:32
@jameslong
Copy link
Copy Markdown
Contributor Author

@xrendan @verrixkio Hey! Here's an initial incomplete draft for #47 .

Few points:

Single cron job for dataset syncing: There's a cron, which enqueues one-off jobs to sync stale datasets. This is a simple/flexible approach, and means scheduling can be easily changed. However, it does introduce some latency to dataset syncs (up to 1 hour as currently set). That feels acceptable to me, but let me know what you think.

No dataset history: For simplicity, I haven't stored any history for the datasets. If that's a problem, let me know and I can update.

No API endpoints yet: Will add this next. Let me know if you have a preference for a specific path (/datasets/:name vs /statcan-datasets/:id etc), or want specific fields included in the json payload etc.

Finally, this is the first time I've written any ruby/rails code, some of it is fairly different to my usual setup (elixir/phoenix), so if I've made any basic mistakes please say. 😅

namespace :statcan do
desc "Setup Statcan datasets"
task setup_datasets: :environment do
statcan_datasets = [
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names/urls/schedules taken from the existing workflows https://github.com/BuildCanada/OutcomeTracker/tree/c165db79919c77fc66f0663c5267bd0b0e300337/.github/workflows.

Is there is a specific reason for the existing schedules?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, we might not need this task now that the Avo resource has been added, as it's easy to add the datasets via the admin panel.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to add this to the db seeds instead so people have a good database to start with `db/seeds/canada.rb

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally implemented this as seed data, and then Claude said I should do as a task instead. 😄

Refactored back to seeds: 32e6671.

@melkuo melkuo requested a review from xrendan July 8, 2025 14:06
@jameslong jameslong force-pushed the feature/issue-47-statcan-scraping branch from 377e9d3 to 4210e51 Compare July 8, 2025 22:55
@jameslong jameslong marked this pull request as ready for review July 8, 2025 22:57
@jameslong
Copy link
Copy Markdown
Contributor Author

jameslong commented Jul 8, 2025

Marked as ready. This should now be feature complete:

  • api endpoint added
  • avo resource added

I'll do a final pass on the dataset names/links to double check, but other than that everything is ready for review. 🙂

Copy link
Copy Markdown
Member

@xrendan xrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some minor nits to make this more railsy, but after it's good to go

namespace :statcan do
desc "Setup Statcan datasets"
task setup_datasets: :environment do
statcan_datasets = [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to add this to the db seeds instead so people have a good database to start with `db/seeds/canada.rb

class StatcanSyncJob < ApplicationJob
queue_as :default

def perform(statcan_dataset_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using an id here is fine (and more space-efficient especially when using sidekiq or a redis-backed) but practically, it's nicer to let rails do it's magic with global_id and just pass in the object.

So instead when it's in the queue the argument looks like: gid://OutcomeTrackerAPI/StatcanDataset/<id> instead of id. This is nice from an operational perspective because you know what you're working with instead of just having an integer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def perform(statcan_dataset_id)
dataset = StatcanDataset.find(statcan_dataset_id)
data = StatcanFetcher.fetch(dataset.statcan_url)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rails has a convention to prefer fat models and skinny controllers (and imo that applies to jobs too).

I'd rather have a method on a StatcanDataset called sync! instead of having the logic for refreshing live in the job itself.

Combining this and the above suggestion you get:

def perform(statcan_dataset)
    statcan_dataset.sync!
end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue_as :default

def perform(current_time = Time.current)
datasets = StatcanDataset.select(:id, :sync_schedule, :last_synced_at)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You should use a scope on for stale datasets instead of querying directly here

datasets = StatcanDataset.stale.select(:id, :sync_schedule, :last_synced_at)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is easily doable. The stale logic involves parsing the cron schedule, and I don't think that's possible within a sql query. I could refactor to store e.g. a next_sync_at field, but then we're caching state.

Maybe it's ok to leave with the method approach for now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit new to this, but for clarity, my understanding is:

  • scopes should return an ActiveRecord::Relation Object (so that scopes are composable)
  • we need non-sql logic for the cron schedule parsing/calculation
  • the suggested approach (using a mix of scope + all.select + application logic) wouldn't return a Relation object, and also wouldn't use the provided filters for the all.select query. i.e. it would fetch all the current_data for all datasets, even though that's not required for the cron job

Hope that makes sense. Very possible I'm misunderstanding something!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're mostly right, you could do it as something like this, but it sucks.

scope :stale, ->(current_time = Time.current) {
  StatcanDataset.where(id: all.select { |dataset| dataset.needs_sync?(current_time).pluck(:id)) }
}

validates :sync_schedule, presence: true
validate :valid_cron_expression

def self.filter_stale(datasets, current_time = Time.current)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this with the scope

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,30 @@
class StatcanDataset < ApplicationRecord
validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add your scope here

Suggested change
validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }
scope :stale, ->(current_time = Time.current) {
all.select { |dataset| dataset.needs_sync?(current_time) }
}
validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }

@jameslong
Copy link
Copy Markdown
Contributor Author

Looks good to me, some minor nits to make this more railsy, but after it's good to go

Amazing. Thank you for all the feedback. ❤️ Will make the changes tomorrow. 🚀

@jameslong
Copy link
Copy Markdown
Contributor Author

@xrendan Addressed all changes, except the scope refactor - see comment thread here. Let me know your thoughts on that, happy to make the change if you still think it's best. 🙂

@jameslong
Copy link
Copy Markdown
Contributor Author

@xrendan Addressed all changes, except the scope refactor - see comment thread here. Let me know your thoughts on that, happy to make the change if you still think it's best. 🙂

@xrendan Bump! Are you ok if I go ahead and merge?

@xrendan xrendan self-requested a review July 15, 2025 18:51
@xrendan xrendan merged commit 7d0fd4d into BuildCanada:main Jul 15, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Statcan scraping more reusable

2 participants