-
Notifications
You must be signed in to change notification settings - Fork 10
Add Statcan data scraping functionality for issue #47 #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d4b190c
cb81dbc
70a07eb
5eefd9a
ded3511
06965e0
62aa380
1b24995
915e0a7
0abca46
0c937b3
e991333
4210e51
32e6671
329e46b
7eb00ab
c97200e
0e2e23d
f8a8f11
09af215
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| class Avo::Resources::StatcanDataset < Avo::BaseResource | ||
| # self.includes = [] | ||
| # self.attachments = [] | ||
| # self.search = { | ||
| # query: -> { query.ransack(id_eq: params[:q], m: "or").result(distinct: false) } | ||
| # } | ||
|
|
||
| def fields | ||
| field :id, as: :id | ||
| field :statcan_url, as: :text | ||
| field :name, as: :text | ||
| field :sync_schedule, as: :text | ||
| field :current_data, as: :code | ||
| field :last_synced_at, as: :date_time | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| # This controller has been generated to enable Rails' resource routes. | ||
| # More information on https://docs.avohq.io/3.0/controllers.html | ||
| class Avo::StatcanDatasetsController < Avo::ResourcesController | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| class StatcanDatasetsController < ApplicationController | ||
| def show | ||
| dataset = StatcanDataset.find(params[:id]) | ||
| render json: dataset | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| class StatcanCronJob < ApplicationJob | ||
| queue_as :default | ||
|
|
||
| def perform(current_time = Time.current) | ||
| datasets = StatcanDataset.select(:id, :sync_schedule, :last_synced_at) | ||
| stale_datasets = StatcanDataset.filter_stale(datasets, current_time) | ||
| stale_datasets.each(&StatcanSyncJob.method(:perform_later)) | ||
|
|
||
| Rails.logger.info "Enqueued #{stale_datasets.count} Statcan sync jobs" | ||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| class StatcanSyncJob < ApplicationJob | ||
| queue_as :default | ||
|
|
||
| def perform(statcan_dataset) | ||
| statcan_dataset.sync! | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,35 @@ | ||||||||||||
| class StatcanDataset < ApplicationRecord | ||||||||||||
| validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp } | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can add your scope here
Suggested change
|
||||||||||||
| validates :name, presence: true, uniqueness: true, format: { with: /\A[a-z0-9-]+\z/, message: "must be lowercase with hyphens only" } | ||||||||||||
| validates :sync_schedule, presence: true | ||||||||||||
| validate :valid_cron_expression | ||||||||||||
|
|
||||||||||||
| def self.filter_stale(datasets, current_time = Time.current) | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. replace this with the scope
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #51 (comment). |
||||||||||||
| datasets.select { |dataset| dataset.needs_sync?(current_time) } | ||||||||||||
| end | ||||||||||||
|
|
||||||||||||
| def needs_sync?(current_time = Time.current) | ||||||||||||
| return true if last_synced_at.nil? | ||||||||||||
|
|
||||||||||||
| cron = Fugit::Cron.parse(sync_schedule) | ||||||||||||
| last_scheduled_time = cron.previous_time(current_time) | ||||||||||||
|
|
||||||||||||
| last_synced_at.to_i < last_scheduled_time.seconds | ||||||||||||
| end | ||||||||||||
|
|
||||||||||||
| def sync! | ||||||||||||
| data = StatcanFetcher.fetch(statcan_url) | ||||||||||||
| update!(current_data: data, last_synced_at: Time.current) | ||||||||||||
| end | ||||||||||||
|
|
||||||||||||
| private | ||||||||||||
|
|
||||||||||||
| def valid_cron_expression | ||||||||||||
| return unless sync_schedule.present? | ||||||||||||
|
|
||||||||||||
| parsed_cron = Fugit::Cron.parse(sync_schedule) | ||||||||||||
| if parsed_cron.nil? | ||||||||||||
| errors.add(:sync_schedule, "must be a valid cron expression") | ||||||||||||
| end | ||||||||||||
| end | ||||||||||||
| end | ||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| require "csv" | ||
|
|
||
| class StatcanFetcher | ||
| def self.fetch(url) | ||
| response = HTTP | ||
| .timeout(connect: 10, read: 60) | ||
| .headers("User-Agent" => "BuildCanada/OutcomeTrackerAPI") | ||
jameslong marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| .get(url) | ||
|
|
||
| unless response.status.success? | ||
| raise "HTTP Error: #{response.status} - #{response.status.reason}" | ||
| end | ||
|
|
||
| csv_string = response.body.to_s | ||
|
|
||
| # Remove UTF-8 Byte Order Mark (BOM) if present | ||
| csv_string = csv_string.sub(/\A\uFEFF/, "") | ||
|
|
||
| CSV.parse(csv_string, headers: true, liberal_parsing: true, skip_blanks: true).map(&:to_h) | ||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| class CreateStatcanDatasets < ActiveRecord::Migration[8.0] | ||
| def change | ||
| create_table :statcan_datasets do |t| | ||
| t.text :statcan_url, null: false | ||
| t.string :name, null: false | ||
| t.string :sync_schedule, null: false | ||
| t.jsonb :current_data | ||
| t.timestamp :last_synced_at | ||
|
|
||
| t.timestamps | ||
| end | ||
|
|
||
| add_index :statcan_datasets, :statcan_url, unique: true | ||
| add_index :statcan_datasets, :name, unique: true | ||
| add_index :statcan_datasets, :last_synced_at | ||
| end | ||
| end |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -637,4 +637,6 @@ | |
|
|
||
| puts "Seeding Evidences..." | ||
|
|
||
| require_relative 'statcan_datasets' | ||
|
|
||
| puts "Done seeding" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| puts "Seeding StatcanDatasets..." | ||
|
|
||
| statcan_datasets = [ | ||
| { | ||
| name: "balance-sheets", | ||
| statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=1010001501&latestN=0&startDate=19901001&endDate=&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B2%5D%2C%5B%5D%5D&checkedLevels=2D1%2C2D2%2C2D3", | ||
| sync_schedule: "23 6 * * *" # Daily at 6:23 AM | ||
| }, | ||
| { | ||
| name: "demographic-incomes-non-permanent-residents", | ||
| statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=1110009101&latestN=2&startDate=&endDate=&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B%5D%2C%5B%5D%2C%5B%5D%2C%5B5%5D%5D&checkedLevels=1D1%2C1D2%2C1D3%2C2D1%2C3D1", | ||
| sync_schedule: "23 8 * * *" # Daily at 8:23 AM | ||
| }, | ||
| { | ||
| name: "gdp", | ||
| statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=3610010401&latestN=0&startDate=19610101&endDate=&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B1%5D%2C%5B1%5D%2C%5B%5D%5D&checkedLevels=3D1%2C3D2%2C3D3%2C3D4", | ||
| sync_schedule: "23 6 * * *" # Daily at 6:23 AM | ||
| }, | ||
| { | ||
| name: "housing-starts", | ||
| statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData.action?pid=3410015101&latestN=0&startDate=19880101&endDate=&csvLocale=en&selectedMembers=%5B%5B%5D%2C%5B1%5D%2C%5B%5D%5D&checkedLevels=0D1%2C2D1%2C2D2", | ||
| sync_schedule: "23 6 * * *" # Daily at 6:23 AM | ||
| }, | ||
| { | ||
| name: "labour-productivity", | ||
| statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=3610020701&latestN=0&startDate=19801001&endDate=&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B5%5D%2C%5B1%2C2%2C3%2C4%2C5%2C6%2C7%2C8%2C9%2C10%2C11%2C13%2C14%2C15%2C16%2C17%2C18%2C19%2C20%2C21%5D%5D&checkedLevels=", | ||
| sync_schedule: "23 9 * * *" # Daily at 9:23 AM | ||
| }, | ||
| { | ||
| name: "non-permanent-residents", | ||
| statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData.action?pid=1710012101&latestN=0&startDate=20210101&endDate=&csvLocale=en&selectedMembers=%5B%5B%5D%2C%5B%5D%5D&checkedLevels=0D1%2C1D1%2C1D2%2C1D3", | ||
| sync_schedule: "23 7 * * *" # Daily at 7:23 AM | ||
| }, | ||
| { | ||
| name: "population", | ||
| statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=1710000901&latestN=0&startDate=19000101&endDate=&csvLocale=en&selectedMembers=%5B%5B1%2C2%2C3%2C4%2C5%2C6%2C7%2C8%2C9%2C10%2C11%2C12%2C14%2C15%5D%5D&checkedLevels=", | ||
| sync_schedule: "23 6 * * *" # Daily at 6:23 AM | ||
| }, | ||
| { | ||
| name: "primary-energy-production", | ||
| statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=2510007901&latestN=5&startDate=&endDate=&csvLocale=en&selectedMembers=%5B%5B%5D%2C%5B%5D%2C%5B%5D%5D&checkedLevels=0D1%2C1D1%2C1D2%2C1D3%2C2D1", | ||
| sync_schedule: "23 10 * * *" # Daily at 10:23 AM | ||
| } | ||
| ] | ||
|
|
||
| statcan_datasets.each do |dataset_attrs| | ||
| dataset = StatcanDataset.find_or_create_by(name: dataset_attrs[:name]) do |d| | ||
| d.statcan_url = dataset_attrs[:statcan_url] | ||
| d.sync_schedule = dataset_attrs[:sync_schedule] | ||
| end | ||
|
|
||
| if dataset.persisted? | ||
| if dataset.previously_new_record? | ||
| puts "✓ #{dataset.name} - created" | ||
| else | ||
| puts "✓ #{dataset.name} - already exists" | ||
| end | ||
| else | ||
| puts "✗ #{dataset.name} - failed to create: #{dataset.errors.full_messages.join(', ')}" | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| require "test_helper" | ||
|
|
||
| class StatcanDatasetsControllerTest < ActionDispatch::IntegrationTest | ||
| test "should show dataset" do | ||
| dataset = statcan_datasets(:synced) | ||
|
|
||
| get statcan_dataset_url(dataset) | ||
|
|
||
| assert_response :success | ||
| assert_equal "application/json; charset=utf-8", response.content_type | ||
|
|
||
| json_response = JSON.parse(response.body) | ||
| assert_equal dataset.id, json_response["id"] | ||
| assert_equal dataset.name, json_response["name"] | ||
| end | ||
|
|
||
| test "should return 404 for non-existent dataset" do | ||
| get statcan_dataset_url(id: 99999) | ||
|
|
||
| assert_response :not_found | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| unsynced: | ||
| statcan_url: "https://statcan.gc.ca/123.csv" | ||
| name: "test-dataset-unsynced" | ||
| sync_schedule: "0 0 * * *" | ||
| current_data: null | ||
| last_synced_at: null | ||
|
|
||
| synced: | ||
| statcan_url: "https://statcan.gc.ca/456.csv" | ||
| name: "test-dataset-synced" | ||
| sync_schedule: "0 0 * * *" | ||
| current_data: | ||
| - year: 2020 | ||
| population: 38000000 | ||
| last_synced_at: "2024-01-15 10:30:00" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| require "test_helper" | ||
|
|
||
| class StatcanCronJobTest < ActiveJob::TestCase | ||
| def setup | ||
| # Remove fixture data before each test | ||
| StatcanDataset.delete_all | ||
| end | ||
|
|
||
| test "should enqueue sync jobs for stale datasets only" do | ||
| current_time = Time.parse("2025-01-02 14:00:00") # 2pm | ||
|
|
||
| # Create a stale dataset (never synced) | ||
| stale_dataset1 = StatcanDataset.create!( | ||
| name: "stale-never-synced", | ||
| statcan_url: "https://statcan.gc.ca/stale1.csv", | ||
| sync_schedule: "0 0 * * *", | ||
| last_synced_at: nil | ||
| ) | ||
|
|
||
| # Create another stale dataset (old sync) | ||
| stale_dataset2 = StatcanDataset.create!( | ||
| name: "stale-old-sync", | ||
| statcan_url: "https://statcan.gc.ca/stale2.csv", | ||
| sync_schedule: "0 0 * * *", | ||
| last_synced_at: Time.parse("2025-01-01 23:00:00") # Yesterday 11pm | ||
| ) | ||
|
|
||
| # Create a fresh dataset (recent sync) | ||
| _fresh_dataset = StatcanDataset.create!( | ||
| name: "fresh-dataset", | ||
| statcan_url: "https://statcan.gc.ca/fresh.csv", | ||
| sync_schedule: "0 0 * * *", | ||
| last_synced_at: Time.parse("2025-01-02 01:00:00") # 1am today | ||
| ) | ||
|
|
||
| # Track enqueued jobs | ||
| assert_enqueued_jobs 2, only: StatcanSyncJob do | ||
| StatcanCronJob.perform_now(current_time) | ||
| end | ||
|
|
||
| # Verify the correct jobs were enqueued | ||
| assert_enqueued_with(job: StatcanSyncJob, args: [ stale_dataset1 ]) | ||
| assert_enqueued_with(job: StatcanSyncJob, args: [ stale_dataset2 ]) | ||
| end | ||
|
|
||
| test "should not enqueue jobs when no datasets need syncing" do | ||
| current_time = Time.parse("2025-01-02 14:00:00") | ||
|
|
||
| # Create only fresh datasets | ||
| StatcanDataset.create!( | ||
| name: "fresh-dataset-1", | ||
| statcan_url: "https://statcan.gc.ca/fresh1.csv", | ||
| sync_schedule: "0 0 * * *", | ||
| last_synced_at: Time.parse("2025-01-02 01:00:00") # 1am today | ||
| ) | ||
|
|
||
| # Should not enqueue any jobs | ||
| assert_enqueued_jobs 0, only: StatcanSyncJob do | ||
| StatcanCronJob.perform_now(current_time) | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You should use a scope on for stale datasets instead of querying directly here
datasets = StatcanDataset.stale.select(:id, :sync_schedule, :last_synced_at)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is easily doable. The stale logic involves parsing the cron schedule, and I don't think that's possible within a sql query. I could refactor to store e.g. a
next_sync_atfield, but then we're caching state.Maybe it's ok to leave with the method approach for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit new to this, but for clarity, my understanding is:
all.selectquery. i.e. it would fetch all thecurrent_datafor all datasets, even though that's not required for the cron jobHope that makes sense. Very possible I'm misunderstanding something!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're mostly right, you could do it as something like this, but it sucks.