Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ group :development, :test do
end
gem "avo", ">= 3.2"

gem "csv", "~> 3.3"

gem "devise", "~> 4.9"

gem "importmap-rails", "~> 2.1"
Expand All @@ -56,6 +58,8 @@ gem "dotenv", groups: [ :development, :test ]

gem "feedjira", "~> 3.2"

gem "fugit", "~> 1.11"

gem "http", "~> 5.3"

gem "iconv", "~> 1.1"
Expand Down
3 changes: 3 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ GEM
concurrent-ruby (1.3.5)
connection_pool (2.5.3)
crass (1.0.6)
csv (3.3.5)
date (3.4.1)
debug (1.11.0)
irb (~> 1.10)
Expand Down Expand Up @@ -428,10 +429,12 @@ DEPENDENCIES
bootsnap
brakeman
commonmarker (~> 2.3)
csv (~> 3.3)
debug
devise (~> 4.9)
dotenv
feedjira (~> 3.2)
fugit (~> 1.11)
good_job (~> 4.11)
http (~> 5.3)
iconv (~> 1.1)
Expand Down
16 changes: 16 additions & 0 deletions app/avo/resources/statcan_dataset.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
class Avo::Resources::StatcanDataset < Avo::BaseResource
# self.includes = []
# self.attachments = []
# self.search = {
# query: -> { query.ransack(id_eq: params[:q], m: "or").result(distinct: false) }
# }

def fields
field :id, as: :id
field :statcan_url, as: :text
field :name, as: :text
field :sync_schedule, as: :text
field :current_data, as: :code
field :last_synced_at, as: :date_time
end
end
4 changes: 4 additions & 0 deletions app/controllers/avo/statcan_datasets_controller.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# This controller has been generated to enable Rails' resource routes.
# More information on https://docs.avohq.io/3.0/controllers.html
class Avo::StatcanDatasetsController < Avo::ResourcesController
end
6 changes: 6 additions & 0 deletions app/controllers/statcan_datasets_controller.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
class StatcanDatasetsController < ApplicationController
def show
dataset = StatcanDataset.find(params[:id])
render json: dataset
end
end
11 changes: 11 additions & 0 deletions app/jobs/statcan_cron_job.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
class StatcanCronJob < ApplicationJob
queue_as :default

def perform(current_time = Time.current)
datasets = StatcanDataset.select(:id, :sync_schedule, :last_synced_at)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You should use a scope on for stale datasets instead of querying directly here

datasets = StatcanDataset.stale.select(:id, :sync_schedule, :last_synced_at)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is easily doable. The stale logic involves parsing the cron schedule, and I don't think that's possible within a sql query. I could refactor to store e.g. a next_sync_at field, but then we're caching state.

Maybe it's ok to leave with the method approach for now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit new to this, but for clarity, my understanding is:

  • scopes should return an ActiveRecord::Relation Object (so that scopes are composable)
  • we need non-sql logic for the cron schedule parsing/calculation
  • the suggested approach (using a mix of scope + all.select + application logic) wouldn't return a Relation object, and also wouldn't use the provided filters for the all.select query. i.e. it would fetch all the current_data for all datasets, even though that's not required for the cron job

Hope that makes sense. Very possible I'm misunderstanding something!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're mostly right, you could do it as something like this, but it sucks.

scope :stale, ->(current_time = Time.current) {
  StatcanDataset.where(id: all.select { |dataset| dataset.needs_sync?(current_time).pluck(:id)) }
}

stale_datasets = StatcanDataset.filter_stale(datasets, current_time)
stale_datasets.each(&StatcanSyncJob.method(:perform_later))

Rails.logger.info "Enqueued #{stale_datasets.count} Statcan sync jobs"
end
end
7 changes: 7 additions & 0 deletions app/jobs/statcan_sync_job.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
class StatcanSyncJob < ApplicationJob
queue_as :default

def perform(statcan_dataset)
statcan_dataset.sync!
end
end
35 changes: 35 additions & 0 deletions app/models/statcan_dataset.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
class StatcanDataset < ApplicationRecord
validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add your scope here

Suggested change
validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }
scope :stale, ->(current_time = Time.current) {
all.select { |dataset| dataset.needs_sync?(current_time) }
}
validates :statcan_url, presence: true, uniqueness: true, format: { with: URI::DEFAULT_PARSER.make_regexp }

validates :name, presence: true, uniqueness: true, format: { with: /\A[a-z0-9-]+\z/, message: "must be lowercase with hyphens only" }
validates :sync_schedule, presence: true
validate :valid_cron_expression

def self.filter_stale(datasets, current_time = Time.current)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this with the scope

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasets.select { |dataset| dataset.needs_sync?(current_time) }
end

def needs_sync?(current_time = Time.current)
return true if last_synced_at.nil?

cron = Fugit::Cron.parse(sync_schedule)
last_scheduled_time = cron.previous_time(current_time)

last_synced_at.to_i < last_scheduled_time.seconds
end

def sync!
data = StatcanFetcher.fetch(statcan_url)
update!(current_data: data, last_synced_at: Time.current)
end

private

def valid_cron_expression
return unless sync_schedule.present?

parsed_cron = Fugit::Cron.parse(sync_schedule)
if parsed_cron.nil?
errors.add(:sync_schedule, "must be a valid cron expression")
end
end
end
21 changes: 21 additions & 0 deletions app/services/statcan_fetcher.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
require "csv"

class StatcanFetcher
def self.fetch(url)
response = HTTP
.timeout(connect: 10, read: 60)
.headers("User-Agent" => "BuildCanada/OutcomeTrackerAPI")
.get(url)

unless response.status.success?
raise "HTTP Error: #{response.status} - #{response.status.reason}"
end

csv_string = response.body.to_s

# Remove UTF-8 Byte Order Mark (BOM) if present
csv_string = csv_string.sub(/\A\uFEFF/, "")

CSV.parse(csv_string, headers: true, liberal_parsing: true, skip_blanks: true).map(&:to_h)
end
end
2 changes: 1 addition & 1 deletion config/environments/test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -52,5 +52,5 @@
config.action_controller.raise_on_missing_callback_actions = true

# Use inline adapter for Active Job in tests for faster execution
config.active_job.queue_adapter = :inline
config.active_job.queue_adapter = :test
end
4 changes: 4 additions & 0 deletions config/initializers/good_job.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@
class: "FeedRefresherJob", # name of the job class as a String; must reference an Active Job job class
description: "Refreshed feeds and creates new entries", # optional description that appears in Dashboard,
enabled_by_default: -> { Rails.env.production? } # Only enable in production, otherwise can be enabled manually through Dashboard
},
statcan_sync: {
cron: "0 * * * *", # Every hour
class: "StatcanCronJob"
}
}

Expand Down
1 change: 1 addition & 0 deletions config/routes.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
resources :promises, only: [ :index, :show ]
resources :evidences, only: [ :index, :show ]
resources :builders, only: [ :index, :show ]
resources :statcan_datasets, only: [ :show ]

namespace :admin do
resources :promises, only: [ :index, :show, :update, :destroy ]
Expand Down
17 changes: 17 additions & 0 deletions db/migrate/20250707155320_create_statcan_datasets.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
class CreateStatcanDatasets < ActiveRecord::Migration[8.0]
def change
create_table :statcan_datasets do |t|
t.text :statcan_url, null: false
t.string :name, null: false
t.string :sync_schedule, null: false
t.jsonb :current_data
t.timestamp :last_synced_at

t.timestamps
end

add_index :statcan_datasets, :statcan_url, unique: true
add_index :statcan_datasets, :name, unique: true
add_index :statcan_datasets, :last_synced_at
end
end
16 changes: 15 additions & 1 deletion db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions db/seeds/canada.rb
Original file line number Diff line number Diff line change
Expand Up @@ -637,4 +637,6 @@

puts "Seeding Evidences..."

require_relative 'statcan_datasets'

puts "Done seeding"
61 changes: 61 additions & 0 deletions db/seeds/statcan_datasets.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
puts "Seeding StatcanDatasets..."

statcan_datasets = [
{
name: "balance-sheets",
statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=1010001501&latestN=0&startDate=19901001&endDate=&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B2%5D%2C%5B%5D%5D&checkedLevels=2D1%2C2D2%2C2D3",
sync_schedule: "23 6 * * *" # Daily at 6:23 AM
},
{
name: "demographic-incomes-non-permanent-residents",
statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=1110009101&latestN=2&startDate=&endDate=&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B%5D%2C%5B%5D%2C%5B%5D%2C%5B5%5D%5D&checkedLevels=1D1%2C1D2%2C1D3%2C2D1%2C3D1",
sync_schedule: "23 8 * * *" # Daily at 8:23 AM
},
{
name: "gdp",
statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=3610010401&latestN=0&startDate=19610101&endDate=&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B1%5D%2C%5B1%5D%2C%5B%5D%5D&checkedLevels=3D1%2C3D2%2C3D3%2C3D4",
sync_schedule: "23 6 * * *" # Daily at 6:23 AM
},
{
name: "housing-starts",
statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData.action?pid=3410015101&latestN=0&startDate=19880101&endDate=&csvLocale=en&selectedMembers=%5B%5B%5D%2C%5B1%5D%2C%5B%5D%5D&checkedLevels=0D1%2C2D1%2C2D2",
sync_schedule: "23 6 * * *" # Daily at 6:23 AM
},
{
name: "labour-productivity",
statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=3610020701&latestN=0&startDate=19801001&endDate=&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B5%5D%2C%5B1%2C2%2C3%2C4%2C5%2C6%2C7%2C8%2C9%2C10%2C11%2C13%2C14%2C15%2C16%2C17%2C18%2C19%2C20%2C21%5D%5D&checkedLevels=",
sync_schedule: "23 9 * * *" # Daily at 9:23 AM
},
{
name: "non-permanent-residents",
statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData.action?pid=1710012101&latestN=0&startDate=20210101&endDate=&csvLocale=en&selectedMembers=%5B%5B%5D%2C%5B%5D%5D&checkedLevels=0D1%2C1D1%2C1D2%2C1D3",
sync_schedule: "23 7 * * *" # Daily at 7:23 AM
},
{
name: "population",
statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=1710000901&latestN=0&startDate=19000101&endDate=&csvLocale=en&selectedMembers=%5B%5B1%2C2%2C3%2C4%2C5%2C6%2C7%2C8%2C9%2C10%2C11%2C12%2C14%2C15%5D%5D&checkedLevels=",
sync_schedule: "23 6 * * *" # Daily at 6:23 AM
},
{
name: "primary-energy-production",
statcan_url: "https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=2510007901&latestN=5&startDate=&endDate=&csvLocale=en&selectedMembers=%5B%5B%5D%2C%5B%5D%2C%5B%5D%5D&checkedLevels=0D1%2C1D1%2C1D2%2C1D3%2C2D1",
sync_schedule: "23 10 * * *" # Daily at 10:23 AM
}
]

statcan_datasets.each do |dataset_attrs|
dataset = StatcanDataset.find_or_create_by(name: dataset_attrs[:name]) do |d|
d.statcan_url = dataset_attrs[:statcan_url]
d.sync_schedule = dataset_attrs[:sync_schedule]
end

if dataset.persisted?
if dataset.previously_new_record?
puts "✓ #{dataset.name} - created"
else
puts "✓ #{dataset.name} - already exists"
end
else
puts "✗ #{dataset.name} - failed to create: #{dataset.errors.full_messages.join(', ')}"
end
end
22 changes: 22 additions & 0 deletions test/controllers/statcan_datasets_controller_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
require "test_helper"

class StatcanDatasetsControllerTest < ActionDispatch::IntegrationTest
test "should show dataset" do
dataset = statcan_datasets(:synced)

get statcan_dataset_url(dataset)

assert_response :success
assert_equal "application/json; charset=utf-8", response.content_type

json_response = JSON.parse(response.body)
assert_equal dataset.id, json_response["id"]
assert_equal dataset.name, json_response["name"]
end

test "should return 404 for non-existent dataset" do
get statcan_dataset_url(id: 99999)

assert_response :not_found
end
end
15 changes: 15 additions & 0 deletions test/fixtures/statcan_datasets.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
unsynced:
statcan_url: "https://statcan.gc.ca/123.csv"
name: "test-dataset-unsynced"
sync_schedule: "0 0 * * *"
current_data: null
last_synced_at: null

synced:
statcan_url: "https://statcan.gc.ca/456.csv"
name: "test-dataset-synced"
sync_schedule: "0 0 * * *"
current_data:
- year: 2020
population: 38000000
last_synced_at: "2024-01-15 10:30:00"
62 changes: 62 additions & 0 deletions test/jobs/statcan_cron_job_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
require "test_helper"

class StatcanCronJobTest < ActiveJob::TestCase
def setup
# Remove fixture data before each test
StatcanDataset.delete_all
end

test "should enqueue sync jobs for stale datasets only" do
current_time = Time.parse("2025-01-02 14:00:00") # 2pm

# Create a stale dataset (never synced)
stale_dataset1 = StatcanDataset.create!(
name: "stale-never-synced",
statcan_url: "https://statcan.gc.ca/stale1.csv",
sync_schedule: "0 0 * * *",
last_synced_at: nil
)

# Create another stale dataset (old sync)
stale_dataset2 = StatcanDataset.create!(
name: "stale-old-sync",
statcan_url: "https://statcan.gc.ca/stale2.csv",
sync_schedule: "0 0 * * *",
last_synced_at: Time.parse("2025-01-01 23:00:00") # Yesterday 11pm
)

# Create a fresh dataset (recent sync)
_fresh_dataset = StatcanDataset.create!(
name: "fresh-dataset",
statcan_url: "https://statcan.gc.ca/fresh.csv",
sync_schedule: "0 0 * * *",
last_synced_at: Time.parse("2025-01-02 01:00:00") # 1am today
)

# Track enqueued jobs
assert_enqueued_jobs 2, only: StatcanSyncJob do
StatcanCronJob.perform_now(current_time)
end

# Verify the correct jobs were enqueued
assert_enqueued_with(job: StatcanSyncJob, args: [ stale_dataset1 ])
assert_enqueued_with(job: StatcanSyncJob, args: [ stale_dataset2 ])
end

test "should not enqueue jobs when no datasets need syncing" do
current_time = Time.parse("2025-01-02 14:00:00")

# Create only fresh datasets
StatcanDataset.create!(
name: "fresh-dataset-1",
statcan_url: "https://statcan.gc.ca/fresh1.csv",
sync_schedule: "0 0 * * *",
last_synced_at: Time.parse("2025-01-02 01:00:00") # 1am today
)

# Should not enqueue any jobs
assert_enqueued_jobs 0, only: StatcanSyncJob do
StatcanCronJob.perform_now(current_time)
end
end
end
Loading