Hypertext Rails

Documentation

Getting Started

Communication Center

Procore / Project Groups

Analytics Database

Other Features

Analytics DB — Sync Stoppage Incident & Resilience Fix

Date: April 17, 2026

Status: Root cause confirmed; resilience fixes implemented in the codebase. Pre-existing dead chains require a one-time manual kickoff to resume (details in Section 5). A daily safety-net cron is proposed as a follow-up (Section 7).

Related: Analytics DB — Duplication Incident Report · Analytics Database — Reference


1. Summary

As of April 17, 2026, 7 projects are not syncing to their analytics databases:

Projects last_synced_at next_job Category
85 NEVER MISSING Chain never started (kickoff never ran)
103, 99, 97, 94 2026-04-06 09:11–09:12 UTC RESTARTED Chain died on April 6, 2026, then was re-kicked

Root cause: the hourly analytics sync chain was implemented so that a single uncaught exception inside any populate step would prevent the next Delayed::Job from being scheduled, permanently breaking the chain for that project until someone manually re-kicked it. April 6, 2026 ~09:10–09:30 UTC was a single incident window in which many projects' hourly runs raised simultaneously, killing 30+ chains at once.

Fix: the sync loop was made resilient at two levels: 1. Per-step isolation in populate_database_with_morale_and_checkin_submissions — each populate step is attempted independently. 2. Guaranteed reschedule in fetch_data_and_schedule_next_fetch via an ensure block — the next hourly job is always queued, even if populate blew up.

These changes apply to both DbCredential (per-project) and ProjectGroup (group-level).


2. Root cause analysis

2.1 How the sync loop works

Each project with an analytics DB has an hourly self-rescheduling Delayed::Job. The job:

  1. Calls populate_database_with_morale_and_checkin_submissions to run six populate steps.
  2. On success, updates last_synced_at.
  3. Enqueues the next Delayed::Job for 1 hour later, tagged with a unique identifier so it cannot be enqueued twice.
DbCredential#fetch_data_and_schedule_next_fetch
    └── populate_database_with_morale_and_checkin_submissions
            ├── populate_checkin_submissions_table
            ├── populate_morale_data_table
            ├── populate_analytics_baselines_table
            ├── populate_analytics_benchmarks_table
            ├── populate_analytics_spotchecks_table
            └── populate_dim_score_feelings_table
        └── self.update(last_synced_at: Time.now)
    └── self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch

ProjectGroup#fetch_data_and_schedule_next_fetch has the same shape.

2.2 The defect (pre-fix)

The loop was written as a straight-line sequence with no exception handling:

def fetch_data_and_schedule_next_fetch
  return if job_already_scheduled?
  populate_database_with_morale_and_checkin_submissions
  # Schedule the next job to run in 1 hour
  job = self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch
  job.update(unique_identifier: self.data_fetching_job_identifier)
end

def populate_database_with_morale_and_checkin_submissions
  populate_checkin_submissions_table
  populate_morale_data_table
  populate_analytics_baselines_table
  populate_analytics_benchmarks_table
  populate_analytics_spotchecks_table
  populate_dim_score_feelings_table
  self.update(last_synced_at: Time.now)
end

Consequences of any raise inside any populate step:

  • Later populate steps are skipped.
  • self.update(last_synced_at: Time.now) is skipped.
  • Execution unwinds out of fetch_data_and_schedule_next_fetch without reaching self.delay(...).
  • Delayed::Job records the job as failed (Delayed::Worker.max_attempts = 1 → no retry).
  • No future hourly job exists for this project. The chain is dead until someone calls fetch_data_and_schedule_next_fetch again manually.

2.3 What typically causes a populate step to raise

Grouped by category:

Category Examples Which steps
Connectivity PG::ConnectionBad, Sequel::DatabaseConnectionError, ActiveRecord::StatementTimeout All six
Schema drift Missing PK on id, missing columns (is_positive, score_4_count, score_5_count) on analytics DBs provisioned before newer migrations Check-ins, morale, spotchecks
Data quality Malformed tags JSON (JSON::ParserError), nil question["us"] / answer["us"], nil score Check-ins, morale, spotchecks
Configuration Unknown DashboardSetting.baseline_calculation_period, corrupt benchmark_calculation_projects_array Baselines, benchmarks
Scale / memory / time .to_a on multi-year submission ranges → OOM; Delayed::Worker.max_run_time = 20.minutes hit; giant IN (?) clauses Baselines, benchmarks, spotchecks
External-service bugs MoraleCalculationService, BaselineRangeCalculator edge cases Baselines, benchmarks

The April 6, 2026 incident is consistent with a single cause (likely a deploy or transient DB condition) that caused many concurrent hourly runs to raise in the same window. Because of the defect in 2.2, every run that raised lost its future job permanently.


3. Fix implemented

3.1 Files changed

  • app/models/db_credential.rbfetch_data_and_schedule_next_fetch, populate_database_with_morale_and_checkin_submissions
  • app/models/project_group.rbfetch_data_and_schedule_next_fetch, populate_database_with_morale_and_checkin_submissions

3.2 Layer 1 — Guarantee the reschedule (outer method)

fetch_data_and_schedule_next_fetch now wraps populate in begin/rescue/ensure. The ensure block always attempts to enqueue the next hourly job; if that enqueue itself fails (e.g. DB outage at exactly that moment), the failure is logged and the current job finishes cleanly rather than crashing.

def fetch_data_and_schedule_next_fetch
  return if job_already_scheduled?

  begin
    populate_database_with_morale_and_checkin_submissions
  rescue => e
    Rails.logger.error("[analytics_sync] Project #{project_id} populate raised unexpectedly: ...")
  ensure
    begin
      job = self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch
      job.update(unique_identifier: self.data_fetching_job_identifier)
    rescue => e
      Rails.logger.error("[analytics_sync] Project #{project_id} FAILED to schedule next fetch: ...")
    end
  end
end

3.3 Layer 2 — Isolate each populate step (inner method)

Each of the six populate steps now runs inside its own begin/rescue. Errors are logged with backtrace and collected into an errors array. self.update(last_synced_at: Time.now) is gated on errors.empty? — we deliberately do not advance last_synced_at on a partial run.

def populate_database_with_morale_and_checkin_submissions
  steps = %i[
    populate_checkin_submissions_table
    populate_morale_data_table
    populate_analytics_baselines_table
    populate_analytics_benchmarks_table
    populate_analytics_spotchecks_table
    populate_dim_score_feelings_table
  ]

  errors = []
  steps.each do |step|
    begin
      public_send(step)
    rescue => e
      errors << { step: step, error: "#{e.class}: #{e.message}" }
      Rails.logger.error("[analytics_sync] Project #{project_id} step #{step} failed: ...")
    end
  end

  if errors.empty?
    self.update(last_synced_at: Time.now)
  else
    Rails.logger.warn("[analytics_sync] Project #{project_id} finished with #{errors.size} failed step(s); last_synced_at NOT advanced. Failed: ...")
  end

  errors
end

3.4 Why last_synced_at is intentionally gated

last_synced_at is used by fetch_checkin_submissions (in DbCredential) as the lower bound of "what's new since last time":

def fetch_checkin_submissions
  ActiveRecord::Base.connection_pool.with_connection do
    last_synced_at = self.last_synced_at || 10.years.ago
    self.project.check_in_submissions.where("updated_at > ?", last_synced_at)
  end
end

If we advanced last_synced_at on a partial run where populate_checkin_submissions_table failed, we would permanently skip the check-ins that failed to write that hour. By gating on errors.empty?, the next hour will retry the same delta — which is safe because populate_checkin_submissions_table and populate_morale_data_table use upsert_records (idempotent given the PK added in the prior duplication fix).

3.5 Same pattern applied to ProjectGroup

ProjectGroup#fetch_data_and_schedule_next_fetch runs only two populate steps inline (populate_checkin_submissions_table, populate_morale_data_table). It received the same begin/rescue/ensure treatment, preserving current behavior (2 steps, not 6). The separately-defined ProjectGroup#populate_database_with_morale_and_checkin_submissions (used by admin batch actions) received the same per-step rescue pattern as DbCredential's.

3.6 Layer 3 — Per-row / per-config hardening inside each populate step

Per-step rescue (Section 3.3) prevents one bad step from killing the others, but a single bad row inside a step still aborts that whole step. Layer 3 pushes the safety net down into each step so that a single malformed row, a missing translation, or a corrupt config value skips just that row / sub-unit and keeps the rest of the step running.

Shared private helpers (added in DbCredential, the heavy code path):

  • safe_localized_text(value, locale = 'us') — safely extract a localized string from a value that may be a Hash (e.g. {"us" => "..."}), a plain String, or nil. Falls back through locale.to_sym and then .values.compact.first. Used wherever we previously did something["us"].
  • safe_json_array(value) — coerce any tags-like value into an Array. Accepts Array, JSON-encoded String, or returns [] for anything else (nil, Hash, malformed JSON).

Per-step hardening applied:

Step Hardening
populate_checkin_submissions_table Per-row rescue around each submission (one bad submission is logged and skipped, not fatal). safe_json_array for tags. safe_localized_text for spot_check.question and answer.answer. project.name / project.baseline_morale cached once with safe fallbacks. Daily morale computed only from integer-coercible scores.
populate_morale_data_table Pre-filter out submissions whose created_at isn't usable or whose score isn't integer-coercible. Per-day rescue around the aggregation so a single odd day is skipped, not fatal. average_mood safe against empty score sets.
populate_analytics_baselines_table baseline_calculation_period validated as Integer (fallback 3). DashboardSetting::PERIOD_TO_RANGE guarded with const_defined? and fallback 'last_3_months'. BaselineRangeCalculator.calculate_date_range wrapped — on failure, skip the step cleanly. start_date / end_date validated to respond to :to_time before use. MoraleCalculationService.calculate_baseline_morale_from_submissions wrapped — on failure, write a 0 baseline instead of raising.
populate_analytics_benchmarks_table benchmark_calculation_projects_array validated as Array; non-array values logged and skipped. Each ID coerced to Integer (rejecting zero/negative/non-numeric). Same period/range hardening as baselines. Per-project rescue around MoraleCalculationService.calculate_team_morale_from_submissions — one bad benchmark project doesn't kill the benchmarks step.
populate_analytics_spotchecks_table Per-spot-check rescue (one bad spot-check is logged and skipped). All localized text accessors (positive_question_text, negative_question_text, title, answer_text) wrapped — missing translations fall back to "Spotcheck #<id>". is_default / is_scheduled coerced via !! to avoid odd truthy/falsy values. Per-row rescue around the final conn_tbl.insert.
populate_dim_score_feelings_table filter_map with per-row rescue — bad ScoreFeeling rows logged and dropped, not fatal. safe_localized_text for name. Per-row rescue around insert.

ProjectGroup mirror: since ProjectGroup mostly copies already-computed rows from each project's analytics DB, hardening there is per-row rescue around the Sequel insert calls for analytics_baselines, analytics_benchmarks, analytics_spotchecks, dim_score_feelings. populate_dim_score_feelings_table also got the same locale-key-safe name extraction and per-row skip pattern.

Log tag [analytics_sync] is used everywhere so you can grep one string to see everything this resilience layer absorbed:

  • Project <id> skipped checkin_submission id=<id>: ...
  • Project <id> skipped morale day=<date>: ...
  • Project <id> BaselineRangeCalculator failed for option=...: ...
  • Project <id> benchmark_calculation_projects_array is not an Array (...); skipping benchmarks
  • Project <id> team morale failed for benchmark_pid=<pid>: ...
  • Project <id> skipped spotcheck id=<id>: ...
  • Project <id> populate_<step> skipped N row(s)
  • Group <id> <step> insert failed id=<id>: ...

These are warn-level because they are recoverable per-row skips. The only error-level logs remain the ones from Section 3 (whole-step failures and reschedule failures).


4. Result matrix (post-fix)

Scenario last_synced_at advances? Next job queued? Chain alive?
All 6 steps succeed ✅ Yes ✅ Yes ✅ Yes
1–5 steps fail, others succeed ❌ No (intentional) ✅ Yes (via ensure) ✅ Yes — retries next hour
Populate raises something otherwise uncatchable ❌ No ✅ Yes (via ensure) ✅ Yes
self.delay(...) itself raises (DB outage at that instant) ❌ No ❌ No ⚠️ Broken — needs kickoff
job_already_scheduled? true at entry N/A ✅ (already queued) ✅ Yes

Observable behavior changes in logs:

  • [analytics_sync] Project <id> step <name> failed: ... — per-step failures, per hour.
  • [analytics_sync] Project <id> finished with N failed step(s); last_synced_at NOT advanced. Failed: ... — summary per run.
  • [analytics_sync] Project <id> FAILED to schedule next fetch: ... — the only remaining fatal signal to alert on.

A simple log alert on [analytics_sync] .* FAILED to schedule is sufficient to catch the rare remaining dead-chain case.


5. Remediation for already-dead chains

Important: this fix does not auto-heal chains that died before deploy. Those projects need a one-time manual kickoff. Once kicked off, the new resilience keeps them alive going forward.

5.1 From the admin UI

Admin → Projects → (select affected projects) → "Kickoff data fetching"

5.2 From Rails console

DbCredential.where(project_id: [85, 94, 97, 99, 103, 124, 125]).each do |c|
  puts "Kicking project #{c.project_id}..."
  c.fetch_data_and_schedule_next_fetch
end

For ProjectGroup:

ProjectGroup.where(id: [<affected group ids>]).each do |g|
  puts "Kicking group #{g.id}..."
  g.fetch_data_and_schedule_next_fetch
end

5.3 Health-check queries

Sort worst-first (NULL last_synced_at first, then oldest):

DbCredential.order(Arel.sql('last_synced_at NULLS FIRST')).each do |c|
  nj = Delayed::Job.where(unique_identifier: c.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
                   .where('run_at > ?', Time.now).pick(:run_at)
  puts format("project=%-4d last_sync=%-30s next_job=%s",
              c.project_id, c.last_synced_at || "NEVER", nj || "MISSING")
end

Interpretation:

  • last_sync=recent + next_job=futurehealthy.
  • last_sync=old + next_job=futurefailing per step but chain alive (new post-fix signal). Grep logs for [analytics_sync] + that project to see which step is broken.
  • last_sync=* + next_job=MISSINGchain dead; needs kickoff.
  • last_sync=NEVER + next_job=MISSINGnever started; needs kickoff.

Failed Delayed::Jobs for this pipeline:

Delayed::Job.where("unique_identifier LIKE ? OR unique_identifier LIKE ?",
                   "analytics_%_fetch_data", "analytics_group_%_fetch_data")
            .where.not(failed_at: nil)
            .order(failed_at: :desc).limit(50)
            .pluck(:id, :unique_identifier, :failed_at, :attempts, :last_error)

6. What this fix does not fix

  1. Pre-existing dead chains. Must be manually kicked off (Section 5). The fix is preventative from here on out.
  2. Permanent bugs in a populate step. They now log every hour with the same error instead of silently breaking the chain. This is an intentional tradeoff: self-healing over fail-fast. Log volume is predictable and taggable.
  3. self.delay(...) itself raising at reschedule time. If the Rails DB is unreachable at exactly that instant, no new Delayed::Job row is created, and the chain is dead until someone kicks it off again. Rare, but possible.
  4. Worker-level failures. If the delayed_job worker process is killed mid-populate (deploy, OOM, crash), none of the code in this file runs, so nothing reschedules. Delayed::Worker.max_run_time = 20.minutes partially mitigates by unlocking the job, but DJ retry behavior depends on the version and configuration.
  5. Delayed::Worker.max_run_time timeouts. A populate run that takes more than 20 minutes (typically populate_analytics_spotchecks_table on projects with years of history) can be unlocked and re-claimed, causing overlap.
  6. Schema drift. Projects whose analytics DBs were provisioned before newer migrations (missing PK, missing columns) will fail every hour with the same schema error until an admin re-runs Init Analytics DB for that project or rake analytics:add_primary_keys.
  7. Data quality issues. A single malformed row in source data (e.g. a spot-check without a question["us"] translation) will fail the same step every hour until the bad row is fixed in Pepcore Base.

7. Future fixes (proposed)

In roughly priority order:

7.1 Daily safety-net kickoff rake — closes gap #3

A daily rake task that finds any DbCredential / ProjectGroup with no future Delayed::Job row and re-enqueues it. This closes the one remaining "chain dies" scenario (reschedule itself failing during a DB outage) and makes the system fully self-healing.

Sketch:

namespace :analytics do
  desc "Re-enqueue hourly fetch jobs for credentials/groups with no future job."
  task ensure_scheduled: :environment do
    # DbCredential
    DbCredential.find_each do |c|
      has_job = Delayed::Job.where(unique_identifier: c.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
                            .where('run_at > ?', Time.now).exists?
      next if has_job
      Rails.logger.warn("[analytics_sync] Re-kicking project #{c.project_id} (no future job)")
      c.fetch_data_and_schedule_next_fetch
    end

    # ProjectGroup — same pattern
    ProjectGroup.find_each do |g|
      has_job = Delayed::Job.where(unique_identifier: g.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
                            .where('run_at > ?', Time.now).exists?
      next if has_job
      Rails.logger.warn("[analytics_sync] Re-kicking group #{g.id} (no future job)")
      g.fetch_data_and_schedule_next_fetch
    end
  end
end

Cron it daily from whatever scheduler you use.

7.2 Per-step last_synced_at — improves accuracy

Today last_synced_at is a single column that only advances on a clean run. For projects with a persistently-failing step (e.g. spotchecks schema drift), the delta for check-ins grows unbounded and each hourly run re-fetches the same backlog.

Alternative: per-step timestamp columns (checkins_last_synced_at, morale_last_synced_at, etc.) so each step advances independently. Tradeoff: migration + code change across both DbCredential and ProjectGroup. Not urgent.


8. Change log

Date Change Commit
2025-04-15 (± original refactor) Upsert logic switched to import with on_duplicate_key_update without unique constraint on id. Source of duplication bug. cc4c6bf / aeffd9b
2026-03-15 Four new analytics tables (baselines, benchmarks, spotchecks, dimscorefeelings) added to the populate sequence. Increased surface for per-step raises. ac8b31d
2026-04-06 ~09:10–09:30 UTC Incident: ~30 project chains died simultaneously. Root cause unknown (suspected deploy or transient DB condition); effect is the pre-fix defect in Section 2.2.
2026-04-15 ensure_primary_key! + upsert_records added to fix duplication. Does not address stoppage. 4896dfb
2026-04-16 Follow-up duplication fixes and rake analytics:add_primary_keys. 124834a
2026-04-17 This fix (Layers 1 + 2): resilience changes in DbCredential and ProjectGroup (per-step rescue + ensure reschedule). (this PR)
2026-04-17 This fix (Layer 3): per-row / per-config hardening inside each populate step; shared safe_localized_text / safe_json_array helpers. (this PR)

9. Related files

File Role
app/models/db_credential.rb Per-project sync loop and populate steps. Resilience changes applied here.
app/models/project_group.rb Group-level sync loop. Resilience changes applied here.
app/admin/projects.rb Admin batch actions: kickoff_data_fetching, init_analytics_db, populate_analytics_data, reset_analytics_data.
config/initializers/delayed_job_config.rb max_attempts = 1, max_run_time = 20.minutes, destroy_failed_jobs = false (failed jobs kept for inspection).
lib/tasks/analytics.rake Inspection + PK maintenance rake tasks. Proposed home for the Section 7.1 safety-net task.
docs/ANALYTICS_DATABASE_REFERENCE.md Table/schema reference for the analytics DB.
docs/ANALYTICS_DUPLICATION_INCIDENT_REPORT.md Prior incident (duplication), which added the PK/upsert safeguards.