Hypertext Rails

Documentation

Getting Started

Readme

Communication Center

Automation Workflow Flow
Trigger Events and Cadence Rules
Fallback Channel Implementation
Fallback Channel Testing (dev)
Twilio SMS Integration Guide
Email Tracking Setup (sent, delivered, failed, open, click)
SMS Tracking & Twilio Free Tier
AWS SES Delivery Tracking (console setup for delivery webhook)
Compiled Template Guide (layout, components, variables)
Compiled Template — Layout Spec (layout structure, triggers, data sources)
Compiled Template — Gap Analysis (new designs vs current app)
Workflow & Template Features (project-driven recipients, multi-project format)

Procore / Project Groups

Procore Integration — Complete Guide (installation single/grouped, project groups, token storage, why no migration)

Analytics Database

Analytics Database — Reference (tables, how they’re created and populated, formulas and variables for PM and developers)
Analytics DB — Duplication Incident Report (root cause, PK / upsert safeguards)
Analytics DB — Sync Stoppage Incident & Resilience Fix (April 6, 2026 stall; per-step rescue + guaranteed reschedule)

Other Features

Heartbeats Dashboard (kiosk connectivity, queries, sample data)

Analytics DB — Sync Stoppage Incident & Resilience Fix

Date: April 17, 2026

Status: Root cause confirmed; resilience fixes implemented in the codebase. Pre-existing dead chains require a one-time manual kickoff to resume (details in Section 5). A daily safety-net cron is proposed as a follow-up (Section 7).

1. Summary

As of April 17, 2026, 7 projects are not syncing to their analytics databases:

Projects	`last_synced_at`	`next_job`	Category
85	`NEVER`	`MISSING`	Chain never started (kickoff never ran)
103, 99, 97, 94	`2026-04-06 09:11–09:12 UTC`	`RESTARTED`	Chain died on April 6, 2026, then was re-kicked

Root cause: the hourly analytics sync chain was implemented so that a single uncaught exception inside any populate step would prevent the next Delayed::Job from being scheduled, permanently breaking the chain for that project until someone manually re-kicked it. April 6, 2026 ~09:10–09:30 UTC was a single incident window in which many projects' hourly runs raised simultaneously, killing 30+ chains at once.

Fix: the sync loop was made resilient at two levels: 1. Per-step isolation in populate_database_with_morale_and_checkin_submissions — each populate step is attempted independently. 2. Guaranteed reschedule in fetch_data_and_schedule_next_fetch via an ensure block — the next hourly job is always queued, even if populate blew up.

These changes apply to both DbCredential (per-project) and ProjectGroup (group-level).

2. Root cause analysis

2.1 How the sync loop works

Each project with an analytics DB has an hourly self-rescheduling Delayed::Job. The job:

Calls populate_database_with_morale_and_checkin_submissions to run six populate steps.
On success, updates last_synced_at.
Enqueues the next Delayed::Job for 1 hour later, tagged with a unique identifier so it cannot be enqueued twice.

DbCredential#fetch_data_and_schedule_next_fetch
    └── populate_database_with_morale_and_checkin_submissions
            ├── populate_checkin_submissions_table
            ├── populate_morale_data_table
            ├── populate_analytics_baselines_table
            ├── populate_analytics_benchmarks_table
            ├── populate_analytics_spotchecks_table
            └── populate_dim_score_feelings_table
        └── self.update(last_synced_at: Time.now)
    └── self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch

ProjectGroup#fetch_data_and_schedule_next_fetch has the same shape.

2.2 The defect (pre-fix)

The loop was written as a straight-line sequence with no exception handling:

def fetch_data_and_schedule_next_fetch
  return if job_already_scheduled?
  populate_database_with_morale_and_checkin_submissions
  # Schedule the next job to run in 1 hour
  job = self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch
  job.update(unique_identifier: self.data_fetching_job_identifier)
end

def populate_database_with_morale_and_checkin_submissions
  populate_checkin_submissions_table
  populate_morale_data_table
  populate_analytics_baselines_table
  populate_analytics_benchmarks_table
  populate_analytics_spotchecks_table
  populate_dim_score_feelings_table
  self.update(last_synced_at: Time.now)
end

Consequences of any raise inside any populate step:

Later populate steps are skipped.
self.update(last_synced_at: Time.now) is skipped.
Execution unwinds out of fetch_data_and_schedule_next_fetch without reaching self.delay(...).
Delayed::Job records the job as failed (Delayed::Worker.max_attempts = 1 → no retry).
No future hourly job exists for this project. The chain is dead until someone calls fetch_data_and_schedule_next_fetch again manually.

2.3 What typically causes a populate step to raise

Grouped by category:

Category	Examples	Which steps
Connectivity	`PG::ConnectionBad`, `Sequel::DatabaseConnectionError`, `ActiveRecord::StatementTimeout`	All six
Schema drift	Missing PK on `id`, missing columns (`is_positive`, `score_4_count`, `score_5_count`) on analytics DBs provisioned before newer migrations	Check-ins, morale, spotchecks
Data quality	Malformed `tags` JSON (`JSON::ParserError`), nil `question["us"]` / `answer["us"]`, nil `score`	Check-ins, morale, spotchecks
Configuration	Unknown `DashboardSetting.baseline_calculation_period`, corrupt `benchmark_calculation_projects_array`	Baselines, benchmarks
Scale / memory / time	`.to_a` on multi-year submission ranges → OOM; `Delayed::Worker.max_run_time = 20.minutes` hit; giant `IN (?)` clauses	Baselines, benchmarks, spotchecks
External-service bugs	`MoraleCalculationService`, `BaselineRangeCalculator` edge cases	Baselines, benchmarks

The April 6, 2026 incident is consistent with a single cause (likely a deploy or transient DB condition) that caused many concurrent hourly runs to raise in the same window. Because of the defect in 2.2, every run that raised lost its future job permanently.

3. Fix implemented

3.1 Files changed

app/models/db_credential.rb — fetch_data_and_schedule_next_fetch, populate_database_with_morale_and_checkin_submissions
app/models/project_group.rb — fetch_data_and_schedule_next_fetch, populate_database_with_morale_and_checkin_submissions

3.2 Layer 1 — Guarantee the reschedule (outer method)

fetch_data_and_schedule_next_fetch now wraps populate in begin/rescue/ensure. The ensure block always attempts to enqueue the next hourly job; if that enqueue itself fails (e.g. DB outage at exactly that moment), the failure is logged and the current job finishes cleanly rather than crashing.

def fetch_data_and_schedule_next_fetch
  return if job_already_scheduled?

  begin
    populate_database_with_morale_and_checkin_submissions
  rescue => e
    Rails.logger.error("[analytics_sync] Project #{project_id} populate raised unexpectedly: ...")
  ensure
    begin
      job = self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch
      job.update(unique_identifier: self.data_fetching_job_identifier)
    rescue => e
      Rails.logger.error("[analytics_sync] Project #{project_id} FAILED to schedule next fetch: ...")
    end
  end
end

3.3 Layer 2 — Isolate each populate step (inner method)

Each of the six populate steps now runs inside its own begin/rescue. Errors are logged with backtrace and collected into an errors array. self.update(last_synced_at: Time.now) is gated on errors.empty? — we deliberately do not advance last_synced_at on a partial run.

def populate_database_with_morale_and_checkin_submissions
  steps = %i[
    populate_checkin_submissions_table
    populate_morale_data_table
    populate_analytics_baselines_table
    populate_analytics_benchmarks_table
    populate_analytics_spotchecks_table
    populate_dim_score_feelings_table
  ]

  errors = []
  steps.each do |step|
    begin
      public_send(step)
    rescue => e
      errors << { step: step, error: "#{e.class}: #{e.message}" }
      Rails.logger.error("[analytics_sync] Project #{project_id} step #{step} failed: ...")
    end
  end

  if errors.empty?
    self.update(last_synced_at: Time.now)
  else
    Rails.logger.warn("[analytics_sync] Project #{project_id} finished with #{errors.size} failed step(s); last_synced_at NOT advanced. Failed: ...")
  end

  errors
end

3.4 Why `last_synced_at` is intentionally gated

last_synced_at is used by fetch_checkin_submissions (in DbCredential) as the lower bound of "what's new since last time":

def fetch_checkin_submissions
  ActiveRecord::Base.connection_pool.with_connection do
    last_synced_at = self.last_synced_at || 10.years.ago
    self.project.check_in_submissions.where("updated_at > ?", last_synced_at)
  end
end

If we advanced last_synced_at on a partial run where populate_checkin_submissions_table failed, we would permanently skip the check-ins that failed to write that hour. By gating on errors.empty?, the next hour will retry the same delta — which is safe because populate_checkin_submissions_table and populate_morale_data_table use upsert_records (idempotent given the PK added in the prior duplication fix).

3.5 Same pattern applied to `ProjectGroup`

ProjectGroup#fetch_data_and_schedule_next_fetch runs only two populate steps inline (populate_checkin_submissions_table, populate_morale_data_table). It received the same begin/rescue/ensure treatment, preserving current behavior (2 steps, not 6). The separately-defined ProjectGroup#populate_database_with_morale_and_checkin_submissions (used by admin batch actions) received the same per-step rescue pattern as DbCredential's.

3.6 Layer 3 — Per-row / per-config hardening inside each populate step

Per-step rescue (Section 3.3) prevents one bad step from killing the others, but a single bad row inside a step still aborts that whole step. Layer 3 pushes the safety net down into each step so that a single malformed row, a missing translation, or a corrupt config value skips just that row / sub-unit and keeps the rest of the step running.

Shared private helpers (added in DbCredential, the heavy code path):

safe_localized_text(value, locale = 'us') — safely extract a localized string from a value that may be a Hash (e.g. {"us" => "..."}), a plain String, or nil. Falls back through locale.to_sym and then .values.compact.first. Used wherever we previously did something["us"].
safe_json_array(value) — coerce any tags-like value into an Array. Accepts Array, JSON-encoded String, or returns [] for anything else (nil, Hash, malformed JSON).

Per-step hardening applied:

Step	Hardening
`populate_checkin_submissions_table`	Per-row `rescue` around each submission (one bad submission is logged and skipped, not fatal). `safe_json_array` for tags. `safe_localized_text` for `spot_check.question` and `answer.answer`. `project.name` / `project.baseline_morale` cached once with safe fallbacks. Daily morale computed only from integer-coercible scores.
`populate_morale_data_table`	Pre-filter out submissions whose `created_at` isn't usable or whose `score` isn't integer-coercible. Per-day `rescue` around the aggregation so a single odd day is skipped, not fatal. `average_mood` safe against empty score sets.
`populate_analytics_baselines_table`	`baseline_calculation_period` validated as `Integer` (fallback `3`). `DashboardSetting::PERIOD_TO_RANGE` guarded with `const_defined?` and fallback `'last_3_months'`. `BaselineRangeCalculator.calculate_date_range` wrapped — on failure, skip the step cleanly. `start_date` / `end_date` validated to respond to `:to_time` before use. `MoraleCalculationService.calculate_baseline_morale_from_submissions` wrapped — on failure, write a `0` baseline instead of raising.
`populate_analytics_benchmarks_table`	`benchmark_calculation_projects_array` validated as `Array`; non-array values logged and skipped. Each ID coerced to `Integer` (rejecting zero/negative/non-numeric). Same period/range hardening as baselines. Per-project `rescue` around `MoraleCalculationService.calculate_team_morale_from_submissions` — one bad benchmark project doesn't kill the benchmarks step.
`populate_analytics_spotchecks_table`	Per-spot-check `rescue` (one bad spot-check is logged and skipped). All localized text accessors (`positive_question_text`, `negative_question_text`, `title`, `answer_text`) wrapped — missing translations fall back to `"Spotcheck #<id>"`. `is_default` / `is_scheduled` coerced via `!!` to avoid odd truthy/falsy values. Per-row `rescue` around the final `conn_tbl.insert`.
`populate_dim_score_feelings_table`	`filter_map` with per-row `rescue` — bad `ScoreFeeling` rows logged and dropped, not fatal. `safe_localized_text` for `name`. Per-row `rescue` around insert.

ProjectGroup mirror: since ProjectGroup mostly copies already-computed rows from each project's analytics DB, hardening there is per-row rescue around the Sequel insert calls for analytics_baselines, analytics_benchmarks, analytics_spotchecks, dim_score_feelings. populate_dim_score_feelings_table also got the same locale-key-safe name extraction and per-row skip pattern.

Log tag [analytics_sync] is used everywhere so you can grep one string to see everything this resilience layer absorbed:

Project <id> skipped checkin_submission id=<id>: ...
Project <id> skipped morale day=<date>: ...
Project <id> BaselineRangeCalculator failed for option=...: ...
Project <id> benchmark_calculation_projects_array is not an Array (...); skipping benchmarks
Project <id> team morale failed for benchmark_pid=<pid>: ...
Project <id> skipped spotcheck id=<id>: ...
Project <id> populate_<step> skipped N row(s)
Group <id> <step> insert failed id=<id>: ...

These are warn-level because they are recoverable per-row skips. The only error-level logs remain the ones from Section 3 (whole-step failures and reschedule failures).

4. Result matrix (post-fix)

Scenario	`last_synced_at` advances?	Next job queued?	Chain alive?
All 6 steps succeed	✅ Yes	✅ Yes	✅ Yes
1–5 steps fail, others succeed	❌ No (intentional)	✅ Yes (via `ensure`)	✅ Yes — retries next hour
Populate raises something otherwise uncatchable	❌ No	✅ Yes (via `ensure`)	✅ Yes
`self.delay(...)` itself raises (DB outage at that instant)	❌ No	❌ No	⚠️ Broken — needs kickoff
`job_already_scheduled?` true at entry	N/A	✅ (already queued)	✅ Yes

Observable behavior changes in logs:

[analytics_sync] Project <id> step <name> failed: ... — per-step failures, per hour.
[analytics_sync] Project <id> finished with N failed step(s); last_synced_at NOT advanced. Failed: ... — summary per run.
[analytics_sync] Project <id> FAILED to schedule next fetch: ... — the only remaining fatal signal to alert on.

A simple log alert on [analytics_sync] .* FAILED to schedule is sufficient to catch the rare remaining dead-chain case.

5. Remediation for already-dead chains

Important: this fix does not auto-heal chains that died before deploy. Those projects need a one-time manual kickoff. Once kicked off, the new resilience keeps them alive going forward.

5.1 From the admin UI

Admin → Projects → (select affected projects) → "Kickoff data fetching"

5.2 From Rails console

DbCredential.where(project_id: [85, 94, 97, 99, 103, 124, 125]).each do |c|
  puts "Kicking project #{c.project_id}..."
  c.fetch_data_and_schedule_next_fetch
end

For ProjectGroup:

ProjectGroup.where(id: [<affected group ids>]).each do |g|
  puts "Kicking group #{g.id}..."
  g.fetch_data_and_schedule_next_fetch
end

5.3 Health-check queries

Sort worst-first (NULL last_synced_at first, then oldest):

DbCredential.order(Arel.sql('last_synced_at NULLS FIRST')).each do |c|
  nj = Delayed::Job.where(unique_identifier: c.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
                   .where('run_at > ?', Time.now).pick(:run_at)
  puts format("project=%-4d last_sync=%-30s next_job=%s",
              c.project_id, c.last_synced_at || "NEVER", nj || "MISSING")
end

Interpretation:

last_sync=recent + next_job=future → healthy.
last_sync=old + next_job=future → failing per step but chain alive (new post-fix signal). Grep logs for [analytics_sync] + that project to see which step is broken.
last_sync=* + next_job=MISSING → chain dead; needs kickoff.
last_sync=NEVER + next_job=MISSING → never started; needs kickoff.

Failed Delayed::Jobs for this pipeline:

Delayed::Job.where("unique_identifier LIKE ? OR unique_identifier LIKE ?",
                   "analytics_%_fetch_data", "analytics_group_%_fetch_data")
            .where.not(failed_at: nil)
            .order(failed_at: :desc).limit(50)
            .pluck(:id, :unique_identifier, :failed_at, :attempts, :last_error)

6. What this fix does not fix

Pre-existing dead chains. Must be manually kicked off (Section 5). The fix is preventative from here on out.
Permanent bugs in a populate step. They now log every hour with the same error instead of silently breaking the chain. This is an intentional tradeoff: self-healing over fail-fast. Log volume is predictable and taggable.
self.delay(...) itself raising at reschedule time. If the Rails DB is unreachable at exactly that instant, no new Delayed::Job row is created, and the chain is dead until someone kicks it off again. Rare, but possible.
Worker-level failures. If the delayed_job worker process is killed mid-populate (deploy, OOM, crash), none of the code in this file runs, so nothing reschedules. Delayed::Worker.max_run_time = 20.minutes partially mitigates by unlocking the job, but DJ retry behavior depends on the version and configuration.
Delayed::Worker.max_run_time timeouts. A populate run that takes more than 20 minutes (typically populate_analytics_spotchecks_table on projects with years of history) can be unlocked and re-claimed, causing overlap.
Schema drift. Projects whose analytics DBs were provisioned before newer migrations (missing PK, missing columns) will fail every hour with the same schema error until an admin re-runs Init Analytics DB for that project or rake analytics:add_primary_keys.
Data quality issues. A single malformed row in source data (e.g. a spot-check without a question["us"] translation) will fail the same step every hour until the bad row is fixed in Pepcore Base.

7. Future fixes (proposed)

In roughly priority order:

7.1 Daily safety-net kickoff rake — closes gap #3

A daily rake task that finds any DbCredential / ProjectGroup with no future Delayed::Job row and re-enqueues it. This closes the one remaining "chain dies" scenario (reschedule itself failing during a DB outage) and makes the system fully self-healing.

Sketch:

namespace :analytics do
  desc "Re-enqueue hourly fetch jobs for credentials/groups with no future job."
  task ensure_scheduled: :environment do
    # DbCredential
    DbCredential.find_each do |c|
      has_job = Delayed::Job.where(unique_identifier: c.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
                            .where('run_at > ?', Time.now).exists?
      next if has_job
      Rails.logger.warn("[analytics_sync] Re-kicking project #{c.project_id} (no future job)")
      c.fetch_data_and_schedule_next_fetch
    end

    # ProjectGroup — same pattern
    ProjectGroup.find_each do |g|
      has_job = Delayed::Job.where(unique_identifier: g.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
                            .where('run_at > ?', Time.now).exists?
      next if has_job
      Rails.logger.warn("[analytics_sync] Re-kicking group #{g.id} (no future job)")
      g.fetch_data_and_schedule_next_fetch
    end
  end
end

Cron it daily from whatever scheduler you use.

7.2 Per-step `last_synced_at` — improves accuracy

Today last_synced_at is a single column that only advances on a clean run. For projects with a persistently-failing step (e.g. spotchecks schema drift), the delta for check-ins grows unbounded and each hourly run re-fetches the same backlog.

Alternative: per-step timestamp columns (checkins_last_synced_at, morale_last_synced_at, etc.) so each step advances independently. Tradeoff: migration + code change across both DbCredential and ProjectGroup. Not urgent.

8. Change log

Date	Change	Commit
2025-04-15 (± original refactor)	Upsert logic switched to `import` with `on_duplicate_key_update` without unique constraint on `id`. Source of duplication bug.	`cc4c6bf` / `aeffd9b`
2026-03-15	Four new analytics tables (baselines, benchmarks, spotchecks, dimscorefeelings) added to the populate sequence. Increased surface for per-step raises.	`ac8b31d`
2026-04-06 ~09:10–09:30 UTC	Incident: ~30 project chains died simultaneously. Root cause unknown (suspected deploy or transient DB condition); effect is the pre-fix defect in Section 2.2.	—
2026-04-15	`ensure_primary_key!` + `upsert_records` added to fix duplication. Does not address stoppage.	`4896dfb`
2026-04-16	Follow-up duplication fixes and `rake analytics:add_primary_keys`.	`124834a`
2026-04-17	This fix (Layers 1 + 2): resilience changes in `DbCredential` and `ProjectGroup` (per-step rescue + `ensure` reschedule).	(this PR)
2026-04-17	This fix (Layer 3): per-row / per-config hardening inside each populate step; shared `safe_localized_text` / `safe_json_array` helpers.	(this PR)

9. Related files

File	Role
`app/models/db_credential.rb`	Per-project sync loop and populate steps. Resilience changes applied here.
`app/models/project_group.rb`	Group-level sync loop. Resilience changes applied here.
`app/admin/projects.rb`	Admin batch actions: `kickoff_data_fetching`, `init_analytics_db`, `populate_analytics_data`, `reset_analytics_data`.
`config/initializers/delayed_job_config.rb`	`max_attempts = 1`, `max_run_time = 20.minutes`, `destroy_failed_jobs = false` (failed jobs kept for inspection).
`lib/tasks/analytics.rake`	Inspection + PK maintenance rake tasks. Proposed home for the Section 7.1 safety-net task.
`docs/ANALYTICS_DATABASE_REFERENCE.md`	Table/schema reference for the analytics DB.
`docs/ANALYTICS_DUPLICATION_INCIDENT_REPORT.md`	Prior incident (duplication), which added the PK/upsert safeguards.