Hypertext Rails
Documentation
Getting Started
Communication Center
- Automation Workflow Flow
- Trigger Events and Cadence Rules
- Fallback Channel Implementation
- Fallback Channel Testing (dev)
- Twilio SMS Integration Guide
- Email Tracking Setup (sent, delivered, failed, open, click)
- SMS Tracking & Twilio Free Tier
- AWS SES Delivery Tracking (console setup for delivery webhook)
- Compiled Template Guide (layout, components, variables)
- Compiled Template — Layout Spec (layout structure, triggers, data sources)
- Compiled Template — Gap Analysis (new designs vs current app)
- Workflow & Template Features (project-driven recipients, multi-project format)
Procore / Project Groups
- Procore Integration — Complete Guide (installation single/grouped, project groups, token storage, why no migration)
Analytics Database
- Analytics Database — Reference (tables, how they’re created and populated, formulas and variables for PM and developers)
- Analytics DB — Duplication Incident Report (root cause, PK / upsert safeguards)
- Analytics DB — Sync Stoppage Incident & Resilience Fix (April 6, 2026 stall; per-step rescue + guaranteed reschedule)
Other Features
- Heartbeats Dashboard (kiosk connectivity, queries, sample data)
Analytics DB — Sync Stoppage Incident & Resilience Fix
Date: April 17, 2026
Status: Root cause confirmed; resilience fixes implemented in the codebase. Pre-existing dead chains require a one-time manual kickoff to resume (details in Section 5). A daily safety-net cron is proposed as a follow-up (Section 7).
Related: Analytics DB — Duplication Incident Report · Analytics Database — Reference
1. Summary
As of April 17, 2026, 7 projects are not syncing to their analytics databases:
| Projects | last_synced_at |
next_job |
Category |
|---|---|---|---|
| 85 | NEVER |
MISSING |
Chain never started (kickoff never ran) |
| 103, 99, 97, 94 | 2026-04-06 09:11–09:12 UTC |
RESTARTED |
Chain died on April 6, 2026, then was re-kicked |
Root cause: the hourly analytics sync chain was implemented so that a single uncaught exception inside any populate step would prevent the next Delayed::Job from being scheduled, permanently breaking the chain for that project until someone manually re-kicked it. April 6, 2026 ~09:10–09:30 UTC was a single incident window in which many projects' hourly runs raised simultaneously, killing 30+ chains at once.
Fix: the sync loop was made resilient at two levels:
1. Per-step isolation in populate_database_with_morale_and_checkin_submissions — each populate step is attempted independently.
2. Guaranteed reschedule in fetch_data_and_schedule_next_fetch via an ensure block — the next hourly job is always queued, even if populate blew up.
These changes apply to both DbCredential (per-project) and ProjectGroup (group-level).
2. Root cause analysis
2.1 How the sync loop works
Each project with an analytics DB has an hourly self-rescheduling Delayed::Job. The job:
- Calls
populate_database_with_morale_and_checkin_submissionsto run six populate steps. - On success, updates
last_synced_at. - Enqueues the next Delayed::Job for 1 hour later, tagged with a unique identifier so it cannot be enqueued twice.
DbCredential#fetch_data_and_schedule_next_fetch
└── populate_database_with_morale_and_checkin_submissions
├── populate_checkin_submissions_table
├── populate_morale_data_table
├── populate_analytics_baselines_table
├── populate_analytics_benchmarks_table
├── populate_analytics_spotchecks_table
└── populate_dim_score_feelings_table
└── self.update(last_synced_at: Time.now)
└── self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch
ProjectGroup#fetch_data_and_schedule_next_fetch has the same shape.
2.2 The defect (pre-fix)
The loop was written as a straight-line sequence with no exception handling:
def fetch_data_and_schedule_next_fetch
return if job_already_scheduled?
populate_database_with_morale_and_checkin_submissions
# Schedule the next job to run in 1 hour
job = self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch
job.update(unique_identifier: self.data_fetching_job_identifier)
end
def populate_database_with_morale_and_checkin_submissions
populate_checkin_submissions_table
populate_morale_data_table
populate_analytics_baselines_table
populate_analytics_benchmarks_table
populate_analytics_spotchecks_table
populate_dim_score_feelings_table
self.update(last_synced_at: Time.now)
end
Consequences of any raise inside any populate step:
- Later populate steps are skipped.
self.update(last_synced_at: Time.now)is skipped.- Execution unwinds out of
fetch_data_and_schedule_next_fetchwithout reachingself.delay(...). - Delayed::Job records the job as failed (
Delayed::Worker.max_attempts = 1→ no retry). - No future hourly job exists for this project. The chain is dead until someone calls
fetch_data_and_schedule_next_fetchagain manually.
2.3 What typically causes a populate step to raise
Grouped by category:
| Category | Examples | Which steps |
|---|---|---|
| Connectivity | PG::ConnectionBad, Sequel::DatabaseConnectionError, ActiveRecord::StatementTimeout |
All six |
| Schema drift | Missing PK on id, missing columns (is_positive, score_4_count, score_5_count) on analytics DBs provisioned before newer migrations |
Check-ins, morale, spotchecks |
| Data quality | Malformed tags JSON (JSON::ParserError), nil question["us"] / answer["us"], nil score |
Check-ins, morale, spotchecks |
| Configuration | Unknown DashboardSetting.baseline_calculation_period, corrupt benchmark_calculation_projects_array |
Baselines, benchmarks |
| Scale / memory / time | .to_a on multi-year submission ranges → OOM; Delayed::Worker.max_run_time = 20.minutes hit; giant IN (?) clauses |
Baselines, benchmarks, spotchecks |
| External-service bugs | MoraleCalculationService, BaselineRangeCalculator edge cases |
Baselines, benchmarks |
The April 6, 2026 incident is consistent with a single cause (likely a deploy or transient DB condition) that caused many concurrent hourly runs to raise in the same window. Because of the defect in 2.2, every run that raised lost its future job permanently.
3. Fix implemented
3.1 Files changed
app/models/db_credential.rb—fetch_data_and_schedule_next_fetch,populate_database_with_morale_and_checkin_submissionsapp/models/project_group.rb—fetch_data_and_schedule_next_fetch,populate_database_with_morale_and_checkin_submissions
3.2 Layer 1 — Guarantee the reschedule (outer method)
fetch_data_and_schedule_next_fetch now wraps populate in begin/rescue/ensure. The ensure block always attempts to enqueue the next hourly job; if that enqueue itself fails (e.g. DB outage at exactly that moment), the failure is logged and the current job finishes cleanly rather than crashing.
def fetch_data_and_schedule_next_fetch
return if job_already_scheduled?
begin
populate_database_with_morale_and_checkin_submissions
rescue => e
Rails.logger.error("[analytics_sync] Project #{project_id} populate raised unexpectedly: ...")
ensure
begin
job = self.delay(run_at: 1.hour.from_now).fetch_data_and_schedule_next_fetch
job.update(unique_identifier: self.data_fetching_job_identifier)
rescue => e
Rails.logger.error("[analytics_sync] Project #{project_id} FAILED to schedule next fetch: ...")
end
end
end
3.3 Layer 2 — Isolate each populate step (inner method)
Each of the six populate steps now runs inside its own begin/rescue. Errors are logged with backtrace and collected into an errors array. self.update(last_synced_at: Time.now) is gated on errors.empty? — we deliberately do not advance last_synced_at on a partial run.
def populate_database_with_morale_and_checkin_submissions
steps = %i[
populate_checkin_submissions_table
populate_morale_data_table
populate_analytics_baselines_table
populate_analytics_benchmarks_table
populate_analytics_spotchecks_table
populate_dim_score_feelings_table
]
errors = []
steps.each do |step|
begin
public_send(step)
rescue => e
errors << { step: step, error: "#{e.class}: #{e.message}" }
Rails.logger.error("[analytics_sync] Project #{project_id} step #{step} failed: ...")
end
end
if errors.empty?
self.update(last_synced_at: Time.now)
else
Rails.logger.warn("[analytics_sync] Project #{project_id} finished with #{errors.size} failed step(s); last_synced_at NOT advanced. Failed: ...")
end
errors
end
3.4 Why last_synced_at is intentionally gated
last_synced_at is used by fetch_checkin_submissions (in DbCredential) as the lower bound of "what's new since last time":
def fetch_checkin_submissions
ActiveRecord::Base.connection_pool.with_connection do
last_synced_at = self.last_synced_at || 10.years.ago
self.project.check_in_submissions.where("updated_at > ?", last_synced_at)
end
end
If we advanced last_synced_at on a partial run where populate_checkin_submissions_table failed, we would permanently skip the check-ins that failed to write that hour. By gating on errors.empty?, the next hour will retry the same delta — which is safe because populate_checkin_submissions_table and populate_morale_data_table use upsert_records (idempotent given the PK added in the prior duplication fix).
3.5 Same pattern applied to ProjectGroup
ProjectGroup#fetch_data_and_schedule_next_fetch runs only two populate steps inline (populate_checkin_submissions_table, populate_morale_data_table). It received the same begin/rescue/ensure treatment, preserving current behavior (2 steps, not 6). The separately-defined ProjectGroup#populate_database_with_morale_and_checkin_submissions (used by admin batch actions) received the same per-step rescue pattern as DbCredential's.
3.6 Layer 3 — Per-row / per-config hardening inside each populate step
Per-step rescue (Section 3.3) prevents one bad step from killing the others, but a single bad row inside a step still aborts that whole step. Layer 3 pushes the safety net down into each step so that a single malformed row, a missing translation, or a corrupt config value skips just that row / sub-unit and keeps the rest of the step running.
Shared private helpers (added in DbCredential, the heavy code path):
safe_localized_text(value, locale = 'us')— safely extract a localized string from a value that may be aHash(e.g.{"us" => "..."}), a plainString, ornil. Falls back throughlocale.to_symand then.values.compact.first. Used wherever we previously didsomething["us"].safe_json_array(value)— coerce any tags-like value into anArray. AcceptsArray, JSON-encodedString, or returns[]for anything else (nil,Hash, malformed JSON).
Per-step hardening applied:
| Step | Hardening |
|---|---|
populate_checkin_submissions_table |
Per-row rescue around each submission (one bad submission is logged and skipped, not fatal). safe_json_array for tags. safe_localized_text for spot_check.question and answer.answer. project.name / project.baseline_morale cached once with safe fallbacks. Daily morale computed only from integer-coercible scores. |
populate_morale_data_table |
Pre-filter out submissions whose created_at isn't usable or whose score isn't integer-coercible. Per-day rescue around the aggregation so a single odd day is skipped, not fatal. average_mood safe against empty score sets. |
populate_analytics_baselines_table |
baseline_calculation_period validated as Integer (fallback 3). DashboardSetting::PERIOD_TO_RANGE guarded with const_defined? and fallback 'last_3_months'. BaselineRangeCalculator.calculate_date_range wrapped — on failure, skip the step cleanly. start_date / end_date validated to respond to :to_time before use. MoraleCalculationService.calculate_baseline_morale_from_submissions wrapped — on failure, write a 0 baseline instead of raising. |
populate_analytics_benchmarks_table |
benchmark_calculation_projects_array validated as Array; non-array values logged and skipped. Each ID coerced to Integer (rejecting zero/negative/non-numeric). Same period/range hardening as baselines. Per-project rescue around MoraleCalculationService.calculate_team_morale_from_submissions — one bad benchmark project doesn't kill the benchmarks step. |
populate_analytics_spotchecks_table |
Per-spot-check rescue (one bad spot-check is logged and skipped). All localized text accessors (positive_question_text, negative_question_text, title, answer_text) wrapped — missing translations fall back to "Spotcheck #<id>". is_default / is_scheduled coerced via !! to avoid odd truthy/falsy values. Per-row rescue around the final conn_tbl.insert. |
populate_dim_score_feelings_table |
filter_map with per-row rescue — bad ScoreFeeling rows logged and dropped, not fatal. safe_localized_text for name. Per-row rescue around insert. |
ProjectGroup mirror: since ProjectGroup mostly copies already-computed rows from each project's analytics DB, hardening there is per-row rescue around the Sequel insert calls for analytics_baselines, analytics_benchmarks, analytics_spotchecks, dim_score_feelings. populate_dim_score_feelings_table also got the same locale-key-safe name extraction and per-row skip pattern.
Log tag [analytics_sync] is used everywhere so you can grep one string to see everything this resilience layer absorbed:
Project <id> skipped checkin_submission id=<id>: ...Project <id> skipped morale day=<date>: ...Project <id> BaselineRangeCalculator failed for option=...: ...Project <id> benchmark_calculation_projects_array is not an Array (...); skipping benchmarksProject <id> team morale failed for benchmark_pid=<pid>: ...Project <id> skipped spotcheck id=<id>: ...Project <id> populate_<step> skipped N row(s)Group <id> <step> insert failed id=<id>: ...
These are warn-level because they are recoverable per-row skips. The only error-level logs remain the ones from Section 3 (whole-step failures and reschedule failures).
4. Result matrix (post-fix)
| Scenario | last_synced_at advances? |
Next job queued? | Chain alive? |
|---|---|---|---|
| All 6 steps succeed | ✅ Yes | ✅ Yes | ✅ Yes |
| 1–5 steps fail, others succeed | ❌ No (intentional) | ✅ Yes (via ensure) |
✅ Yes — retries next hour |
| Populate raises something otherwise uncatchable | ❌ No | ✅ Yes (via ensure) |
✅ Yes |
self.delay(...) itself raises (DB outage at that instant) |
❌ No | ❌ No | ⚠️ Broken — needs kickoff |
job_already_scheduled? true at entry |
N/A | ✅ (already queued) | ✅ Yes |
Observable behavior changes in logs:
[analytics_sync] Project <id> step <name> failed: ...— per-step failures, per hour.[analytics_sync] Project <id> finished with N failed step(s); last_synced_at NOT advanced. Failed: ...— summary per run.[analytics_sync] Project <id> FAILED to schedule next fetch: ...— the only remaining fatal signal to alert on.
A simple log alert on [analytics_sync] .* FAILED to schedule is sufficient to catch the rare remaining dead-chain case.
5. Remediation for already-dead chains
Important: this fix does not auto-heal chains that died before deploy. Those projects need a one-time manual kickoff. Once kicked off, the new resilience keeps them alive going forward.
5.1 From the admin UI
Admin → Projects → (select affected projects) → "Kickoff data fetching"
5.2 From Rails console
DbCredential.where(project_id: [85, 94, 97, 99, 103, 124, 125]).each do |c|
puts "Kicking project #{c.project_id}..."
c.fetch_data_and_schedule_next_fetch
end
For ProjectGroup:
ProjectGroup.where(id: [<affected group ids>]).each do |g|
puts "Kicking group #{g.id}..."
g.fetch_data_and_schedule_next_fetch
end
5.3 Health-check queries
Sort worst-first (NULL last_synced_at first, then oldest):
DbCredential.order(Arel.sql('last_synced_at NULLS FIRST')).each do |c|
nj = Delayed::Job.where(unique_identifier: c.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
.where('run_at > ?', Time.now).pick(:run_at)
puts format("project=%-4d last_sync=%-30s next_job=%s",
c.project_id, c.last_synced_at || "NEVER", nj || "MISSING")
end
Interpretation:
last_sync=recent+next_job=future→ healthy.last_sync=old+next_job=future→ failing per step but chain alive (new post-fix signal). Grep logs for[analytics_sync]+ that project to see which step is broken.last_sync=*+next_job=MISSING→ chain dead; needs kickoff.last_sync=NEVER+next_job=MISSING→ never started; needs kickoff.
Failed Delayed::Jobs for this pipeline:
Delayed::Job.where("unique_identifier LIKE ? OR unique_identifier LIKE ?",
"analytics_%_fetch_data", "analytics_group_%_fetch_data")
.where.not(failed_at: nil)
.order(failed_at: :desc).limit(50)
.pluck(:id, :unique_identifier, :failed_at, :attempts, :last_error)
6. What this fix does not fix
- Pre-existing dead chains. Must be manually kicked off (Section 5). The fix is preventative from here on out.
- Permanent bugs in a populate step. They now log every hour with the same error instead of silently breaking the chain. This is an intentional tradeoff: self-healing over fail-fast. Log volume is predictable and taggable.
self.delay(...)itself raising at reschedule time. If the Rails DB is unreachable at exactly that instant, no new Delayed::Job row is created, and the chain is dead until someone kicks it off again. Rare, but possible.- Worker-level failures. If the
delayed_jobworker process is killed mid-populate (deploy, OOM, crash), none of the code in this file runs, so nothing reschedules.Delayed::Worker.max_run_time = 20.minutespartially mitigates by unlocking the job, but DJ retry behavior depends on the version and configuration. Delayed::Worker.max_run_timetimeouts. A populate run that takes more than 20 minutes (typicallypopulate_analytics_spotchecks_tableon projects with years of history) can be unlocked and re-claimed, causing overlap.- Schema drift. Projects whose analytics DBs were provisioned before newer migrations (missing PK, missing columns) will fail every hour with the same schema error until an admin re-runs Init Analytics DB for that project or
rake analytics:add_primary_keys. - Data quality issues. A single malformed row in source data (e.g. a spot-check without a
question["us"]translation) will fail the same step every hour until the bad row is fixed in Pepcore Base.
7. Future fixes (proposed)
In roughly priority order:
7.1 Daily safety-net kickoff rake — closes gap #3
A daily rake task that finds any DbCredential / ProjectGroup with no future Delayed::Job row and re-enqueues it. This closes the one remaining "chain dies" scenario (reschedule itself failing during a DB outage) and makes the system fully self-healing.
Sketch:
namespace :analytics do
desc "Re-enqueue hourly fetch jobs for credentials/groups with no future job."
task ensure_scheduled: :environment do
# DbCredential
DbCredential.find_each do |c|
has_job = Delayed::Job.where(unique_identifier: c.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
.where('run_at > ?', Time.now).exists?
next if has_job
Rails.logger.warn("[analytics_sync] Re-kicking project #{c.project_id} (no future job)")
c.fetch_data_and_schedule_next_fetch
end
# ProjectGroup — same pattern
ProjectGroup.find_each do |g|
has_job = Delayed::Job.where(unique_identifier: g.data_fetching_job_identifier, failed_at: nil, locked_at: nil)
.where('run_at > ?', Time.now).exists?
next if has_job
Rails.logger.warn("[analytics_sync] Re-kicking group #{g.id} (no future job)")
g.fetch_data_and_schedule_next_fetch
end
end
end
Cron it daily from whatever scheduler you use.
7.2 Per-step last_synced_at — improves accuracy
Today last_synced_at is a single column that only advances on a clean run. For projects with a persistently-failing step (e.g. spotchecks schema drift), the delta for check-ins grows unbounded and each hourly run re-fetches the same backlog.
Alternative: per-step timestamp columns (checkins_last_synced_at, morale_last_synced_at, etc.) so each step advances independently. Tradeoff: migration + code change across both DbCredential and ProjectGroup. Not urgent.
8. Change log
| Date | Change | Commit |
|---|---|---|
| 2025-04-15 (± original refactor) | Upsert logic switched to import with on_duplicate_key_update without unique constraint on id. Source of duplication bug. |
cc4c6bf / aeffd9b |
| 2026-03-15 | Four new analytics tables (baselines, benchmarks, spotchecks, dimscorefeelings) added to the populate sequence. Increased surface for per-step raises. | ac8b31d |
| 2026-04-06 ~09:10–09:30 UTC | Incident: ~30 project chains died simultaneously. Root cause unknown (suspected deploy or transient DB condition); effect is the pre-fix defect in Section 2.2. | — |
| 2026-04-15 | ensure_primary_key! + upsert_records added to fix duplication. Does not address stoppage. |
4896dfb |
| 2026-04-16 | Follow-up duplication fixes and rake analytics:add_primary_keys. |
124834a |
| 2026-04-17 | This fix (Layers 1 + 2): resilience changes in DbCredential and ProjectGroup (per-step rescue + ensure reschedule). |
(this PR) |
| 2026-04-17 | This fix (Layer 3): per-row / per-config hardening inside each populate step; shared safe_localized_text / safe_json_array helpers. |
(this PR) |
9. Related files
| File | Role |
|---|---|
app/models/db_credential.rb |
Per-project sync loop and populate steps. Resilience changes applied here. |
app/models/project_group.rb |
Group-level sync loop. Resilience changes applied here. |
app/admin/projects.rb |
Admin batch actions: kickoff_data_fetching, init_analytics_db, populate_analytics_data, reset_analytics_data. |
config/initializers/delayed_job_config.rb |
max_attempts = 1, max_run_time = 20.minutes, destroy_failed_jobs = false (failed jobs kept for inspection). |
lib/tasks/analytics.rake |
Inspection + PK maintenance rake tasks. Proposed home for the Section 7.1 safety-net task. |
docs/ANALYTICS_DATABASE_REFERENCE.md |
Table/schema reference for the analytics DB. |
docs/ANALYTICS_DUPLICATION_INCIDENT_REPORT.md |
Prior incident (duplication), which added the PK/upsert safeguards. |