Hypertext Rails

Documentation

Getting Started

Communication Center

Procore / Project Groups

Analytics Database

Other Features

Analytics DB — Duplication Incident Report

Date: April 15, 2026

Status: Root cause confirmed; safeguards implemented in the codebase; production investigation and duplicate-day audit documented below. No analytics DB reset or data repair has been run in production yet — that remains after merge/deploy and operational sign-off.


1. Summary

32 out of 122 projects (26.9%) have duplicated checkin_submissions records in their analytics databases. The remaining 87 are clean, and 3 have no analytics DB. For example, Project 58 had 634,210 rows in analytics but only 16,596 in source — an inflation of 38x. The worst-affected project (Project 58) had 617,619 duplicate rows.

The duplication was caused by repeated analytics sync operations (admin batch actions and scheduled background jobs) writing to analytics tables that were missing a unique constraint on the id column. Without this constraint, the intended upsert logic silently degraded into repeated inserts.

The duplication traces back to a code refactor on April 15, 2025 (commit cc4c6bf) that changed the populate logic from a safe check-then-insert pattern to Sequel's batch import with on_duplicate_key_update: { conflict_target: [:id] } — without adding the unique constraint this approach requires. Projects that had active scheduled jobs during or shortly after this transition accumulated duplicates before the last_synced_at delta mechanism stabilized. See Section 2.5 for full timeline.


2. Root Cause Analysis

2.1 How the analytics pipeline works

Pepcore Source DB                    Analytics DB (per project)
┌──────────────────────┐             ┌──────────────────────────┐
│ check_in_submission   │  ──sync──▶ │ checkin_submissions       │
│ (PK on id, unique)    │             │ (no PK, no unique index) │
└──────────────────────┘             └──────────────────────────┘
  • Each project has an analytics database (e.g. analytics_44) with a checkin_submissions table.
  • Data is copied from the Pepcore source (check_in_submission table) into analytics via a sync/populate process.
  • The populate code uses Sequel's import with on_duplicate_key_update: { conflict_target: [:id] } — an upsert that requires a unique constraint on id to detect conflicts.

2.2 The defect

When analytics tables were created via init_analytics_db, the id column was added as a plain nullable integer — no primary key, no unique index, no constraints at all.

File: app/models/db_credential.rbcreate_checkin_submissions_attributes

connection.alter_table :checkin_submissions do
  add_column :id, :integer unless attributes.include?(:id)
  # ... other columns ...
end
# ← No unique index or PK was created here

2.3 The trigger

Without a unique constraint, conflict_target: [:id] has nothing to conflict against. Every populate run inserted all fetched rows as new records instead of updating existing ones.

Three code paths trigger populate:

Trigger File Method
Admin batch action: "Populate Analytics Data" app/admin/projects.rb populate_analytics_data
Admin batch action: "Kickoff Data Fetching" app/admin/projects.rb kickoff_data_fetching
Recurring Delayed Job (hourly) app/models/db_credential.rb fetch_data_and_schedule_next_fetch
ProjectGroup Delayed Job (hourly) app/models/project_group.rb fetch_data_and_schedule_next_fetch

Each execution added duplicate copies of the same source records. Over days/weeks, this compounded into massive inflation.

2.4 Why not all projects were affected equally

Production audit confirmed: both affected and clean projects have last_synced_at set to today (April 16, 2026) and active scheduled jobs. The duplication is historical, not ongoing. The hourly jobs now fetch only delta records (usually empty), so no new duplicates are being created — but old duplicates remain.

Latest production audit (Section 3.3): 119 DbCredential rows checked — 32 duplicated, 87 clean, 0 errors. (122 directory projects minus 3 without a credential / analytics link in that run.)

2.5 Root cause timeline (from git history)

Date Event Impact
Mar 14, 2025 Analytics DB feature created (65e4512). Populate used safe check-then-insert: if existing_record.present? → update, else → insert. No unique constraint needed. No duplicates possible.
Mar 17, 2025 Most DB credentials created. Jobs kicked off for early projects. Safe pattern still in use.
Apr 15, 2025 Refactor commit cc4c6bf: populate switched to batch import with on_duplicate_key_update: { conflict_target: [:id] }. last_synced_at column added (defaults to nil for all existing credentials). Breaking change: batch upsert requires unique constraint on id, which was never added.
Apr 15, 2025 onward First job runs after deploy: last_synced_at = nil → fetches ALL source records → import runs against table with no unique constraint → duplicates inserted. 32 projects with active jobs accumulated duplicates during this window.
Eventually last_synced_at stabilized for most projects → subsequent runs fetch empty delta → no new duplicates. Duplication stopped growing, but existing duplicates remained.

Why only 32 of 82 projects with jobs were affected:

The original check-then-insert code (if existing_record.present?) was safe without a unique constraint. The April 15 refactor removed this safety. Projects that had active hourly jobs during the transition window re-inserted all records before last_synced_at could stabilize. Projects whose first post-refactor populate completed cleanly (setting last_synced_at) only ever fetched empty deltas afterward, avoiding duplication.


3. Evidence (sampled April 15, 2026)

3.1 Project 44 (originally reported)

Metric Value
Source DB (check_in_submission) rows 3,154 (all unique)
Analytics DB (checkin_submissions) total rows 55,513
Analytics DB distinct IDs 3,308
Duplicate rows 52,205
Inflation ratio 16.8x
Duplicate window March 24 – April 2, 2026
Data after April 2 Clean (dup factor 1.0)
Last data timestamp April 10 (no data since April 11)
Top duplicated ID (447565) 1,189 copies

Note: The duplicate-heavy calendar days for Project 44 are 2026-03-24 through 2026-04-02 (8 days). That is not the same as the April 15, 2025 engineering date when the unsafe batch upsert shipped — see Section 3.4 for all projects’ duplicate-day ranges.

3.2 Random 10-project sample

Project ID Project Name Source Count Analytics Total Distinct IDs Duplicate Rows Ratio
86 Team ML08x 2,388 171,357 2,388 168,969 71.8x
79 Team FRA 134/135 11,435 442,778 11,430 431,348 38.7x
72 Team Newport 9,055 259,518 9,050 250,468 28.7x
51 Mace Pure Dub01 3,702 52,435 3,700 48,735 14.2x
113 Yondr Ops DC FRA1 704 656 656 0 0.93x
111 Equinix/DPR DC17 Virginia 1,055 971 971 0 0.92x
100 Project Advance 1,386 1,274 1,274 0 0.92x
112 Yondr Ops DC NOV1 943 803 803 0 0.85x
110 STO QTS DFW2 DC4 74 53 53 0 0.72x
122 Yondr Collen Building C 0 N/A

4 out of 9 projects with analytics DBs had severe duplication (initial sample).

3.3 Full production audit (counts)

Metric Value
Total projects (directory) 122
DbCredential records checked (prod console) 119
Clean (total_rows == distinct_ids) 87
Duplicated 32 (26.9% of checked credentials)
No checkin_submissions table / other 0 (in this run)
Errors 0

Three projects have no DbCredential in this run (122 − 119 = 3), e.g. not yet linked to an analytics DB.

Top 10 most affected projects:

Project ID Project Name Source Count Analytics Total Duplicate Rows Ratio
58 Team GBL6 16,596 634,210 617,619 38.2x
25 SigmaRoc 20,601 483,924 460,713 23.5x
79 Team FRA 134/135 11,435 442,778 431,348 38.7x
74 Team Madrid 14,930 436,722 421,799 29.3x
52 SAM3A Team 8,495 396,996 388,502 46.7x
73 Team Ersbo 23,743 406,843 383,110 17.1x
50 Team Powergate 44,615 417,783 373,172 9.4x
43 Team PA14 3,891 329,637 325,748 84.7x
71 IPS Rahway MACS 8,890 296,270 287,382 33.3x
72 Team Newport 9,057 259,521 250,468 28.7x

All 32 duplicated projects (sorted by duplicate rows): see Section 3.4 for per-project duplicate-day windows.

3.4 Per-project duplicate-day windows (production)

Method: For each project’s analytics DB, days where COUNT(*) > COUNT(DISTINCT id) on checkin_submissions grouped by created_at::date (duplicate-heavy calendar days in analytics).

Executive summary

  1. March 20, 2026 onward — majority of the volume: In the production audit, 20 of 32 duplicated projects show their first duplicate-heavy day on 2026-03-20 exactly; several others show their first such day a few days later in the same window (2026-03-232026-03-30, e.g. Projects 44, 41, 53). Almost all heavy duplication calendar days (where total_rows > distinct_ids per day) fall between 2026-03-20 and 2026-04-06. That pattern is consistent with a bulk operational action around that time — in particular Admin → Projects → batch “Populate Analytics Data” (or equivalent) run across many projects: each run re-fetched a large slice of rows and inserted again because id had no unique constraint, so the same source IDs appear many times on the same created_at dates.
  2. Earlier traces do not contradict that story: 7 of 32 projects have at least one duplicate-heavy day before 2026-03-20 (e.g. 12, 16, 18, 21, 71, 78, 1 — dates in 2024–2025 or scattered). Those are expected from the same structural bug since the April 2025 batch-upsert change: any earlier hourly job or populate that re-fetched overlapping rows could create duplicate-heavy days on whatever calendar dates those check-ins were stamped. So: the big March–April 2026 spike aligns with a deliberate bulk populate; smaller or older spikes show the defect was exploitable before that action as well.
  3. Project 44 (narrower window): Duplicate-heavy days 2026-03-24 through 2026-04-02 only (8 days) — first dup day after 2026-03-20, so not every project participated on the exact same first day.
  4. Minimal / single-day (2026): Project 38 (2026-03-24 only), Project 2 (2026-04-02 only), Project 53 (2026-03-23–24).

Evidence vs. proof: The calendar alignment across dozens of projects is strong evidence for a shared trigger (bulk populate). Server logs / admin audit trails would be the definitive proof of who ran which batch action when; the database alone cannot name the user.

Per-project summary (first duplicate-heavy day → last; count of such days)

Project ID Name Dup days First dup day Last dup day
58 Team GBL6 14 2026-03-20 2026-04-04
25 SigmaRoc 16 2026-03-20 2026-04-04
79 Team FRA 134/135 14 2026-03-20 2026-04-05
74 Team Madrid 11 2026-03-20 2026-04-01
52 SAM3A Team 14 2026-03-20 2026-04-04
73 Team Ersbo 18 2026-03-20 2026-04-06
50 Team Powergate 17 2026-03-20 2026-04-06
43 Team PA14 14 2026-03-20 2026-04-04
71 IPS Rahway MACS 17 2025-05-05 2026-04-04
72 Team Newport 14 2026-03-20 2026-04-04
32 Team GRQ 18 2026-03-20 2026-04-06
86 Team ML08x 13 2026-03-20 2026-04-03
42 Team LD14 15 2026-03-20 2026-04-04
81 Project NVA05D 11 2026-03-20 2026-04-02
75 Team Frankfurt 14 2026-03-20 2026-04-05
21 Peps Peeps 16 2025-01-23 2026-04-02
48 StructureTone NTT TX4 13 2026-03-20 2026-04-04
44 Team FR8.2 8 2026-03-24 2026-04-02
51 Mace Pure Dub01 12 2026-03-20 2026-04-03
80 Holder SVY01C-F 8 2026-03-20 2026-04-04
76 Team Harlow 9 2026-03-20 2026-04-02
77 DPR Expansion Team 9 2026-03-20 2026-04-03
53 Lebanon Peptide 2 2026-03-23 2026-03-24
14 Lebanon Gen Site 5 2026-03-20 2026-04-03
41 AIB 4 2026-03-30 2026-04-02
16 Yondr Redkite 100 2025-02-13 2025-05-31
38 StructureTone Hawk 1 2026-03-24 2026-03-24
2 CS Demos 1 2026-04-02 2026-04-02
78 PepTalk Live 31 2025-03-20 2026-01-22
12 Team PepTalk 46 2023-11-10 2025-04-01
18 John Paul Drogheda 8 2025-06-04 2025-06-12
1 Legacy 1 2024-03-06 2024-03-06

Full day lists for each project are in the production console output archive (same order as above).

3.5 Source DB integrity confirmed

  • check_in_submission in Pepcore Base has a primary key (check_in_submission_pkey) and a unique index (check_in_submission_couchdb_id_unique).
  • For Project 44: COUNT(*) == COUNT(DISTINCT id) — source IDs are fully unique.
  • The CheckInSubmission model is read-only (def readonly?; true; end) and no write paths exist in this codebase.
  • Conclusion: source data is clean. Duplication is entirely analytics-side.

4. Safeguards Implemented

4.1 Shared concern: AnalyticsUpsertSafety

File: app/models/concerns/analytics_upsert_safety.rb

A reusable module included by both DbCredential and ProjectGroup that:

  1. Checks if a unique index or primary key already exists on the table's id column.
  2. If missing and no duplicates exist: creates the unique index automatically.
  3. If missing and duplicates exist: logs a warning and returns false so the caller skips populate gracefully (no crash, no further duplication).
# Returns true  → safe to upsert
# Returns false → duplicates found, skip populate, log warning
def ensure_unique_id!(connection, table)

4.2 Guard applied to all upsert paths

Both DbCredential and ProjectGroup now call ensure_unique_id! before every upsert on: - checkin_submissions - morale_data

If the guard returns false, the populate step is skipped for that table — the process continues with remaining tables and other projects.

4.3 Error isolation in admin batch actions

File: app/admin/projects.rb

All four batch actions (init_analytics_db, populate_analytics_data, reset_analytics_data, kickoff_data_fetching) now wrap each project in rescue => e: - Failed projects are skipped (not crash the batch). - Admin sees which projects were skipped and why. - Remaining projects proceed normally.


5. Remediation Steps

Step 1: Deploy the safeguard code

Push the changes to production. Files changed:

  • app/models/concerns/analytics_upsert_safety.rb (new)
  • app/models/db_credential.rb (updated)
  • app/models/project_group.rb (updated)
  • app/admin/projects.rb (updated)

Step 2: Reset all analytics databases

SSH into production, open rails console, and run:

errors = []

DbCredential.includes(:project).find_each do |cred|
  project = cred.project
  next unless project

  puts "Resetting project #{project.id} (#{project.name})..."
  cred.reset_analytics_data
  puts "  Done."
rescue => e
  errors << { project_id: project&.id, error: e.message }
  puts "  SKIPPED: #{e.message}"
end

puts "\n=== COMPLETE ==="
puts "Errors: #{errors.size}"
errors.each { |e| puts "  Project #{e[:project_id]}: #{e[:error]}" }

What reset_analytics_data does per project: 1. Deletes all rows from all 6 analytics tables. 2. Resets last_synced_at to nil (so next populate fetches all source records). 3. Calls populate_database_with_morale_and_checkin_submissions which: - Runs ensure_unique_id! → creates unique index on the now-empty table. - Upserts all source records (now properly deduplicated by the index).

Step 3: Verify all projects are clean

clean = 0; dirty = 0

DbCredential.includes(:project).find_each do |cred|
  next unless cred.project
  cred.with_analytics_connection do |conn|
    next unless conn.table_exists?(:checkin_submissions)
    row = conn.fetch("SELECT COUNT(*) AS total, COUNT(DISTINCT id) AS distinct_ids FROM checkin_submissions").first
    dupes = row[:total].to_i - row[:distinct_ids].to_i
    if dupes > 0
      dirty += 1
      puts "DIRTY project #{cred.project_id}: #{dupes} duplicates remain"
    else
      clean += 1
    end
  end
rescue => e
  puts "ERROR project #{cred.project_id}: #{e.message}"
end

puts "\nClean: #{clean}, Dirty: #{dirty}"

Expected result: Dirty: 0

Step 4: Verify unique index exists

DbCredential.includes(:project).find_each do |cred|
  next unless cred.project
  cred.with_analytics_connection do |conn|
    next unless conn.table_exists?(:checkin_submissions)
    idx = conn.fetch("SELECT indexname FROM pg_indexes WHERE tablename='checkin_submissions' AND indexname='idx_checkin_submissions_unique_id'").first
    status = idx ? "HAS INDEX" : "MISSING INDEX"
    puts "Project #{cred.project_id}: #{status}"
  end
rescue => e
  puts "Project #{cred.project_id}: ERROR #{e.message}"
end

Expected result: all projects show HAS INDEX.


6. Prevention (going forward)

Risk Mitigation
Repeated populate creating duplicates Unique index on id enforced before every upsert
Dirty DB silently getting worse ensure_unique_id! detects existing duplicates and skips populate
One project error killing entire batch Per-project rescue in all admin batch actions
Same bug in morale_data table Same safeguard applied to morale_data upserts
Same bug in ProjectGroup DB Same concern included in ProjectGroup model

7. Files Changed

File Change
app/models/concerns/analytics_upsert_safety.rb New. Shared concern for unique constraint enforcement.
app/models/db_credential.rb Include concern; replace old guard; add morale_data guard; use next unless pattern.
app/models/project_group.rb Include concern; replace old guard; add morale_data guard; use next unless pattern.
app/admin/projects.rb Add per-project error handling in all 4 batch actions.