AI-Ready Patient Data Pipeline for a Health Network

ABOUT THE PROJECT

Overview

A multi-site health network managing 280,000 active patient records across eight facilities had accumulated patient data across four incompatible systems over 14 years of organic growth and two hospital acquisitions. Their EHR held clinical encounter data. A separate practice management system held scheduling and billing records. A legacy lab system managed diagnostic results under a different patient identifier schema. And a recently acquired facility was still operating on its own standalone EHR with no integration to the parent network.

The data team was spending 11 days per analytics cycle manually reconciling patient records across the four systems before any clinical analysis could begin. The network's strategy team had identified three AI diagnostics initiatives — a readmission risk model, a chronic disease progression predictor, and a population health segmentation tool — that were blocked entirely until the data environment was in a condition to support them. Verttx rebuilt the patient data infrastructure from the ground up. The unified platform went live 14 weeks after the initial discovery call.

‍

The Situation

The root cause of the data fragmentation problem was a lack of a master patient index that could reliably link records for the same individual across all four systems. Each system used a different patient identifier: the EHR used a network-assigned MRN, the practice management system used an insurance member ID as its primary key, the lab system used its own sequential identifier, and the acquired facility's EHR used a date-of-birth plus name hash. Linking records across systems required a manual matching process performed by two data analysts who spent an estimated 440 combined hours per analytics cycle on reconciliation work before any analysis could begin.

The matching process was also error-prone. An internal audit of a 12,000-record sample found that 6.3% of records had been incorrectly linked at least once in the previous 18 months — creating patient records where diagnostic results, medication histories, or encounter notes from one patient had been associated with another. In a clinical analytics context, incorrect patient matching does not just produce bad data. It produces clinically dangerous data. The network's Chief Data Officer had flagged the error rate to the board as a patient safety risk, not just an operational one.

The three AI initiatives sitting on the roadmap were not the only casualty of the data environment. The network's value-based care contracts required quarterly quality reporting to two payer partners. Each quarterly report was taking the data team three weeks to produce because the source data required the same manual reconciliation process as the analytics work. Both payer partners had flagged submission delays in the previous contract year.

‍

The Approach

Master patient index and identity resolution

The foundation of the new platform is a probabilistic master patient index (MPI) that links patient records across all four source systems using a deterministic-first, probabilistic-fallback matching algorithm. Deterministic matching uses exact agreement on MRN, date of birth, and name — linking 84.7% of records with high confidence. The remaining 15.3% are processed through a probabilistic matching model trained on the network's own confirmed duplicate and non-duplicate pairs, using a weighted Fellegi-Sunter scoring approach across name, DOB, address, phone, and insurance ID fields. Records with a match probability below 0.82 are routed to a human review queue rather than auto-linked — a threshold calibrated to produce fewer than 0.3% incorrect links at the volume of records the network manages.

Unified data platform and FHIR standardisation

All four source systems now feed into a centralised clinical data repository through HL7 FHIR R4 interfaces. Where source systems did not natively support FHIR output — which was true of both the legacy lab system and the acquired facility's EHR — Verttx built lightweight transformation services that convert the proprietary data formats into FHIR-conformant resources before ingestion. The repository standardises clinical data across 14 FHIR resource types including Patient, Encounter, Observation, MedicationRequest, DiagnosticReport, and Condition — providing a consistent data model that AI models can be trained and evaluated against without per-project data preparation.

All data at rest is encrypted using AES-256. All data in transit uses TLS 1.3. Access is governed by role-based access controls aligned to the minimum necessary standard under HIPAA's Privacy Rule. The platform operates within the network's existing HIPAA-compliant cloud environment under a Business Associate Agreement, and all PHI processing is logged to an immutable audit trail retained for six years under 45 CFR §164.312.

Automated quality monitoring

A data quality monitoring layer runs continuously against the unified repository, tracking 23 quality dimensions across completeness, consistency, timeliness, and conformance. Automated alerts surface when any dimension falls outside agreed thresholds — for example, when the percentage of encounters missing a diagnosis code exceeds 2.4%, or when lab result ingestion latency from the legacy system exceeds 6 hours. Quality metrics are surfaced in a live dashboard visible to the data engineering team and reviewed in the network's monthly data governance committee meeting.

‍

The Result

The data preparation cycle that had previously taken 11 days of manual analyst work now completes automatically in 4 hours. The two analysts who had been spending 440 hours per cycle on reconciliation work were redeployed to clinical analytics — the work the data team had been hired to do but had never had capacity to do properly. Manual data reconciliation as a workflow was eliminated entirely for 94% of use cases, with the remaining 6% being genuinely ambiguous identity resolution cases appropriately routed to human review.

The MPI incorrect link rate fell from 6.3% to 0.28% on the 12,000-record sample the internal audit had previously assessed — a 22x improvement in patient matching accuracy. The Chief Data Officer presented the improvement to the board's patient safety committee as a direct reduction in clinical risk exposure from data errors.

All three AI initiatives that had been blocked by the data environment were initiated within 90 days of the platform going live. The readmission risk model reached production in a separate 8-week engagement. The quarterly payer quality reports, previously taking three weeks to produce, are now generated automatically in 6 hours and submitted on time for the first contract year since the value-based care agreements were signed. Both payer partners acknowledged the improvement in their annual contract review meetings.

The complete data platform — the master patient index, all FHIR transformation services, the quality monitoring layer, and the full infrastructure-as-code specification — was transferred to the network's data engineering team at handover with no proprietary tooling and no ongoing reliance on Verttx to operate or extend it.

We had been trying to fix this data problem for three years. Every analytics project we started ran into the same wall — the data wasn't in a state where you could trust what it was telling you. Verttx rebuilt the foundation in fourteen weeks. The AI work we had been planning is finally happening. — Chief Data Officer, Regional Health Network

‍

RESULTS

Data preparation time per analytics cycle fell from 11 days to 4 hours. Patient record matching accuracy improved from 93.7% to 99.72% — a 22x reduction in incorrect links. Manual reconciliation work was eliminated for 94% of use cases. All three blocked AI initiatives were initiated within 90 days of go-live. Quarterly payer quality reports now generate automatically in 6 hours and are submitted on time for the first contract year since the value-based care agreements were signed.

4 hrs

Data preparation time, down from 11 days

94%

Reduction in manual data reconciliation work

22x

Improvement in patient record matching accuracy

280K

Active patient records now AI-ready