Skip to content

Data Generator

Healthcare systems use standardized data formats, but each hospital or clinic configures their data differently. This creates challenges when building applications that need to work across multiple healthcare systems.

The data generator creates test data that matches the structure and format expected by Electronic Health Record (EHR) systems. It's designed for testing your applications, not for research studies that need realistic patient populations.

According to the UK ONS synthetic data classification, HealthChain generates "level 1: synthetic structural data" - data that follows the correct format but contains fictional information.

Synthetic data

CDS Data Generator

The .generate_prefetch() method will return a Prefetch model with the prefetch field populated with a dictionary of FHIR resources. Each key in the dictionary corresponds to a FHIR resource type, and the value is a list of FHIR resources of that type. For more information, check out the CDS Hooks documentation.

For each workflow, a pre-configured list of FHIR resources is randomly generated and placed in the prefetch field of a CDSRequest.

Current implemented workflows:

Workflow Implementation Completeness Generated Synthetic Resources
patient-view Patient, Encounter (Future: MedicationStatement, AllergyIntolerance)
encounter-discharge Patient, Encounter, Procedure, MedicationRequest, Optional DocumentReference
order-sign Partial Future: MedicationRequest, ProcedureRequest, ServiceRequest
order-select Partial Future: MedicationRequest, ProcedureRequest, ServiceRequest

For more information on CDS workflows, see the CDS Hooks Protocol documentation.

You can use the data generator within a client function or on its own.

import healthchain as hc
from healthchain.sandbox.use_cases import ClinicalDecisionSupport
from healthchain.models import Prefetch
from healthchain.data_generators import CdsDataGenerator

@hc.sandbox
class MyCoolSandbox(ClinicalDecisionSupport):
    def __init__(self) -> None:
        self.data_generator = CdsDataGenerator()

    @hc.ehr(workflow="patient-view")
    def load_data_in_client(self) -> Prefetch:
        prefetch = self.data_generator.generate_prefetch()
        return prefetch

    @hc.api
    def my_server(self, request) -> None:
        pass
from healthchain.data_generators import CdsDataGenerator
from healthchain.sandbox.workflows import Workflow

# Initialize data generator
data_generator = CdsDataGenerator()

# Generate FHIR resources for use case workflow
data_generator.set_workflow(Workflow.encounter_discharge)
prefetch = data_generator.generate_prefetch()

print(prefetch.model_dump())

# {
#    "prefetch": {
#        "encounter":
#            {
#              "resourceType": ...
#            }
#    }
#}

Other synthetic data sources

If you are looking for realistic datasets, you are also free to load your own data in a sandbox run! Check out MIMIC for comprehensive continuity of care records and free-text data, or Synthea for synthetically generated FHIR resources. Both are open-source, although you will need to complete PhysioNet Credentialing to access MIMIC.

Loading free-text

You can specify the free_text_csv field of the .generate_prefetch() method to load in free-text sources into the data generator, e.g. discharge summaries. This will wrap the text into a FHIR DocumentReference resource (N.B. currently we place the text directly in the resource attachment, although it is technically supposed to be base64 encoded).

A random text document from the csv file will be picked for each generation.

# Load free text into a DocumentResource FHIR resource
data = data_generator.generate_prefetch(free_text_csv="./dir/to/csv/file")