Data Generator
Healthcare systems use standardized data formats, but each hospital or clinic configures their data differently. This creates challenges when building applications that need to work across multiple healthcare systems.
The data generator creates test data that matches the structure and format expected by Electronic Health Record (EHR) systems. It's designed for testing your applications, not for research studies that need realistic patient populations.
According to the UK ONS synthetic data classification, HealthChain generates "level 1: synthetic structural data" - data that follows the correct format but contains fictional information.
CDS Data Generator
The .generate_prefetch()
method will return a Prefetch
model with the prefetch
field populated with a dictionary of FHIR resources. Each key in the dictionary corresponds to a FHIR resource type, and the value is a list of FHIR resources of that type. For more information, check out the CDS Hooks documentation.
For each workflow, a pre-configured list of FHIR resources is randomly generated and placed in the prefetch
field of a CDSRequest
.
Current implemented workflows:
Workflow | Implementation Completeness | Generated Synthetic Resources |
---|---|---|
patient-view | Patient , Encounter (Future: MedicationStatement , AllergyIntolerance ) |
|
encounter-discharge | Patient , Encounter , Procedure , MedicationRequest , Optional DocumentReference |
|
order-sign | Partial | Future: MedicationRequest , ProcedureRequest , ServiceRequest |
order-select | Partial | Future: MedicationRequest , ProcedureRequest , ServiceRequest |
For more information on CDS workflows, see the CDS Hooks Protocol documentation.
You can use the data generator within a client function or on its own.
import healthchain as hc
from healthchain.sandbox.use_cases import ClinicalDecisionSupport
from healthchain.models import Prefetch
from healthchain.data_generators import CdsDataGenerator
@hc.sandbox
class MyCoolSandbox(ClinicalDecisionSupport):
def __init__(self) -> None:
self.data_generator = CdsDataGenerator()
@hc.ehr(workflow="patient-view")
def load_data_in_client(self) -> Prefetch:
prefetch = self.data_generator.generate_prefetch()
return prefetch
@hc.api
def my_server(self, request) -> None:
pass
from healthchain.data_generators import CdsDataGenerator
from healthchain.sandbox.workflows import Workflow
# Initialize data generator
data_generator = CdsDataGenerator()
# Generate FHIR resources for use case workflow
data_generator.set_workflow(Workflow.encounter_discharge)
prefetch = data_generator.generate_prefetch()
print(prefetch.model_dump())
# {
# "prefetch": {
# "encounter":
# {
# "resourceType": ...
# }
# }
#}
Other synthetic data sources
If you are looking for realistic datasets, you are also free to load your own data in a sandbox run! Check out MIMIC for comprehensive continuity of care records and free-text data, or Synthea for synthetically generated FHIR resources. Both are open-source, although you will need to complete PhysioNet Credentialing to access MIMIC.
Loading free-text
You can specify the free_text_csv
field of the .generate_prefetch()
method to load in free-text sources into the data generator, e.g. discharge summaries. This will wrap the text into a FHIR DocumentReference resource (N.B. currently we place the text directly in the resource attachment, although it is technically supposed to be base64 encoded).
A random text document from the csv
file will be picked for each generation.