Skip to content

FHIR Feature Mapper

The FHIRFeatureMapper allows you to easily extract relevant features from FHIR Bundles based on a declarative schema. This makes it simple to generate ML-ready DataFrames for downstream analysis and modeling.

Overview

The mapper uses feature schemasβ€”YAML configs that define which clinical data to extract and how to transform it. This enables:

  • Declarative mapping: Define features in YAML, not code
  • Reproducible pipelines: Same schema = same features across train/test/prod
  • Built-in validation: Type checking catches mismatches before inference
  • FHIR-native: Works with any EHR's FHIR Bundle

Usage

Write a YAML file specifying which FHIR resources and codes to extract, desired data types, and any transformations:

name: sepsis_prediction_features
version: "1.0"
description: Feature schema for sepsis risk model

# Optional: Control how patient age is calculated
metadata:
  age_calculation: event_date        # Calculate age at event time
  event_date_source: Observation     # Use earliest observation date
  event_date_strategy: earliest

features:
  # Vital signs from Observations
  heart_rate:
    fhir_resource: Observation
    code: "220045"                   # MIMIC-IV itemID
    code_system: http://mimic.mit.edu/fhir/mimic/CodeSystem/mimic-chartevents-d-items
    display: Heart Rate
    unit: bpm
    dtype: float64
    required: true

  # Demographics from Patient resource
  age:
    fhir_resource: Patient
    field: birthDate             # Extract this field
    transform: calculate_age     # Apply this transformation
    dtype: int64
    required: true

Standalone Use

In most cases you should use the Dataset API to automatically load your schema and extract features. It's the easiest and most robust workflow.

For advanced usage, you can load a schema and use the FHIRFeatureMapper directly for more control:

from healthchain.io.mappers import FHIRFeatureMapper
from healthchain.io.containers import FeatureSchema

# Manually load your YAML feature schema
schema = FeatureSchema.from_yaml("configs/features/my_model.yaml")

# Create the feature mapper with your schema
mapper = FHIRFeatureMapper(schema)

# Extract features from a FHIR Bundle
features_df = mapper.map(bundle, aggregation="mean")

print(features_df.head())
# (Optional) Access patient references
patient_refs = features_df["patient_ref"].tolist()

Aggregation Strategies

When a patient has multiple observations for the same code (e.g., multiple temperature readings), specify how to aggregate them:

# Take the mean of all values
dataset = Dataset.from_fhir_bundle(bundle, schema, aggregation="mean")

# Use the most recent value
dataset = Dataset.from_fhir_bundle(bundle, schema, aggregation="last")

# Other options: "median", "max", "min"

Multiple Code Systems

Different EHRs use different code systems. You can map the same clinical concept across systems:

# LOINC code for heart rate (standard)
heart_rate_loinc:
  fhir_resource: Observation
  code: "8867-4"
  code_system: http://loinc.org
  display: Heart Rate
  dtype: float64

# MIMIC-IV internal code
heart_rate_mimic:
  fhir_resource: Observation
  code: "220045"
  code_system: http://mimic.mit.edu/fhir/mimic/CodeSystem/mimic-chartevents-d-items
  display: Heart Rate
  dtype: float64

Then in your code, merge as needed:

df = dataset.data
# Combine both columns, preferring LOINC
df['heart_rate'] = df['heart_rate_loinc'].fillna(df['heart_rate_mimic'])

Validation and Error Handling

Check that incoming data matches your training schema:

from healthchain.io.containers import FeatureSchema

schema = FeatureSchema.from_yaml("configs/features/my_model.yaml")
result = dataset.validate(schema, raise_on_error=False)