Data Container
The healthchain.io.containers
module provides classes for storing and manipulating data throughout the pipeline. The main classes are DataContainer
, Document
, and Tabular
.
DataContainer 📦
DataContainer
is a generic base class for storing data of any type.
from healthchain.io.containers import DataContainer
# Create a DataContainer with string data
container = DataContainer("Some data")
# Convert to dictionary and JSON
data_dict = container.to_dict()
data_json = container.to_json()
# Create from dictionary or JSON
container_from_dict = DataContainer.from_dict(data_dict)
container_from_json = DataContainer.from_json(data_json)
Document 📄
The Document
class provides comprehensive functionality for working with text, NLP annotations, FHIR resources, and more.
Attribute | Access | Primary Purpose | Key Features | Common Use Cases |
---|---|---|---|---|
FHIR Data | doc.fhir |
Manage clinical data in FHIR format | • Resource bundles • Clinical lists (problems, meds, allergies) • Document references • CDS prefetch |
• Store patient records • Track medical history • Manage clinical documents |
NLP | doc.nlp |
Process and analyze text | • Tokenization • Entity recognition • Embeddings • spaCy integration |
• Extract medical terms • Analyze clinical text • Generate features |
CDS | doc.cds |
Clinical decision support | • Recommendation cards • Suggested actions • Clinical alerts |
• Generate alerts • Suggest interventions • Guide clinical decisions |
Model Outputs | doc.models |
Store ML model results | • Multi-framework support • Task-specific outputs • Text generation |
• Store classifications • Keep predictions • Track generations |
FHIR Data (doc.fhir
)
The FHIR component serves as a comprehensive manager for FHIR resources, providing:
Storage and Management:
- Automatic
Bundle
creation and management - Resource type validation
- Convenient access to common clinical data lists
Clinical Data Lists:
problem_list
: List ofCondition
resources (diagnoses, problems)medication_list
: List ofMedicationStatement
resourcesallergy_list
: List ofAllergyIntolerance
resources
Document Reference Management:
- Document relationship tracking (parent/child/sibling)
- Attachment handling with
base64
encoding - Document family retrieval
CDS Support:
- Support for CDS Hooks prefetch resources
- Resource indexing by type
Example usage:
from healthchain.io import Document
from healthchain.fhir import (
create_condition,
create_medication_statement,
create_document_reference,
)
# Basic usage with clinical lists
doc = Document("Patient presents with hypertension")
# Add problems to the problem list
doc.fhir.problem_list = [
create_condition(subject="Patient/123", code="38341003", display="Hypertension")
]
# Add medications
doc.fhir.medication_list = [
create_medication_statement(
subject="Patient/123", code="1049221", display="Acetaminophen"
)
]
# Working with document references
parent_doc = create_document_reference(
data="Original clinical note", content_type="text/plain"
)
parent_id = doc.fhir.add_document_reference(parent_doc)
# Add a child document with relationship
child_doc = create_document_reference(
data="Updated clinical note", content_type="text/plain"
)
child_id = doc.fhir.add_document_reference(
child_doc, parent_id=parent_id, relationship_type="replaces"
)
# Get document family (parents/children/siblings)
family = doc.fhir.get_document_reference_family(child_id)
print(f"Parent doc: {family['parents'][0].description}")
# Using CDS prefetch resources
prefetch = {
"Condition": doc.fhir.problem_list,
"MedicationStatement": doc.fhir.medication_list,
}
doc.fhir.prefetch_resources = prefetch
conditions = doc.fhir.get_prefetch_resources("Condition")
Technical Notes:
- All FHIR resources are validated using fhir.resources
- Document relationships follow the FHIR DocumentReference.relatesTo standard
Resource Documentation:
NLP Component (doc.nlp
)
get_tokens()
: Returns list of tokens from the textget_entities()
: Returns named entities with optional CUI (SNOMED CT concept IDs)get_embeddings()
: Returns vector representations of textget_spacy_doc()
: Returns the underlying spaCy document for advanced NLP featuresword_count()
: Returns total word count based on tokenization
Clinical Decision Support (doc.cds
)
cards
: List of Card objects with clinical recommendationsactions
: List of Action objects for suggested interventions
Model Outputs (doc.models
)
get_output(model_name, task)
: Get specific model outputget_generated_text(model_name, task)
: Get generated text- Supports outputs from various models (e.g., Hugging Face, LangChain)
Example using all components:
from healthchain.io import Document
from healthchain.fhir import create_condition
from healthchain.models import Card, Action
# Create document with text
doc = Document("Patient reports chest pain and shortness of breath.")
# Access and update NLP features
print(f"Word count: {doc.word_count()}")
tokens = doc.nlp.get_tokens()
spacy_doc = doc.nlp.get_spacy_doc()
# Add FHIR resources
doc.fhir.problem_list.append(
create_condition(subject="Patient/123", code="29857009", display="Chest pain")
)
# Update problem list from NLP entities (requires spaCy doc with CUI extension)
doc.update_problem_list_from_nlp()
# Add CDS results
doc.cds.cards = [
Card(
summary="Consider cardiac evaluation",
indicator="warning",
source={"label": "AHA Guidelines"},
)
]
doc.cds.actions = [Action(...)]
# Get model outputs
classification = doc.models.get_output("huggingface", "classification")
Tabular 📊
The Tabular
class is used for storing and manipulating tabular data, wrapping a pandas DataFrame.
import pandas as pd
from healthchain.io.containers import Tabular
# Create a Tabular object from a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
tabular = Tabular(df)
# Access basic information
print(f"Columns: {tabular.columns}")
print(f"Row count: {tabular.row_count()}")
print(f"Column count: {tabular.column_count()}")
print(f"Data types: {tabular.dtypes}")
# Describe the tabular data
print(tabular.describe())
# Remove a column
tabular.remove_column('A')
# Save to CSV
tabular.to_csv('output.csv')
# Create from CSV
tabular_from_csv = Tabular.from_csv('input.csv')
These classes provide a consistent interface for working with different types of data in the healthchain pipeline.