Skip to content

Component

BaseComponent

Bases: Generic[T], ABC

Abstract base class for all components in the pipeline.

This class should be subclassed to create specific components. Subclasses must implement the call method.

Source code in healthchain/pipeline/components/base.py
class BaseComponent(Generic[T], ABC):
    """
    Abstract base class for all components in the pipeline.

    This class should be subclassed to create specific components.
    Subclasses must implement the __call__ method.

    Attributes:
        None
    """

    @abstractmethod
    def __call__(self, data: DataContainer[T]) -> DataContainer[T]:
        """
        Process the input data and return the processed data.

        Args:
            data (DataContainer[T]): The input data to be processed.

        Returns:
            DataContainer[T]: The processed data.
        """
        pass

__call__(data) abstractmethod

Process the input data and return the processed data.

PARAMETER DESCRIPTION
data

The input data to be processed.

TYPE: DataContainer[T]

RETURNS DESCRIPTION
DataContainer[T]

DataContainer[T]: The processed data.

Source code in healthchain/pipeline/components/base.py
@abstractmethod
def __call__(self, data: DataContainer[T]) -> DataContainer[T]:
    """
    Process the input data and return the processed data.

    Args:
        data (DataContainer[T]): The input data to be processed.

    Returns:
        DataContainer[T]: The processed data.
    """
    pass

Component

Bases: BaseComponent[T]

A concrete implementation of the BaseComponent class.

This class can be used as a base for creating specific components that do not require any additional processing logic.

METHOD DESCRIPTION
__call__

DataContainer[T]) -> DataContainer[T]: Process the input data and return the processed data. In this implementation, the input data is returned unmodified.

Source code in healthchain/pipeline/components/base.py
class Component(BaseComponent[T]):
    """
    A concrete implementation of the BaseComponent class.

    This class can be used as a base for creating specific components
    that do not require any additional processing logic.

    Methods:
        __call__(data: DataContainer[T]) -> DataContainer[T]:
            Process the input data and return the processed data.
            In this implementation, the input data is returned unmodified.
    """

    def __call__(self, data: DataContainer[T]) -> DataContainer[T]:
        return data

HFTransformer

Bases: BaseComponent[str]

A component that integrates Hugging Face transformers models into the pipeline.

This component allows using any Hugging Face model and task within the pipeline by wrapping the transformers.pipeline API. The model outputs are stored in the document's model_outputs container under the "huggingface" source key.

Note that this component is only recommended for non-conversational language tasks. For chat-based tasks, consider using LangChainLLM instead.

PARAMETER DESCRIPTION
pipeline

A pre-configured HuggingFace pipeline object to use for inference. Must be an instance of transformers.pipelines.base.Pipeline.

TYPE: Any

ATTRIBUTE DESCRIPTION
task

The task name of the underlying pipeline, e.g. "sentiment-analysis", "ner". Automatically extracted from the pipeline object.

TYPE: str

RAISES DESCRIPTION
ImportError

If the transformers package is not installed

TypeError

If pipeline is not a valid HuggingFace Pipeline instance

Example

Initialize for sentiment analysis

from transformers import pipeline nlp = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english") component = HFTransformer(pipeline=nlp) doc = component(doc) # Analyzes sentiment of doc.data

Or use the factory method

component = HFTransformer.from_model_id( ... model="facebook/bart-large-cnn", ... task="summarization", ... max_length=130, ... min_length=30, ... do_sample=False ... ) doc = component(doc) # Generates summary of doc.data

Source code in healthchain/pipeline/components/integrations.py
class HFTransformer(BaseComponent[str]):
    """
    A component that integrates Hugging Face transformers models into the pipeline.

    This component allows using any Hugging Face model and task within the pipeline
    by wrapping the transformers.pipeline API. The model outputs are stored in the
    document's model_outputs container under the "huggingface" source key.

    Note that this component is only recommended for non-conversational language tasks.
    For chat-based tasks, consider using LangChainLLM instead.

    Args:
        pipeline (Any): A pre-configured HuggingFace pipeline object to use for inference.
            Must be an instance of transformers.pipelines.base.Pipeline.

    Attributes:
        task (str): The task name of the underlying pipeline, e.g. "sentiment-analysis", "ner".
            Automatically extracted from the pipeline object.

    Raises:
        ImportError: If the transformers package is not installed
        TypeError: If pipeline is not a valid HuggingFace Pipeline instance

    Example:
        >>> # Initialize for sentiment analysis
        >>> from transformers import pipeline
        >>> nlp = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
        >>> component = HFTransformer(pipeline=nlp)
        >>> doc = component(doc)  # Analyzes sentiment of doc.data
        >>>
        >>> # Or use the factory method
        >>> component = HFTransformer.from_model_id(
        ...     model="facebook/bart-large-cnn",
        ...     task="summarization",
        ...     max_length=130,
        ...     min_length=30,
        ...     do_sample=False
        ... )
        >>> doc = component(doc)  # Generates summary of doc.data
    """

    @requires_package("transformers", "transformers.pipelines")
    def __init__(self, pipeline: Any):
        """Initialize with a pre-configured HuggingFace pipeline.

        Args:
            pipeline: A pre-configured HuggingFace pipeline object from transformers.pipeline().
                     Must be an instance of transformers.pipelines.base.Pipeline.

        Raises:
            ImportError: If transformers package is not installed
            TypeError: If pipeline is not a valid HuggingFace Pipeline instance
        """
        from transformers.pipelines.base import Pipeline

        if not isinstance(pipeline, Pipeline):
            raise TypeError(
                f"Expected HuggingFace Pipeline object, got {type(pipeline)}"
            )
        self._pipe = pipeline
        self.task = pipeline.task

    @classmethod
    @requires_package("transformers", "transformers.pipelines")
    def from_model_id(cls, model: str, task: str, **kwargs: Any) -> "HFTransformer":
        """Create a transformer component from a model identifier.

        Factory method that initializes a HuggingFace pipeline with the specified model and task,
        then wraps it in a HFTransformer component.

        Args:
            model: The model identifier or path to load. Can be:
                - A model ID from the HuggingFace Hub (e.g. "bert-base-uncased")
                - A local path to a saved model
            task: The task to run (e.g. "text-classification", "token-classification", "summarization")
            **kwargs: Additional configuration options passed to transformers.pipeline()
                Common options include:
                - device: Device to run on ("cpu", "cuda", etc.)
                - batch_size: Batch size for inference
                - model_kwargs: Dict of model-specific args

        Returns:
            HFTransformer: Initialized transformer component wrapping the pipeline

        Raises:
            TypeError: If invalid kwargs are passed to pipeline initialization
            ValueError: If pipeline initialization fails for any other reason
            ImportError: If transformers package is not installed
        """
        from transformers import pipeline

        try:
            pipe = pipeline(task=task, model=model, **kwargs)
        except TypeError as e:
            raise TypeError(f"Invalid kwargs for transformers.pipeline: {str(e)}")
        except Exception as e:
            raise ValueError(f"Error initializing transformer pipeline: {str(e)}")

        return cls(pipeline=pipe)

    def __call__(self, doc: Document) -> Document:
        """Process the document using the Hugging Face pipeline. Adds outputs to .model_outputs['huggingface']."""
        output = self._pipe(doc.data)
        doc.models.add_output("huggingface", self.task, output)

        return doc

__call__(doc)

Process the document using the Hugging Face pipeline. Adds outputs to .model_outputs['huggingface'].

Source code in healthchain/pipeline/components/integrations.py
def __call__(self, doc: Document) -> Document:
    """Process the document using the Hugging Face pipeline. Adds outputs to .model_outputs['huggingface']."""
    output = self._pipe(doc.data)
    doc.models.add_output("huggingface", self.task, output)

    return doc

__init__(pipeline)

Initialize with a pre-configured HuggingFace pipeline.

PARAMETER DESCRIPTION
pipeline

A pre-configured HuggingFace pipeline object from transformers.pipeline(). Must be an instance of transformers.pipelines.base.Pipeline.

TYPE: Any

RAISES DESCRIPTION
ImportError

If transformers package is not installed

TypeError

If pipeline is not a valid HuggingFace Pipeline instance

Source code in healthchain/pipeline/components/integrations.py
@requires_package("transformers", "transformers.pipelines")
def __init__(self, pipeline: Any):
    """Initialize with a pre-configured HuggingFace pipeline.

    Args:
        pipeline: A pre-configured HuggingFace pipeline object from transformers.pipeline().
                 Must be an instance of transformers.pipelines.base.Pipeline.

    Raises:
        ImportError: If transformers package is not installed
        TypeError: If pipeline is not a valid HuggingFace Pipeline instance
    """
    from transformers.pipelines.base import Pipeline

    if not isinstance(pipeline, Pipeline):
        raise TypeError(
            f"Expected HuggingFace Pipeline object, got {type(pipeline)}"
        )
    self._pipe = pipeline
    self.task = pipeline.task

from_model_id(model, task, **kwargs) classmethod

Create a transformer component from a model identifier.

Factory method that initializes a HuggingFace pipeline with the specified model and task, then wraps it in a HFTransformer component.

PARAMETER DESCRIPTION
model

The model identifier or path to load. Can be: - A model ID from the HuggingFace Hub (e.g. "bert-base-uncased") - A local path to a saved model

TYPE: str

task

The task to run (e.g. "text-classification", "token-classification", "summarization")

TYPE: str

**kwargs

Additional configuration options passed to transformers.pipeline() Common options include: - device: Device to run on ("cpu", "cuda", etc.) - batch_size: Batch size for inference - model_kwargs: Dict of model-specific args

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
HFTransformer

Initialized transformer component wrapping the pipeline

TYPE: HFTransformer

RAISES DESCRIPTION
TypeError

If invalid kwargs are passed to pipeline initialization

ValueError

If pipeline initialization fails for any other reason

ImportError

If transformers package is not installed

Source code in healthchain/pipeline/components/integrations.py
@classmethod
@requires_package("transformers", "transformers.pipelines")
def from_model_id(cls, model: str, task: str, **kwargs: Any) -> "HFTransformer":
    """Create a transformer component from a model identifier.

    Factory method that initializes a HuggingFace pipeline with the specified model and task,
    then wraps it in a HFTransformer component.

    Args:
        model: The model identifier or path to load. Can be:
            - A model ID from the HuggingFace Hub (e.g. "bert-base-uncased")
            - A local path to a saved model
        task: The task to run (e.g. "text-classification", "token-classification", "summarization")
        **kwargs: Additional configuration options passed to transformers.pipeline()
            Common options include:
            - device: Device to run on ("cpu", "cuda", etc.)
            - batch_size: Batch size for inference
            - model_kwargs: Dict of model-specific args

    Returns:
        HFTransformer: Initialized transformer component wrapping the pipeline

    Raises:
        TypeError: If invalid kwargs are passed to pipeline initialization
        ValueError: If pipeline initialization fails for any other reason
        ImportError: If transformers package is not installed
    """
    from transformers import pipeline

    try:
        pipe = pipeline(task=task, model=model, **kwargs)
    except TypeError as e:
        raise TypeError(f"Invalid kwargs for transformers.pipeline: {str(e)}")
    except Exception as e:
        raise ValueError(f"Error initializing transformer pipeline: {str(e)}")

    return cls(pipeline=pipe)

LangChainLLM

Bases: BaseComponent[str]

A component that integrates LangChain chains into the pipeline.

This component allows using any LangChain chain within the pipeline by wrapping the chain's invoke method. The chain outputs are stored in the document's model_outputs container under the "langchain" source key.

PARAMETER DESCRIPTION
chain

The LangChain chain to run on the document text. Must be a Runnable object from the LangChain library.

TYPE: Runnable

task

The task name to use when storing outputs, e.g. "summarization", "chat". Used as key to organize model outputs in the document's model container.

TYPE: str

**kwargs

Additional parameters to pass to the chain's invoke method. These are forwarded directly to the chain's invoke() call.

TYPE: Any DEFAULT: {}

RAISES DESCRIPTION
TypeError

If chain is not a LangChain Runnable object or if invalid kwargs are passed

ValueError

If there is an error during chain invocation

ImportError

If langchain-core package is not installed

Example

from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI

chain = ChatPromptTemplate.from_template("What is {input}?") | ChatOpenAI() component = LangChainLLM(chain=chain, task="chat") doc = component(doc) # Runs the chain on doc.data and stores output

Source code in healthchain/pipeline/components/integrations.py
class LangChainLLM(BaseComponent[str]):
    """
    A component that integrates LangChain chains into the pipeline.

    This component allows using any LangChain chain within the pipeline by wrapping
    the chain's invoke method. The chain outputs are stored in the document's
    model_outputs container under the "langchain" source key.

    Args:
        chain (Runnable): The LangChain chain to run on the document text.
            Must be a Runnable object from the LangChain library.
        task (str): The task name to use when storing outputs, e.g. "summarization", "chat".
            Used as key to organize model outputs in the document's model container.
        **kwargs: Additional parameters to pass to the chain's invoke method.
            These are forwarded directly to the chain's invoke() call.

    Raises:
        TypeError: If chain is not a LangChain Runnable object or if invalid kwargs are passed
        ValueError: If there is an error during chain invocation
        ImportError: If langchain-core package is not installed

    Example:
        >>> from langchain_core.prompts import ChatPromptTemplate
        >>> from langchain_openai import ChatOpenAI

        >>> chain = ChatPromptTemplate.from_template("What is {input}?") | ChatOpenAI()
        >>> component = LangChainLLM(chain=chain, task="chat")
        >>> doc = component(doc)  # Runs the chain on doc.data and stores output
    """

    @requires_package("langchain-core", "langchain_core.runnables")
    def __init__(self, chain: Any, task: str, **kwargs: Any):
        """Initialize with a LangChain chain."""
        from langchain_core.runnables import Runnable

        if not isinstance(chain, Runnable):
            raise TypeError(f"Expected LangChain Runnable object, got {type(chain)}")

        self.chain = chain
        self.task = task
        self.kwargs = kwargs

    def __call__(self, doc: Document) -> Document:
        """Process the document using the LangChain chain. Adds outputs to .model_outputs['langchain']."""
        try:
            output = self.chain.invoke(doc.data, **self.kwargs)
        except TypeError as e:
            raise TypeError(f"Invalid kwargs for chain.invoke: {str(e)}")
        except Exception as e:
            raise ValueError(f"Error during chain invocation: {str(e)}")

        doc.models.add_output("langchain", self.task, output)

        return doc

__call__(doc)

Process the document using the LangChain chain. Adds outputs to .model_outputs['langchain'].

Source code in healthchain/pipeline/components/integrations.py
def __call__(self, doc: Document) -> Document:
    """Process the document using the LangChain chain. Adds outputs to .model_outputs['langchain']."""
    try:
        output = self.chain.invoke(doc.data, **self.kwargs)
    except TypeError as e:
        raise TypeError(f"Invalid kwargs for chain.invoke: {str(e)}")
    except Exception as e:
        raise ValueError(f"Error during chain invocation: {str(e)}")

    doc.models.add_output("langchain", self.task, output)

    return doc

__init__(chain, task, **kwargs)

Initialize with a LangChain chain.

Source code in healthchain/pipeline/components/integrations.py
@requires_package("langchain-core", "langchain_core.runnables")
def __init__(self, chain: Any, task: str, **kwargs: Any):
    """Initialize with a LangChain chain."""
    from langchain_core.runnables import Runnable

    if not isinstance(chain, Runnable):
        raise TypeError(f"Expected LangChain Runnable object, got {type(chain)}")

    self.chain = chain
    self.task = task
    self.kwargs = kwargs

SpacyNLP

Bases: BaseComponent[str]

A component that integrates spaCy models into the pipeline.

This component allows using any spaCy model within the pipeline by loading and applying it to process text documents. The spaCy doc outputs are stored in the document's nlp annotations container under .spacy_docs.

PARAMETER DESCRIPTION
nlp

A pre-configured spaCy Language object.

TYPE: Language

Example

Using pre-configured pipeline

import spacy nlp = spacy.load("en_core_web_sm", disable=["parser"]) component = SpacyNLP(nlp) doc = component(doc)

Or using model name

component = SpacyNLP.from_model_id("en_core_web_sm", disable=["parser"]) doc = component(doc)

Source code in healthchain/pipeline/components/integrations.py
class SpacyNLP(BaseComponent[str]):
    """
    A component that integrates spaCy models into the pipeline.

    This component allows using any spaCy model within the pipeline by loading
    and applying it to process text documents. The spaCy doc outputs are stored
    in the document's nlp annotations container under .spacy_docs.

    Args:
        nlp: A pre-configured spaCy Language object.

    Example:
        >>> # Using pre-configured pipeline
        >>> import spacy
        >>> nlp = spacy.load("en_core_web_sm", disable=["parser"])
        >>> component = SpacyNLP(nlp)
        >>> doc = component(doc)
        >>>
        >>> # Or using model name
        >>> component = SpacyNLP.from_model_id("en_core_web_sm", disable=["parser"])
        >>> doc = component(doc)
    """

    def __init__(self, nlp: "Language"):
        """Initialize with a pre-configured spaCy Language object."""
        self._nlp = nlp

    @classmethod
    def from_model_id(cls, model: str, **kwargs: Any) -> "SpacyNLP":
        """
        Create a SpacyNLP component from a model identifier.

        Args:
            model (str): The name or path of the spaCy model to load.
                Can be a model name like 'en_core_web_sm' or path to saved model.
            **kwargs: Additional configuration options passed to spacy.load.
                Common options include disable, exclude, enable.

        Returns:
            SpacyNLP: Initialized spaCy component

        Raises:
            ImportError: If spaCy or the specified model is not installed
            TypeError: If invalid kwargs are passed to spacy.load
        """
        try:
            import spacy
        except ImportError:
            raise ImportError(
                "Could not import spacy. Please install it with: " "`pip install spacy`"
            )

        try:
            nlp = spacy.load(model, **kwargs)
        except TypeError as e:
            raise TypeError(f"Invalid kwargs for spacy.load: {str(e)}")
        except Exception as e:
            raise ImportError(
                f"Could not load spaCy model {model}! "
                "Make sure you have installed it with: "
                f"`python -m spacy download {model}`"
            ) from e

        return cls(nlp)

    def _add_concepts_to_hc_doc(self, spacy_doc: SpacyDoc, hc_doc: Document):
        """
        Extract entities from spaCy Doc and add them to the HealthChain Document concepts.

        Args:
            spacy_doc (Doc): The processed spaCy Doc object containing entities
            hc_doc (Document): The HealthChain Document to store concepts in

        Note: Defaults to ProblemConcepts and SNOMED CT concepts
        # TODO: make configurable
        """
        concepts = []
        for ent in spacy_doc.ents:
            # Check for CUI attribute from extensions like medcat
            concept = ProblemConcept(
                code=ent._.cui if hasattr(ent, "_.cui") else None,
                code_system="2.16.840.1.113883.6.96",
                code_system_name="SNOMED CT",
                display_name=ent.text,
            )
            concepts.append(concept)

        # Add to document concepts
        hc_doc.add_concepts(problems=concepts)

    def __call__(self, doc: Document) -> Document:
        """Process the document using the spaCy pipeline. Adds outputs to nlp.spacy_docs."""
        spacy_doc = self._nlp(doc.data)
        self._add_concepts_to_hc_doc(spacy_doc, doc)
        doc.nlp.add_spacy_doc(spacy_doc)
        return doc

__call__(doc)

Process the document using the spaCy pipeline. Adds outputs to nlp.spacy_docs.

Source code in healthchain/pipeline/components/integrations.py
def __call__(self, doc: Document) -> Document:
    """Process the document using the spaCy pipeline. Adds outputs to nlp.spacy_docs."""
    spacy_doc = self._nlp(doc.data)
    self._add_concepts_to_hc_doc(spacy_doc, doc)
    doc.nlp.add_spacy_doc(spacy_doc)
    return doc

__init__(nlp)

Initialize with a pre-configured spaCy Language object.

Source code in healthchain/pipeline/components/integrations.py
def __init__(self, nlp: "Language"):
    """Initialize with a pre-configured spaCy Language object."""
    self._nlp = nlp

from_model_id(model, **kwargs) classmethod

Create a SpacyNLP component from a model identifier.

PARAMETER DESCRIPTION
model

The name or path of the spaCy model to load. Can be a model name like 'en_core_web_sm' or path to saved model.

TYPE: str

**kwargs

Additional configuration options passed to spacy.load. Common options include disable, exclude, enable.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
SpacyNLP

Initialized spaCy component

TYPE: SpacyNLP

RAISES DESCRIPTION
ImportError

If spaCy or the specified model is not installed

TypeError

If invalid kwargs are passed to spacy.load

Source code in healthchain/pipeline/components/integrations.py
@classmethod
def from_model_id(cls, model: str, **kwargs: Any) -> "SpacyNLP":
    """
    Create a SpacyNLP component from a model identifier.

    Args:
        model (str): The name or path of the spaCy model to load.
            Can be a model name like 'en_core_web_sm' or path to saved model.
        **kwargs: Additional configuration options passed to spacy.load.
            Common options include disable, exclude, enable.

    Returns:
        SpacyNLP: Initialized spaCy component

    Raises:
        ImportError: If spaCy or the specified model is not installed
        TypeError: If invalid kwargs are passed to spacy.load
    """
    try:
        import spacy
    except ImportError:
        raise ImportError(
            "Could not import spacy. Please install it with: " "`pip install spacy`"
        )

    try:
        nlp = spacy.load(model, **kwargs)
    except TypeError as e:
        raise TypeError(f"Invalid kwargs for spacy.load: {str(e)}")
    except Exception as e:
        raise ImportError(
            f"Could not load spaCy model {model}! "
            "Make sure you have installed it with: "
            f"`python -m spacy download {model}`"
        ) from e

    return cls(nlp)

requires_package(package_name, import_path)

Decorator to check if an optional package is available.

PARAMETER DESCRIPTION
package_name

Name of the package to install (e.g., 'langchain-core')

TYPE: str

import_path

Import path to check (e.g., 'langchain_core.runnables')

TYPE: str

Source code in healthchain/pipeline/components/integrations.py
def requires_package(package_name: str, import_path: str) -> Callable:
    """Decorator to check if an optional package is available.

    Args:
        package_name: Name of the package to install (e.g., 'langchain-core')
        import_path: Import path to check (e.g., 'langchain_core.runnables')
    """

    def decorator(func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        def wrapper(*args, **kwargs) -> T:
            try:
                __import__(import_path)
            except ImportError:
                raise ImportError(
                    f"This feature requires {package_name}. "
                    f"Please install it with: `pip install {package_name}`"
                )
            return func(*args, **kwargs)

        return wrapper

    return decorator

TextPreProcessor

Bases: BaseComponent[Document]

A component for preprocessing text documents.

This class applies various cleaning and tokenization steps to a Document object, based on the provided configuration.

ATTRIBUTE DESCRIPTION
tokenizer

The tokenizer to use. Can be "basic" or a custom tokenization function that takes a string and returns a list of tokens. Defaults to "basic".

TYPE: Union[str, Callable[[str], List[str]]]

lowercase

Whether to convert text to lowercase. Defaults to False.

TYPE: bool

remove_punctuation

Whether to remove punctuation. Defaults to False.

TYPE: bool

standardize_spaces

Whether to standardize spaces. Defaults to False.

TYPE: bool

regex

List of regex patterns and replacements. Defaults to an empty list.

TYPE: List[Tuple[str, str]]

tokenizer_func

The tokenization function.

TYPE: Callable[[str], List[str]]

cleaning_steps

List of text cleaning functions.

TYPE: List[Callable[[str], str]]

Source code in healthchain/pipeline/components/preprocessors.py
class TextPreProcessor(BaseComponent[Document]):
    """
    A component for preprocessing text documents.

    This class applies various cleaning and tokenization steps to a Document object,
    based on the provided configuration.

    Attributes:
        tokenizer (Union[str, Callable[[str], List[str]]]): The tokenizer to use. Can be "basic" or a custom
            tokenization function that takes a string and returns a list of tokens. Defaults to "basic".
        lowercase (bool): Whether to convert text to lowercase. Defaults to False.
        remove_punctuation (bool): Whether to remove punctuation. Defaults to False.
        standardize_spaces (bool): Whether to standardize spaces. Defaults to False.
        regex (List[Tuple[str, str]]): List of regex patterns and replacements. Defaults to an empty list.
        tokenizer_func (Callable[[str], List[str]]): The tokenization function.
        cleaning_steps (List[Callable[[str], str]]): List of text cleaning functions.
    """

    def __init__(
        self,
        tokenizer: Union[str, Callable[[str], List[str]]] = "basic",
        lowercase: bool = False,
        remove_punctuation: bool = False,
        standardize_spaces: bool = False,
        regex: List[Tuple[str, str]] = None,
    ):
        """
        Initialize the TextPreprocessor with the given configuration.

        Args:
            tokenizer (Union[str, Callable[[str], List[str]]]): The tokenizer to use. Can be "basic" or a custom
                tokenization function that takes a string and returns a list of tokens. Defaults to "basic".
            lowercase (bool): Whether to convert text to lowercase. Defaults to False.
            remove_punctuation (bool): Whether to remove punctuation. Defaults to False.
            standardize_spaces (bool): Whether to standardize spaces. Defaults to False.
            regex (List[Tuple[str, str]], optional): List of regex patterns and replacements. Defaults to None.
        """
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.standardize_spaces = standardize_spaces
        self.regex = regex or []
        self.tokenizer = self._get_tokenizer(tokenizer)
        self.cleaning_steps = self._configure_cleaning_steps()

    def _get_tokenizer(
        self, tokenizer: Union[str, Callable[[str], List[str]]]
    ) -> Callable[[str], List[str]]:
        """
        Get the tokenization function based on the specified tokenizer.

        Args:
            tokenizer: Either "basic" or a custom tokenization function.

        Returns:
            Callable[[str], List[str]]: The tokenization function.

        Raises:
            ValueError: If an unsupported tokenizer string is specified.
        """
        if callable(tokenizer):
            return tokenizer
        elif tokenizer == "basic":
            return lambda text: text.split()
        else:
            raise ValueError(
                f"Unsupported tokenizer: {tokenizer}. Use 'basic' or provide a custom tokenization function."
            )

    def _configure_cleaning_steps(self) -> List[Callable[[str], str]]:
        """
        Configure the text cleaning steps based on the preprocessor configuration.

        Returns:
            List[Callable[[str], str]]: List of text cleaning functions.
        """
        steps = []
        if self.lowercase:
            steps.append(lambda text: text.lower())

        regex_steps = []
        if self.regex:
            regex_steps.extend(self.regex)
        else:
            if self.remove_punctuation:
                regex_steps.append((r"[^\w\s]", ""))
            if self.standardize_spaces:
                regex_steps.append((r"\s+", " "))

        for pattern, repl in regex_steps:
            steps.append(self._create_regex_step(pattern, repl))

        if self.standardize_spaces:
            steps.append(str.strip)

        return steps

    @staticmethod
    def _create_regex_step(pattern: str, repl: str) -> Callable[[str], str]:
        """
        Create a regex-based cleaning step. This can be used in place of other cleaning steps, if required.

        Args:
            pattern (str): The regex pattern to match.
            repl (str): The replacement string.

        Returns:
            Callable[[str], str]: A function that applies the regex substitution.
        """
        return lambda text: re.sub(pattern, repl, text)

    def _clean_text(self, text: str) -> str:
        """
        Apply all cleaning steps to the input text.

        Args:
            text (str): The input text to clean.

        Returns:
            str: The cleaned text.
        """
        for step in self.cleaning_steps:
            text = step(text)
        return text

    def __call__(self, doc: Document) -> Document:
        """
        Preprocess the given Document.

        This method applies the configured cleaning steps and tokenization to the document's text (in that order).

        Args:
            doc (Document): The document to preprocess.

        Returns:
            Document: The preprocessed document with updated tokens and preprocessed text.
        """
        # Preprocess text
        preprocessed_text = self._clean_text(doc.text)
        doc.preprocessed_text = preprocessed_text

        if self.tokenizer:
            tokens = self.tokenizer(preprocessed_text)
            doc.tokens = tokens

        return doc

__call__(doc)

Preprocess the given Document.

This method applies the configured cleaning steps and tokenization to the document's text (in that order).

PARAMETER DESCRIPTION
doc

The document to preprocess.

TYPE: Document

RETURNS DESCRIPTION
Document

The preprocessed document with updated tokens and preprocessed text.

TYPE: Document

Source code in healthchain/pipeline/components/preprocessors.py
def __call__(self, doc: Document) -> Document:
    """
    Preprocess the given Document.

    This method applies the configured cleaning steps and tokenization to the document's text (in that order).

    Args:
        doc (Document): The document to preprocess.

    Returns:
        Document: The preprocessed document with updated tokens and preprocessed text.
    """
    # Preprocess text
    preprocessed_text = self._clean_text(doc.text)
    doc.preprocessed_text = preprocessed_text

    if self.tokenizer:
        tokens = self.tokenizer(preprocessed_text)
        doc.tokens = tokens

    return doc

__init__(tokenizer='basic', lowercase=False, remove_punctuation=False, standardize_spaces=False, regex=None)

Initialize the TextPreprocessor with the given configuration.

PARAMETER DESCRIPTION
tokenizer

The tokenizer to use. Can be "basic" or a custom tokenization function that takes a string and returns a list of tokens. Defaults to "basic".

TYPE: Union[str, Callable[[str], List[str]]] DEFAULT: 'basic'

lowercase

Whether to convert text to lowercase. Defaults to False.

TYPE: bool DEFAULT: False

remove_punctuation

Whether to remove punctuation. Defaults to False.

TYPE: bool DEFAULT: False

standardize_spaces

Whether to standardize spaces. Defaults to False.

TYPE: bool DEFAULT: False

regex

List of regex patterns and replacements. Defaults to None.

TYPE: List[Tuple[str, str]] DEFAULT: None

Source code in healthchain/pipeline/components/preprocessors.py
def __init__(
    self,
    tokenizer: Union[str, Callable[[str], List[str]]] = "basic",
    lowercase: bool = False,
    remove_punctuation: bool = False,
    standardize_spaces: bool = False,
    regex: List[Tuple[str, str]] = None,
):
    """
    Initialize the TextPreprocessor with the given configuration.

    Args:
        tokenizer (Union[str, Callable[[str], List[str]]]): The tokenizer to use. Can be "basic" or a custom
            tokenization function that takes a string and returns a list of tokens. Defaults to "basic".
        lowercase (bool): Whether to convert text to lowercase. Defaults to False.
        remove_punctuation (bool): Whether to remove punctuation. Defaults to False.
        standardize_spaces (bool): Whether to standardize spaces. Defaults to False.
        regex (List[Tuple[str, str]], optional): List of regex patterns and replacements. Defaults to None.
    """
    self.lowercase = lowercase
    self.remove_punctuation = remove_punctuation
    self.standardize_spaces = standardize_spaces
    self.regex = regex or []
    self.tokenizer = self._get_tokenizer(tokenizer)
    self.cleaning_steps = self._configure_cleaning_steps()

TextPostProcessor

Bases: BaseComponent[Document]

A component for post-processing text documents, specifically for refining entities.

This class applies post-coordination rules to entities in a Document object, replacing entities with their refined versions based on a lookup dictionary.

ATTRIBUTE DESCRIPTION
entity_lookup

A dictionary for entity refinement lookups.

TYPE: Dict[str, str]

Source code in healthchain/pipeline/components/postprocessors.py
class TextPostProcessor(BaseComponent[Document]):
    """
    A component for post-processing text documents, specifically for refining entities.

    This class applies post-coordination rules to entities in a Document object,
    replacing entities with their refined versions based on a lookup dictionary.

    Attributes:
        entity_lookup (Dict[str, str]): A dictionary for entity refinement lookups.
    """

    def __init__(self, postcoordination_lookup: Dict[str, str] = None):
        """
        Initialize the TextPostProcessor with an optional postcoordination lookup.

        Args:
            postcoordination_lookup (Dict[str, str], optional): A dictionary for entity refinement lookups.
                If not provided, an empty dictionary will be used.
        """
        self.entity_lookup = postcoordination_lookup or {}

    def __call__(self, doc: Document) -> Document:
        """
        Apply post-processing to the given Document.

        This method refines the entities in the document based on the entity_lookup.
        If an entity exists in the lookup, it is replaced with its refined version.

        Args:
            doc (Document): The document to be post-processed.

        Returns:
            Document: The post-processed document with refined entities.

        Note:
            If the entity_lookup is empty or the document has no 'entities' attribute,
            the document is returned unchanged.
        """
        if not self.entity_lookup or not hasattr(doc._nlp, "_entities"):
            return doc

        refined_entities = []
        for entity in doc.nlp.get_entities():
            entity_text = entity["text"]
            if entity_text in self.entity_lookup:
                entity["text"] = self.entity_lookup[entity_text]
            refined_entities.append(entity)

        doc.nlp.set_entities(refined_entities)

        return doc

__call__(doc)

Apply post-processing to the given Document.

This method refines the entities in the document based on the entity_lookup. If an entity exists in the lookup, it is replaced with its refined version.

PARAMETER DESCRIPTION
doc

The document to be post-processed.

TYPE: Document

RETURNS DESCRIPTION
Document

The post-processed document with refined entities.

TYPE: Document

Note

If the entity_lookup is empty or the document has no 'entities' attribute, the document is returned unchanged.

Source code in healthchain/pipeline/components/postprocessors.py
def __call__(self, doc: Document) -> Document:
    """
    Apply post-processing to the given Document.

    This method refines the entities in the document based on the entity_lookup.
    If an entity exists in the lookup, it is replaced with its refined version.

    Args:
        doc (Document): The document to be post-processed.

    Returns:
        Document: The post-processed document with refined entities.

    Note:
        If the entity_lookup is empty or the document has no 'entities' attribute,
        the document is returned unchanged.
    """
    if not self.entity_lookup or not hasattr(doc._nlp, "_entities"):
        return doc

    refined_entities = []
    for entity in doc.nlp.get_entities():
        entity_text = entity["text"]
        if entity_text in self.entity_lookup:
            entity["text"] = self.entity_lookup[entity_text]
        refined_entities.append(entity)

    doc.nlp.set_entities(refined_entities)

    return doc

__init__(postcoordination_lookup=None)

Initialize the TextPostProcessor with an optional postcoordination lookup.

PARAMETER DESCRIPTION
postcoordination_lookup

A dictionary for entity refinement lookups. If not provided, an empty dictionary will be used.

TYPE: Dict[str, str] DEFAULT: None

Source code in healthchain/pipeline/components/postprocessors.py
def __init__(self, postcoordination_lookup: Dict[str, str] = None):
    """
    Initialize the TextPostProcessor with an optional postcoordination lookup.

    Args:
        postcoordination_lookup (Dict[str, str], optional): A dictionary for entity refinement lookups.
            If not provided, an empty dictionary will be used.
    """
    self.entity_lookup = postcoordination_lookup or {}

CdsCardCreator

Bases: BaseComponent[str]

Component that creates CDS Hooks cards from model outputs or static content.

This component formats text into CDS Hooks cards that can be displayed in an EHR system.
It can create cards from either:
1. Model-generated text stored in a document's model outputs container
2. Static content provided during initialization

The component uses Jinja2 templates to format the text into valid CDS Hooks card JSON.
The generated cards are added to the document's CDS container.

Args:
    template (str, optional): Jinja2 template string for card creation. If not provided,
        uses a default template that creates an info card.
    template_path (Union[str, Path], optional): Path to a Jinja2 template file.
    static_content (str, optional): Static text to use instead of model output.
    source (str, optional): Source framework to get model output from (e.g. "huggingface").
    task (str, optional): Task name to get model output from (e.g. "summarization").
    delimiter (str, optional): String to split model output into multiple cards.
    default_source (Dict[str, Any], optional): Default source info for cards.
        Defaults to {"label": "Card Generated by HealthChain"}.

Example:
    >>> # Create cards from model output
    >>> creator = CdsCardCreator(source="huggingface", task="summarization")
    >>> doc = creator(doc)  # Creates cards from model output
    >>>
    >>> # Create cards with static content
    >>> creator = CdsCardCreator(static_content="Static card message")
    >>> doc = creator(doc)  # Creates card with static content
    >>>
    >>> # Create cards with custom template
    >>> template = '''
    ... {
    ...     "summary": "{{ model_output[:140] }}",
    ...     "indicator": "info",
    ...     "source": {{ default_source | tojson }},
    ...     "detail": "{{ model_output }}"
    ... }
    ... '''
    >>> creator = CdsCardCreator(
    ...     template=template,
    ...     source="langchain",
    ...     task="chat",
    ...     delimiter="

" ... ) >>> doc = creator(doc) # Creates cards split by newlines

Source code in healthchain/pipeline/components/cdscardcreator.py
class CdsCardCreator(BaseComponent[str]):
    """
    Component that creates CDS Hooks cards from model outputs or static content.

    This component formats text into CDS Hooks cards that can be displayed in an EHR system.
    It can create cards from either:
    1. Model-generated text stored in a document's model outputs container
    2. Static content provided during initialization

    The component uses Jinja2 templates to format the text into valid CDS Hooks card JSON.
    The generated cards are added to the document's CDS container.

    Args:
        template (str, optional): Jinja2 template string for card creation. If not provided,
            uses a default template that creates an info card.
        template_path (Union[str, Path], optional): Path to a Jinja2 template file.
        static_content (str, optional): Static text to use instead of model output.
        source (str, optional): Source framework to get model output from (e.g. "huggingface").
        task (str, optional): Task name to get model output from (e.g. "summarization").
        delimiter (str, optional): String to split model output into multiple cards.
        default_source (Dict[str, Any], optional): Default source info for cards.
            Defaults to {"label": "Card Generated by HealthChain"}.

    Example:
        >>> # Create cards from model output
        >>> creator = CdsCardCreator(source="huggingface", task="summarization")
        >>> doc = creator(doc)  # Creates cards from model output
        >>>
        >>> # Create cards with static content
        >>> creator = CdsCardCreator(static_content="Static card message")
        >>> doc = creator(doc)  # Creates card with static content
        >>>
        >>> # Create cards with custom template
        >>> template = '''
        ... {
        ...     "summary": "{{ model_output[:140] }}",
        ...     "indicator": "info",
        ...     "source": {{ default_source | tojson }},
        ...     "detail": "{{ model_output }}"
        ... }
        ... '''
        >>> creator = CdsCardCreator(
        ...     template=template,
        ...     source="langchain",
        ...     task="chat",
        ...     delimiter="\n"
        ... )
        >>> doc = creator(doc)  # Creates cards split by newlines
    """

    # TODO: make source and other fields configurable from model too
    DEFAULT_TEMPLATE = """
    {
        "summary": "{{ model_output[:140] }}",
        "indicator": "info",
        "source": {{ default_source | tojson }},
        "detail": "{{ model_output }}"
    }
    """

    def __init__(
        self,
        template: Optional[str] = None,
        template_path: Optional[Union[str, Path]] = None,
        static_content: Optional[str] = None,
        source: Optional[str] = None,
        task: Optional[str] = None,
        delimiter: Optional[str] = None,
        default_source: Optional[Dict[str, Any]] = None,
    ):
        # Load template from file or use string template
        if template_path:
            try:
                template_path = Path(template_path)
                if not template_path.exists():
                    raise FileNotFoundError(f"Template file not found: {template_path}")
                with open(template_path) as f:
                    template = f.read()
            except Exception as e:
                logger.error(f"Error loading template from {template_path}: {str(e)}")
                template = self.DEFAULT_TEMPLATE

        self.template = Template(
            template if template is not None else self.DEFAULT_TEMPLATE
        )
        self.static_content = static_content
        self.source = source
        self.task = task
        self.delimiter = delimiter
        self.default_source = default_source or {
            "label": "Card Generated by HealthChain"
        }

    def create_card(self, content: str) -> Card:
        """Creates a CDS Card using the template and model output."""
        try:
            # Clean and escape the content
            # TODO: format to html that can be rendered in card
            content = content.replace("\n", " ").replace("\r", " ").strip()
            content = content.replace('"', '\\"')  # Escape double quotes

            try:
                card_json = self.template.render(
                    model_output=content, default_source=self.default_source
                )
            except Exception as e:
                raise ValueError(f"Error rendering template: {str(e)}")

            # Parse the rendered JSON into card fields
            card_fields = json.loads(card_json)

            return Card(
                summary=card_fields["summary"][:140],  # Enforce max length
                indicator=IndicatorEnum(card_fields["indicator"]),
                source=Source(**card_fields["source"]),
                detail=card_fields.get("detail"),
                suggestions=card_fields.get("suggestions"),
                selectionBehavior=card_fields.get("selectionBehavior"),
                overrideReasons=card_fields.get("overrideReasons"),
                links=card_fields.get("links"),
            )
        except Exception as e:
            raise ValueError(
                f"Error creating CDS card: Failed to render template or parse card fields: {str(e)}"
            )

    def __call__(self, doc: Document) -> Document:
        """
        Process a document and create CDS Hooks cards from model outputs or static content.

        Creates cards in one of two ways:
        1. From model-generated text stored in the document's model outputs container,
           accessed using the configured source and task
        2. From static content provided during initialization

        The generated text can optionally be split into multiple cards using a delimiter.
        Each piece of text is formatted using the configured template into a CDS Hooks card
        and added to the document's CDS container.

        Args:
            doc (Document): Document containing model outputs and CDS container

        Returns:
            Document: The input document with generated CDS cards added to its CDS container

        Raises:
            ValueError: If neither model configuration (source and task) nor static content
                is provided for card creation
        """
        if self.source and self.task:
            generated_text = doc.models.get_generated_text(self.source, self.task)
            if not generated_text:
                logger.warning(
                    f"No generated text for {self.source}/{self.task} found for CDS card creation!"
                )
                return doc
        elif self.static_content:
            generated_text = [self.static_content]
        else:
            raise ValueError(
                "Either model output (source and task) or content need to be provided for CDS card creation!"
            )

        # Create card from model output
        cards = []
        for text in generated_text:
            texts = [text] if not self.delimiter else text.split(self.delimiter)
            for t in texts:
                try:
                    cards.append(self.create_card(t))
                except Exception as e:
                    logger.warning(f"Error creating card: {str(e)}")

        if cards:
            doc.add_cds_cards(cards)

        return doc

__call__(doc)

Process a document and create CDS Hooks cards from model outputs or static content.

Creates cards in one of two ways: 1. From model-generated text stored in the document's model outputs container, accessed using the configured source and task 2. From static content provided during initialization

The generated text can optionally be split into multiple cards using a delimiter. Each piece of text is formatted using the configured template into a CDS Hooks card and added to the document's CDS container.

PARAMETER DESCRIPTION
doc

Document containing model outputs and CDS container

TYPE: Document

RETURNS DESCRIPTION
Document

The input document with generated CDS cards added to its CDS container

TYPE: Document

RAISES DESCRIPTION
ValueError

If neither model configuration (source and task) nor static content is provided for card creation

Source code in healthchain/pipeline/components/cdscardcreator.py
def __call__(self, doc: Document) -> Document:
    """
    Process a document and create CDS Hooks cards from model outputs or static content.

    Creates cards in one of two ways:
    1. From model-generated text stored in the document's model outputs container,
       accessed using the configured source and task
    2. From static content provided during initialization

    The generated text can optionally be split into multiple cards using a delimiter.
    Each piece of text is formatted using the configured template into a CDS Hooks card
    and added to the document's CDS container.

    Args:
        doc (Document): Document containing model outputs and CDS container

    Returns:
        Document: The input document with generated CDS cards added to its CDS container

    Raises:
        ValueError: If neither model configuration (source and task) nor static content
            is provided for card creation
    """
    if self.source and self.task:
        generated_text = doc.models.get_generated_text(self.source, self.task)
        if not generated_text:
            logger.warning(
                f"No generated text for {self.source}/{self.task} found for CDS card creation!"
            )
            return doc
    elif self.static_content:
        generated_text = [self.static_content]
    else:
        raise ValueError(
            "Either model output (source and task) or content need to be provided for CDS card creation!"
        )

    # Create card from model output
    cards = []
    for text in generated_text:
        texts = [text] if not self.delimiter else text.split(self.delimiter)
        for t in texts:
            try:
                cards.append(self.create_card(t))
            except Exception as e:
                logger.warning(f"Error creating card: {str(e)}")

    if cards:
        doc.add_cds_cards(cards)

    return doc

create_card(content)

Creates a CDS Card using the template and model output.

Source code in healthchain/pipeline/components/cdscardcreator.py
def create_card(self, content: str) -> Card:
    """Creates a CDS Card using the template and model output."""
    try:
        # Clean and escape the content
        # TODO: format to html that can be rendered in card
        content = content.replace("\n", " ").replace("\r", " ").strip()
        content = content.replace('"', '\\"')  # Escape double quotes

        try:
            card_json = self.template.render(
                model_output=content, default_source=self.default_source
            )
        except Exception as e:
            raise ValueError(f"Error rendering template: {str(e)}")

        # Parse the rendered JSON into card fields
        card_fields = json.loads(card_json)

        return Card(
            summary=card_fields["summary"][:140],  # Enforce max length
            indicator=IndicatorEnum(card_fields["indicator"]),
            source=Source(**card_fields["source"]),
            detail=card_fields.get("detail"),
            suggestions=card_fields.get("suggestions"),
            selectionBehavior=card_fields.get("selectionBehavior"),
            overrideReasons=card_fields.get("overrideReasons"),
            links=card_fields.get("links"),
        )
    except Exception as e:
        raise ValueError(
            f"Error creating CDS card: Failed to render template or parse card fields: {str(e)}"
        )