Component

`BaseComponent`

Bases: Generic[T], ABC

Abstract base class for all components in the pipeline.

This class should be subclassed to create specific components. Subclasses must implement the call method.

Source code in healthchain/pipeline/components/base.py

class BaseComponent(Generic[T], ABC):
    """
    Abstract base class for all components in the pipeline.

    This class should be subclassed to create specific components.
    Subclasses must implement the __call__ method.
    """

    @abstractmethod
    def __call__(self, data: DataContainer[T]) -> DataContainer[T]:
        """
        Process the input data and return the processed data.

        Args:
            data (DataContainer[T]): The input data to be processed.

        Returns:
            DataContainer[T]: The processed data.
        """
        pass

`call(data)` `abstractmethod`

Process the input data and return the processed data.

PARAMETER	DESCRIPTION
`data`	The input data to be processed. TYPE: `DataContainer[T]`

RETURNS	DESCRIPTION
`DataContainer[T]`	DataContainer[T]: The processed data.

Source code in healthchain/pipeline/components/base.py

@abstractmethod
def __call__(self, data: DataContainer[T]) -> DataContainer[T]:
    """
    Process the input data and return the processed data.

    Args:
        data (DataContainer[T]): The input data to be processed.

    Returns:
        DataContainer[T]: The processed data.
    """
    pass

`Component`

Bases: BaseComponent[T]

A concrete implementation of the BaseComponent class.

This class can be used as a base for creating specific components that do not require any additional processing logic.

METHOD	DESCRIPTION
`__call__`	DataContainer[T]) -> DataContainer[T]: Process the input data and return the processed data. In this implementation, the input data is returned unmodified.

Source code in healthchain/pipeline/components/base.py

class Component(BaseComponent[T]):
    """
    A concrete implementation of the BaseComponent class.

    This class can be used as a base for creating specific components
    that do not require any additional processing logic.

    Methods:
        __call__(data: DataContainer[T]) -> DataContainer[T]:
            Process the input data and return the processed data.
            In this implementation, the input data is returned unmodified.
    """

    def __call__(self, data: DataContainer[T]) -> DataContainer[T]:
        return data

`HFTransformer`

Bases: BaseComponent[str]

A component that integrates Hugging Face transformers models into the pipeline.

This component allows using any Hugging Face model and task within the pipeline by wrapping the transformers.pipeline API. The model outputs are stored in the document's model_outputs container under the "huggingface" source key.

Note that this component is only recommended for non-conversational language tasks. For chat-based tasks, consider using LangChainLLM instead.

PARAMETER	DESCRIPTION
`pipeline`	A pre-configured HuggingFace pipeline object to use for inference. Must be an instance of transformers.pipelines.base.Pipeline. TYPE: `Any`

ATTRIBUTE	DESCRIPTION
`task`	The task name of the underlying pipeline, e.g. "sentiment-analysis", "ner". Automatically extracted from the pipeline object. TYPE: `str`

RAISES	DESCRIPTION
`ImportError`	If the transformers package is not installed
`TypeError`	If pipeline is not a valid HuggingFace Pipeline instance

Example

Initialize for sentiment analysis

from transformers import pipeline nlp = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english") component = HFTransformer(pipeline=nlp) doc = component(doc) # Analyzes sentiment of doc.data

Or use the factory method

component = HFTransformer.from_model_id( ... model="facebook/bart-large-cnn", ... task="summarization", ... max_length=130, ... min_length=30, ... do_sample=False ... ) doc = component(doc) # Generates summary of doc.data

Source code in healthchain/pipeline/components/integrations.py

class HFTransformer(BaseComponent[str]):
    """
    A component that integrates Hugging Face transformers models into the pipeline.

    This component allows using any Hugging Face model and task within the pipeline
    by wrapping the transformers.pipeline API. The model outputs are stored in the
    document's model_outputs container under the "huggingface" source key.

    Note that this component is only recommended for non-conversational language tasks.
    For chat-based tasks, consider using LangChainLLM instead.

    Args:
        pipeline (Any): A pre-configured HuggingFace pipeline object to use for inference.
            Must be an instance of transformers.pipelines.base.Pipeline.

    Attributes:
        task (str): The task name of the underlying pipeline, e.g. "sentiment-analysis", "ner".
            Automatically extracted from the pipeline object.

    Raises:
        ImportError: If the transformers package is not installed
        TypeError: If pipeline is not a valid HuggingFace Pipeline instance

    Example:
        >>> # Initialize for sentiment analysis
        >>> from transformers import pipeline
        >>> nlp = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
        >>> component = HFTransformer(pipeline=nlp)
        >>> doc = component(doc)  # Analyzes sentiment of doc.data
        >>>
        >>> # Or use the factory method
        >>> component = HFTransformer.from_model_id(
        ...     model="facebook/bart-large-cnn",
        ...     task="summarization",
        ...     max_length=130,
        ...     min_length=30,
        ...     do_sample=False
        ... )
        >>> doc = component(doc)  # Generates summary of doc.data
    """

    @requires_package("transformers", "transformers.pipelines")
    def __init__(self, pipeline: Any):
        """Initialize with a pre-configured HuggingFace pipeline.

        Args:
            pipeline: A pre-configured HuggingFace pipeline object from transformers.pipeline().
                     Must be an instance of transformers.pipelines.base.Pipeline.

        Raises:
            ImportError: If transformers package is not installed
            TypeError: If pipeline is not a valid HuggingFace Pipeline instance
        """
        from transformers.pipelines.base import Pipeline

        if not isinstance(pipeline, Pipeline):
            raise TypeError(
                f"Expected HuggingFace Pipeline object, got {type(pipeline)}"
            )
        self._pipe = pipeline
        self.task = pipeline.task

    @classmethod
    @requires_package("transformers", "transformers.pipelines")
    def from_model_id(cls, model: str, task: str, **kwargs: Any) -> "HFTransformer":
        """Create a transformer component from a model identifier.

        Factory method that initializes a HuggingFace pipeline with the specified model and task,
        then wraps it in a HFTransformer component.

        Args:
            model: The model identifier or path to load. Can be:
                - A model ID from the HuggingFace Hub (e.g. "bert-base-uncased")
                - A local path to a saved model
            task: The task to run (e.g. "text-classification", "token-classification", "summarization")
            **kwargs: Additional configuration options passed to transformers.pipeline()
                Common options include:
                - device: Device to run on ("cpu", "cuda", etc.)
                - batch_size: Batch size for inference
                - model_kwargs: Dict of model-specific args

        Returns:
            HFTransformer: Initialized transformer component wrapping the pipeline

        Raises:
            TypeError: If invalid kwargs are passed to pipeline initialization
            ValueError: If pipeline initialization fails for any other reason
            ImportError: If transformers package is not installed
        """
        from transformers import pipeline

        try:
            pipe = pipeline(task=task, model=model, **kwargs)
        except TypeError as e:
            raise TypeError(f"Invalid kwargs for transformers.pipeline: {str(e)}")
        except Exception as e:
            raise ValueError(f"Error initializing transformer pipeline: {str(e)}")

        return cls(pipeline=pipe)

    def __call__(self, doc: Document) -> Document:
        """Process the document using the Hugging Face pipeline. Adds outputs to .model_outputs['huggingface']."""
        output = self._pipe(doc.data)
        doc.models.add_output("huggingface", self.task, output)

        return doc

`call(doc)`

Process the document using the Hugging Face pipeline. Adds outputs to .model_outputs['huggingface'].

Source code in healthchain/pipeline/components/integrations.py

def __call__(self, doc: Document) -> Document:
    """Process the document using the Hugging Face pipeline. Adds outputs to .model_outputs['huggingface']."""
    output = self._pipe(doc.data)
    doc.models.add_output("huggingface", self.task, output)

    return doc

`init(pipeline)`

Initialize with a pre-configured HuggingFace pipeline.

PARAMETER	DESCRIPTION
`pipeline`	A pre-configured HuggingFace pipeline object from transformers.pipeline(). Must be an instance of transformers.pipelines.base.Pipeline. TYPE: `Any`

RAISES	DESCRIPTION
`ImportError`	If transformers package is not installed
`TypeError`	If pipeline is not a valid HuggingFace Pipeline instance

Source code in healthchain/pipeline/components/integrations.py

@requires_package("transformers", "transformers.pipelines")
def __init__(self, pipeline: Any):
    """Initialize with a pre-configured HuggingFace pipeline.

    Args:
        pipeline: A pre-configured HuggingFace pipeline object from transformers.pipeline().
                 Must be an instance of transformers.pipelines.base.Pipeline.

    Raises:
        ImportError: If transformers package is not installed
        TypeError: If pipeline is not a valid HuggingFace Pipeline instance
    """
    from transformers.pipelines.base import Pipeline

    if not isinstance(pipeline, Pipeline):
        raise TypeError(
            f"Expected HuggingFace Pipeline object, got {type(pipeline)}"
        )
    self._pipe = pipeline
    self.task = pipeline.task

`from_model_id(model, task, **kwargs)` `classmethod`

Create a transformer component from a model identifier.

Factory method that initializes a HuggingFace pipeline with the specified model and task, then wraps it in a HFTransformer component.

PARAMETER	DESCRIPTION
`model`	The model identifier or path to load. Can be: - A model ID from the HuggingFace Hub (e.g. "bert-base-uncased") - A local path to a saved model TYPE: `str`
`task`	The task to run (e.g. "text-classification", "token-classification", "summarization") TYPE: `str`
`**kwargs`	Additional configuration options passed to transformers.pipeline() Common options include: - device: Device to run on ("cpu", "cuda", etc.) - batch_size: Batch size for inference - model_kwargs: Dict of model-specific args TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`HFTransformer`	Initialized transformer component wrapping the pipeline TYPE: `HFTransformer`

RAISES	DESCRIPTION
`TypeError`	If invalid kwargs are passed to pipeline initialization
`ValueError`	If pipeline initialization fails for any other reason
`ImportError`	If transformers package is not installed

Source code in healthchain/pipeline/components/integrations.py

@classmethod
@requires_package("transformers", "transformers.pipelines")
def from_model_id(cls, model: str, task: str, **kwargs: Any) -> "HFTransformer":
    """Create a transformer component from a model identifier.

    Factory method that initializes a HuggingFace pipeline with the specified model and task,
    then wraps it in a HFTransformer component.

    Args:
        model: The model identifier or path to load. Can be:
            - A model ID from the HuggingFace Hub (e.g. "bert-base-uncased")
            - A local path to a saved model
        task: The task to run (e.g. "text-classification", "token-classification", "summarization")
        **kwargs: Additional configuration options passed to transformers.pipeline()
            Common options include:
            - device: Device to run on ("cpu", "cuda", etc.)
            - batch_size: Batch size for inference
            - model_kwargs: Dict of model-specific args

    Returns:
        HFTransformer: Initialized transformer component wrapping the pipeline

    Raises:
        TypeError: If invalid kwargs are passed to pipeline initialization
        ValueError: If pipeline initialization fails for any other reason
        ImportError: If transformers package is not installed
    """
    from transformers import pipeline

    try:
        pipe = pipeline(task=task, model=model, **kwargs)
    except TypeError as e:
        raise TypeError(f"Invalid kwargs for transformers.pipeline: {str(e)}")
    except Exception as e:
        raise ValueError(f"Error initializing transformer pipeline: {str(e)}")

    return cls(pipeline=pipe)

`LangChainLLM`

Bases: BaseComponent[str]

A component that integrates LangChain chains into the pipeline.

This component allows using any LangChain chain within the pipeline by wrapping the chain's invoke method. The chain outputs are stored in the document's model_outputs container under the "langchain" source key.

PARAMETER	DESCRIPTION
`chain`	The LangChain chain to run on the document text. Must be a Runnable object from the LangChain library. TYPE: `Runnable`
`task`	The task name to use when storing outputs, e.g. "summarization", "chat". Used as key to organize model outputs in the document's model container. TYPE: `str`
`**kwargs`	Additional parameters to pass to the chain's invoke method. These are forwarded directly to the chain's invoke() call. TYPE: `Any` DEFAULT: `{}`

RAISES	DESCRIPTION
`TypeError`	If chain is not a LangChain Runnable object or if invalid kwargs are passed
`ValueError`	If there is an error during chain invocation
`ImportError`	If langchain-core package is not installed

Example

from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI

chain = ChatPromptTemplate.from_template("What is {input}?") | ChatOpenAI() component = LangChainLLM(chain=chain, task="chat") doc = component(doc) # Runs the chain on doc.data and stores output

Source code in healthchain/pipeline/components/integrations.py

class LangChainLLM(BaseComponent[str]):
    """
    A component that integrates LangChain chains into the pipeline.

    This component allows using any LangChain chain within the pipeline by wrapping
    the chain's invoke method. The chain outputs are stored in the document's
    model_outputs container under the "langchain" source key.

    Args:
        chain (Runnable): The LangChain chain to run on the document text.
            Must be a Runnable object from the LangChain library.
        task (str): The task name to use when storing outputs, e.g. "summarization", "chat".
            Used as key to organize model outputs in the document's model container.
        **kwargs: Additional parameters to pass to the chain's invoke method.
            These are forwarded directly to the chain's invoke() call.

    Raises:
        TypeError: If chain is not a LangChain Runnable object or if invalid kwargs are passed
        ValueError: If there is an error during chain invocation
        ImportError: If langchain-core package is not installed

    Example:
        >>> from langchain_core.prompts import ChatPromptTemplate
        >>> from langchain_openai import ChatOpenAI

        >>> chain = ChatPromptTemplate.from_template("What is {input}?") | ChatOpenAI()
        >>> component = LangChainLLM(chain=chain, task="chat")
        >>> doc = component(doc)  # Runs the chain on doc.data and stores output
    """

    @requires_package("langchain-core", "langchain_core.runnables")
    def __init__(self, chain: Any, task: str, **kwargs: Any):
        """Initialize with a LangChain chain."""
        from langchain_core.runnables import Runnable

        if not isinstance(chain, Runnable):
            raise TypeError(f"Expected LangChain Runnable object, got {type(chain)}")

        self.chain = chain
        self.task = task
        self.kwargs = kwargs

    def __call__(self, doc: Document) -> Document:
        """Process the document using the LangChain chain. Adds outputs to .model_outputs['langchain']."""
        try:
            output = self.chain.invoke(doc.data, **self.kwargs)
        except TypeError as e:
            raise TypeError(f"Invalid kwargs for chain.invoke: {str(e)}")
        except Exception as e:
            raise ValueError(f"Error during chain invocation: {str(e)}")

        doc.models.add_output("langchain", self.task, output)

        return doc

`call(doc)`

Process the document using the LangChain chain. Adds outputs to .model_outputs['langchain'].

Source code in healthchain/pipeline/components/integrations.py

def __call__(self, doc: Document) -> Document:
    """Process the document using the LangChain chain. Adds outputs to .model_outputs['langchain']."""
    try:
        output = self.chain.invoke(doc.data, **self.kwargs)
    except TypeError as e:
        raise TypeError(f"Invalid kwargs for chain.invoke: {str(e)}")
    except Exception as e:
        raise ValueError(f"Error during chain invocation: {str(e)}")

    doc.models.add_output("langchain", self.task, output)

    return doc

`init(chain, task, **kwargs)`

Initialize with a LangChain chain.

Source code in healthchain/pipeline/components/integrations.py

@requires_package("langchain-core", "langchain_core.runnables")
def __init__(self, chain: Any, task: str, **kwargs: Any):
    """Initialize with a LangChain chain."""
    from langchain_core.runnables import Runnable

    if not isinstance(chain, Runnable):
        raise TypeError(f"Expected LangChain Runnable object, got {type(chain)}")

    self.chain = chain
    self.task = task
    self.kwargs = kwargs

`SpacyNLP`

Bases: BaseComponent[str]

A component that integrates spaCy models into the pipeline.

This component allows using any spaCy model within the pipeline by loading and applying it to process text documents. The spaCy doc outputs are stored in the document's nlp annotations container under .spacy_docs.

PARAMETER	DESCRIPTION
`nlp`	A pre-configured spaCy Language object. TYPE: `Language`

Example

Using pre-configured pipeline

import spacy nlp = spacy.load("en_core_web_sm", disable=["parser"]) component = SpacyNLP(nlp) doc = component(doc)

Or using model name

component = SpacyNLP.from_model_id("en_core_web_sm", disable=["parser"]) doc = component(doc)

Source code in healthchain/pipeline/components/integrations.py

class SpacyNLP(BaseComponent[str]):
    """
    A component that integrates spaCy models into the pipeline.

    This component allows using any spaCy model within the pipeline by loading
    and applying it to process text documents. The spaCy doc outputs are stored
    in the document's nlp annotations container under .spacy_docs.

    Args:
        nlp: A pre-configured spaCy Language object.

    Example:
        >>> # Using pre-configured pipeline
        >>> import spacy
        >>> nlp = spacy.load("en_core_web_sm", disable=["parser"])
        >>> component = SpacyNLP(nlp)
        >>> doc = component(doc)
        >>>
        >>> # Or using model name
        >>> component = SpacyNLP.from_model_id("en_core_web_sm", disable=["parser"])
        >>> doc = component(doc)
    """

    def __init__(self, nlp: "Language"):
        """Initialize with a pre-configured spaCy Language object."""
        self._nlp = nlp

    @classmethod
    def from_model_id(cls, model: str, **kwargs: Any) -> "SpacyNLP":
        """
        Create a SpacyNLP component from a model identifier.

        Args:
            model (str): The name or path of the spaCy model to load.
                Can be a model name like 'en_core_web_sm' or path to saved model.
            **kwargs: Additional configuration options passed to spacy.load.
                Common options include disable, exclude, enable.

        Returns:
            SpacyNLP: Initialized spaCy component

        Raises:
            ImportError: If spaCy or the specified model is not installed
            TypeError: If invalid kwargs are passed to spacy.load
        """
        try:
            import spacy
        except ImportError:
            raise ImportError(
                "Could not import spacy. Please install it with: " "`pip install spacy`"
            )

        try:
            nlp = spacy.load(model, **kwargs)
        except TypeError as e:
            raise TypeError(f"Invalid kwargs for spacy.load: {str(e)}")
        except Exception as e:
            raise ImportError(
                f"Could not load spaCy model {model}! "
                "Make sure you have installed it with: "
                f"`python -m spacy download {model}`"
            ) from e

        return cls(nlp)

    def __call__(self, doc: Document) -> Document:
        """Process the document using the spaCy pipeline. Adds outputs to nlp.spacy_docs."""
        spacy_doc = self._nlp(doc.data)
        doc.nlp.add_spacy_doc(spacy_doc)
        return doc

`call(doc)`

Process the document using the spaCy pipeline. Adds outputs to nlp.spacy_docs.

Source code in healthchain/pipeline/components/integrations.py

def __call__(self, doc: Document) -> Document:
    """Process the document using the spaCy pipeline. Adds outputs to nlp.spacy_docs."""
    spacy_doc = self._nlp(doc.data)
    doc.nlp.add_spacy_doc(spacy_doc)
    return doc

`init(nlp)`

Initialize with a pre-configured spaCy Language object.

Source code in healthchain/pipeline/components/integrations.py

64
65
66

def __init__(self, nlp: "Language"):
    """Initialize with a pre-configured spaCy Language object."""
    self._nlp = nlp

`from_model_id(model, **kwargs)` `classmethod`

Create a SpacyNLP component from a model identifier.

PARAMETER	DESCRIPTION
`model`	The name or path of the spaCy model to load. Can be a model name like 'en_core_web_sm' or path to saved model. TYPE: `str`
`**kwargs`	Additional configuration options passed to spacy.load. Common options include disable, exclude, enable. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`SpacyNLP`	Initialized spaCy component TYPE: `SpacyNLP`

RAISES	DESCRIPTION
`ImportError`	If spaCy or the specified model is not installed
`TypeError`	If invalid kwargs are passed to spacy.load

Source code in healthchain/pipeline/components/integrations.py

@classmethod
def from_model_id(cls, model: str, **kwargs: Any) -> "SpacyNLP":
    """
    Create a SpacyNLP component from a model identifier.

    Args:
        model (str): The name or path of the spaCy model to load.
            Can be a model name like 'en_core_web_sm' or path to saved model.
        **kwargs: Additional configuration options passed to spacy.load.
            Common options include disable, exclude, enable.

    Returns:
        SpacyNLP: Initialized spaCy component

    Raises:
        ImportError: If spaCy or the specified model is not installed
        TypeError: If invalid kwargs are passed to spacy.load
    """
    try:
        import spacy
    except ImportError:
        raise ImportError(
            "Could not import spacy. Please install it with: " "`pip install spacy`"
        )

    try:
        nlp = spacy.load(model, **kwargs)
    except TypeError as e:
        raise TypeError(f"Invalid kwargs for spacy.load: {str(e)}")
    except Exception as e:
        raise ImportError(
            f"Could not load spaCy model {model}! "
            "Make sure you have installed it with: "
            f"`python -m spacy download {model}`"
        ) from e

    return cls(nlp)

`requires_package(package_name, import_path)`

Decorator to check if an optional package is available.

PARAMETER	DESCRIPTION
`package_name`	Name of the package to install (e.g., 'langchain-core') TYPE: `str`
`import_path`	Import path to check (e.g., 'langchain_core.runnables') TYPE: `str`

Source code in healthchain/pipeline/components/integrations.py

def requires_package(package_name: str, import_path: str) -> Callable:
    """Decorator to check if an optional package is available.

    Args:
        package_name: Name of the package to install (e.g., 'langchain-core')
        import_path: Import path to check (e.g., 'langchain_core.runnables')
    """

    def decorator(func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        def wrapper(*args, **kwargs) -> T:
            try:
                __import__(import_path)
            except ImportError:
                raise ImportError(
                    f"This feature requires {package_name}. "
                    f"Please install it with: `pip install {package_name}`"
                )
            return func(*args, **kwargs)

        return wrapper

    return decorator

`TextPreProcessor`

Bases: BaseComponent[Document]

A component for preprocessing text documents.

This class applies various cleaning and tokenization steps to a Document object, based on the provided configuration.

ATTRIBUTE	DESCRIPTION
`tokenizer`	The tokenizer to use. Can be "basic" or a custom tokenization function that takes a string and returns a list of tokens. Defaults to "basic". TYPE: `Union[str, Callable[[str], List[str]]]`
`lowercase`	Whether to convert text to lowercase. Defaults to False. TYPE: `bool`
`remove_punctuation`	Whether to remove punctuation. Defaults to False. TYPE: `bool`
`standardize_spaces`	Whether to standardize spaces. Defaults to False. TYPE: `bool`
`regex`	List of regex patterns and replacements. Defaults to an empty list. TYPE: `List[Tuple[str, str]]`
`tokenizer_func`	The tokenization function. TYPE: `Callable[[str], List[str]]`
`cleaning_steps`	List of text cleaning functions. TYPE: `List[Callable[[str], str]]`

Source code in healthchain/pipeline/components/preprocessors.py

class TextPreProcessor(BaseComponent[Document]):
    """
    A component for preprocessing text documents.

    This class applies various cleaning and tokenization steps to a Document object,
    based on the provided configuration.

    Attributes:
        tokenizer (Union[str, Callable[[str], List[str]]]): The tokenizer to use. Can be "basic" or a custom
            tokenization function that takes a string and returns a list of tokens. Defaults to "basic".
        lowercase (bool): Whether to convert text to lowercase. Defaults to False.
        remove_punctuation (bool): Whether to remove punctuation. Defaults to False.
        standardize_spaces (bool): Whether to standardize spaces. Defaults to False.
        regex (List[Tuple[str, str]]): List of regex patterns and replacements. Defaults to an empty list.
        tokenizer_func (Callable[[str], List[str]]): The tokenization function.
        cleaning_steps (List[Callable[[str], str]]): List of text cleaning functions.
    """

    def __init__(
        self,
        tokenizer: Union[str, Callable[[str], List[str]]] = "basic",
        lowercase: bool = False,
        remove_punctuation: bool = False,
        standardize_spaces: bool = False,
        regex: List[Tuple[str, str]] = None,
    ):
        """
        Initialize the TextPreprocessor with the given configuration.

        Args:
            tokenizer (Union[str, Callable[[str], List[str]]]): The tokenizer to use. Can be "basic" or a custom
                tokenization function that takes a string and returns a list of tokens. Defaults to "basic".
            lowercase (bool): Whether to convert text to lowercase. Defaults to False.
            remove_punctuation (bool): Whether to remove punctuation. Defaults to False.
            standardize_spaces (bool): Whether to standardize spaces. Defaults to False.
            regex (List[Tuple[str, str]], optional): List of regex patterns and replacements. Defaults to None.
        """
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.standardize_spaces = standardize_spaces
        self.regex = regex or []
        self.tokenizer = self._get_tokenizer(tokenizer)
        self.cleaning_steps = self._configure_cleaning_steps()

    def _get_tokenizer(
        self, tokenizer: Union[str, Callable[[str], List[str]]]
    ) -> Callable[[str], List[str]]:
        """
        Get the tokenization function based on the specified tokenizer.

        Args:
            tokenizer: Either "basic" or a custom tokenization function.

        Returns:
            Callable[[str], List[str]]: The tokenization function.

        Raises:
            ValueError: If an unsupported tokenizer string is specified.
        """
        if callable(tokenizer):
            return tokenizer
        elif tokenizer == "basic":
            return lambda text: text.split()
        else:
            raise ValueError(
                f"Unsupported tokenizer: {tokenizer}. Use 'basic' or provide a custom tokenization function."
            )

    def _configure_cleaning_steps(self) -> List[Callable[[str], str]]:
        """
        Configure the text cleaning steps based on the preprocessor configuration.

        Returns:
            List[Callable[[str], str]]: List of text cleaning functions.
        """
        steps = []
        if self.lowercase:
            steps.append(lambda text: text.lower())

        regex_steps = []
        if self.regex:
            regex_steps.extend(self.regex)
        else:
            if self.remove_punctuation:
                regex_steps.append((r"[^\w\s]", ""))
            if self.standardize_spaces:
                regex_steps.append((r"\s+", " "))

        for pattern, repl in regex_steps:
            steps.append(self._create_regex_step(pattern, repl))

        if self.standardize_spaces:
            steps.append(str.strip)

        return steps

    @staticmethod
    def _create_regex_step(pattern: str, repl: str) -> Callable[[str], str]:
        """
        Create a regex-based cleaning step. This can be used in place of other cleaning steps, if required.

        Args:
            pattern (str): The regex pattern to match.
            repl (str): The replacement string.

        Returns:
            Callable[[str], str]: A function that applies the regex substitution.
        """
        return lambda text: re.sub(pattern, repl, text)

    def _clean_text(self, text: str) -> str:
        """
        Apply all cleaning steps to the input text.

        Args:
            text (str): The input text to clean.

        Returns:
            str: The cleaned text.
        """
        for step in self.cleaning_steps:
            text = step(text)
        return text

    def __call__(self, doc: Document) -> Document:
        """
        Preprocess the given Document.

        This method applies the configured cleaning steps and tokenization to the document's text (in that order).

        Args:
            doc (Document): The document to preprocess.

        Returns:
            Document: The preprocessed document with updated tokens and preprocessed text.
        """
        # Preprocess text
        preprocessed_text = self._clean_text(doc.text)
        doc.preprocessed_text = preprocessed_text

        if self.tokenizer:
            tokens = self.tokenizer(preprocessed_text)
            doc.tokens = tokens

        return doc

`call(doc)`

Preprocess the given Document.

This method applies the configured cleaning steps and tokenization to the document's text (in that order).

PARAMETER	DESCRIPTION
`doc`	The document to preprocess. TYPE: `Document`

RETURNS	DESCRIPTION
`Document`	The preprocessed document with updated tokens and preprocessed text. TYPE: `Document`

Source code in healthchain/pipeline/components/preprocessors.py

def __call__(self, doc: Document) -> Document:
    """
    Preprocess the given Document.

    This method applies the configured cleaning steps and tokenization to the document's text (in that order).

    Args:
        doc (Document): The document to preprocess.

    Returns:
        Document: The preprocessed document with updated tokens and preprocessed text.
    """
    # Preprocess text
    preprocessed_text = self._clean_text(doc.text)
    doc.preprocessed_text = preprocessed_text

    if self.tokenizer:
        tokens = self.tokenizer(preprocessed_text)
        doc.tokens = tokens

    return doc

`init(tokenizer='basic', lowercase=False, remove_punctuation=False, standardize_spaces=False, regex=None)`

Initialize the TextPreprocessor with the given configuration.

PARAMETER	DESCRIPTION
`tokenizer`	The tokenizer to use. Can be "basic" or a custom tokenization function that takes a string and returns a list of tokens. Defaults to "basic". TYPE: `Union[str, Callable[[str], List[str]]]` DEFAULT: `'basic'`
`lowercase`	Whether to convert text to lowercase. Defaults to False. TYPE: `bool` DEFAULT: `False`
`remove_punctuation`	Whether to remove punctuation. Defaults to False. TYPE: `bool` DEFAULT: `False`
`standardize_spaces`	Whether to standardize spaces. Defaults to False. TYPE: `bool` DEFAULT: `False`
`regex`	List of regex patterns and replacements. Defaults to None. TYPE: `List[Tuple[str, str]]` DEFAULT: `None`

Source code in healthchain/pipeline/components/preprocessors.py

def __init__(
    self,
    tokenizer: Union[str, Callable[[str], List[str]]] = "basic",
    lowercase: bool = False,
    remove_punctuation: bool = False,
    standardize_spaces: bool = False,
    regex: List[Tuple[str, str]] = None,
):
    """
    Initialize the TextPreprocessor with the given configuration.

    Args:
        tokenizer (Union[str, Callable[[str], List[str]]]): The tokenizer to use. Can be "basic" or a custom
            tokenization function that takes a string and returns a list of tokens. Defaults to "basic".
        lowercase (bool): Whether to convert text to lowercase. Defaults to False.
        remove_punctuation (bool): Whether to remove punctuation. Defaults to False.
        standardize_spaces (bool): Whether to standardize spaces. Defaults to False.
        regex (List[Tuple[str, str]], optional): List of regex patterns and replacements. Defaults to None.
    """
    self.lowercase = lowercase
    self.remove_punctuation = remove_punctuation
    self.standardize_spaces = standardize_spaces
    self.regex = regex or []
    self.tokenizer = self._get_tokenizer(tokenizer)
    self.cleaning_steps = self._configure_cleaning_steps()

`TextPostProcessor`

Bases: BaseComponent[Document]

A component for post-processing text documents, specifically for refining entities.

This class applies post-coordination rules to entities in a Document object, replacing entities with their refined versions based on a lookup dictionary.

ATTRIBUTE	DESCRIPTION
`entity_lookup`	A dictionary for entity refinement lookups. TYPE: `Dict[str, str]`

Source code in healthchain/pipeline/components/postprocessors.py

class TextPostProcessor(BaseComponent[Document]):
    """
    A component for post-processing text documents, specifically for refining entities.

    This class applies post-coordination rules to entities in a Document object,
    replacing entities with their refined versions based on a lookup dictionary.

    Attributes:
        entity_lookup (Dict[str, str]): A dictionary for entity refinement lookups.
    """

    def __init__(self, postcoordination_lookup: Dict[str, str] = None):
        """
        Initialize the TextPostProcessor with an optional postcoordination lookup.

        Args:
            postcoordination_lookup (Dict[str, str], optional): A dictionary for entity refinement lookups.
                If not provided, an empty dictionary will be used.
        """
        self.entity_lookup = postcoordination_lookup or {}

    def __call__(self, doc: Document) -> Document:
        """
        Apply post-processing to the given Document.

        This method refines the entities in the document based on the entity_lookup.
        If an entity exists in the lookup, it is replaced with its refined version.

        Args:
            doc (Document): The document to be post-processed.

        Returns:
            Document: The post-processed document with refined entities.

        Note:
            If the entity_lookup is empty or the document has no 'entities' attribute,
            the document is returned unchanged.
        """
        if not self.entity_lookup or not hasattr(doc._nlp, "_entities"):
            return doc

        refined_entities = []
        for entity in doc.nlp.get_entities():
            entity_text = entity["text"]
            if entity_text in self.entity_lookup:
                entity["text"] = self.entity_lookup[entity_text]
            refined_entities.append(entity)

        doc.nlp.set_entities(refined_entities)

        return doc

`call(doc)`

Apply post-processing to the given Document.

This method refines the entities in the document based on the entity_lookup. If an entity exists in the lookup, it is replaced with its refined version.

PARAMETER	DESCRIPTION
`doc`	The document to be post-processed. TYPE: `Document`

RETURNS	DESCRIPTION
`Document`	The post-processed document with refined entities. TYPE: `Document`

Note

If the entity_lookup is empty or the document has no 'entities' attribute, the document is returned unchanged.

Source code in healthchain/pipeline/components/postprocessors.py

def __call__(self, doc: Document) -> Document:
    """
    Apply post-processing to the given Document.

    This method refines the entities in the document based on the entity_lookup.
    If an entity exists in the lookup, it is replaced with its refined version.

    Args:
        doc (Document): The document to be post-processed.

    Returns:
        Document: The post-processed document with refined entities.

    Note:
        If the entity_lookup is empty or the document has no 'entities' attribute,
        the document is returned unchanged.
    """
    if not self.entity_lookup or not hasattr(doc._nlp, "_entities"):
        return doc

    refined_entities = []
    for entity in doc.nlp.get_entities():
        entity_text = entity["text"]
        if entity_text in self.entity_lookup:
            entity["text"] = self.entity_lookup[entity_text]
        refined_entities.append(entity)

    doc.nlp.set_entities(refined_entities)

    return doc

`init(postcoordination_lookup=None)`

Initialize the TextPostProcessor with an optional postcoordination lookup.

PARAMETER	DESCRIPTION
`postcoordination_lookup`	A dictionary for entity refinement lookups. If not provided, an empty dictionary will be used. TYPE: `Dict[str, str]` DEFAULT: `None`

Source code in healthchain/pipeline/components/postprocessors.py

def __init__(self, postcoordination_lookup: Dict[str, str] = None):
    """
    Initialize the TextPostProcessor with an optional postcoordination lookup.

    Args:
        postcoordination_lookup (Dict[str, str], optional): A dictionary for entity refinement lookups.
            If not provided, an empty dictionary will be used.
    """
    self.entity_lookup = postcoordination_lookup or {}

`CdsCardCreator`

Bases: BaseComponent[str]

Component that creates CDS Hooks cards from model outputs or static content.

This component formats text into CDS Hooks cards that can be displayed in an EHR system.
It can create cards from either:
1. Model-generated text stored in a document's model outputs container
2. Static content provided during initialization

The component uses Jinja2 templates to format the text into valid CDS Hooks card JSON.
The generated cards are added to the document's CDS container.

Args:
    template (str, optional): Jinja2 template string for card creation. If not provided,
        uses a default template that creates an info card.
    template_path (Union[str, Path], optional): Path to a Jinja2 template file.
    static_content (str, optional): Static text to use instead of model output.
    source (str, optional): Source framework to get model output from (e.g. "huggingface").
    task (str, optional): Task name to get model output from (e.g. "summarization").
    delimiter (str, optional): String to split model output into multiple cards.
    default_source (Dict[str, Any], optional): Default source info for cards.
        Defaults to {"label": "Card Generated by HealthChain"}.

Example:
    >>> # Create cards from model output
    >>> creator = CdsCardCreator(source="huggingface", task="summarization")
    >>> doc = creator(doc)  # Creates cards from model output
    >>>
    >>> # Create cards with static content
    >>> creator = CdsCardCreator(static_content="Static card message")
    >>> doc = creator(doc)  # Creates card with static content
    >>>
    >>> # Create cards with custom template
    >>> template = '''
    ... {
    ...     "summary": "{{ model_output[:140] }}",
    ...     "indicator": "info",
    ...     "source": {{ default_source | tojson }},
    ...     "detail": "{{ model_output }}"
    ... }
    ... '''
    >>> creator = CdsCardCreator(
    ...     template=template,
    ...     source="langchain",
    ...     task="chat",
    ...     delimiter="

" ... ) >>> doc = creator(doc) # Creates cards split by newlines

Source code in healthchain/pipeline/components/cdscardcreator.py

class CdsCardCreator(BaseComponent[str]):
    """
    Component that creates CDS Hooks cards from model outputs or static content.

    This component formats text into CDS Hooks cards that can be displayed in an EHR system.
    It can create cards from either:
    1. Model-generated text stored in a document's model outputs container
    2. Static content provided during initialization

    The component uses Jinja2 templates to format the text into valid CDS Hooks card JSON.
    The generated cards are added to the document's CDS container.

    Args:
        template (str, optional): Jinja2 template string for card creation. If not provided,
            uses a default template that creates an info card.
        template_path (Union[str, Path], optional): Path to a Jinja2 template file.
        static_content (str, optional): Static text to use instead of model output.
        source (str, optional): Source framework to get model output from (e.g. "huggingface").
        task (str, optional): Task name to get model output from (e.g. "summarization").
        delimiter (str, optional): String to split model output into multiple cards.
        default_source (Dict[str, Any], optional): Default source info for cards.
            Defaults to {"label": "Card Generated by HealthChain"}.

    Example:
        >>> # Create cards from model output
        >>> creator = CdsCardCreator(source="huggingface", task="summarization")
        >>> doc = creator(doc)  # Creates cards from model output
        >>>
        >>> # Create cards with static content
        >>> creator = CdsCardCreator(static_content="Static card message")
        >>> doc = creator(doc)  # Creates card with static content
        >>>
        >>> # Create cards with custom template
        >>> template = '''
        ... {
        ...     "summary": "{{ model_output[:140] }}",
        ...     "indicator": "info",
        ...     "source": {{ default_source | tojson }},
        ...     "detail": "{{ model_output }}"
        ... }
        ... '''
        >>> creator = CdsCardCreator(
        ...     template=template,
        ...     source="langchain",
        ...     task="chat",
        ...     delimiter="\n"
        ... )
        >>> doc = creator(doc)  # Creates cards split by newlines
    """

    # TODO: make source and other fields configurable from model too
    DEFAULT_TEMPLATE = """
    {
        "summary": "{{ model_output[:140] }}",
        "indicator": "info",
        "source": {{ default_source | tojson }},
        "detail": "{{ model_output }}"
    }
    """

    def __init__(
        self,
        template: Optional[str] = None,
        template_path: Optional[Union[str, Path]] = None,
        static_content: Optional[str] = None,
        source: Optional[str] = None,
        task: Optional[str] = None,
        delimiter: Optional[str] = None,
        default_source: Optional[Dict[str, Any]] = None,
    ):
        # Load template from file or use string template
        if template_path:
            try:
                template_path = Path(template_path)
                if not template_path.exists():
                    raise FileNotFoundError(f"Template file not found: {template_path}")
                with open(template_path) as f:
                    template = f.read()
            except Exception as e:
                logger.error(f"Error loading template from {template_path}: {str(e)}")
                template = self.DEFAULT_TEMPLATE

        self.template = Template(
            template if template is not None else self.DEFAULT_TEMPLATE
        )
        self.static_content = static_content
        self.source = source
        self.task = task
        self.delimiter = delimiter
        self.default_source = default_source or {
            "label": "Card Generated by HealthChain"
        }

    def create_card(self, content: str) -> Card:
        """Creates a CDS Card using the template and model output."""
        try:
            # Clean and escape the content
            # TODO: format to html that can be rendered in card
            content = content.replace("\n", " ").replace("\r", " ").strip()
            content = content.replace('"', '\\"')  # Escape double quotes

            try:
                card_json = self.template.render(
                    model_output=content, default_source=self.default_source
                )
            except Exception as e:
                raise ValueError(f"Error rendering template: {str(e)}")

            # Parse the rendered JSON into card fields
            card_fields = json.loads(card_json)

            return Card(
                summary=card_fields["summary"][:140],  # Enforce max length
                indicator=IndicatorEnum(card_fields["indicator"]),
                source=Source(**card_fields["source"]),
                detail=card_fields.get("detail"),
                suggestions=card_fields.get("suggestions"),
                selectionBehavior=card_fields.get("selectionBehavior"),
                overrideReasons=card_fields.get("overrideReasons"),
                links=card_fields.get("links"),
            )
        except Exception as e:
            raise ValueError(
                f"Error creating CDS card: Failed to render template or parse card fields: {str(e)}"
            )

    def __call__(self, doc: Document) -> Document:
        """
        Process a document and create CDS Hooks cards from model outputs or static content.

        Creates cards in one of two ways:
        1. From model-generated text stored in the document's model outputs container,
           accessed using the configured source and task
        2. From static content provided during initialization

        The generated text can optionally be split into multiple cards using a delimiter.
        Each piece of text is formatted using the configured template into a CDS Hooks card
        and added to the document's CDS container.

        Args:
            doc (Document): Document containing model outputs and CDS container

        Returns:
            Document: The input document with generated CDS cards added to its CDS container

        Raises:
            ValueError: If neither model configuration (source and task) nor static content
                is provided for card creation
        """
        if self.source and self.task:
            generated_text = doc.models.get_generated_text(self.source, self.task)
            if not generated_text:
                logger.warning(
                    f"No generated text for {self.source}/{self.task} found for CDS card creation!"
                )
                return doc
        elif self.static_content:
            generated_text = [self.static_content]
        else:
            raise ValueError(
                "Either model output (source and task) or content need to be provided for CDS card creation!"
            )

        # Create card from model output
        cards = []
        for text in generated_text:
            texts = [text] if not self.delimiter else text.split(self.delimiter)
            for t in texts:
                try:
                    cards.append(self.create_card(t))
                except Exception as e:
                    logger.warning(f"Error creating card: {str(e)}")

        if cards:
            doc.cds.cards = cards

        return doc

`call(doc)`

Process a document and create CDS Hooks cards from model outputs or static content.

Creates cards in one of two ways: 1. From model-generated text stored in the document's model outputs container, accessed using the configured source and task 2. From static content provided during initialization

The generated text can optionally be split into multiple cards using a delimiter. Each piece of text is formatted using the configured template into a CDS Hooks card and added to the document's CDS container.

PARAMETER	DESCRIPTION
`doc`	Document containing model outputs and CDS container TYPE: `Document`

RETURNS	DESCRIPTION
`Document`	The input document with generated CDS cards added to its CDS container TYPE: `Document`

RAISES	DESCRIPTION
`ValueError`	If neither model configuration (source and task) nor static content is provided for card creation

Source code in healthchain/pipeline/components/cdscardcreator.py

def __call__(self, doc: Document) -> Document:
    """
    Process a document and create CDS Hooks cards from model outputs or static content.

    Creates cards in one of two ways:
    1. From model-generated text stored in the document's model outputs container,
       accessed using the configured source and task
    2. From static content provided during initialization

    The generated text can optionally be split into multiple cards using a delimiter.
    Each piece of text is formatted using the configured template into a CDS Hooks card
    and added to the document's CDS container.

    Args:
        doc (Document): Document containing model outputs and CDS container

    Returns:
        Document: The input document with generated CDS cards added to its CDS container

    Raises:
        ValueError: If neither model configuration (source and task) nor static content
            is provided for card creation
    """
    if self.source and self.task:
        generated_text = doc.models.get_generated_text(self.source, self.task)
        if not generated_text:
            logger.warning(
                f"No generated text for {self.source}/{self.task} found for CDS card creation!"
            )
            return doc
    elif self.static_content:
        generated_text = [self.static_content]
    else:
        raise ValueError(
            "Either model output (source and task) or content need to be provided for CDS card creation!"
        )

    # Create card from model output
    cards = []
    for text in generated_text:
        texts = [text] if not self.delimiter else text.split(self.delimiter)
        for t in texts:
            try:
                cards.append(self.create_card(t))
            except Exception as e:
                logger.warning(f"Error creating card: {str(e)}")

    if cards:
        doc.cds.cards = cards

    return doc

`create_card(content)`

Creates a CDS Card using the template and model output.

Source code in healthchain/pipeline/components/cdscardcreator.py

def create_card(self, content: str) -> Card:
    """Creates a CDS Card using the template and model output."""
    try:
        # Clean and escape the content
        # TODO: format to html that can be rendered in card
        content = content.replace("\n", " ").replace("\r", " ").strip()
        content = content.replace('"', '\\"')  # Escape double quotes

        try:
            card_json = self.template.render(
                model_output=content, default_source=self.default_source
            )
        except Exception as e:
            raise ValueError(f"Error rendering template: {str(e)}")

        # Parse the rendered JSON into card fields
        card_fields = json.loads(card_json)

        return Card(
            summary=card_fields["summary"][:140],  # Enforce max length
            indicator=IndicatorEnum(card_fields["indicator"]),
            source=Source(**card_fields["source"]),
            detail=card_fields.get("detail"),
            suggestions=card_fields.get("suggestions"),
            selectionBehavior=card_fields.get("selectionBehavior"),
            overrideReasons=card_fields.get("overrideReasons"),
            links=card_fields.get("links"),
        )
    except Exception as e:
        raise ValueError(
            f"Error creating CDS card: Failed to render template or parse card fields: {str(e)}"
        )

Component

BaseComponent

__call__(data) abstractmethod

Component

HFTransformer

Initialize for sentiment analysis

Or use the factory method

__call__(doc)

__init__(pipeline)

from_model_id(model, task, **kwargs) classmethod

LangChainLLM

__call__(doc)

__init__(chain, task, **kwargs)

SpacyNLP

Using pre-configured pipeline

Or using model name

__call__(doc)

__init__(nlp)

from_model_id(model, **kwargs) classmethod

requires_package(package_name, import_path)

TextPreProcessor

__call__(doc)

__init__(tokenizer='basic', lowercase=False, remove_punctuation=False, standardize_spaces=False, regex=None)

TextPostProcessor

__call__(doc)

__init__(postcoordination_lookup=None)

CdsCardCreator

__call__(doc)

create_card(content)

`BaseComponent`

`call(data)` `abstractmethod`

`Component`

`HFTransformer`

`call(doc)`

`init(pipeline)`

`from_model_id(model, task, **kwargs)` `classmethod`

`LangChainLLM`

`call(doc)`

`init(chain, task, **kwargs)`

`SpacyNLP`

`call(doc)`

`init(nlp)`

`from_model_id(model, **kwargs)` `classmethod`

`requires_package(package_name, import_path)`

`TextPreProcessor`

`call(doc)`

`init(tokenizer='basic', lowercase=False, remove_punctuation=False, standardize_spaces=False, regex=None)`

`TextPostProcessor`

`call(doc)`

`init(postcoordination_lookup=None)`

`CdsCardCreator`

`call(doc)`

`create_card(content)`