Home Artificial Intelligence Type-Hinting DataFrames for Static Evaluation and Runtime Validation

Type-Hinting DataFrames for Static Evaluation and Runtime Validation

0
Type-Hinting DataFrames for Static Evaluation and Runtime Validation

How StaticFrame Enables Comprehensive DataFrame Type Hints

Towards Data Science
A multi-colored glass mosaic
Photo by Writer

Because the advent of type hints in Python 3.5, statically typing a DataFrame has generally been limited to specifying just the kind:

def process(f: DataFrame) -> Series: ...

This is insufficient, because it ignores the categories contained inside the container. A DataFrame might need string column labels and three columns of integer, string, and floating-point values; these characteristics define the kind. A function argument with such type hints provides developers, static analyzers, and runtime checkers with all the knowledge needed to grasp the expectations of the interface. StaticFrame 2 (an open-source project of which I’m lead developer) now permits this:

from typing import Any
from static_frame import Frame, Index, TSeriesAny

def process(f: Frame[ # type of the container
Any, # type of the index labels
Index[np.str_], # style of the column labels
np.int_, # style of the primary column
np.str_, # style of the second column
np.float64, # style of the third column
]) -> TSeriesAny: ...

All core StaticFrame containers now support generic specifications. While statically checkable, a brand new decorator, @CallGuard.check, permits runtime validation of those type hints on function interfaces. Further, using Annotated generics, the brand new Require class defines a family of powerful runtime validators, permitting per-column or per-row data checks. Finally, each container exposes a brand new via_type_clinic interface to derive and validate type hints. Together, these tools offer a cohesive approach to type-hinting and validating DataFrames.

Requirements of a Generic DataFrame

Python’s built-in generic types (e.g., tuple or dict) require specification of component types (e.g., tuple[int, str, bool] or dict[str, int]). Defining component types permits more accurate static evaluation. While the identical is true for DataFrames, there have been few attempts to define comprehensive type hints for DataFrames.

Pandas, even with the pandas-stubs package, doesn’t permit specifying the forms of a DataFrame’s components. The Pandas DataFrame, permitting extensive in-place mutation, is probably not sensible to type statically. Fortunately, immutable DataFrames can be found in StaticFrame.

Further, Python’s tools for outlining generics, until recently, haven’t been well-suited for DataFrames. That a DataFrame has a variable variety of heterogeneous columnar types poses a challenge for generic specification. Typing such a structure became easier with the brand new TypeVarTuple, introduced in Python 3.11 (and back-ported within the typing_extensions package).

A TypeVarTuple permits defining generics that accept a variable variety of types. (See PEP 646 for details.) With this recent type variable, StaticFrame can define a generic Frame with a TypeVar for the index, a TypeVar for the columns, and a TypeVarTuple for zero or more columnar types.

A generic Series is defined with a TypeVar for the index and a TypeVar for the values. The StaticFrame Index and IndexHierarchy are also generic, the latter again benefiting from TypeVarTuple to define a variable variety of component Index for every depth level.

StaticFrame uses NumPy types to define the columnar forms of a Frame, or the values of a Series or Index. This allows narrowly specifying sized numerical types, reminiscent of np.uint8 or np.complex128; or broadly specifying categories of types, reminiscent of np.integer or np.inexact. As StaticFrame supports all NumPy types, the correspondence is direct.

Interfaces Defined with Generic DataFrames

Extending the instance above, the function interface below shows a Frame with three columns transformed right into a dictionary of Series. With so far more information provided by component type hints, the function’s purpose is sort of obvious.

from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonth

def process(f: Frame[
Any,
Index[np.str_],
np.int_,
np.str_,
np.float64,
]) -> dict[
int,
Series[ # type of the container
IndexYearMonth, # type of the index labels
np.float64, # type of the values
],
]: ...

This function processes a signal table from an Open Source Asset Pricing (OSAP) dataset (Firm Level Characteristics / Individual / Predictors). Each table has three columns: security identifier (labeled “permno”), yr and month (labeled “yyyymm”), and the signal (with a reputation specific to the signal).

The function ignores the index of the provided Frame (typed as Any) and creates groups defined by the primary column “permno” np.int_ values. A dictionary keyed by “permno” is returned, where each value is a Series of np.float64 values for that “permno”; the index is an IndexYearMonth created from the np.str_ “yyyymm” column. (StaticFrame uses NumPy datetime64 values to define unit-typed indices: IndexYearMonth stores datetime64[M] labels.)

Somewhat than returning a dict, the function below returns a Series with a hierarchical index. The IndexHierarchy generic specifies a component Index for every depth level; here, the outer depth is an Index[np.int_] (derived from the “permno” column), the inner depth an IndexYearMonth (derived from the “yyyymm” column).

from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy

def process(f: Frame[
Any,
Index[np.str_],
np.int_,
np.str_,
np.float64,
]) -> Series[ # type of the container
IndexHierarchy[ # type of the index labels
Index[np.int_], # style of index depth 0
IndexYearMonth], # style of index depth 1
np.float64, # style of the values
]: ...

Wealthy type hints provide a self-documenting interface that makes functionality explicit. Even higher, these type hints may be used for static evaluation with Pyright (now) and Mypy (pending full TypeVarTuple support). For instance, calling this function with a Frame of two columns of np.float64 will fail a static evaluation type check or deliver a warning in an editor.

Runtime Type Validation

Static type checking is probably not enough: runtime evaluation provides even stronger constraints, particularly for dynamic or incompletely (or incorrectly) type-hinted values.

Constructing on a brand new runtime type checker named TypeClinic, StaticFrame 2 introduces @CallGuard.check, a decorator for runtime validation of type-hinted interfaces. All StaticFrame and NumPy generics are supported, and most built-in Python types are supported, even when deeply nested. The function below adds the @CallGuard.check decorator.

from typing import Any
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy, CallGuard

@CallGuard.check
def process(f: Frame[
Any,
Index[np.str_],
np.int_,
np.str_,
np.float64,
]) -> Series[
IndexHierarchy[Index[np.int_], IndexYearMonth],
np.float64,
]: ...

Now decorated with @CallGuard.check, if the function above is known as with an unlabelled Frame of two columns of np.float64, a ClinicError exception will likely be raised, illustrating that, where three columns were expected, two were provided, and where string column labels were expected, integer labels were provided. (To issue warnings as an alternative of raising exceptions, use the @CallGuard.warn decorator.)

ClinicError:
In args of (f: Frame[Any, Index[str_], int64, str_, float64]) -> Series[IndexHierarchy[Index[int64], IndexYearMonth], float64]
└── Frame[Any, Index[str_], int64, str_, float64]
└── Expected Frame has 3 dtype, provided Frame has 2 dtype
In args of (f: Frame[Any, Index[str_], int64, str_, float64]) -> Series[IndexHierarchy[Index[int64], IndexYearMonth], float64]
└── Frame[Any, Index[str_], int64, str_, float64]
└── Index[str_]
└── Expected str_, provided int64 invalid

Runtime Data Validation

Other characteristics may be validated at runtime. For instance, the shape or name attributes, or the sequence of labels on the index or columns. The StaticFrame Require class provides a family of configurable validators.

  • Require.Name: Validate the “name“ attribute of the container.
  • Require.Len: Validate the length of the container.
  • Require.Shape: Validate the “shape“ attribute of the container.
  • Require.LabelsOrder: Validate the ordering of the labels.
  • Require.LabelsMatch: Validate inclusion of labels independent of order.
  • Require.Apply: Apply a Boolean-returning function to the container.

Aligning with a growing trend, these objects are provided inside type hints as a number of additional arguments to an Annotated generic. (See PEP 593 for details.) The sort referenced by the primary Annotated argument is the goal of subsequent-argument validators. For instance, if a Index[np.str_] type hint is replaced with an Annotated[Index[np.str_], Require.Len(20)] type hint, the runtime length validation is applied to the index related to the primary argument.

Extending the instance of processing an OSAP signal table, we’d validate our expectation of column labels. The Require.LabelsOrder validator can define a sequence of labels, optionally using for contiguous regions of zero or more unspecified labels. To specify that the primary two columns of the table are labeled “permno” and “yyyymm”, while the third label is variable (depending on the signal), the next Require.LabelsOrder may be defined inside an Annotated generic:

from typing import Any, Annotated
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy, CallGuard, Require

@CallGuard.check
def process(f: Frame[
Any,
Annotated[
Index[np.str_],
Require.LabelsOrder('permno', 'yyyymm', ...),
],
np.int_,
np.str_,
np.float64,
]) -> Series[
IndexHierarchy[Index[np.int_], IndexYearMonth],
np.float64,
]: ...

If the interface expects a small collection of OSAP signal tables, we are able to validate the third column with the Require.LabelsMatch validator. This validator can specify required labels, sets of labels (from which not less than one must match), and regular expression patterns. If tables from only three files are expected (i.e., “Mom12m.csv”, “Mom6m.csv”, and “LRreversal.csv”), we are able to validate the labels of the third column by defining Require.LabelsMatch with a set:

@CallGuard.check
def process(f: Frame[
Any,
Annotated[
Index[np.str_],
Require.LabelsOrder('permno', 'yyyymm', ...),
Require.LabelsMatch({'Mom12m', 'Mom6m', 'LRreversal'}),
],
np.int_,
np.str_,
np.float64,
]) -> Series[
IndexHierarchy[Index[np.int_], IndexYearMonth],
np.float64,
]: ...

Each Require.LabelsOrder and Require.LabelsMatch can associate functions with label specifiers to validate data values. If the validator is applied to column labels, a Series of column values will likely be provided to the function; if the validator is applied to index labels, a Series of row values will likely be provided to the function.

Much like the usage of Annotated, the label specifier is replaced with an inventory, where the primary item is the label specifier, and the remaining items are row- or column-processing functions that return a Boolean.

To increase the instance above, we’d validate that each one “permno” values are greater than zero and that each one signal values (“Mom12m”, “Mom6m”, “LRreversal”) are greater than or equal to -1.

from typing import Any, Annotated
from static_frame import Frame, Series, Index, IndexYearMonth, IndexHierarchy, CallGuard, Require

@CallGuard.check
def process(f: Frame[
Any,
Annotated[
Index[np.str_],
Require.LabelsOrder(
['permno', lambda s: (s > 0).all()],
'yyyymm',
...,
),
Require.LabelsMatch(
[{'Mom12m', 'Mom6m', 'LRreversal'}, lambda s: (s >= -1).all()],
),
],
np.int_,
np.str_,
np.float64,
]) -> Series[
IndexHierarchy[Index[np.int_], IndexYearMonth],
np.float64,
]: ...

If a validation fails, @CallGuard.check will raise an exception. For instance, if the above function is known as with a Frame that has an unexpected third-column label, the next exception will likely be raised:

ClinicError:
In args of (f: Frame[Any, Annotated[Index[str_], LabelsOrder(['permno', ], 'yyyymm', ...), LabelsMatch([{'Mom12m', 'LRreversal', 'Mom6m'}, ])], int64, str_, float64]) -> Series[IndexHierarchy[Index[int64], IndexYearMonth], float64]
└── Frame[Any, Annotated[Index[str_], LabelsOrder(['permno', ], 'yyyymm', ...), LabelsMatch([{'Mom12m', 'LRreversal', 'Mom6m'}, ])], int64, str_, float64]
└── Annotated[Index[str_], LabelsOrder(['permno', ], 'yyyymm', ...), LabelsMatch([{'Mom12m', 'LRreversal', 'Mom6m'}, ])]
└── LabelsMatch([{'Mom12m', 'LRreversal', 'Mom6m'}, ])
└── Expected label to match frozenset({'Mom12m', 'LRreversal', 'Mom6m'}), no provided match

The Expressive Power of TypeVarTuple

As shown above, TypeVarTuple permits specifying Frame with zero or more heterogeneous columnar types. For instance, we are able to provide type hints for a Frame of two float or six mixed types:

>>> from typing import Any
>>> from static_frame import Frame, Index

>>> f1: sf.Frame[Any, Any, np.float64, np.float64]
>>> f2: sf.Frame[Any, Any, np.bool_, np.float64, np.int8, np.int8, np.str_, np.datetime64]

While this accommodates diverse DataFrames, type-hinting wide DataFrames, reminiscent of those with lots of of columns, could be unwieldy. Python 3.11 introduces a brand new syntax to supply a variable range of types in TypeVarTuple generics: star expressions of tuple generic aliases. For instance, to type-hint a Frame with a date index, string column labels, and any configuration of columnar types, we are able to star-unpack a tuple of zero or more All:

>>> from typing import Any
>>> from static_frame import Frame, Index

>>> f: sf.Frame[Index[np.datetime64], Index[np.str_], *tuple[All, ...]]

The tuple star expression can go anywhere in an inventory of types, but there may be just one. For instance, the kind hint below defines a Frame that must start with Boolean and string columns but has a versatile specification for any variety of subsequent np.float64 columns.

>>> from typing import Any
>>> from static_frame import Frame

>>> f: sf.Frame[Any, Any, np.bool_, np.str_, *tuple[np.float64, ...]]

Utilities for Type Hinting

Working with such detailed type hints may be difficult. To help users, StaticFrame provides convenient utilities for runtime type hinting and checking. All StaticFrame 2 containers now feature a via_type_clinic interface, permitting access to TypeClinic functionality.

First, utilities are provided to translate a container, reminiscent of an entire Frame, into a sort hint. The string representation of the via_type_clinic interface provides a string representation of the container’s type hint; alternatively, the to_hint() method returns an entire generic alias object.

>>> import static_frame as sf
>>> f = sf.Frame.from_records(([3, '192004', 0.3], [3, '192005', -0.4]), columns=('permno', 'yyyymm', 'Mom3m'))

>>> f.via_type_clinic
Frame[Index[int64], Index[str_], int64, str_, float64]

>>> f.via_type_clinic.to_hint()
static_frame.core.frame.Frame[static_frame.core.index.Index[numpy.int64], static_frame.core.index.Index[numpy.str_], numpy.int64, numpy.str_, numpy.float64]

Second, utilities are provided for runtime-type-hint testing. The via_type_clinic.check() function permits validating the container against a provided type hint.

>>> f.via_type_clinic.check(sf.Frame[sf.Index[np.str_], sf.TIndexAny, *tuple[tp.Any, ...]])
ClinicError:
In Frame[Index[str_], Index[Any], Unpack[Tuple[Any, ...]]]
└── Index[str_]
└── Expected str_, provided int64 invalid

To support gradual typing, StaticFrame defines several generic aliases configured with Any for each component type. For instance, TFrameAny may be used for any Frame, and TSeriesAny for any Series. As expected, TFrameAny will validate the Frame created above.

>>> f.via_type_clinic.check(sf.TFrameAny)

Conclusion

Higher type hinting for DataFrames is overdue. With modern Python typing tools and a DataFrame built on an immutable data model, StaticFrame 2 meets this need, providing powerful resources for engineers prioritizing maintainability and verifiability.

LEAVE A REPLY

Please enter your comment!
Please enter your name here