Skip to content

Feature System

Metaxy has a declarative (defined statically at class level), expressive, flexible feature system. It has been inspired by Dagster's Software-Defined Assets and Nix.

Abstract

Features represent tabular metadata, typically containing references to external multi-modal data such as files, images, or videos. But it can be just pure metadata as well.

I will highlight data and metadata with bold so it really stands out.

Metaxy is responsible for providing correct metadata to users.

During incremental processing, Metaxy will automatically resolve added, changed and deleted metadata rows and calculate the right sample versions for them.

Metaxy does not interact with data directly, the user is responsible for writing it, typically using metadata to identify sample locations in storage.

Keeping Historical Data

Include metaxy_data_version in your data path to avoid collisions between different versions of the same data sample. Doing this will ensure that newer samples are never written over older ones.

I hope we can stop using bold for data and metadata from now on, hopefully we've made our point.

Feature Definitions

Metaxy provides a BaseFeature class that can be extended to create user-defined features. It's a Pydantic model.

Abstract

Features must have unique (across all projects) FeatureKey associated with them.

Users must provide one or more ID columns to FeatureSpec, telling Metaxy how to uniquely identify feature samples.

from metaxy import BaseFeature, FeatureSpec


class VideoFeature(
    BaseFeature, spec=FeatureSpec(key="/raw/video", id_columns=["video_id"])
):
    path: str

Since VideoFeature is a root feature, it doesn't have any dependencies.

That's it! Easy.

Tip

You may now use VideoFeature.spec() class method to access the original feature spec: it's bound to the class.

Now let's define a child feature.

class Transcript(
    BaseFeature,
    spec=FeatureSpec(
        key="/processed/transcript", id_columns=["video_id"], deps=[VideoFeature]
    ),
):
    transcript_path: str
    speakers_json_path: str
    num_speakers: int
The God FeatureGraph object

Features live on a global FeatureGraph object (typically users do not need to interact with it directly).

Hurray! You get the idea.

Field-Level Dependencies

A core (1) feature of Metaxy is the concept of field-level dependencies. These are used to define dependencies between logical fields of features.

  1. really a killer 🔫

Abstract

A Metaxy field is not to be confused with metadata column. Columns refer to metadata and are stored in metadata stores (such as databases) supported by Metaxy. (1)

  1. columns can be defined with Pydantic fields 😅

Fields refer to data and are purely logical - users are free to define them as they see fit. Fields are supposed to represent parts of data that users care about. For example, a Video feature - an .mp4 file - may have frames and audio fields.

At this point, careful readers have probably noticed that the Transcript feature from the example above should not depend on the full video: it only needs the audio track in order to generate the transcript. Let's express this with Metaxy:

from metaxy import FieldDep, FieldSpec

video_spec = FeatureSpec(key="/raw/video", fields=["audio", "frames"])


class VideoFeature(BaseFeature, spec=video_spec):
    path: str


transcript_spec = TranscriptFeatureSpec(
    key="/raw/transcript",
    id_columns=["video_id"],
    fields=[
        FieldSpec(
            key="text",
            deps=[FieldDep(feature=VideoFeature, fields=["audio"])],
        )
    ],
)


class TranscriptFeature(BaseTranscriptFeature, spec=transcript_spec):
    path: str

Voilà!

Use boilerplate-free API

Metaxy allows passing simplified types to some of the models like FeatureSpec or FeatureKey. See syntactic sugar for more details.

The Data Versioning docs explain more about how Metaxy calculates versions for different components of a feature graph.

Attaching user-defined metadata

Users can attach arbitrary JSON-like metadata dictionary to feature specs, typically used for declaring ownership, providing information to third-party tooling, or documentation purposes. This metadata does not influence graph topology or the versioning system.

Fully Qualified Field Key

Abstract

A fully qualified field key (FQFK) is an identifier that uniquely identifies a field within the whole feature graph.

It consists of the feature key and the field key, separated by a colon.

Example

  • /raw/video:frames

  • /raw/video:audio/english