Skip to content

Metaxy Logo

Metaxy

PyPI version Python versions PyPI downloads CI Ruff Ty prek


Warning

Metaxy hasn't been publicly released yet, but you can try the latest dev release:

pip install --pre metaxy

Metaxy is a metadata layer for multi-modal Data and ML pipelines that manages and tracks metadata: sample versions, dependencies, and data lineage across complex computational graphs.

It's agnostic to orchestration frameworks, compute engines, data or metadata storage. Metaxy has no strict infrastructure requirements.

Metaxy can scale to handle large amounts of big metadata.

All of this is possible thanks to (1) Narwhals, Ibis, and a few clever tricks.

  1. we really do stand on the shoulders of giants

What problem exactly does Metaxy solve?

Data, ML and AI workloads processing large amounts of images, videos, audios, or texts (1) can be very expensive to run. In contrast to traditional data engineering, re-running the whole pipeline on changes is no longer an option. Therefore, it becomes crucially important to correctly implement incremental processing and sample-level versioning.

  1. or really any kind of data

These workloads typically aren't stale: they evolve all the time, with new data being shipped, bugfixes or algorithm changes introduced, and new features added to the pipeline. This means the pipeline has to be re-computed frequently, but at the same time it's important to avoid unnecessary recomputations for individual samples in the datasets.

Here are some of the cases where re-computing would be undesirable:

  • merging two consecutive steps into one (refactoring the graph topology)

  • partial data updates, e.g. changing only the audio track inside a video file

  • backfilling metadata from another source

Correctly distinguishing these scenarios from cases where the feature should be re-computed is surprisingly challenging. Tracking and propagating these changes correctly to the right subset of samples and features can become incredibly complicated and time-consuming. Until now, a general solution for this problem did not exist, but this is not the case anymore.

Metaxy's solution

  1. Metaxy builds a versioned graph from declarative feature definitions and tracks version changes across individual samples.

  2. Metaxy introduces a unique field-level dependency system to express partial data dependencies and avoiding unnecessary downstream recomputations. Each sample holds a dictionary mapping its fields to their respective versions. (1)

  3. Metaxy implements a general MetadataStore interface that enables it to work with any kind of storage system, be it an analytical database or a lakehouse format.

  1. for example, a video sample could independently version the frames and the audio track: {"audio": "asddsa", "frames": "joasdb"}

Quickstart

Head to Quickstart (WIP!).

About Metaxy

Metaxy is:

  • 🧩 composable --- bring your own everything!

    • supports DuckDB, ClickHouse, and 20+ databases via Ibis, lakehouse storage formats such as DeltaLake or DuckLake, and other solutions such as LanceDB

    • is agnostic to tabular compute engines: Polars, Spark, Pandas, and databases thanks to Narwhals

    • we totally don't care how is the multi-modal data produced or where is it stored: Metaxy is responsible for yielding input metadata and writing output metadata

    • blends right in with orchestrators: see the excellent Dagster integration 🐙

  • 🪨 rock solid when it matters:

    • versioning is guaranteed to be consistent across DBs and local (1) compute engines. We really have tested this very well!

    • unique field-level dependency system prevents unnecessary recomputations for features that depend on partial data

    • metadata is append-only to ensure data integrity and immutability, but Metaxy provides the tooling to perform cleanup on-demand

  • 📈 scalable:

    • supports feature organization and discovery patterns such as packaging entry points. This enables collaboration across teams and projects.

    • is built with performance in mind: all operations default to run in the DB and we only fetch the resolved increment to Python

  • 🧑‍💻 dev friendly:

    • clean, intuitive Python API with syntactic sugar that simplifies common feature definitions

    • feature discovery system for effortless dependency management

    • comprehensive type hints with all the typing shenanigans like @overload. Metaxy utilizes Pydantic for feature definitions.

    • first-class support for local development, testing, preview environments, and CI/CD workflows

    • CLI tool for easy interaction, inspection and visualization of feature graphs, enriched with real metadata and stats

    • integrations with popular tools such as Dagster, DuckDB, SQLModel and others

  1. the local versioning engine is implemented in Polars and polars-hash

What's Next?