Basic Example¶

Overview¶

This example demonstrates how Metaxy automatically detects changes in upstream features and triggers recomputation of downstream features. It shows the core value proposition of Metaxy: avoiding unnecessary recomputation while ensuring data consistency.

We will build a simple two-feature pipeline where a child feature depends on a parent feature. When the parent's algorithm changes (represented by code_version), the child feature is automatically recomputed.

The Pipeline¶

Let's define a pipeline with two features:

---
title: Feature Graph
---
flowchart TB
    %% Snapshot version: none
    %%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
    examples_parent["<div style="text-align:left"><b>examples/parent</b><br/>0aad9b8a<br/><font color="#999">---</font><br/>- embeddings (05e66510)</div>"]
    examples_child["<div style="text-align:left"><b>examples/child</b><br/>467b2c02<br/><font color="#999">---</font><br/>- predictions (15e27d1c)</div>"]
    examples_parent --> examples_child

Defining features: `ParentFeature`¶

The parent feature represents raw embeddings computed from source data. It has a single field embeddings with a code_version that tracks the algorithm version.

src/example_basic/features.py

"""Feature definitions for recompute example."""

from metaxy import (
    BaseFeature,
    FeatureSpec,
    FieldSpec,
)


class ParentFeature(
    BaseFeature,
    spec=FeatureSpec(
        key="examples/parent",
        fields=[
            FieldSpec(
                key="embeddings",
                code_version="1",
            ),
        ],
        id_columns=("sample_uid",),
    ),
):
    """Parent feature that generates embeddings from raw data."""

    pass

Defining features: `ChildFeature`¶

The child feature depends on the parent and produces predictions. The key configuration is the FeatureDep which declares that ChildFeature depends on ParentFeature.

src/example_basic/features.py

    BaseFeature,
    spec=FeatureSpec(
        key="examples/child",
        deps=[ParentFeature],
        fields=["predictions"],
        id_columns=("sample_uid",),
    ),
):
    """Child feature that uses parent embeddings to generate predictions."""

    pass

The FeatureDep declaration tells Metaxy:

ChildFeature depends on ParentFeature
When the parent's field provenance changes, the child must be recomputed
This dependency is tracked automatically, enabling incremental recomputation

Walkthrough¶

Step 1: Initial Run¶

Run the pipeline to create parent embeddings and child predictions:

$ python -m example_basic.pipeline

Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3
Written 3 rows for feature examples/parent
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3

📊 Computing examples/child...
  feature_version: 467b2c02f5a1629d83fa892000dbc7e441d7ca5e6754d8788a2c93951951f939
Identified: 3 new samples, 0 samples with new provenance_by_field
✓ Materialized 3 new samples

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '14324470123186761611'}
  sample_uid=2: {'predictions': '17377221795775496311'}
  sample_uid=3: {'predictions': '11499809972266932532'}


✅ Pipeline complete!

The pipeline materialized 3 samples for the child feature. Each sample has its provenance tracked.

Step 2: Verify Idempotency¶

Run the pipeline again without any changes:

$ python -m example_basic.pipeline

Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3
Metadata already exists for feature examples/parent (feature_version: 0aad9b8a2ea055cd...)
Skipping write to avoid duplicates
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3

📊 Computing examples/child...
  feature_version: 467b2c02f5a1629d83fa892000dbc7e441d7ca5e6754d8788a2c93951951f939
Identified: 0 new samples, 0 samples with new provenance_by_field

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '14324470123186761611'}
  sample_uid=2: {'predictions': '17377221795775496311'}
  sample_uid=3: {'predictions': '11499809972266932532'}

No changes detected (idempotent)

✅ Pipeline complete!

Key observation: No recomputation occurred.

Step 3: Update Parent Algorithm¶

Now let's simulate an algorithm improvement by changing the parent's code_version from "1" to "2":

PatchFeature Graph Changes

patches/01_update_parent_algorithm.patch

--- a/src/example_basic/features.py
+++ b/src/example_basic/features.py
@@ -15,7 +15,7 @@ class ParentFeature(
         fields=[
             FieldSpec(
                 key="embeddings",
-                code_version="1",
+                code_version="2",
             ),
         ],
         id_columns=("sample_uid",),

---
title: Feature Graph Changes
---
flowchart TB
    %% Snapshot version: none
    %%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
    examples_parent["<div style="text-align:left"><b>examples/parent</b><br/><font color="#FF0000">0aad9b8a</font> → <font color="#00FF00">a007f308</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">embeddings</font> (<font color="#FF0000">05e66510</font> → <font color="#00FF00">3c8d3e9b</font>)</div>"]
    examples_child["<div style="text-align:left"><b>examples/child</b><br/><font color="#FF0000">467b2c02</font> → <font color="#00FF00">415c7848</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">predictions</font> (<font color="#FF0000">15e27d1c</font> → <font color="#00FF00">391c5ef3</font>)</div>"]
    examples_parent --> examples_child


    style examples_parent stroke:#FFAA00,stroke-width:2px
    style examples_child stroke:#FFAA00,stroke-width:2px

This change means that the existing embeddings and the downstream feature have to be recomputed.

Step 4: Observe Automatic Recomputation¶

Run the pipeline again after the algorithm change:

$ python -m example_basic.pipeline

Graph snapshot_version: 1b0b50c116514994c79e989666226a2f8b2d7c5e42b565bf5779e70f6a80fb5c
Written 3 rows for feature examples/parent
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph snapshot_version: 1b0b50c116514994c79e989666226a2f8b2d7c5e42b565bf5779e70f6a80fb5c

📊 Computing examples/child...
  feature_version: 415c78486684ec5f285b05b4f0043395c0ab2ac193123c3f5b5d6bfb0b145c43
Identified: 3 new samples, 0 samples with new provenance_by_field
✓ Materialized 3 new samples

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '14324470123186761611'}
  sample_uid=2: {'predictions': '17377221795775496311'}
  sample_uid=3: {'predictions': '11499809972266932532'}


✅ Pipeline complete!

Key observation: The child feature was automatically recomputed because:

The parent's code_version changed from "1" to "2"
This changed the parent's metaxy_feature_version
The child's field dependency on embeddings detected the change
All child samples were marked for recomputation

How It Works¶

Metaxy tracks provenance at the field level using content hashes:

Feature Version: A hash of the feature specification (including code_version of all fields)
Field Provenance: A hash combining the field's code_version and upstream provenance
Dependency Resolution: When resolving updates, Metaxy computes what the provenance would be and compares it to what's stored

The resolve_update() method returns:

added: New samples that don't exist in the store
changed: Existing samples whose computed provenance differs from stored provenance

This enables precise, incremental recomputation without re-processing unchanged data.

Conclusion¶

Metaxy provides automatic change detection and incremental recomputation through:

Feature dependency tracking via FeatureDep
Algorithm versioning via code_version
Provenance-based change detection via resolve_update()

This ensures your pipelines are efficient data stays up to date.

Learn more about: