Skip to content

Basic Example

Overview

View Example Source on GitHub

This example demonstrates how Metaxy automatically detects changes in upstream features and triggers recomputation of downstream features. It shows the core value proposition of Metaxy: avoiding unnecessary recomputation while ensuring data consistency.

We will build a simple two-feature pipeline where a child feature depends on a parent feature. When the parent's algorithm changes (represented by code_version), the child feature is automatically recomputed.

The Pipeline

Let's define a pipeline with two features:

---
title: Feature Graph
---
flowchart TB
    %% Snapshot version: none
    %%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
    examples_parent["<div style="text-align:left"><b>examples/parent</b><br/>0aad9b8a<br/><font color="#999">---</font><br/>- embeddings (05e66510)</div>"]
    examples_child["<div style="text-align:left"><b>examples/child</b><br/>467b2c02<br/><font color="#999">---</font><br/>- predictions (15e27d1c)</div>"]
    examples_parent --> examples_child

Defining features: ParentFeature

The parent feature represents raw embeddings computed from source data. It has a single field embeddings with a code_version that tracks the algorithm version.

src/example_basic/features.py
"""Feature definitions for recompute example."""

from metaxy import (
    BaseFeature,
    FeatureSpec,
    FieldSpec,
)


class ParentFeature(
    BaseFeature,
    spec=FeatureSpec(
        key="examples/parent",
        fields=[
            FieldSpec(
                key="embeddings",
                code_version="1",
            ),
        ],
        id_columns=("sample_uid",),
    ),
):
    """Parent feature that generates embeddings from raw data."""

    pass

Defining features: ChildFeature

The child feature depends on the parent and produces predictions. The key configuration is the FeatureDep which declares that ChildFeature depends on ParentFeature.

src/example_basic/features.py
    BaseFeature,
    spec=FeatureSpec(
        key="examples/child",
        deps=[ParentFeature],
        fields=["predictions"],
        id_columns=("sample_uid",),
    ),
):
    """Child feature that uses parent embeddings to generate predictions."""

    pass

The FeatureDep declaration tells Metaxy:

  1. ChildFeature depends on ParentFeature
  2. When the parent's field provenance changes, the child must be recomputed
  3. This dependency is tracked automatically, enabling incremental recomputation

Walkthrough

Step 1: Initial Run

Run the pipeline to create parent embeddings and child predictions:

$ python -m example_basic.pipeline
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3
Written 3 rows for feature examples/parent
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3

📊 Computing examples/child...
  feature_version: 467b2c02f5a1629d83fa892000dbc7e441d7ca5e6754d8788a2c93951951f939
Identified: 3 new samples, 0 samples with new provenance_by_field
✓ Materialized 3 new samples

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '14324470123186761611'}
  sample_uid=2: {'predictions': '17377221795775496311'}
  sample_uid=3: {'predictions': '11499809972266932532'}


✅ Pipeline complete!

The pipeline materialized 3 samples for the child feature. Each sample has its provenance tracked.

Step 2: Verify Idempotency

Run the pipeline again without any changes:

$ python -m example_basic.pipeline
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3
Metadata already exists for feature examples/parent (feature_version: 0aad9b8a2ea055cd...)
Skipping write to avoid duplicates
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3

📊 Computing examples/child...
  feature_version: 467b2c02f5a1629d83fa892000dbc7e441d7ca5e6754d8788a2c93951951f939
Identified: 0 new samples, 0 samples with new provenance_by_field

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '14324470123186761611'}
  sample_uid=2: {'predictions': '17377221795775496311'}
  sample_uid=3: {'predictions': '11499809972266932532'}

No changes detected (idempotent)

✅ Pipeline complete!

Key observation: No recomputation occurred.

Step 3: Update Parent Algorithm

Now let's simulate an algorithm improvement by changing the parent's code_version from "1" to "2":

patches/01_update_parent_algorithm.patch
--- a/src/example_basic/features.py
+++ b/src/example_basic/features.py
@@ -15,7 +15,7 @@ class ParentFeature(
         fields=[
             FieldSpec(
                 key="embeddings",
-                code_version="1",
+                code_version="2",
             ),
         ],
         id_columns=("sample_uid",),
---
title: Feature Graph Changes
---
flowchart TB
    %% Snapshot version: none
    %%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
    examples_parent["<div style="text-align:left"><b>examples/parent</b><br/><font color="#FF0000">0aad9b8a</font> → <font color="#00FF00">a007f308</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">embeddings</font> (<font color="#FF0000">05e66510</font> → <font color="#00FF00">3c8d3e9b</font>)</div>"]
    examples_child["<div style="text-align:left"><b>examples/child</b><br/><font color="#FF0000">467b2c02</font> → <font color="#00FF00">415c7848</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">predictions</font> (<font color="#FF0000">15e27d1c</font> → <font color="#00FF00">391c5ef3</font>)</div>"]
    examples_parent --> examples_child


    style examples_parent stroke:#FFAA00,stroke-width:2px
    style examples_child stroke:#FFAA00,stroke-width:2px

This change means that the existing embeddings and the downstream feature have to be recomputed.

Step 4: Observe Automatic Recomputation

Run the pipeline again after the algorithm change:

$ python -m example_basic.pipeline
Graph snapshot_version: 1b0b50c116514994c79e989666226a2f8b2d7c5e42b565bf5779e70f6a80fb5c
Written 3 rows for feature examples/parent
Pipeline
============================================================

[1/2] Computing parent feature...

[2/2] Computing child feature...
Graph snapshot_version: 1b0b50c116514994c79e989666226a2f8b2d7c5e42b565bf5779e70f6a80fb5c

📊 Computing examples/child...
  feature_version: 415c78486684ec5f285b05b4f0043395c0ab2ac193123c3f5b5d6bfb0b145c43
Identified: 3 new samples, 0 samples with new provenance_by_field
✓ Materialized 3 new samples

📋 Child provenance_by_field:
  sample_uid=1: {'predictions': '14324470123186761611'}
  sample_uid=2: {'predictions': '17377221795775496311'}
  sample_uid=3: {'predictions': '11499809972266932532'}


✅ Pipeline complete!

Key observation: The child feature was automatically recomputed because:

  1. The parent's code_version changed from "1" to "2"
  2. This changed the parent's metaxy_feature_version
  3. The child's field dependency on embeddings detected the change
  4. All child samples were marked for recomputation

How It Works

Metaxy tracks provenance at the field level using content hashes:

  1. Feature Version: A hash of the feature specification (including code_version of all fields)
  2. Field Provenance: A hash combining the field's code_version and upstream provenance
  3. Dependency Resolution: When resolving updates, Metaxy computes what the provenance would be and compares it to what's stored

The resolve_update() method returns:

  • added: New samples that don't exist in the store
  • changed: Existing samples whose computed provenance differs from stored provenance

This enables precise, incremental recomputation without re-processing unchanged data.

Conclusion

Metaxy provides automatic change detection and incremental recomputation through:

  • Feature dependency tracking via FeatureDep
  • Algorithm versioning via code_version
  • Provenance-based change detection via resolve_update()

This ensures your pipelines are efficient data stays up to date.

Learn more about: