Basic Example¶
Overview¶
This example demonstrates how Metaxy automatically detects changes in upstream features and triggers recomputation of downstream features. It shows the core value proposition of Metaxy: avoiding unnecessary recomputation while ensuring data consistency.
We will build a simple two-feature pipeline where a child feature depends on a parent feature. When the parent's algorithm changes (represented by code_version), the child feature is automatically recomputed.
The Pipeline¶
Let's define a pipeline with two features:
---
title: Feature Graph
---
flowchart TB
%% Snapshot version: none
%%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
examples_parent["<div style="text-align:left"><b>examples/parent</b><br/>0aad9b8a<br/><font color="#999">---</font><br/>- embeddings (05e66510)</div>"]
examples_child["<div style="text-align:left"><b>examples/child</b><br/>467b2c02<br/><font color="#999">---</font><br/>- predictions (15e27d1c)</div>"]
examples_parent --> examples_child
Defining features: ParentFeature¶
The parent feature represents raw embeddings computed from source data. It has a single field embeddings with a code_version that tracks the algorithm version.
"""Feature definitions for recompute example."""
from metaxy import (
BaseFeature,
FeatureSpec,
FieldSpec,
)
class ParentFeature(
BaseFeature,
spec=FeatureSpec(
key="examples/parent",
fields=[
FieldSpec(
key="embeddings",
code_version="1",
),
],
id_columns=("sample_uid",),
),
):
"""Parent feature that generates embeddings from raw data."""
pass
Defining features: ChildFeature¶
The child feature depends on the parent and produces predictions. The key configuration is the FeatureDep which declares that ChildFeature depends on ParentFeature.
BaseFeature,
spec=FeatureSpec(
key="examples/child",
deps=[ParentFeature],
fields=["predictions"],
id_columns=("sample_uid",),
),
):
"""Child feature that uses parent embeddings to generate predictions."""
pass
The FeatureDep declaration tells Metaxy:
ChildFeaturedepends onParentFeature- When the parent's field provenance changes, the child must be recomputed
- This dependency is tracked automatically, enabling incremental recomputation
Walkthrough¶
Step 1: Initial Run¶
Run the pipeline to create parent embeddings and child predictions:
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3
Written 3 rows for feature examples/parent
Pipeline
============================================================
[1/2] Computing parent feature...
[2/2] Computing child feature...
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3
📊 Computing examples/child...
feature_version: 467b2c02f5a1629d83fa892000dbc7e441d7ca5e6754d8788a2c93951951f939
Identified: 3 new samples, 0 samples with new provenance_by_field
✓ Materialized 3 new samples
📋 Child provenance_by_field:
sample_uid=1: {'predictions': '14324470123186761611'}
sample_uid=2: {'predictions': '17377221795775496311'}
sample_uid=3: {'predictions': '11499809972266932532'}
✅ Pipeline complete!
The pipeline materialized 3 samples for the child feature. Each sample has its provenance tracked.
Step 2: Verify Idempotency¶
Run the pipeline again without any changes:
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3
Metadata already exists for feature examples/parent (feature_version: 0aad9b8a2ea055cd...)
Skipping write to avoid duplicates
Pipeline
============================================================
[1/2] Computing parent feature...
[2/2] Computing child feature...
Graph snapshot_version: 0ac29a763c520b92fd66a1622b9f340fc0b139a9134699a67c18a9efdece5cd3
📊 Computing examples/child...
feature_version: 467b2c02f5a1629d83fa892000dbc7e441d7ca5e6754d8788a2c93951951f939
Identified: 0 new samples, 0 samples with new provenance_by_field
📋 Child provenance_by_field:
sample_uid=1: {'predictions': '14324470123186761611'}
sample_uid=2: {'predictions': '17377221795775496311'}
sample_uid=3: {'predictions': '11499809972266932532'}
No changes detected (idempotent)
✅ Pipeline complete!
Key observation: No recomputation occurred.
Step 3: Update Parent Algorithm¶
Now let's simulate an algorithm improvement by changing the parent's code_version from "1" to "2":
---
title: Feature Graph Changes
---
flowchart TB
%% Snapshot version: none
%%{init: {'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'themeVariables': {'fontSize': '14px'}}}%%
examples_parent["<div style="text-align:left"><b>examples/parent</b><br/><font color="#FF0000">0aad9b8a</font> → <font color="#00FF00">a007f308</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">embeddings</font> (<font color="#FF0000">05e66510</font> → <font color="#00FF00">3c8d3e9b</font>)</div>"]
examples_child["<div style="text-align:left"><b>examples/child</b><br/><font color="#FF0000">467b2c02</font> → <font color="#00FF00">415c7848</font><br/><font color="#999">---</font><br/>- <font color="#FFAA00">predictions</font> (<font color="#FF0000">15e27d1c</font> → <font color="#00FF00">391c5ef3</font>)</div>"]
examples_parent --> examples_child
style examples_parent stroke:#FFAA00,stroke-width:2px
style examples_child stroke:#FFAA00,stroke-width:2px
This change means that the existing embeddings and the downstream feature have to be recomputed.
Step 4: Observe Automatic Recomputation¶
Run the pipeline again after the algorithm change:
Graph snapshot_version: 1b0b50c116514994c79e989666226a2f8b2d7c5e42b565bf5779e70f6a80fb5c
Written 3 rows for feature examples/parent
Pipeline
============================================================
[1/2] Computing parent feature...
[2/2] Computing child feature...
Graph snapshot_version: 1b0b50c116514994c79e989666226a2f8b2d7c5e42b565bf5779e70f6a80fb5c
📊 Computing examples/child...
feature_version: 415c78486684ec5f285b05b4f0043395c0ab2ac193123c3f5b5d6bfb0b145c43
Identified: 3 new samples, 0 samples with new provenance_by_field
✓ Materialized 3 new samples
📋 Child provenance_by_field:
sample_uid=1: {'predictions': '14324470123186761611'}
sample_uid=2: {'predictions': '17377221795775496311'}
sample_uid=3: {'predictions': '11499809972266932532'}
✅ Pipeline complete!
Key observation: The child feature was automatically recomputed because:
- The parent's
code_versionchanged from"1"to"2" - This changed the parent's
metaxy_feature_version - The child's field dependency on
embeddingsdetected the change - All child samples were marked for recomputation
How It Works¶
Metaxy tracks provenance at the field level using content hashes:
- Feature Version: A hash of the feature specification (including
code_versionof all fields) - Field Provenance: A hash combining the field's
code_versionand upstream provenance - Dependency Resolution: When resolving updates, Metaxy computes what the provenance would be and compares it to what's stored
The resolve_update() method returns:
added: New samples that don't exist in the storechanged: Existing samples whose computed provenance differs from stored provenance
This enables precise, incremental recomputation without re-processing unchanged data.
Conclusion¶
Metaxy provides automatic change detection and incremental recomputation through:
- Feature dependency tracking via
FeatureDep - Algorithm versioning via
code_version - Provenance-based change detection via
resolve_update()
This ensures your pipelines are efficient data stays up to date.
Related Materials¶
Learn more about: