End-to-end lineage with DVC and Amazon SageMaker AI MLflow apps

less than 1 minute read

Production ML teams often struggle to trace the full lineage of a model back to the exact data and code that trained it. In this post, we close that gap by combining DVC for data versioning, Amazon SageMaker AI for scalable processing and training, and Amazon SageMaker AI MLflow Apps for experiment tracking and model registry — turning multi-day audit investigations into a single query.

Full text here, and GitHub repository here GitHub stars

We walk through two deployable patterns you can run end-to-end in your own AWS account: a foundational dataset-level lineage pattern, where every MLflow run logs the DVC commit hash (data_git_commit_id) that points to the exact versioned dataset in Amazon S3; and a record-level lineage pattern for regulated environments (healthcare, financial services, GDPR opt-out scenarios) that adds manifests and a consent registry so you can answer questions like “which models were trained on patient X’s data?” instantly from MLflow artifacts. The result is a clean separation of concerns — DVC owns data-to-training lineage, MLflow owns training-to-deployment lineage, and the Git commit hash ties them together.