Talks and presentations

Accelerate FM pre-training on Amazon SageMaker HyperPod

September 04, 2024

Conference, AWS Summit Zurich, Zurich, Switzerland

Co-presented with Ankit Anand and Matt Nightingale, this session explored the challenges of training foundation models at scale and how Amazon SageMaker HyperPod addresses them. The talk covered the generative AI landscape and the growing computational demands of FM development, from prompt engineering and RAG to full pre-training. We introduced SageMaker HyperPod as a resilient, performant, and customizable environment for large-scale distributed training — featuring self-healing clusters that automatically detect hardware failures, replace faulty instances, and resume training jobs from checkpoints, reducing training time by up to 20%. The session went under the hood of HyperPod, covering cluster architecture, instance groups, lifecycle scripts, Elastic Fabric Adapter (EFA) for high-speed inter-node communication, distributed training software stacks for both GPU and Trainium, and job scheduling with auto-healing. Customer stories from Stability AI, Perplexity AI, and Hugging Face illustrated real-world benefits.

Productionize ML workloads using Amazon SageMaker MLOps

September 26, 2023

Conference, AWS Cloud Day Zurich, Zurich, Switzerland

This talk presented a comprehensive overview of MLOps on AWS, covering the journey from experimental notebooks to production-ready ML systems using Amazon SageMaker. Starting from the premise that ML code is only a small fraction of a real-world ML system, the session walked through an MLOps maturity framework across four phases — Initial, Repeatable, Reliable, and Scalable — mapping each to specific AWS services and capabilities. Topics included SageMaker Studio for experimentation, SageMaker Experiments for tracking, SageMaker Pipelines for workflow automation, Model Registry for versioning and promotion, SageMaker Projects for one-click CI/CD provisioning, shadow testing and deployment guardrails, Model Monitor for drift detection, and Model Cards and Dashboard for governance. The talk also covered team structures, multi-account strategies, and custom project templates for enterprise-scale MLOps.

Paolo Di Francesco

Talks and presentations

Accelerate FM pre-training on Amazon SageMaker HyperPod

Productionize ML workloads using Amazon SageMaker MLOps