[Session Report] Revealing the Secret of 40% Cost Reduction! Best Practices for Large-scale Model Training Proven in Amazon Nova Development (AWS-56) #AWSSummit

AWS Summit Japan 2025

2025.06.26

This page has been translated by machine translation. View original

June 26, 2025 (Thu) 15:50 - 16:30

Speaker: Keita Watanabe

Sr. World Wide Specialist Solutions Architect, Frameworks WWSO

Amazon Web Services Japan G.K.
 OverviewThe combination of Amazon SageMaker HyperPod and EC2 UltraClusters has achieved high fault tolerance and efficiency in large-scale foundation model training. By leveraging these best practices, which were proven during the development of Amazon Nova, it is possible to reduce costs and shorten training time. Particularly, the optimal combination of 3D parallelism (data parallel, tensor parallel, pipeline parallel) in distributed training and technologies such as asynchronous checkpoint generation played crucial roles.
 Evolution and Challenges of Distributed LearningMachine learning has evolved from being completed on a single GPU to requiring distributed learning due to the emergence of large-scale foundation models.
There are three main parallelization methods in distributed learning:
Data Parallelism

Processing different data across multiple model replicas
Tensor Parallelism

Distributing processing at the MLP and Attention block level
Pipeline Parallelism

Distributing each layer of the model
However, distributed learning has tightly coupled states, with the vulnerability that a single node failure can stop the entire learning process.
 Innovative Features of Amazon SageMaker HyperPodAWS provides insights gained from developing its own generative AI Amazon Nova to overcome the drawbacks of distributed learning in the form of Amazon SageMaker HyperPod. HyperPod is a foundation model development environment that reflects best practices in large-scale distributed learning.
Resiliency Feature

Automatic recovery during node failures
HyperPod Observability

Visualization of system issues
Asynchronous Checkpoint Generation

Creating checkpoints without interrupting training
 Leveraging Amazon EC2 UltraClustersUltraClusters function as a supercomputer foundation integrating high-performance computing, networking, and storage.
High-speed accelerators and large-capacity device memory
High-bandwidth interconnect
Scalable distributed file storage
 AWS Deep Learning Software StackProviding machine images with all necessary libraries for model development in the form of Deep Learning AMI (DLAMI).
ML Frameworks

PyTorch, JAX, DDP, FSDP, MegatronLM, DeepSpeed, torch-neuronx
Communication Libraries & SDKs

NCCL (important for GPU communication), AWS OFI NCCL, SMP, SMDDP
Hardware & Kernels

Accelerator drivers, EFA kernel drivers
 Proven Case StudiesLlama 3.3 Swallow

Adopted best practices for distributed learning using HyperPod
 ImpressionI was impressed by the innovation in large-scale model training brought about by AWS's advanced distributed computing architecture. In particular, the improved fault tolerance and efficiency achieved by combining Amazon SageMaker HyperPod and EC2 UltraClusters is expected to provide a significant competitive advantage in future AI development.