How to Migrate AI Workloads to Colocation
Moving AI workloads from cloud to colocation is one of the most impactful infrastructure decisions an AI-focused organization can make. At scale, colocation can reduce costs by 40-60% compared to cloud GPU instances while providing better control, predictable pricing, and hardware customization. But the migration process is complex — involving hardware procurement, network design, data transfer, and operational changes. This guide walks through every step of migrating AI workloads to colocation.
Why Migrate AI from Cloud to Colocation?
Before diving into the how, let's be clear about the why. Not every organization should migrate to colocation, but the economics become compelling at certain thresholds:
- Cost savings at scale: Cloud GPU instances (e.g., AWS p5.48xlarge with 8x H100) cost approximately $98/hour on-demand, or $40-60/hour reserved. That's $350K-$525K per year per 8-GPU node. Owning the same hardware in colocation (including depreciation, power, cooling, and space) costs approximately $150K-200K per year. At 10+ nodes, the savings become substantial.
- GPU availability: Cloud GPU capacity has been severely constrained since 2023. Reserved capacity requires 1-3 year commitments at still-premium pricing. Owning your hardware guarantees availability when you need it.
- Customization: In the cloud, you get the instance types the provider offers. In colocation, you can build exactly the cluster you need — custom networking topologies, specific storage configurations, and tailored cooling solutions.
- Data gravity: If you have large training datasets (petabytes), storing them in the cloud is expensive and moving them between providers is slow. Colocation lets you keep data on local high-speed storage without egress charges.
When to Stay in the Cloud
Colocation isn't right for everyone. Consider staying in the cloud if:
- Your GPU usage is intermittent (less than 50% utilization over time)
- You need to scale up and down rapidly and unpredictably
- You don't have (or want to build) hardware operations expertise
- Your total GPU spend is under $500K/year (the overhead of colocation may not justify the savings)
- You need instant access to the latest GPU generations without procurement lead times
Phase 1: Assessment and Planning (4-8 Weeks)
Inventory Your Current Workloads
Start by documenting everything currently running in the cloud that you plan to migrate:
- Compute requirements: How many GPUs, what types, what utilization rates? Track actual usage over 30-60 days to understand real (not provisioned) requirements.
- Storage requirements: Total data volume, I/O patterns (sequential vs random), throughput needs. AI training often requires high-throughput parallel storage (like Lustre or GPFS).
- Networking: Inter-node communication bandwidth (critical for distributed training), external connectivity, latency requirements for inference endpoints.
- Software stack: Frameworks (PyTorch, JAX, TensorFlow), orchestration (Kubernetes, Slurm), monitoring, and management tools.
Define Your Target Architecture
Design the colocation deployment architecture based on your workload analysis:
- GPU cluster sizing: How many GPU nodes, what interconnect (InfiniBand vs RoCE), what topology (fat-tree, rail-optimized)?
- Storage architecture: Local NVMe vs networked storage vs hybrid. AI training benefits from high-bandwidth parallel file systems.
- Network design: Separate management, storage, and GPU interconnect networks. Plan for 400 Gbps+ InfiniBand for large training clusters.
- Power and cooling: Calculate total power draw (including GPU thermal design power at full utilization) and confirm the facility can support your density.
Budget and Timeline
A realistic colocation migration budget should include:
- Hardware: GPU servers ($200-400K each for 8x H100 systems), networking switches ($50-200K for InfiniBand fabric), storage systems ($100K-1M+ depending on capacity), and spares.
- Colocation: Space, power, cooling, and connectivity. Budget $150-250/kW/month for high-density AI deployments in major markets.
- One-time costs: Installation, cabling, data migration, project management, and consulting. Typically 10-15% of hardware costs.
- Ongoing operations: Remote hands, monitoring, hardware warranty/maintenance, spare parts inventory.
- Timeline: Plan for 3-6 months from decision to fully operational, depending on hardware lead times and facility readiness.
Phase 2: Vendor Selection (2-4 Weeks)
Choosing a Colocation Provider
For AI workloads, not all colocation providers are equal. Key evaluation criteria:
- Power density support: Can the facility deliver 40-80+ kW per rack? Many older facilities max out at 10-15 kW. Confirm actual available density, not marketing claims.
- Cooling infrastructure: Does the facility support liquid cooling (direct-to-chip or immersion)? Air-only facilities cannot efficiently cool modern GPU clusters.
- Network connectivity: Carrier diversity, cloud on-ramps (if you'll maintain a hybrid architecture), and available bandwidth.
- Location: Consider latency to users (for inference), power costs, and proximity to your team for maintenance visits.
- Expansion capacity: Can you grow in the same facility if your AI cluster needs to scale? Getting locked into a facility with no room to expand is a common mistake.
- Operational support: Remote hands capabilities, 24/7 staffing, and experience with GPU/HPC infrastructure.
Hardware Procurement
GPU server procurement requires careful planning:
- Lead times: NVIDIA GPU servers have had 3-12 month lead times since 2023. Order early and have backup plans.
- OEM vs direct: Major OEMs (Dell, HPE, Lenovo, Supermicro) offer pre-configured AI systems. NVIDIA sells DGX direct. Each has trade-offs in customization, support, and pricing.
- Networking: InfiniBand switches and cables from NVIDIA/Mellanox should be ordered alongside compute. Don't forget optics, cables, and spare parts.
- Storage: High-throughput storage (VAST, Weka, DDN, or building your own with NVMe-over-Fabrics) should be specced and ordered early.
Phase 3: Deployment and Configuration (4-8 Weeks)
Physical Installation
- Rack and cable management planning — AI clusters require significant cabling, especially InfiniBand
- Power circuit provisioning and testing — verify power delivery under load before deploying production workloads
- Cooling verification — ensure liquid cooling loops are operational and thermal performance meets specs under load
- Network infrastructure — install top-of-rack switches, InfiniBand fabric, and management networking
Software Stack Deployment
- Operating system: Ubuntu is the standard for AI workloads. Deploy via PXE boot or Ansible for consistency across nodes.
- GPU drivers: NVIDIA drivers and CUDA toolkit. Pin versions and test thoroughly — driver mismatches cause silent training failures.
- Container runtime: Docker + NVIDIA Container Toolkit for GPU-accelerated containers. Kubernetes with GPU scheduling if using orchestration.
- Distributed training: NCCL for multi-GPU/multi-node communication. Tune NCCL environment variables for your specific network topology.
- Cluster management: Slurm for HPC-style job scheduling, or Kubernetes with GPU-aware scheduling for cloud-native approaches.
- Monitoring: DCGM (Data Center GPU Manager) for GPU health and performance, Prometheus + Grafana for system metrics, custom dashboards for training metrics.
Phase 4: Data Migration (1-4 Weeks)
Moving large training datasets from cloud to colocation is often the most time-consuming step:
- Network transfer: For datasets under 50 TB, direct transfer over dedicated circuits (AWS Direct Connect, Azure ExpressRoute) is practical. At 10 Gbps sustained, 50 TB takes about 12 hours.
- Physical data transfer: For datasets over 100 TB, consider physical transfer (AWS Snowball, Azure Data Box). Mailing drives is often faster than network transfer for very large datasets.
- Incremental sync: Start the data migration weeks before the compute cutover. Use rsync or similar tools to keep the colocation copy in sync as new training data arrives.
- Validation: Always verify data integrity after transfer with checksums. A corrupted training dataset can waste weeks of GPU time.
Phase 5: Cutover and Optimization (2-4 Weeks)
Validation Testing
- Run NCCL all-reduce benchmarks to verify inter-node GPU communication performance
- Run storage bandwidth tests to verify training data throughput
- Execute a small training run with known expected results to validate end-to-end functionality
- Test failure scenarios — node failures, network failures, storage failures — and verify recovery procedures
Production Cutover
- Migrate workloads gradually — start with non-critical training jobs, then move production training
- Maintain cloud capacity during the transition period (typically 1-2 months) as a fallback
- Monitor performance closely during the first 2-4 weeks — training throughput (samples/second), GPU utilization, and thermal performance
- Optimize: tune NCCL parameters, adjust storage caching, fine-tune cooling setpoints
Hybrid Architecture Considerations
Many organizations maintain a hybrid architecture post-migration:
- AI training in colocation (predictable, long-running, cost-sensitive)
- AI inference in cloud (burstable, globally distributed, auto-scaling)
- Data preprocessing and experimentation in cloud (flexible, collaborative)
- Dedicated network connections between colocation and cloud providers for seamless data movement
Common Migration Mistakes
- Underestimating power requirements: GPU servers at full utilization draw more power than their nameplate rating suggests when you account for networking and storage. Build in 20% headroom.
- Ignoring cooling requirements: Signing a colocation contract before confirming the facility can support your power density. Many facilities claim "high density" support that maxes out at 15 kW/rack.
- Network topology mistakes: Under-provisioning InfiniBand fabric or choosing the wrong topology for your cluster size. Consult NVIDIA's reference architectures.
- No operational runbooks: Not documenting procedures for common operations (reboot, firmware updates, hardware replacement) before going live.
- Insufficient spare parts: GPU servers have many components that can fail. Keep spare GPUs, DIMMs, NVMe drives, fans, and network cables on-site.
Migration Timeline Summary
Ready to find the right colocation facility for your AI workloads? Browse AI-ready facilities in our directory, or compare colocation vs cloud costs for AI workloads.