Scaling Without Disruption: Engineering a Seamless Cloud Transition

June 4, 2025 11:15 am | 0 comments

Cloud migrations are often associated with risk — downtime, performance instability, and a potential hit to developer productivity. But for us, this wasn’t just about moving workloads. It was an opportunity to rethink and refine how we build, deploy, and scale our infrastructure. The challenge? Ensuring zero disruption to millions of users while overhauling an ecosystem of 130 microservices, hundreds of databases, and a deeply optimized engineering stack.

Rather than taking a high-risk “big bang” approach, we engineered a multi-cloud bridge, allowing services to be migrated in a controlled, incremental manner. Dedicated fiber connections ensured sub-2ms latency between cloud environments, making the transition seamless. To manage traffic dynamically, we deployed a dual ingress routing system, leveraging ZooKeeper-based service discovery and real-time configuration updates. This allowed us to shift workloads without causing disruptions or increasing latency.

Addressing Infrastructure Challenges

We began with identifying key technical and operational risks. The complexity of the release process was a significant challenge. Coordination between QA, development, and SRE teams meant that even minor delays could disrupt the migration timeline. Manual release processes further increased the risk of deployment rollbacks.

Infrastructure inconsistencies posed another major hurdle. MPL had a mix of tools for infrastructure provisioning, multiple storage and deployment systems, and scattered service configurations. Without addressing these inconsistencies, the risk of configuration drift and cross-cloud incompatibility was high.

Ensuring operational continuity during migration was equally critical. Testing environments often lacked consistency, increasing the risk of undetected migration-related issues. The time-consuming process of setting up test environments could further slow validation. MPL’s developer teams also relied heavily on SRE for infrastructure support, creating a bottleneck that could threaten migration timelines.

We saw an opportunity to use migration as more than just a technical transition. It was a chance to address longstanding inefficiencies, modernize processes, and reduce dependencies on manual infrastructure management.

Finding the Right Migration Tooling
The turning point came when we decided to standardize our migration approach through a platform solution. Initially, we considered building custom Terraform solutions but quickly realized that a platform-based approach would provide greater consistency, enable rapid rollbacks, and maintain deployment velocity without adding to the SRE workload.

Their chosen platform delivered key capabilities, including unified configuration management across AWS and GCP, simplified service migration via configuration changes, and self-service deployment features that reduced reliance on SRE. With infrastructure complexity abstracted through a centralized management layer, MPL could ensure a seamless transition while keeping business operations intact.

Architecting a Reliable Migration

The backbone of MPL’s migration strategy was dedicated cross-cloud connectivity. They established dual active-active fiber connections between AWS Mumbai and GCP Mumbai, costing approximately $8,000 per month. With sub-2ms latency — comparable to inter-AZ latency within a single cloud — these connections transformed the migration process from a high-risk cutover to a controlled, incremental transition.

This connectivity enabled a robust service discovery and traffic routing system. We deployed separate ingress controllers on both clouds and extended their internal “Pathfinder” library to support cross-cloud service discovery. Our configuration-based traffic control allowed seamless traffic routing between AWS and GCP with instant rollback capabilities.

Rather than migrating services haphazardly, MPL developed a strategic sequencing approach. By mapping service dependencies, they identified tightly coupled service “islands” and migrated them together to reduce cross-cloud latency. For databases, the migration strategy varied based on criticality — some were temporarily disabled during low-traffic hours, while others relied on cross-cloud replication before a controlled cutover.

Organizational Structure for Execution

To execute the migration while maintaining development velocity, we implemented a three-tier organizational model. The central SRE team focused on infrastructure, security, and database migration. A dedicated GCP professional services team provided expertise on cloud architecture, networking, and best practices. Finally, a platform layer managed orchestration, deployment, and service discovery across both clouds, ensuring consistency throughout the transition.

This structure allowed application developers to continue releasing features without directly engaging in migration complexities. Kaustubh Bhoyar, Head of Engineering, emphasized this streamlined process: “The developer’s job was to change the configuration, deploy the service in the QA environment, test if everything was working fine, and release that service to production.” This model preserved feature velocity while ensuring a smooth cloud transition.

Achieving Transformation Beyond Cost Savings
By the end of seven months, MPL had successfully migrated 130 microservices and hundreds of databases, achieving our infrastructure cost savings target. However, the benefits extended far beyond cost reduction.

The migration modernized MPL’s development processes, eliminating SRE dependencies for QA releases through a self-service deployment platform. All infrastructure was now managed through a single source of truth, ensuring parity across development, testing, and production environments.

Operational efficiency improved dramatically. Environment provisioning for performance testing, which once took days, was now completed in minutes. Kubernetes adoption led to optimized resource utilization, while standardized deployment processes reduced incidents. Release velocity increased to 8–10 services weekly, and the SRE team could shift focus from operational firefighting to reliability engineering.

The Strategic Value of a Well-Executed Migration
While cloud migrations are inherently complex, they can drive significant long-term value when executed strategically. By treating the transition as a catalyst for broader transformation, we not only reduced infrastructure costs but also streamlined development, modernized operations, and enhanced overall engineering efficiency.

For CTOs facing similar migration pressures, the key takeaway is clear: focus beyond the infrastructure shift and use migration as an opportunity to improve processes, standardize configurations, and reduce operational dependencies. With the right technical architecture, organizational alignment, and tooling support, a cloud migration can be more than a necessity — it can be a foundation for long-term success.