, , , ,

Jul 4, 2025 | 3 Minute Read

Building A Cloud-Native System To Handle Real-Time Traffic Spikes

Table of Contents

Introduction

Building real-time platforms that can handle sudden spikes in user activity, while controlling cloud costs, is a common challenge in digital engineering today. Axelerant partnered with a rapidly scaling platform where users interact concurrently and continuously, demanding high availability, low latency, and massive scalability.

To meet these needs, Axelerant engineered a high-throughput, cloud-native system leveraging Kubernetes, Go, Kafka, Terraform, and advanced observability tools. This engagement was not just about infrastructure, it was about ensuring sustained performance at scale, operational efficiency, and security excellence in a real-time, high-concurrency environment.

Understanding The Stakes

When dealing with dynamic workloads that experience unpredictable traffic spikes, such as major live events or high-demand digital campaigns, the system architecture must guarantee both reliability and speed. The core use case demanded real-time responsiveness for thousands of concurrent users interacting within milliseconds.

Traffic simulations suggested peak loads could jump from a few thousand to tens of thousands of concurrent users in seconds. Performance budgets were defined to meet low latency SLA across critical transaction paths. The engineering team monitored P95 and P99 latencies with real-user data to ensure consistent responsiveness.

Architecting For Scale

The platform’s foundation lies in a fully AWS-hosted, microservices-based architecture, orchestrated via Kubernetes with isolated clusters for different environments (development, production). Kubernetes namespaces were aligned with service domains, and role-based access control (RBAC) was implemented per namespace for operational security.

The CI/CD pipeline, powered by Docker, Amazon ECR, Helm, and Argo CD, included static code analysis, unit tests, integration environments, and rollback validation stages. Artifacts were promoted between environments through GitOps workflows, and rollbacks were automated using Argo CD.

Secrets and configuration were managed using external secret stores integrated with Kubernetes through CSI drivers, enabling secure, audited access to sensitive credentials. Network design used separate VPCs per environment with peering connections, public/private subnet segregation, and NAT gateways for controlled internet access.

Redis caching was migrated to AWS ElastiCache for Redis with Multi-AZ replication and automatic failover enabled. Kafka was adopted for real-time stream processing and decoupling data ingestion from processing pipelines. Plans to migrate to Confluent will enable schema registry, observability dashboards, and managed scaling for producer/consumer workloads.

The Cloud Platform Architecture

Solving The Performance Bottleneck

Python was initially used for building core services due to its development speed. However, when concurrency requirements intensified, the Global Interpreter Lock (GIL) and multiprocessing overhead led to significant compute strain. Profiling with Py-spy and Pyroscope showed high I/O wait times and inefficient CPU usage.

The team chose to rewrite these services in Go, re-architecting them for asynchronous, event-driven patterns. Using goroutines the services achieved:

  • A 6–10x improvement in compute efficiency
  • CPU reduction from 6–8 cores to 0.75 per instance
  • Memory usage reduced from 16GB to <500MB per pod

Refactoring included modularization, REST and gRPC support, and integration with internal service discovery. Each new service underwent load simulation with tools like k6 and Locust before deployment.

Scaling Infrastructure Efficiently

To control costs while scaling, Axelerant implemented:

  • OpenCost for real-time Kubernetes cost tracking by namespace and workload
  • PerfectScale to flag oversized deployments and recommend right resources to allocate

Karpenter allowed node-level autoscaling across spot and on-demand instances, improving cost/performance trade-offs. Node pools were defined with custom taints and affinities to isolate stateful and stateless services. Storage optimization included resizing EBS volumes based on Prometheus metrics and leveraging lifecycle policies on S3 for data archiving.

Observability And Resilience

Prometheus was used to collect metrics across application and infrastructure layers. Grafana dashboards were created to visualize:

  • Service response times by route
  • Kubernetes resource usage by pod/deployment
  • Kafka consumer lag and throughput
  • Node-level disk I/O and memory saturation

Centralized logging was implemented using OpenTelemetry to forward logs to Grafana Loki. Logs were enriched with correlation IDs for traceability across microservices.

Alertmanager rules were configured for latency spikes, resource thresholds, DB connection pool saturation, and pod evictions. Load testing with synthetic traffic validated HPA rules, scaling pods under spike loads within minutes, all while maintaining latency SLA. The team implemented SLO dashboards that informed deployment freezes and post-incident reviews.

Security And Governance

Security and account governance were managed using AWS Control Tower. Separate AWS accounts were created for development, and production. These accounts were provisioned using Landing Zones with enforced guardrails.

Terraform modules managed IAM roles, policies, and service boundaries. IAM roles were designed for least-privilege access, scoped via tags and environment-specific trust policies. Service accounts used IRSA (IAM Roles for Service Accounts) to securely access AWS resources from within Kubernetes.

Access was federated via AWS SSO integrated with Google Workspace. MFA, session timeouts, and group-based access helped streamline compliance with enterprise IT policies. CloudTrail and GuardDuty were used for logging, anomaly detection, and compliance reporting. All secrets were rotated using AWS Secrets Manager.

Future-Ready Vision

Looking ahead, the platform roadmap includes:

  • Enabling predictive autoscaling with business critical signals instead of resource usage.
  • Moving to ephemeral environments for PR-based testing using preview deployments
  • Adopting service mesh architecture for fine-grained traffic management and policy enforcement

This project is a testament to Axelerant's deep digital engineering expertise. From re-architecting performance-critical services in Go, to designing a resilient microservices platform on Kubernetes, and implementing scalable observability and security practices, every decision was made with scale, reliability, and cost in mind.

Through strategic optimization, Axelerant helped the client achieve:

  • Up to 10x performance efficiency improvements
  • A fully observable and auto-scaling infrastructure
  • Streamlined security and governance with AWS best practices

Are You Engineering For Scale? 

Let’s talk about how Axelerant can help you build robust, cloud-native systems that meet the demands of tomorrow.

 

About the Author
Bassam Ismail, Director of Digital Engineering

Bassam Ismail, Director of Digital Engineering

Away from work, he likes cooking with his wife, reading comic strips, or playing around with programming languages for fun.


Leave us a comment

Back to Top