How we migrated our CDN to AWS CloudFront at Trainline

Mohamed Abukar
Trainline’s Blog
Published in
12 min readJan 9, 2024

--

Introduction

What’s the Deal? 🚄

Trainline Logo Green

Trainline is Europe’s leading train and coach app. To put it simply, we are a one-stop shop for train and coach travel. Every day, we gather routes, prices, and travel times from over 270 rail and coach operators in 45 countries, so that everyone can buy tickets quickly and save time, effort, and money. With millions of users relying on their platform, delivering high-performance content is crucial for ensuring an exceptional user experience. Running 24/7 & with over 100+ million users, such a system needs to always be highly available, easily scalable and our CDN game has to be strong!

While our previous CDN provider (which will be unnamed in this article and referred to as CDN X from here onwards) served us well, we recognised the need to grow and adapt, which led us to explore new CDN alternatives like AWS CloudFront.

Amazon CloudFront is a global content delivery network (CDN) service that can help businesses deliver their content, videos, applications, and APIs to customers worldwide with low latency and high data transfer speeds.

A content delivery network (CDN) is a geographically distributed group of servers that work together to provide fast delivery of internet content.

Why Switch? ⚡

Motivation for Migration 🛫

CDN X served us well, but as our needs evolved, AWS CloudFront offered better:

  • No More Manual Work 🛠️: Streamlined CDN management by cutting out manual processes — the CDN change process is now managed by Terraform Automation.
  • AWS: We are already an AWS house so it keeps us closer to their stack.
  • Cost-Effective 💰: AWS CloudFront fits well within our budget.
  • API Goodness 🤖: Robust APIs made our automation workflows even better.
  • DIY Configs 🎛️: Self-service features freed us from additional pro services.
  • Automate All The Things 🔄: Full-stack deployment automation, thanks to Infrastructure-as-Code.
  • Easy Certs 🔐: AWS Certificate Manager simplified our SSL/TLS needs.
  • Cloud-Native 🛡️: Built for high availability and low latency globally.
  • Docs and Support 📚: AWS responded quickly and had all the info we needed.

The Planning Phase 📝

In the Planning Phase, which spanned over six weeks, we had a core team of 6–8 core engineers, 1 team lead, 1 engineering manager & 1 product manager.

Switching CDNs is no small task. Here’s how we prepared:

Research

  • Origins: Our primary content source was AWS-hosted ALBs, backends, and S3 buckets. Seamless AWS integration made setting this up a breeze.
  • Cache Rules: We set cache behaviours tailored for different content types (static, dynamic, API responses).
  • Lambda@Edge: Custom Lambda functions added on-the-fly tweaks at the edge locations. Think URL rewriting, setting cookies & setting universal headers
  • Geo-Restrictions: As a precautionary step to protect our infrastructure, we used CloudFront’s geo-blocking.
  • Data Squeeze️: Gzip & brotli compression minimised data transfer and sped things up.
  • SSL/TLS with ACM: Integrated seamlessly with Amazon Certificate Manager for our security needs & helped us with certificate renewals. AWS takes care of the headache of renewing these certificates.

Assessment 📋

We started by reviewing our existing CDN X configurations:

  • List of Assets: Documented all static and dynamic assets.
  • Edge Rules: Cataloged all caching and forwarding rules.

Mapping 🗺️

We then mapped CDN X features to their AWS CloudFront equivalents.

  • Cache Policies: Mapped to CloudFront’s caching settings.
  • Forwarding Rules: Mapped to Lambda@Edge functions.
  • WAF: CDN X WAF to AWS WAF
  • And many more features we have mapped onto CloudFront

Test Plans 🧪

  • Proof of Concept: Tested CloudFront with a single, low-traffic distribution
  • We used preview distributions (Trainline’s internal solution) to test our configurations before deploying them to production. Before this, AWS didn’t have such a solution and now they do — to this day, there are certain limitations to it.

Not all goes as planned

Most if not all of our testing was small-scale testing. This didn’t allow us to see the bigger picture till the end. Some examples were:

  1. Billing Surprises: When testing, we only had one URL redirect, which was cost-effective. But, once we scaled to 100 redirects, the AWS billing surprised us 📈. To mitigate this, we shifted from Lambda to CloudFront functions.
  2. Limitations with CloudFront Functions: While CloudFront functions are great, they struggled with the workload of 1000s of redirects, often leading to timeouts ⌛. Our solution was to isolate the redirect distributions to their own CloudFront distributions, creating a more manageable system.

The Migration Phase (Summarised) 🚀

During the Migration Phase, which lasted about four to six months, we looped in our principal engineers, developers and had weekly syncs with various internal stakeholders.

Rough architecture estate once the migration was complete

Phase 1: Small-Scale Testing 🧪

  • We created our internal CloudFront Terraform module.
  • Set up a CloudFront Distribution: Created a new distribution via our custom module.
# Our rough module structure
module "cloudfront-distribution" {
source = "<module_source>"
version = "<module_version>"

cloudfront_functions = "function_demo"
default_cache_behavior = [<Default Cache Behaviour ARN goes here>]
domain_names = ["thetrainline.com", "trainline.eu"]
environment_type = "Production"

custom_origins = ALB
description = "CloudFront Distribution for Trainline"
http_version = HTTP/2
waf_name = "CDN-WAF"
}
  • DNS Routing: Used Cloudflare for DNS testing.
  • Monitor and Tweak: Observed performance through CloudFront logs, Splunk, and NewRelic.
  • We migrated some of the smaller domains (the ones with less traffic)

Phase 2: Full Migration

  • Once the smaller domains were completed, we eventually went on to migrate the larger domains such as thetrainline.com & others.
  • Gradual Traffic Routing (for larger domains): Increased the weight for CloudFront in Route 53. (in small increments) — 5% > 10% > 25% > 50% > 75% > 100%
  • Final Checks: Ensured all assets and rules were correctly migrated.
  • Traffic Switchover: Moved 100% of traffic to CloudFront.

Tech Stack 🛠️

  • Terraform: Infrastructure as Code for CloudFront configurations.
  • AWS Route 53 & Cloudflare: DNS management.
  • NewRelic, Splunk and CloudWatch: Monitoring & Alerting
  • Kinesis Logging: Real-time logging.
  • Spacelift: Terraform state management & CICD for Terraform.
  • Terratest: Automated module testing.
  • AWS ACM & Shield: SSL/TLS certificates and DDoS protection.
  • Custom WAF Module: Our home-brewed Web Application Firewall module.
  • Internal CDN-CloudFront Module: Custom module for easier CloudFront setup.
  • OPA Policies: Compliance checks.

Architecture 🗺️

General Architecture of a CDN

How a CDN works

Pre-Migration (CDN X)

The pre-migration architecture of our CDN

Post-Migration (CloudFront)

Post-migration architecture of our CDN

Deep dive of the Migration (step-by-step)

In this section, we deep dive into the migration steps and the considerations that shaped our migration journey.

Distribution Scaffolding

We developed an internal scaffolding tool. This tool simplifies the initial setup, and generates a PR to create new Spacelift stacks (we use these to manage our Terraform configuration) and a CDN distribution template based on our CDN module. It’s a streamlined way to kick off new CDN distribution creation. A new repository is created with the pattern of “cdn-distribution-<distribution-name>”

One repository per distribution: We created a separate repository for each distribution. This allowed us to manage changes and deployments independently. It also was a neat way to handle granular control and isolated deployments.

Replicating CDN X Behaviors

We aimed to copy the CDN X behaviours onto AWS CloudFront, ensuring the transition maintained functionality and performance across different domains.

Deploying to cdn-staging

To facilitate testing, we deployed changes to the cdn-staging environment. Triggering a manual deployment in Spacelift allowed for comprehensive testing.

Integration Testing

We created test suites to verify intended behaviours and policies. These suites were triggered manually and via continuous integration (CI) for every commit in the main repository. Tests were mocked in Jest against an origin.

Non-Production Deployment

Before going live, we started with non-production deployments. The CDN- repository housed the Terraform code for CDN resource creation. Staging deployments were performed against a mock origin.

External Test Deployment

After successful staging, external test deployments were initiated to confirm expected behaviours and policies.

Preview Domains

We set up a preview domain for testing our production CDN configurations. This allowed for robust testing and minimized risks during the live environment switch.

Production Deployment

Finally, we moved to production. The changes were closely monitored to ensure a seamless transition.

Network Changes and DNS Flip

Our networking team handled DNS changes and gradually shifted the traffic from CDN X to CloudFront.

Roll Forward and Observe

After the DNS flip, we observed CDN performance for a week to validate the migration’s effectiveness.

Rollback Plan

We had a well-defined rollback plan to mitigate any unforeseen issues, ensuring minimal user disruption.

We did this for over 300+ domains that were migrated fully from CDN X to CloudFront including the main site www.thetrainline.com, all European sites and many other domains owned by Trainline.

Graph showing after we fully migrated from CDN X to CloudFront (CDN X in blue and CloudFront in orange)

Security & various other aspects of the Migration 🛡️

When migrating to a CDN setup, it’s not just about speed and efficiency. Security is equally important. Let’s dig into how we layered our security during this migration.

Key Security Measures

  1. Attack Prevention: We deployed AWS Shield Advanced for robust DDoS protection and AWS WAF to guard against common web threats like SQL injections and XSS.
  2. Rate & Traffic Control: AWS WAF’s rate-based rules helped us handle sketchy traffic surges. We also used DataDome for behavioural analysis, identifying bad bots and rogue IPs.
  3. Data Encryption & Access: Using CloudFront’s field-level encryption, we ensured that sensitive information stays secure. IAM policies dictate who has access to what, so there’s no funny business.
  4. Monitoring & Alerts: All critical data goes to Splunk, where we’ve set up alerts for unusual behaviour. This keeps us in the loop and ready to act.

Enhancing Security at the Origin

  • Origin Access Identity (OAI) secures our S3 bucket, while custom headers add another layer of security for Application Load Balancers (ALB).

Certificates & Protocols

  • AWS Certificate Manager auto-renews our SSL certificates. We support TLS 1.3 and 1.2, sticking to the latest encryption standards. We also used the latest ciphers.

Customization & Flexibility

  • Lambda@Edge and CloudFront Functions handle domain redirects and request customization. We also make use of CloudFront’s geo-headers to adapt content based on user location.

Logging & Traceability

  • All logs flow into an S3 bucket before landing in Splunk for deeper analysis. Our Infrastructure as Code (IaC) approach ensures all CDN config changes are version-controlled.

By keeping our focus on these key areas, we managed to secure our CDN migration without any major hiccups. With these measures in place, we’re not just faster; we’re also more secure.

Caching Strategies in CloudFront: Do’s and Don’ts

Caching can dramatically improve your website’s performance, but it’s not as straightforward as it seems. In this section, we’ll delve into some key considerations for setting up caching policies, with a special focus on AWS CloudFront.

Desired Outcomes

  • Versioned Assets: Always use versioned asset names. This allows you to set longer cache durations without the need for a CDN-level cache reset.
  • Cache-Control: Application teams should be setting the cache-control headers directly within the app. You can use the s-maxage directive specifically for CDN-level caching and max-age for both browser and CDN caching.
  • S3 Backend: If your resources are in an S3 bucket, ideally the Cache-Control header should be set on the S3 objects themselves.

Limitations

  • File Extensions: Unfortunately, CloudFront doesn’t let you set caching policies based on file extensions directly.
  • Negative Match: You can’t set up rules like “cache this but not that within the same directory.” This complicates scenarios where you’d like varying cache durations for similar endpoints.
  • Arbitrary Headers: Conditional caching based on the existence of arbitrary headers isn’t straightforward either.

Caching mechanism proposals

  1. Default to Max-Age 0: If Cache-Control is not set at the origin, it's a good practice to set the default max-age to zero. This prevents sensitive data from being accidentally cached.
  2. Log Analysis: Regularly review your logs to adjust your caching strategy, ensuring you’re not sacrificing security for performance or vice versa.

Challenges faced

Switching a major domain from one CDN to another is no small task. Let’s delve into how we approached migrating www.thetrainline.com from CDN X to CloudFront, given its complexity and high traffic volume.

The Challenge (one of them)

www.thetrainline.com used to account for a whopping 70%+ of all CDN X traffic for Trainline, with over 8500 lines of JSON configurations. A direct switch to CloudFront could risk downtime and disruptions, something we wanted to avoid.

Incremental Migration: A Safer Approach

When we decided to migrate from CDN X to CloudFront, we knew we had to tread carefully to ensure a smooth transition without disruptions. Here’s a streamlined version of our approach:

Replicating Configurations

We began by replicating the essential CDN X configurations in CloudFront for a specific subset of requests. This groundwork was crucial for a successful migration.

Testing the Waters

Rather than making an abrupt switch, we opted for a gradual transition. We used Route53 weighted DNS to direct a portion of our traffic to CloudFront while maintaining a watchful eye on performance and any potential issues that might arise.

Rollback plan

Recognising that no migration is entirely risk-free, we reduced the TTL on the CNAME record for www.thetrainline.com to 5 minutes well in advance of the migration. If anything went awry during the process, we could quickly revert all requests back to CDN X. This safety net gave us the confidence to proceed.

Lessons Learned 📚

In a short span of less than 9 months, and with just a small team, we pulled off the mammoth task of migrating from CDN X to AWS CloudFront. This wasn’t a mere lift-and-shift; it was more like re-engineering the aeroplane while in flight!

We had to rethink our architecture, break down our CDN monolith into more manageable parts, and pave the way for a smoother transition. The automated frameworks we set up ensured faster and safer changes during deployment.

Key Takeaways 👇

  • Planning: No matter how straightforward a project seems, don’t skimp on planning. A little planning upfront can save you hours of work down the line.
  • Route 53 Magic & Cloudflare: Use Route 53’s weighted records and CNAMEs for a smooth DNS transition. It’s like a GPS for your data packets.
  • Read Up, Level Up: Spend time with CloudFront’s & AWS docs. It’ll save you hours of debugging and redesign. Trust us, you don’t want to learn CDN features the hard way.
  • Think Before You Leap: Always consider the architectural differences between your current CDN and the one you’re migrating to. One size doesn’t fit all; the trick is to tailor-fit your solutions.
  • Lambda@Edge Limits: If you’re planning on using Lambda@Edge, understand its scaling model. And don’t hesitate to contact AWS if you need to extend your quotas, particularly for large-scale deployments.
  • HTTPS All The Way: Make sure to redirect all HTTP traffic to HTTPS for added security. You can set this up easily in CloudFront.
  • WAF Magic: Use AWS WAF with CloudFront to protect against common web exploits. With just a few clicks, you can secure your application without breaking a sweat.
  • Use IaC: Infrastructure as Code, particularly Terraform, can be a lifesaver. It streamlines your operations, making changes quicker and less error-prone.
  • Rollback Plan: Always have a rollback strategy when making changes. If things go south, you should be able to revert to the previous state ASAP.
  • DDoS Protection: Employ rate limiting and other DDoS protection mechanisms to shield your services.
  • Phased Migration: Especially for high-traffic domains, consider migrating in phases. This helps in risk mitigation and allows for easier troubleshooting.
  • Monitoring: Don’t just set it and forget it. Keep a constant eye on performance metrics. Your future self will thank you.

Conclusion

Migrating a CDN is not just a technical task; it’s a strategic move that impacts your organization on multiple levels. For those in leadership positions and on the engineering frontlines, the insights gained from such an endeavour are invaluable.

Plan carefully, understand the constraints, and invest time in learning about the features and limitations of your chosen services. Security should be a core focus, not an add-on. Leverage tools that facilitate a smoother transition, and don’t underestimate the importance of monitoring performance metrics.

The experiences and challenges faced during a migration offer a unique learning opportunity. Use it as a chance to refine processes, improve collaboration, and further develop technical skills across your team.

In short, a CDN migration is a significant commitment but, if executed well, it can lead to streamlined operations, improved performance, and a stronger security posture.

Acknowledgements:

A HUGE shoutout to the internal CDN Team from Trainline (Vin, Vasan, Andrew, Merlin, Ravi, Nilesh and myself Mo). Couldn’t have been possible without them!

To wrap things up, this migration was a massive team effort and a learning experience for all of us. We’ve seen tangible benefits, from some cost savings to better automation/self-service and even improved site performance. What’s next, you ask? We have loads of projects that we are currently working on and hopefully, aim to share with you soon. Stay tuned!

We are HIRING!

At Trainline, we’re at the forefront of developing cutting-edge technology solutions that deliver a seamless and personalised travel experience to millions of users worldwide. 🌎

Check out all of our open roles across Europe here and jump on board as we empower people to make greener travel choices. 💚

--

--

Mohamed Abukar
Trainline’s Blog

AWS Community Builder | Red Hat Accelerator | Platform Engineer @ Trainline