September 27, 2019

A Network Engineer’s Journey to Connect the Cloud

by Mariusz Lenart
in Networking

It all started with a couple of VPCs, several IPSec VPN connections, a couple of on-premises firewalls and limited AWS knowledge.

Today, our network engineering team runs multiple connections to AWS’s public and private resources. We are utilizing everything AWS has to offer, including Transit and Direct Connect Gateways, VPC Peering connections, Virtual Private Gateways, and BGP failover mechanism. Our connectivity has multiple levels of redundancy making sure that our cloud infrastructure is always available.

The past four years were pretty exciting for us and I wanted to tell our story of how we got where we are today. Hopefully, it will help you with your journey to the cloud.

Chapter I: We need you to connect on-prem to AWS… and don’t spend any money.

A dedicated context was carved out on our existing firewalls which allowed us to use it for terminating VPN connections. We soon began building VPN tunnels for each of our VPCs. This is roughly how our first connectivity to AWS looked like.

In this stage, we have ended up building roughly ten different VPN connections between our datacenter and AWS’ VPCs. This process was a little bit work-intensive as we had to manually configure a new VPN connection for every new VPC. While we just began using AWS and no production workloads were using those connections, our team knew that at some point having a non-redundant VPN-only solution will not be enough.

Chapter II: Maybe we can get a dedicated circuit before devs start complaining about our VPN.

In our second implementation of the AWS network, our team began investigating the use of Direct Connect for AWS connectivity to give us higher throughput and more predictable latency. We also liked the idea of using existing VPNs as a backup.

Due to those improvements, our network became more resilient since now we had a redundant connection capable of automatically failing over. Unfortunately, just like in the beginning, every time a new VPC was created, we had to configure new VPN connectivity, deploy Direct Connect interfaces, configure BGP, and perform route redistribution. A single data center was still used for terminating both VPNs and the Direct Connect link. In addition, our AWS usage grew and so the need for more bandwidth. More VPCs and AWS accounts were created, and we have also begun utilizing a second AWS region. Managing all those new connections began to feel like a full-time job. New solutions to address scalability issues were desperately needed.

Chapter III: I refuse to manually configure another VPN session and BGP peering connection.

Transit VPC came to the rescue and we hoped that our scalability problems will be solved once and for all. We have acquired Cisco licenses and began deploying two CSR 1000v routers to handle on-prem to VPC communication. We have used the Cloud Formation template to build out Transit VPC, and even though there were a lot of components that made it work ( various Lambda’s for automation and S3 for configuration storage ), getting everything up and running was relatively easy. In our third implementation of network connectivity, we have also decided to upgrade our Direct Connect links to 10G and deployed them in multiple data centers. Both Direct Connect circuits and VPN backup were attached to our new Transit VPC. We no longer had to configure IPSec peers or BGP sessions every time a new VPC was deployed. A simple tag would signal the VGW Poller to start building VPN and BGP connectivity between new VPCs and the CSRs. Cisco Configurator pushed all the configuration changes and everything magically worked. This is how our connectivity looked like when CSR routers were deployed.

Things were great for a while but as we began utilizing AWS more, we have started noticing various IPSec tunnel instability between our VPCs and the CSRs. At times, we would even see both primary and backup VPN tunnels go down fully affecting connectivity to a given VPC. Sometime later, we have experienced EC2 instance degradation which took down one of our CSR routers. By the time new CSR was deployed, the network configuration became out of sync and that gave us additional headaches. Various occasional issues and limited visibility made us like our new solution a little bit less. We have quickly realized that our AWS network became a little bit too complex, so we were quite happy when AWS announced Direct Connect Gateway.

Chapter IV: A bittersweet goodbye and it looks like we are back to configuring VPNs by hand.

Direct Connect Gateway was a big step. We were able to connect our on-premises infrastructure to multiple VPCs across different AWS regions. In addition, unlike in Transit VPC architecture, all Direct Connect traffic was riding the AWS backbone network. This improved our latency, throughput, and the stability of our infrastructure. In addition, we had no virtual routers to manage and no need to worry about EC2 instances.

The team knew that those benefits will take our network to the next level, but lack of centralized IPSec VPN backup solution meant that we had to go back to our manual model of deploying on-prem IPsec VPNs. At that time, it felt a little bit like a step back since we already had a scalable solution for VPN backup connectivity when we built the Transit VPC.

Chapter V. “And Now, Ladies And Gentlemen, May I Present To You The Real Star Of The Evening!”.

Manually deploying multiple VPNs for the second time on our on-premises firewalls got old pretty quickly. Something had to be done. Fortunately for us, AWS announced a new service called Transit Gateway. We have immediately begun re-architecting our VPN infrastructure. Things couldn’t be better now. Establishing connectivity between VPCs and the Transit Gateway was as easy as adding a route to the AWS routing table. There was no longer a need for deploying per-VPC VPN tunnels to our on-premises infrastructure. New Transit Gateway solution relied only on two redundant VPN connections per AWS region – one to our primary and one to our secondary data center.

Our newest design has been in place for a while now and we have not run into any issues. It relies on several layers of redundancy deployed in geographically dispersed locations. We were able to build a scalable solution and with enough bandwidth to meet our needs for the next several years.

In conclusion, starting small and without fully resilient infrastructure was acceptable for our needs in the early days of our journey. Having a simple solution allowed us to learn how AWS’ networking works. In the past four years, we have re-architected our network as new features became available. Our team was capable of quickly adjusting to changing requirements and we have learned to fail fast when our solutions were sub-optimal. Also, from a technical standpoint, we now strongly believe that embracing native solutions is a preferred way of doing things. I am pretty sure that we will continue to evolve our networks as Enova’s needs change, but for now, we are taking a short break.

I am interested to learn about where you are in your cloud connectivity journey and I would like to hear about what you would do differently. Please, feel free to share this blog with your peers and let me know if you are interested in having a conversation. I know a couple of good coffee shops in Chicago.

About