Shopify’s Unified Edge — Infrastructure (2022)

0
36
Shopify’s Unified Edge — Infrastructure (2022)


While working on improvements to the Shopify platform in terms of how we handle traffic, we identified that through time each change became more and more challenging. When deploying a feature, if we wanted the whole platform to benefit from it, we couldn’t simply build it “one way fits all”—we already had more than six different ways for the traffic to reach us. To list some, we had a traffic path for:

  • “general population” (GenPop) Shopify Core: the monolith serving shops that’s already using an edge provider
  • GenPop Shopify applications and services
  • GenPop Shopify applications and services that’s using an edge provider
  • Personally identifiable information (PII) restricted Shopify Core: where we need to make sure that the traffic and related data are kept in a specific area of the world
  • PII restricted Shopify applications and services
  • publicly-accessible APIs that are required for Mutual Transport Layer Security (mTLS) authentication.
LOTR Meme stating: "How many ways are there for traffic to reach Shopify?"
LOTR Meme

We had to choose which traffic path could use those new features or build the same feature in more than six different ways. Moreover, many different traffic paths means that observability gets blurred. When we receive requests and something doesn’t work properly, figuring out why and how to fix it requires more time. It also takes longer to onboard new team members to all of those possibilities and how to distinguish between them.

LOTR meme image of Frodo holding the ring stating: One edge to reach them all, one edge to secure them; One edge to bring them all and in the clusters bind them.
One edge to reach them all, one edge to secure them; One edge to bring them all and in the clusters bind them.

This isn’t the way to build a highway to new features and improvements. I’d like to tell you why and how we built the one edge to front our services and systems.

LOTR meme image of Aragorn stating: One does not simply know what "The Edge" stands for.
One does not simply know what “The Edge” stands for.

The most straightforward definition of the edge, or network edge, is the point at which an enterprise-owned network connects to a third-party network. With cloud computing, lines are slightly blurred as we use third parties to provide us with servers and networks (even more when using a provider to front your network, like we do at Shopify). But in both those cases, as long as they’re used and controlled by Shopify, they’re considered part of our network.

The edge of Shopify is where requests from outside our network are made to reach our network.

Unifying our edge became our next objective and two projects were born to make this possible: Möbius, which as the name taken from the “Möbius strip” suggests, was to be the one edge of Shopify and Shopify Front End (SFE), the routing layer that receives traffic from Möbius and dispatches it to where it needs to go.

A flow diagram showing Möbius’s traffic path that takes requests from the internet to the routing layer and then sends traffic to the application’s clusters for traffic to be served. Purple entities are on the traffic path for PII restricted traffic, while the beige ones are for the GenPop traffic.
Möbius’s traffic path takes requests from the internet to the routing layer and then sends traffic to the application’s clusters for traffic to be served. Purple entities are on the traffic path for PII restricted traffic, while the beige ones are for the GenPop traffic.

About a year before starting Möbius, we already had a small number of applications handled through our edge, but we saw limitations in terms of how to properly automate such an approach at scale, while the gains to the platform justified the monetary costs to reach those gains. We designed SFE and Möbius together, leading to a better separation of concerns between the edge and the routing layers.

The Shopify Front End

SFE is designed to provide a unified routing layer behind Möbius. Deployed in many different regions, routing clusters can receive any kind of web traffic from Möbius, whether for Shopify Core or Applications. Those clusters are mainly nginx deployments with custom Lua code to handle the routing according to a number of criteria, including but not limited to the IP address a client connected to and the domain that was used to reach Shopify. For the PII restricted requirements, parallel deployments of the same routing clusters code are deployed in the relevant regions.

To handle traffic for applications and services, SFE works by using a centralized API receiving requests from Kubernetes controllers deployed in every cluster using such applications and services. This allows linking the domain names declared by an application to the clusters where the application is deployed. We also use this to provide active/active (when two instances of a given service can receive requests at the same time) or active/passive (when only a single instance of a given service can receive requests) load balancing.

Providing load balancing at the routing layer instead of DNS allows for near instantaneous traffic changes instead of depending on the Time to Live as described in my previous post. It avoids those decisions being made on the client side and thus provides us with better command and control over the traffic.

Möbius

Möbius’s core concerns are simple: we grab the traffic from outside of Shopify and make sure it makes its way inside of Shopify in a stable, secure, and performant manner. Outside of Shopify is any client connecting to Shopify from outside a Shopify cluster. Inside of Shopify is, as far as Möbius is concerned, the routing cluster with the lowest latency to the receiving edge’s point-of-presence (PoP).

Möbius is responsible for TLS and TCP termination with the clients, and doing that termination as close as possible to the client. It brings faster requests and better DDoS protection, plus it allows us to filter malicious requests before the traffic even reaches our clusters. This is something that was already done for our GenPop Shopify Core traffic, but Möbius now standardizes. On top of handling the certificates for the shops, we added an automated path to handle certificates for applications domains.

A flow diagram showing the configuration of the edge with Möbius and SFE. Domains updates are intercepted to update the edge provider’s domains and certificates store, making sure that we’re able to terminate TCP and TLS for those domains and let the request follow its path
Configuration of the edge with Möbius and SFE. Domains updates are intercepted to update the edge provider’s domains and certificates store, making sure that we’re able to terminate TCP and TLS for those domains and let the request follow its path

SFE already needs to be aware of the domains that the applications respond to, so instead of building the same logic a second time to configure the edge, we piggybacked on the work the SFE controller was already doing. We added handlers in the centralized API to configure those domains at the edge, through API requests to our vendor, and indicate we’re expecting to receive traffic on those, and to forward requests to SFE. Our API handler takes care of each and any DNS challenge to validate that we own the domain in order for the traffic to start flowing, but also obtains a valid certificate.

Prior to Möbius, if an application owner wanted to take advantage of the edge, they had to configure their domain manually at the edge (validating ownership, obtaining a certificate, setting up the routing), but Möbius provides full automation of that setup, allowing application owners to simply configure their ingress and DNS and ripe the benefits of the edge right away.

Finally, it’s never easy to have many systems migrate to use a new one. We aimed to make that change as easy as possible for application owners. With automation deploying all that was required, the last required step was a simple DNS change for applications domains, from targeting a direct-to-cluster record to targeting Möbius. We wanted to keep that change manual to make sure that application owners own the process and make sure that nothing gets broken.

A screenshot of the dashboard for the shopify-debug.com application. It displays hostnames configured to serve an application and the status of it's edge
Example dashboard for the shopify-debug.com application (accessible publicly and used for debugging connectivity issues with merchants). On the dashboard, we can find a link to its edge logs, see that the domains of the application are properly configured at the edge to receive traffic, and provide a TLS certificate. A test link also allows to simulate a connection to the platform using that domain so the response can be verified manually.

To make sure all is fine for an application before (and after!) migration, we also added observability in the form of easy:

  • access to the logs for a given application at the edge
  • identification of which domains an application will have configured at the edge,
  • understanding of what is the status of those domains.

This allows owners of applications and services to immediately identify if one of their domains isn’t configured or behaving as expected.

A drawing of Gollum's face (from Lord of the Rings) staring at a Mobius strip like it's the ring

On top of all the direct benefits that Möbius provides right away, it allows us to build the future of Shopify’s edge. Different teams are already working on improvements to the way we do caching at the edge, for instance, or on ways to use other edge features that we’re not already taking advantage of. We also have ongoing projects to handle cluster-to-cluster communications by avoiding the traffic from going through the edge and coming back to our clusters by taking advantage of SFE.

Using new edge features and standardizing internal communications is possible because we unified the edge. There are exceptions where we need to avoid cross-dependency for applications and services on which either Möbius or SFE depend to function. If we were to onboard them to use Möbius and SFE, whenever an issue would happen, we would be in a crash-loop situation: Möbius/SFE requires that application to work, but that application requires Möbius/SFE to work.

It’s now way easier to explain to new Shopifolk how traffic reaches us and what happens between a client and Shopify. There’s no need for as many conditionals in those explanations, nor as many whiteboards… but we might need more of those to explain all that we do as we grow the capabilities on our now-unified edge!

Raphaël Beamonte holds a Ph.D. in Computer Engineering in systems performance analysis and tracing, and sometimes gives lectures to future engineers, at Polytechnique Montréal, about Distributed Systems and Cloud Computing.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.



Source link

Leave a reply

Please enter your comment!
Please enter your name here