Shopify’s New Machine Learning Platform — Data Science & Engineering (2022)

0
36
Shopify's New Machine Learning Platform — Data Science & Engineering (2022)


Shopify’s machine learning platform team builds the infrastructure, tools and abstracted layers to help data scientists streamline, accelerate and simplify their machine learning workflows. There are many different kinds of machine learning use cases at Shopify, internal and external. Internal use cases are being developed and used in specialized domains like fraud detection and revenue predictions. External use cases are merchant and buyer facing, and include projects such as product categorization and recommendation systems.

At Shopify we build for the long term, and last year we decided to redesign our machine learning platform. We need a machine learning platform that can handle different (often conflicting) requirements, inputs, data types, dependencies and integrations. The platform should be flexible enough to support the different aspects of building machine learning solutions in production, and enable our data scientists to use the best tools for the job.

In this post, we walk through how we built Merlin, our magical new machine learning platform. We dive into the architecture, working with the platform, and a product use case.

The Magic of Merlin

Our new machine learning platform is based on an open source stack and technologies. Using open source tooling end-to-end was important to us because we wanted to both draw from and contribute to the most up-to-date technologies and their communities as well as provide the agility in evolving the platform to our users’ needs.

Merlin’s objective is to enable Shopify’s teams to train, test, deploy, serve and monitor machine learning models efficiently and quickly. In other words, Merlin enables:

  1. Scalability: robust infrastructure that can scale up our machine learning workflows
  2. Fast Iterations: tools that reduce friction and increase productivity for our data scientists and machine learning engineers by minimizes the gap between prototyping and production
  3. Flexibility: users can use any libraries or packages they need for their models

For the first iteration of Merlin, we focused on enabling training and batch inference on the platform.

Merlin Architecture

A high level diagram of Merlin’s architecture

Merlin gives our users the tools to run their machine learning workflows. Typically, large scale data modeling and processing at Shopify happens in other parts of our data platform, using tools such as Spark. The data and features are then saved to our data lake or Pano, our feature store. Merlin uses these features and datasets as inputs to the machine learning tasks it runs, such as preprocessing, training, and batch inference.

With Merlin, each use case runs in a dedicated environment that can be defined by its tasks, dependencies and required resources — we call these environments Merlin Workspaces. These dedicated environments also enable distributed computing and scalability for the machine learning tasks that run on them. Behind the scenes, Merlin Workspaces are actually Ray clusters that we deploy on our Kubernetes cluster, and are designed to be short lived for batch jobs, as processing only happens for a certain amount of time.

We built the Merlin API as a consolidated service to allow the creation of Merlin Workspaces on demand. Our users can then use their Merlin Workspace from Jupyter Notebooks to prototype their work, or orchestrate it through Airflow or Oozie.

Merlin’s architecture, and Merlin Workspaces in particular, are enabled by one of our core components—Ray.

What Is Ray?

Ray is an open source framework that provides a simple, universal API for building distributed systems and tools to parallelize machine learning workflows. Ray is a large ecosystem of applications, libraries and tools dedicated to machine learning such as distributed scikit-learn, XGBoost, TensorFlow, PyTorch, etc.

When using Ray, you get a cluster that enables you to distribute your computation across multiple CPUs and machines. In the following example, we train a model using Ray:

We start by importing the Ray package. We call ray.init() to start a new Ray runtime that can run either on a laptop/machine or connect to an existing Ray cluster locally or remotely. This enables us to seamlessly take the same code that runs locally, and run it on a distributed cluster. When working with a remote Ray cluster, we can use the Ray Client API to connect to it and distribute the work.

In the example above, we use the integration between Ray and XGBoost to train a new model and distribute the training across a Ray cluster by defining the number of Ray actors for the job and different resources each Ray actor will use (CPUs, GPUs, etc.).

For more information, details and examples for Ray usage and integrations, check out the Ray documentation.

Ray In Merlin

At Shopify, machine learning development is usually done using Python. We chose to use Ray for Merlin’s distributed workflows because it enables us to write end-to-end machine learning workflows with Python, integrate it with the machine learning libraries we use at Shopify and easily distribute and scale them with little to no code changes. In Merlin, each machine learning project comes with the Ray library as part of its dependencies, and uses it for distributed preprocessing, training and prediction.

Ray makes it easy for data scientists and machine learning engineers to move from prototype to production. Our users start by prototyping on their local machines or in a Jupyter Notebook. Even at this stage, their work can be distributed on a remote Ray cluster, allowing them to run the code at scale from an early stage of development.

Ray is a fast evolving open source project. It has short release cycles and the Ray team is continuously adding and working on new features. In Merlin, we adopted capabilities and features such as:

  • Ray Train: a library for distributed deep learning which we use for training our TensorFlow and PyTorch models
  • Ray Tune: a library for experiment execution and hyperparameter tuning
  • Ray Kubernetes Operator: a component for managing deployments of Ray on Kubernetes and autoscale Ray clusters

Building On Merlin

A diagram of the user’s development journey in Merlin

A user’s first interaction with Merlin usually happens when they start a new machine learning project. Let’s walk through a user’s development journey:

  1. Creating a new project: The user starts by creating a Merlin Project where they can place their code and specify the requirements and packages they need for development
  2. Prototyping: Next, the user will create a Merlin Workspace, the sandbox where they use Jupyter notebooks to prototype on a distributed and scalable environment
  3. Moving to Production: When the user is done prototyping, they can productionize their project by updating their Merlin Project with the updated code and any additional requirements
  4. Automating: Once the Merlin Project is updated, the user can orchestrate and schedule their workflow to run regularly in production
  5. Iterating: When needed, the user can iterate on their project by spinning up another Merlin Workspace and prototyping with different models, features, parameters, etc.

Let’s dive a little deeper into these steps.

Merlin Projects

The first step of each machine learning use case on our platform is creating a dedicated Merlin Project. Users can create Merlin Projects for machine learning tasks like training a model or performing batch predictions. Each project can be customized to fit the needs of the project by specifying the system-level packages or Python libraries required for development. From a technical perspective, a Merlin Project is a Docker container with a dedicated virtual environment (e.g. Conda, pyenv, etc.), which isolates code and dependencies. As the project requirements change, the user can update and change their Merlin Project to fit their new needs. Our users can leverage a simple-to-use command line interface that allows them to create, define and use their Merlin Project.

Below is an example of a Merlin Project file hierarchy:

The config.yml file allows users to specify the different dependencies and machine learning libraries that they need for their use case. All the code relevant to a specific use case is stored in the src folder.

Once users push their Merlin Project code to their branch, our CI/CD pipelines build a custom Docker image.

Merlin Workspaces

Once the Merlin Project is ready, our data scientists can use the centralized Merlin API to create dedicated Merlin Workspaces in prototype and production environments. The interface abstracts away all of the infrastructure-related logic (e.g. deployment of Ray clusters on Kubernetes, creation of ingress, service accounts) so they can focus on the core of the job.

A high level architecture diagram of Merlin Workspaces

Merlin Workspaces also allow users to define the resources required for running their project. While some use cases need GPUs, others might need more memory and additional CPUs or more machines to run on. The Docker image that was created for a Merlin Project will be used to spin up the Ray cluster in a dedicated Kubernetes namespace for a Merlin Workspace. The user can configure all of this through the Merlin API, which gives them either a default environment or allows them to select the specific resource types (GPUs, memory, machine types, etc.) that their job requires.

Here’s an example of a payload that we send the Merlin API in order to create a Merlin Workspace:

Using this payload will result in a new Merlin Workspace which will spin up a new Ray cluster with the specific pre-built Docker image of one of our models at Shopify—our product categorization model, which we’ll dive into more later on. This cluster will use 20 Ray workers, each one with 10 CPUs, 30GB of memory and 1 nvidia-tesla-t4 GPU. The cluster will be able to scale up to 30 workers.

After the job is complete, the Merlin Workspace can be shut down, either manually or automatically, and return the resources back to the Kubernetes cluster.

Prototyping From Jupyter Notebooks

Once our users have their Merlin Workspace up and running, they can start prototyping and experimenting with their code from Shopify’s centrally hosted JupyterHub environment. This environment allows them to spin up a new machine learning notebook using their Merlin Project’s Docker image, which includes all their code and dependencies that will be available in their notebook.

An example of how our users can create a Merlin Jupyter Notebook

From the notebook, the user can access the Ray Client API to connect remotely to their Merlin Workspaces. They can then run their remote Ray Tasks and Ray Actors to parallelize and distribute the computation work on the Ray cluster underlying the Merlin Workspace.

This method of working with Merlin minimizes the gap between prototyping and production by providing our users with the full capabilities of Merlin and Ray right from the beginning.

Moving to Production

Once the user is done prototyping, they can push their code to their Merlin Project. This will kick off our CI/CD pipelines and create a new version of the project’s Docker image.

Merlin was built to be fully integrated with the tools and systems that we already have in place to process data at Shopify. Once the Merlin Project’s production Docker image is ready, the user can build the orchestration around their machine learning flows using declarative YAML (yet another markup language) templates or by configuring a DAG (directed Acyclic Graph) in our Airflow environment. The jobs can be scheduled to run periodically, call the production Merlin API to spin up Merlin Workspaces and run Merlin jobs on them.

A simple example of an Airflow DAG running a training job on Merlin

The DAG in the image above demonstrates a training flow, where we create a Merlin Workspace, train our model on it and—when it’s done—delete the workspace and return the resources back to the Kubernetes cluster.

We also integrated Merlin with our monitoring and observability tools. Each Merlin Workspace gets its own dedicated Datadog dashboard which allows users to monitor their Merlin job. It also helps them understand more about the computation load of their job and the resources it requires. On top of this, each Merlin job sends its logs to Splunk so that our users can also debug their job based on the errors or stacktrace.

At this point, our user’s journey is done! They created their Merlin Project, prototyped their use case on a Merlin Workspace and scheduled their Merlin jobs using one of the orchestrators we have at Shopify (e.g Airflow). Later on, when the data scientist needs to update their model or machine learning flow, they can go back to their Merlin Project to start the development cycle again from the prototype phase.

Now that we explained Merlin’s architecture and our user journey, let’s dive into how we onboarded a real-world algorithm to Merlin—Shopify’s product categorization model.

Onboarding Shopify’s Product Categorization Model to Merlin

A high level diagram of the machine learning workflow for the Product Categorization model

Recently we rebuilt our product categorization model to ensure we understand what our merchants are selling, so we can build the best products that help power their sales. This is a complex use case that requires several workflows for its training and batch prediction. Onboarding this use case to Merlin early on enabled us to validate our new platform, as it requires large scale computation and includes complex machine learning logic and flows. The training and batch prediction workflows were migrated to Merlin and converted using Ray.

Migrating the training code

To onboard the product categorization model training stage to Merlin, we integrated its Tensorflow training code with Ray Train, for distributing training across a Ray cluster. With Ray Train, changing the code to support the distributed training was easy and required few code changes – the original logic stayed the same, and the core changes are described in the example below.

The following is an example of how we integrated Ray Train with our Tensorflow training code for this use-case:

The TensorFlow logic for the training step stays the same, but is separated out into its own function. The primary change is adding Ray logic to the main function. Ray Train allows us to specify the job configuration, with details such as number of workers, backend type, and GPU usage.

Migrating inference

The inference step in the product categorization model is a multi-step process. We migrated each step separately, using the following method. We used Ray ActorPool to distribute each step of batch inference across a Ray cluster. Ray ActorPool is similar to Python’s `multiprocessing.Pool` and allows scheduling Ray tasks over a fixed pool of actors. Using Ray ActorPool is straightforward and allows easy configuration for parallelizing the computation.

Here’s an example of how we integrated Ray ActorPool with our existing inference code to perform distributed batch predictions:

We first need to create our Predictor class (a Ray Actor), which includes the logic for loading the product categorization model, and performing predictions on product datasets. In the main function, we use the size of the cluster (ray.available_resources()["CPU"]) to create all the Actors that will run in the ActorPool. We then send all of our dataset partitions to the ActorPool for prediction.

While this method works for us at the moment, we plan to migrate to using Ray Dataset Pipelines which provides a more robust way to distribute the load of the data and perform batch inference across the cluster with less dependence on the number of data partitions or their size.

What’s next for Merlin

As Merlin and its infrastructure mature, we plan to continue growing and evolving to better support the needs of our users. Our aspiration is to create a centralized platform that streamlines our machine learning workflows in order to enable our data scientists to innovate and focus on their craft.

Our next milestones include:

  • Migration: Intention to migrate all of Shopify’s machine learning use cases and workflows to Merlin and adding a low code framework to onboard new use cases
  • Online inference: Support real time serving of machine learning models at scale
  • Model lifecycle management: Add model registry and experiment tracking
  • Monitoring: Support monitoring for machine learning

While Merlin is still a new platform at Shopify, it’s already empowering us with the scalability, fast iteration and flexibility that we had in mind when designing it. We’re excited to keep building the platform and onboarding new data scientists, so Merlin can help enable the millions of businesses powered by Shopify.

Isaac Vidas is a tech lead on the ML Platform team, focusing on designing and building Merlin, Shopify’s machine learning platform. Connect with Isaac on LinkedIn.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.



Source link

Leave a reply

Please enter your comment!
Please enter your name here