Model Serving – Scaling

Why is scaling important?

  • Because the incoming traffic will grow overtime and model will become more complex
  • Scaling dimension
    • Data volume
    • Model parameters

Types of scaling

  • Vertical
    • Increased power
    • Upgrade RAM/CPU/GPU/TPU
    • Faster storage
  • Horizontal
    • More devices on the network
    • Scale up as needed
    • Scale down to minimum
    • Generally prefer horizontal scaling to vertical
      • Elasticity
      • No need to go offline
      • No hardware limit on a single device


The solution to highly scalable serving infra is containerization

  • Left – using virtual machine as the container requires running separate OS on top of the hypervisor
    • OS does not run on hardware but on hypervisor
  • Right – using docker container is a light weight environment that runs on top of docker engine
    • each container has its own file system
    • a docker image is a stack of layering of file system where the lower layers are read-only (a snapshot of the file system of the image that it depends on)
    • OS still runs on hardware


Container Orchestration

  • Mange the life cycle of container instances in production environments
  • Scaling of containers
  • Reliability of containers
    • Containers on hot standby
  • Distribute resources among containers
  • Monitor health of containers
  • Popular container orchestration tools
    • Kubernetes
      • Kubeflow – ML workflows on Kubernetes
        • Makes deployments of all ML workflows portable and scalable
        • Kubeflow can be used anywhere that Kubernetes is run (both on prem or cloud Kubernetes)
    • Docker Swarm


Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *