Cloud Native machine learning at Lyft with AWS Batch and Amazon EKS
Concept of AWS batch and workflow:
From 01:37, Steve informs that AWS is way to schedule native in the cloud which is dissimilar from on-premise schedulers which are installed on computers with limited resources and decisions while cloud, with greater breadth, is managed by batch on website with much resources.
At 02:52, he tells about the natively integration with AWS platform. AWS batch provides integration of file systems new instant families as they release step function and auto scaling and compute new file system to provide maximum control on cloud resources. From 03:24, cloud resources are optimized for resource provisioning which means AWS handles and controls workflow of customer within minimum cost by using maximum resources. Integration of AWS with spot market which can be used by customer to compute cost rent 70-80%.
At 04:20, Steve tells about users of AWS batch. AWS uses in weather forecasting, healthcare, risk analysis and industry formation.
AWS batch job architecture:
At 04:50, customers provide job definition, new inputs arrives in customers account in the form of S3 bucket. Inputs submits to AWS batch jobs queue which is managed (evaluate in form instances in AWS batch computing environment) by AWS batch.
After the processing of queue, queue schedules in AWS batch scheduler which evaluate and run on computing environment. After finishing jobs, output transfer back S3 bucket. At 06:50, it is highlighted by example that customer can define dependencies which will be managed by AWS batch. AWS is natural fit for workflow managers.
Flyte; how their products use Amazon services:
At 09:50, ally gale explains native machine learning and data processing platform built. The developing of skillful team, complex machine learning workflows and data pipeline development is difficult to execute by a simple company. Infrastructure is significant which takes time and complex in developing.
At 11:01, ally tells that for business data collecting and processing then applying in machine learning which is convergent. Machine learning is more than just algorithm, it is a collection of data, its transformation, tuning of machine resources usage.
At 12:08, in example each box is a service and each box is consist of many complex pipelines and workflows. Flyte executes 1 million tasks per month.
From 13:35, matt explains that how flyte make it easy to work instead of maintaining complex infrastructure, pricing and forecasting. Each workflow is difficult to handle when it is replicated several times for ETA teams. The use of a lot of existing orchestration engines becomes difficult to manage for teams. To manage infrastructure is also difficult for team. At 15:42, matt highlighted that each job is of different size and shape so it is very hard to pack on to cluster. Rollbacks is also hard to manage where we might ended up trampling the model exist in the production.
Abstraction of task: from 16:22, task is a function which requires inputs and produces certain outputs from them. In below image, a python configuration which shows spark task. This specifies the number of executors, amount of memory and amounts of CPUs for completion of task which specifies output. This works as small unit for the complex workflows. In this case AWS batch is used. By compiling these tasks and maintaining spec in proto buff and submitting it to AWS batch cloud which allows schedulers to work efficiently on AWS batch.
From 17:41, Workflow is similar to task and has interface which is composed of task. Sub workflows composed of larger workflows. How it works? AWS batch pass inputs to task 1 of workflow than outputs of this task pass to next task in workflow. It makes replication and different schedulers easy to work. This helps in analyze errors from production and rollback to previous pipeline suddenly.
From 20:00, matt explains about multi-tenant which is the usage of workflows by multiple customers. Workflows can be executed in future too. For multi- tenant process, huge amount of data is from production history of lyfts needs to use.
Project: At 20:34, project offers a logical grouping of workflows and tasks which can be split across one or more container which can be used for the services of AWS batch. Workflows works on thousands nodes that need to be trained after a while to prevent it from trampling by using AWS batch.
Domains is beneficial for the integration of workflows and provide semantics to workflow. By domains, user can push new production to produce a data for the business and can rollback to previous version. It is configured globally.
Sharing: from 22:23, sharing is important aspects in which team is working with similar problem. The already made pipeline or workflow can be used to run and to execute the problem with different version of scheduler. The example of flytekit sharing is bellow in which project B which runs on the project resources of project A.
From 24:24, architecture system in which flyte have control plane, user plane and execution plane and components which allow to interact with platform in a variety of ways. Architecture is programmed of python but can be derived into different languages. By using this:
- One can register workflows and task.
- One can execute workflows.
- One can monitor and retrieve input and output from workflows.
At 25:22, central registrations maintains the history of workflows and tasks version and can be used anytime. Number of workflows, versions and nodes cannot be run on single cloud so we use multiple clouds to isolate multiple domains and nodes of workflow to run at a time.
From 28:20, we do not need any physical machines and users. They just need to require CPU, GPU and memory by submitting to batch through kebernetes. Flyte also use automatic scaling for resources and offers custom rate limiting for any resources. At 29:22, service oriented generates other languages for client which can be done by using API. Service orientation is used to build service where client can drag and drop tasks by using GUI.
Data catalog and memorization:
From 30:40, every task execution is recorded in catalog. Each task have specific signature and input values. Workflows which are recorded is used for different inputs. Data catalog prevents from debugging, fixing and then running to check workflow again and again which is costly. It is very easy to work by evaluating outputs from pre determine workflow by using flyte. It is also used to evaluate the behavior of workflows. From 33:13, this process is time saving and reliable.
The reliability goals are:
- Observability: extensive user visibility per project. Observability to ensure the achieving of reliability goals.
- Monitoring: customization notifications with existing integrations. We can see latencies in our system to ensure the demands of our customers.
- Security: pre execution RBAC using service accounts. We are backed with service accounts AWS to authorize specific user on Flyte.
Extensible; plugin and backend plugin:
From 34:33, Extensibility provides tools for testing and development and take care of boilerplate which are executed in containers for increasing the rapid abilities of Flyte. From 35:48, backend plugin used for deep integration in Flyte by using Golang interface for special visualization and managing of resources. We can add extra logs in lifecycle.
At 36:35, in demo example of Flyte, a workflow is created and trained then results of workflow is obtained by passing inputs into workflow tasks. It is a model of SDK python which explaining the creation the creation of tasks, registration, sharing workflow, data catalog and memorization of workflow.
At 40:47, the steps involve in the execution of workflow is shown in order of execution. At 40:42, Catalog took 29 seconds to run while workflow will takes hours to execute.