Mastering Chaos – A Netflix Guide to Microservices
A complement to the Microservices at Netflix Scale talk is this presentation from Josh Evans: Mastering Chaos – A Netflix Guide to Microservices.
Josh provides a deeper dive into the detail of the architecture components. Download the slides & audio at InfoQ.
Introduction and Overview
Beginning at 5:30 Josh provides an introduction to what microservices are and aren’t.
He starts with the basics- the anatomy of a microservice, the challenges around distributed systems, and the benefits. Then he builds on that foundation exploring the cultural, architectural, and operational methods that lead to microservice mastery.
At 6:02 he states that Netflix initially had an infrastructure that was hardware oriented and very expensive. He also added that so many changes were happening on a regular basis to the base architecture.
At 8:22 Josh defines microservices to be an architectural approach for developing the single application as a suite of the small services ensuring that each run its own process and communicates via lightweight mechanisms. At 9:38 he emphasizes that separation of concerns that include modularity, encapsulation and the biggest benefits of Netflix microservices that includes scalability and virtualization.
Edge Service
From 10:00 Josh walks through the overall architecture of the Netflix platform.
He starts with the Edge Service, made up of the ELB and Zuul for dynamic routing, the NCCP (Netflix Content Control Plane) and the API gateway that is core to their modern architecture, calling out to all the other services to fulfill customer requests.
At 10:40 he moves on to the middle tier and platform, an environment made up of many components such as an A/B testing and subscriber service, a recommendations system and platform services such as microservices routing, dynamic configuration, crypto operations and persistence layers.
Microservices architecture and practices
From 11:40 he explores the core principles and challenges of microservices, highlighting the complexity of their inter-operation with client libraries and caches, and explores mitigating these challenges through four key factors of Dependency, Scale, Variance and Change.
At 12:55 Josh explains the way microservices achieve abstraction, the way the EVCache Client sends the request which is then processed in the backend database with the help of the server.
At 15:10 he explains an interesting scenario of Cascading Failure wherein one can find one service failing with the improper defenses against the failing service, it cascades which would, in turn can then demolish the entire deployed system.
Josh elaborates on the role of Hystrix-based ‘circuit breakers’ in preventing this from occurring, at 16:40 he describing the practice of FIT – Failure Injection Testing that this enables. As the name suggests this provides a method for testing various microservice scenarios within a live context.
Critical Microservices
A challenge that this presents is how to manage the scope of what is tested, given the exponential scale of permutations that can arise from so many microservice interactions.
Central to achieving this is that Netflix defined ‘Critical Microservices’ – The minimum level of service a customer would want in the event of failures, ie. a basic browse and watch capability. Customers would be accepting of the loss of some of the value add services like personalization as long as they can achieve at least this.
From this they created ‘FIT recipes’, templates that blacklisted all the other non-critical services. This enabled them to test the ongoing availability of critical services in the event of the loss of the other functions.
Return of the Monolith
From 19:15 Josh explains the heated debate Netflix had about whether to adopt an approach of building client libraries or not.
They perceived considerable benefit to them, through the availability of common logic and access patterns for calling services as a much more simplified approach and thus decided upon them.
The interesting challenge that arose is that in essence this began rebuilding a monolith application, a new kind where the API gateway is running a lot of in-process code, giving rise to problems like heap consumption, logical defects and transitive dependencies that pull in conflicting libraries.
The conclusion to this debate has been a goal of the balance between the ultra simple, bare bones REST only model and achieving the most simplified client libraries possible.
Eventual Consistency
Beginning at 22:00 Josh explores their approach to persistence, with the ‘CAP Theorem‘ guiding their thoughts in this area, wherein one will have to choose between consistency and reliability, in scenarios where one particular database in an availability zone may not be available while the others are.
Netflix opted for an approach of ‘Eventual Consistency‘, the choice where you write to the others and catch up later with the full replication between them to achieve data consistency, a feature that Cassandra works very well for.
Scale
At 25:00 Josh moves on to a major section discussing scaling.
This is broken down into different sections addressing stateless and stateful service scenarios. He first explains that through the use of auto-scaling dealing with the scaling requirements of stateless services is almost a no-brainer. Responding to failures and traffic spikes is easily accomplished through auto-provisioning new AWS resource.
Stateful services, those that use databases and/or store significant levels of their own data, are a different ball game all together, presenting a much more significant challenge for handling their scaling requirements.
Technology central to this scenario is the sharded use of EVCache. Not only does this write data to multiple nodes but across multiple availability zones too. Reading from them happens locally but again can the application can read from across multiple zones if needed.
At 31:40 he moves to the current trending topic of hybrid microservices wherein excessive load scenarios can be managed well with the hybrid architectural design. He adds on that workload partitioning and the request-level caching technique are employed for hybrid cases.
Variance
At 33:35 Josh explores the challenges of variance within an IT architecture, a challenge that grows in scale as it in increases, primarily through operational drift and the introduction of new languages and containers.
Operational drift is the inevitable aging of different factors of maintaining a complex system that happens unconsciously, best addressed through continuous learning and automation. For example quickly learning from incidents as they happen and automating the remedying best practices into the infrastructure, so that “knowledge becomes code”.
In contrast introducing new languages and technologies like containers are conscious decisions to add new complexity to the environment. His operation team standardized a ‘paved road’ of the best technologies for Netflix that baked in their pre-defined best practices, based around Java and EC2, but as the business evolved developers began adding new components such as Python, Ruby, Node-JS and Docker.
This repeated the challenge of growing a new monolith, with the API service becoming overloaded with code in such a way as to cause various failure scenarios; they addressed this through sculpting out Node-JS components to run as small apps in Docker containers.
At 40:15 Josh summarizes the cost of these variances, notably:
- Productivity tooling – Managing these new technologies required new tools.
- Insight and triage capabilities – A keynote example is new tools are needed to reveal insights about performance factors.
- Base image fragmentation – A simple base AMI became more fragmented and specialized.
- Node management – The challenge of node management was so significant they found there was no available off the shelf technology for it, so they built Titus for this.
- Library / platform duplication – Providing the same core platform functions to these new technologies required duplicating it to the new tools, such as rewriting some of them in Node JS.
- Learning curve / production expertise – Inevitably introducing new technologies presents new challenges that must be overcome and learned from.
Ultimately this resulted in them operating multiple ‘paved roads’, a variance which they sought to minimize and manage through constraining centralized support, educating teams as to the costs of their decisions and where possible to seek reusable solutions.
Change velocity
From 43:30 Josh begins to wrap up by examining the headline theme – How do you achieve software innovation velocity with confidence? How can you be continually introducing new change into a system with minimal breakages?
Fundamentally Netflix achieved this through their use of Spinnaker. This integrated the best practices they had learned into their deployment life-cycles, automatically applying capabilities such as canary analysis and staged deployments.
Closing out he goes back to 2009 to describe the evolution of the Netflix tech organization, with a view to explaining how departmental structures and dynamics can also play a major force in shaping the design of systems and be a factor in enabling or inhibiting change.