Netflix Security Tools: Chaos Monkey and Chaos Gorilla

Netflix has a security tool that changed the way businesses test for resilience called chaos monkey and it's successor chaos gorilla.

Netflix Security Tools: Chaos Monkey and Chaos Gorilla

Netflix is one of the most popular tech companies of the 21st century, especially during this pandemic that has so many people sitting at home and trying to find ways to entertain themselves. They were one of the first to make online streaming for movies and tv shows so easily accessible and affordable for the everyday consumer, and put older companies like Blockbuster completely out of business. But their innovation doesn’t stop there, they took an extremely daring and creative approach to testing the fault tolerance of their servers that most companies would not be willing to try.

Rather than hoping that their servers were never hacked or failed, they decided to develop a system that would cause their servers to fail periodically and force their employees to learn how to handle these failures without losing availability for their customers. They named this project chaos monkey and it's successor chaos gorilla.  

Chaos Monkey
In 2010 Netflix had decided to move their system to the cloud, and in this environment hosts can be terminated and replaced at any time. To ensure that they were able to provide services at all times, netflix needed to be sure that their infrastructure could handle servers going down without losing the ability to meet the demands of their customers. To fix this issue they created chaos monkey and what Chaos Monkey does is it pseudo-randomly reboots Netflix's servers. This makes it obvious to the company whether or not they have systems that are redundant and can handle a few hosts going down at any time. Rather than planning and trying to avoid hosts going down, Netflix decided to create those uncertain conditions on purpose and ensure that they are able to handle it.

Since the creation of chaos monkey, Netflix has gone further and created a series of tools to perform this type of testing called the simian army. Among these tools is a more advanced version of chaos monkey called chaos gorilla that simulates the failure of an entire AWS availability zone. Chaos Gorilla has been successfully used by Netflix to verify that in the event that an entire AWS availability zone goes down, that they can still provide their customers with service. As a result Netflix has great uptime, even when part of their cloud service provider (AWS) is unavailable. You can find an article here explaining how Netflix was able to stay up and running despite AWS having issues with their simple storage service (S3) in 2017.

Lessons Learned
Netflix's implementation of chaos monkey helped to build the credibility of a new engineering practice known as chaos engineering. Chaos engineering is defined as “the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production.” It goes back to the idea that in order to know that something will work when you need it to, it needs to be tested regularly. Whether that’s testing the resilience of your cloud infrastructure, testing your cybersecurity through bug bounty programs or testing your business’s Disaster Recovery and Business Continuity contingency plans, it needs to be tested on a regular basis to ensure that it will work.

The best way to test your company's response, would be using automation similar to what Netflix has done. The test is built into the design of the system rather than having to organize it every year or every six months etc. In 2012 Netflix had this to say about Chaos Monkey  “we have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient”.

This approach may not be suited to other companies that offer more critical services, such as in the healthcare industry or financial sector where people need access to their services and even lower uptime in the short run can have very negative consequences. But in the long run an approach like this would provide a company with much higher uptime and much more reliable results in the event of an unexpected failure. I think this sort of approach needs to be built into the design of companies wherever possible so that having resilient systems becomes a must rather than just a nice to have.


Learn More About The Images We Choose

Today we are celebrating the work of artist Zaki Abdelmounim and joining him in his hunt for what's left of Hong Kong's iconic neon signs, an essential element of this cityscape's visual culture, covering HK's streets for years with glow. We will roam the dazzling roads aimlessly reminiscing about a dystopian past that only existed in neo-noire cult fiction movies like Blade Runner, trying to burn these lively picturesque streets into our memories before they vanish, all while figuring out how to thrive creatively in this organized chaos. Hopefully this vaporwave stylized series of street photography will bring as much joy as it did to us.

The beautiful image used in this article was created by Zaki Abdelmounim.