International DevOps Certification Academy™
How Should You Enable Your DevOps Continuous Learning?

From your software projects you already know that: Even though you have checklists, peer reviews, control, audit and compliance mechanisms, you still have problems. This is inevitable. It is time for your DevOps team and organization to build a self-diagnostics, self-learning and self-improvement culture. Your culture accepts problems and your teams are ready when problems occur. Solving problems is not an exceptional state of work. But they must be part of your daily work to contribute on continuous learning and improvement journey of your organization. And you multiply the effects of these solutions for the problems you solve, by making them transparent, available and easily accessible within your entire DevOps organization.

One of the prominent DevOps organizations, Netflix, has built an in-house software (Chaos Monkey) to simulate catastrophic events in their cloud-based data centers. Chaos Monkey randomly destroys servers in production systems, so Netflix team can build additional assurance on their operational ability for resilience, stability and uninterrupted service quality for their clients. From each failure they learn new lessons and they exploit these lessons to make their systems even more stable and resilient.

Understanding Importance of Building a Learning Culture

In your organization if there is finger-pointing after incidents, this will create a fear culture for engineers. Thus, your organization simply becomes slow, bureaucratic and a political slippery landscape. Instead of consciously learning from errors, being organically more resistant and resilient against errors and being more mindful and careful to prevent errors. Everyone in such organizations care about self-protection. Work, problems and even solutions themselves are never fully transparent.

Because problems are inevitable in complex systems, instead of finger-pointing, blaming and shaming the ones who cause problems, your organization should value actions to make problems visible in your daily work. It should encourage organizational learnings from errors and inefficiencies, so everyone in your DevOps organization can also learn and profit from these problems, solutions and knowledge. When engineers in your DevOps organization feel safe about giving details about mistakes, they voluntarily go extra mile and spend a lot of energy to make sure that a similar problem will not happen again in their own work center and in other work centers in your organizational value stream. If engineers are punished or even if they feel that they are punished when they do mistakes, then they will be afraid of making mistakes, so

  1. They produce less work to do less mistakes.
  2. They are not transparent about work, problems and solutions.
  3. They are not incentivized to convert solutions of problems into organizational learnings.
  4. It is guaranteed that the same or very similar problem will happen again because nobody ever spends time and energy to learn, share and teach about problems/solutions and make them visible.

Run Post-Mortems As Soon As Incidents Happen, Before Memories About Problem Causes Fade

The goals of a post-mortem review are very simple:

  • To identify the things you did right, so that you can remember to try them again in similar situations.
  • To note the things that should have been done differently, so that you can refine your techniques in the future.
  • To note the things that you did wrong, and to suggest alternative approaches or safety measures that you should employ the next time you face a similar problem.
  • To find out why it did make sense to take (or not to take) the action which caused in the incident.

Exploring what you did wrong is frightening and in some organizations it is dangerous. If admitting having made mistakes opens you to criticism or discipline, you are unlikely to make such admissions. This strategy is ultimately self-defeating, since failing to understand a past mistake usually condemns you to repeating it again in the future. Organizations that are serious about improvement understand this, and take trouble to create a process and culture wherein it is safe to explore mistakes.

When you enter into a post-mortem review process, you must accept a few basic premises:

  • Everybody tries to do their best, as best they understand it.
  • You make our decisions in stressed situations, with imperfect information.
  • You are often called upon to carry out tasks for which you have not been trained, with whatever tools and resources happened to be at hand.
  • Mistakes are inevitable in such situations.
  • The goal of this process is not to find fault with any individual or their actions. Rather it is to look at what happened and see what lessons you can learn from it.
  • The output of this process will not be an assessment of any person or group of people, but rather an assessment of our processes, and how they can be improved.

It is absolutely essential that everyone involved completely accept this "No blame, We are here to learn model". Many organizations go to great trouble to create such safe environments. The FAA, for instance, has an Aviation Safety Reporting System, whereby pilots who make "mistakes" can gain immunity from regulatory discipline if they report those incidents.

Post-mortem reviews must always define actionable measures to prevent the incident from happening again in the future. New Telemetry metrics, new automated test cases, identification of type of changes that require additional code reviews, refactoring code or decoupling complex system components which cause frequent problems can be examples of such preventative measures.

Publish post-mortem review protocols and lessons learnt widely in your organization. This will help you convert your local learnings from one work center in your value stream into organization-wide global learnings. And this will be a clear message in your DevOps organization to nurture transparency, openness and learning culture.

Organize Game Days To Improve Your Systems

A game day is not one of your typical boring team events where extraverts enjoy the show and introverts play with their mobile phones to speed up the flow of time.

In a game day catastrophic failures are simulated in your test systems. And DevOps teams work towards fixing and learning from these failures.

For instance, a critical server is terminated to validate the successful operation of failover mechanism without service interruptions. Then your DevOps team validates if/how your recovery mechanism from backups or from your Infrastructure as Code (IaC) works. Identifying problems in these fail scenarios helps your DevOps team build resilient, fault-tolerant systems and create learnings.

During the process of solving problem, your DevOps team builds relationship with other departments while they rehearse fail events in non-stress conditions. You will test and have a visible chance to improve communication and troubleshooting processes within your larger global organization.

Furthermore, you will have the ability to observe weaker signals for potential larger issues that may reveal themselves in the future. Frequently happening low priority incidents during these fail scenarios, or a small side effect that may have come close to crash another critical component in your architecture are important week signals that you should take into account and work out to improve your systems.


In your DevOps team encourage calculated risk taking. High performer DevOps organizations like yours do more often errors. This is not only OK, but this is also what your organization needs. To learn and perform better.

Over typical organizations, high performers have 80% less critical failures in their production systems. In other words they have 5 times less incidents which impact their clients. This is why your engineers in your DevOps organization needs to feel free to do errors and learn from them.

Your DevOps Training
Table of Contents

We guarantee that Your Free Online Training will make you pass Your DevOps Certification Exam!


Your DevOps Training Program prepared by our consortium of renowned Business and People Leaders, DevOps Coaches, Mentors, Experts and Authorities from all major Industries are available to all visitors of International DevOps Certification Academy™'s web site. Your Online DevOps Training Materials are accessible under Your Free DevOps Book and Your Free Premium DevOps Training items from the top menu.

Although this Online DevOps Training Program is the copyrighted intellectual property of International DevOps Certification Academy™, we wanted to make these materials freely accessible for everybody. We believe that only by sharing our expertise we can best serve for DevOps Professionals and for the further development of DevOps Domain.

Your DevOps Certification examination comprises multiple-choice Test Questions. Reading your Online DevOps Training Program will be very helpful for DevOps Professionals like you to acquire the knowhow to pass your DevOps Certification Examination and to get your DevOps Certification.

We guarantee that Your Free of Charge Premium Online Training will make you pass Your DevOps Certification Exam!

Send Me My Free Videos & Book!