_
_
Back to Blog
Datadog
No items found.

Who Broke It?

Discover how a blameless engineering culture can enable innovation, risk-taking, and systems resilience.
4
min read
|
by
Tameem Hourani
August 20, 2020

Do you remember the last time something blew up and you were in a room with other managers and directors, all awkwardly looking at each other, asking who broke it? I certainly do.

I remember it happening over and over again. I even remember once, the CIO walking in and asking why the Internet WAFs weren't functioning correctly, and explicitly asking whose fault it was. His goal was to make a scene in front of around 30 or 40 engineers. You probably guessed that this was not during my time at Wayfair, this was at a job I had about 10 years ago. Thankfully no one got fired, but the scenario stuck with me and I'll always remember the bad taste left in everyone's mouth.

It was a very tough environment to be innovative in, nobody wanted to be the reason a customer couldn't access an application. Moving to a culture that was truly blameless was something I'd never experienced before, and that’s what really opened up my eyes to what is possible. Everyone was focused on writing code that was creative and impactful, and that they truly believed would have a positive impact on the project they were working on. These efforts would cause outages, they would take systems down, and they would even sometimes lead to our website going down. The magic that this culture brought was driven by the fact that these outages created opportunities for us to improve our systems.

If engineers ever managed to deploy code that did really impact our systems, it was very easy to ask for help and get feedback on how to become a better engineer, as well as make our platforms more resilient. I remember sitting in a meeting with one of the co-founders, we were talking about how it was going to take us about six months to be ready to expand our platform to become a truly hybrid cloud, bursting into the public cloud when we had traffic overages. He was attentively listening to the whole conversation. There were plenty of engineering leaders in the room, and by the time we were done discussing he interrupted and asked a simple question… This question resonated with me and with a lot of other people in the room, it was the reason a lot of us were driven and motivated by working at Wayfair. It was a real indicator of the culture they had built, and the willingness to innovate and to move fast and break things.

The question Steve asked at the time was very simple; ‘why can't we get this done in 2 months?’.

The real reason for that timeline was not that we didn't have enough engineers, or that we didn't have enough time to make sure that it worked. It was because we took our time planning and preparing to make sure that we to avoided any outages or downtime when we actually went to make this change. The question Steve asked really painted a picture of a willingness to take risks, and to potentially break things when we made this move. It was openly discussed that trying to rush things in 2 months is likely to take our website down multiple times. The response to that was very positive, reassuring the team that we will make mistakes at the cost of innovation and remain a leader in our industry - that's how we operate here.

Hearing such a phrase from someone that built the company we were all part of was very reassuring to the culture that has been built over the years. It's a culture of true blameless engineering, and one that drives the company's ability to innovate and take risks without always looking for someone to blame or point fingers at.  The uniqueness of this culture really brought to light the importance of innovation and disruption.Being able to take risks like this is not something that could be implemented in every single organization. I would imagine that teams working in healthcare might have a lot more scrutiny, or teams working with extremely low risk tolerances such as major financials. I wouldn't say that it's a good idea to break hospital systems for 5 minutes at a time to try to make them better, as it could end up costing someone's life!  

What's important to keep in mind is that there is almost always going to be a way to innovate quickly and effectively by leveraging technology, as long as the culture supports it.

This is a great example of what a good culture at a company can do, even when you have thousands of engineers working together. Good leadership, with a clear motivation to support and back up their engineers, is a great way to build momentum, and that's something I learned working with some of the most effective leaders I've had the opportunity to.

I'd love to hear how your culture is similar or different to the experiences I've had, and any challenges you might be running into in achieving a true blameless culture, which supports innovation and disruption. As always, reach out to me via email or LinkedIn if you want to chat some more!

Written by
Tameem Hourani
Boston
I'm an Engineer in Boston, started my career as a Cisco guy and quickly took a liking for Tech-Ops. Learned a ton over the past 5 years in the DevOps space so I decided to start blogging about it!
you might also like
back to blog