The Boeing 737 max 8 disaster: a failure of engineering or a failure of psychological safety?
One of the case studies often cited by proponents of psychological safety is the Boeing 737 max 8 disaster. The high profile of this debacle and the fact that it was completely avoidable if the leaders had listened to their engineers, made this situation the poster child for lack of psychological safety, and all the negative repercussions of having a culture of fear and silence.
You would be hard-pressed to find someone who heard about psychological safety without hearing about how it relates to Boeing’s tragic failure.
But what happened? How a culture of fear leads to incalculable loss, mainly the 346 souls who perished in the 2 crashes? And what does it say about the engineers working on this project?
Boeing employees raised concerns about 737 max before crashes, documents show cnbc.com
At the height of this crisis, a lot of Boeing engineers came out publicly to say they had serious concerns about the plane. That all along the development process, red flags kept popping, putting the plane viability into question. At the same time, they described a climate where speaking up was not possible, where they did not feel safe voicing any objection. They knew that even if they had said something, they would have been ignored. Some were quoted in the news saying they thought they would lose their job if they had spoken up about problems.
The morality of someone weighing losing their job versus other people losing their lives might sound awful, and it is probably unlikely that an engineer will be fired just for saying that there is a problem with what they are building. A simple fact remains: when people have that perception, real or not, it leads to a fear-based management system. A system in which great engineers forget about what made them great, and try to just deal with what they perceive as imminent danger (their boss breeding down their necks). This becomes more visible by looking at the engineering practices that led to these planes being built and subsequently crashing and comparing it with how other engineers all over the world often work.
First, a short recap.
According to Arnold Barnett, the MIT professor of statistics and the nation’s leading expert on aviation safety, “Airplane’s passengers had a one in one million chance of dying from 1960 to 1969. From 2000 to 2007 that chance dropped to one in twenty million”. Or as he rephrases it, “An American child about to board a U.S. aircraft is more likely to grow up to be President than to fail to reach her destination.”
“An American child about to board a U.S. aircraft is more likely to grow up to be President than to fail to reach her destination.” Arnold Barnett
Air travel has continually grown safer to a point wherein 2012 and 2013, the world had the fewest deaths since 1945, a year with only one percent of today’s air travel. And the time between serious airline accidents had been steadily increasing for the last 30 years.
These simple statistics give us an idea about how much of an outlier the Boeing failure is. So, in October of 2018, a Boeing 737 max operated by Indonesian airline Lion Air crashed into the sea tragically claiming the lives of all 189 passengers and crew members on board. Most people thought it was an isolated incident and most likely the result of a human error. But in March of 2019, another Boeing 737 max operated by Ethiopian airlines crashed, claiming another 157 lives.
With two separate crashes involving the 737 max occurring within the space of just a few months, the reality becomes painfully obvious. There is something fundamentally wrong with the airplane’s design that was causing these tragedies and these problems.
By then, almost everyone knows about the Maneuvering Characteristics Augmentation System (MCAS), the system that Boeing put in place to avoid having to recertify their planes with Federal Aviation Administration (FAA), and the component responsible for putting these two planes into a nosedive.
A system so poorly conceived, designed, and implemented that no engineer worth being called that, will agree to it being used in a plane that flies people except if they do not have a choice.
What Got us here, will get us there.
This huge increase in the airplanes’ safety, — these 2 incidents aside- has been achieved by following 2 simple rules: — Simple, but not easy -
- Learn and adapt.
- Good engineering practices and culture.
These two simple principles guided all aspects of civil aviation, from plane design and manufacturing to flight crow management, and it is clear they are working when they are followed.
Learn and adapt
In the book Antifragile, Nassim Nicholas Taleb talks about how something benefits from disruption, how they become better under pressure or after a shock. The human immune system is a good example. After each infection, it develops antibodies to easily identify and neutralize foreign objects next time.
He also specifically mentions how civil aviation evolves its safety record, how each plane that crashes makes the next flight safer. How in the aftermath of each disaster a whole panel of experts gets together to analyze the situation, determine the root cause, and develop remediation actions, to avoid similar failures happening again. This continuous inspection and adaptation led to some of the greatest innovations, and by consequence, to the safety record, we are seeing today in civil aviation.
Good engineering practice and culture.
Engineers are a special breed of people. I do not just say that because I am referred to as a software engineer, but also because I had the chance to work closely with lots of engineers in my career. I used to be employed for an organization that specializes in offering job posting, training, and other services to engineers (genium360). And as part of my job, I interacted with a lot of engineers, from all sectors of the economy and branches of engineering, from Civil to Electrical and Mechanical, and I got fascinated by how they saw themselves, and their contribution to society. Actually, since my arrival to Montreal, Qc (2012) I have talked to or met a Canadian or an American engineer daily, and they all seem to share this sense of pride of being the ones who build stuff, stuff that lasts, and get the job done.
They do that by being lifelong learners, always trying to be aware of the latest trend in their field, and by respecting two simple principles:
Simpler is better.
Since we know both 737 max 8 crashes were caused by the malfunctions of the Maneuvering Characteristics Augmentation System (MCAS), the issue is often labeled as a software problem, a glitch in the system, that needs to be addressed by designing and building a better version of this tool. And it is true, the device was poorly designed and implemented as we will see in the next section, but before that, we need to state that the simple idea of using such a device in the first place, is contrary to good engineer practice and culture.
When an Engineer is faced with a problem, they often try to find the simplest solution, the one with the fewest components, the one that requires minimal input from the operator or can be mass-produced fastest and cheapest, etc. The device they are building still must do its job and do it perfectly according to the specified parameters. But the simpler their system, the more reliably they can get that job done.
Centuries of building things have taught us that more complex systems are more prone to failure. And it makes complete sense. If you have a tool, composed of one part, as long as that part is well built, you do not have to worry about anything else. But a tool made of ten parts can fail if any of the ten parts breaks or any interaction between them does not go as it is supposed to.
You can see how the number of failure points grows exponentially with the complexity. And a system can quickly become unwieldy. So the best way to avoid all this is to prioritize simpler systems as much as possible. And this is one of the core tenets of the engineering mindset.
The famous MCAS
The purpose of the MCAS is to automatically regulate the angle of attack (AOA), which is the degree to which the airplane’s nose is pointing. If the AOA is too high, the airplane risks stalling out. So, the MCAS automatically senses when this is about to happen and adjusts the plane’s nose.
The problem with both flights was the MCAS system falsely thought that the AOA was too high when it was normal and to avoid any complication, it forced the plane to dip its nose downward. The pilots manually turned the MCAS off to control the plane by hand, but the MCAS would almost immediately turn itself back on and continue to force the plane into a nosedive.
Why did we need to constantly regulate the AOA?
A plane flying in the air, and without any input from the pilot, will always come to a stable position. This is important for safety because you might not feel it when you are traveling but a plane is constantly getting knocked out of straight and level flight by turbulence. And it is not simply hard but impossible to ask the pilot to always keep an eye on this movement and readjust.
Everything in a modern airliner, from the power, weight, and position of the jet engines to the wing and wingtips geometry, to the size and shape of the tail, has been designed for maximum efficiency but without sacrificing safety. This means that if something is going to make the plane burn less fuel but makes it less stable, it should be scrapped in favor of some other approach that ensures stability.
This last part, the tail, the horizontal stabilizer to be exact, is important in our case because it is responsible for providing a balancing force for the center of gravity and center of left.
Or in plain language, it makes sure the plane stays level, by correcting for any variation in the Angle of attack. And the remarkable thing about this is that a horizontal stabilizer provides passive pitch stability, which means it requires zero input from the pilot, and it does it without any fancy software or moving parts. It is just a huge piece of metal, connected to the plane tail, and has a negative angle of attack.
So, if the wind knocks the plane upwards, for example, the angle of attack of the horizontal stabilizer decreases, and the downforce decreases allowing the weight of the nose to pull it back, and vice versa if the plane pitches down.
Like we saw earlier, there is little that can go wrong with a static control surface like this. As long as the airline regularly performs proper maintenance on the tail surfaces, a failure of the horizontal stabilizer is essentially impossible.
On the other hand, a system like the MCAS uses what is called active stability. This means that the system is constantly wiggling the control surfaces to ensure the plane’s stability. This has numerous points of failure: the control surfaces with their various moving parts could fail, the sensors feeding the computer with data could fail or the software itself could fail.
Boeing could not rely on tried-and-true passive stability provided by the horizontal stabilizer, because the 737 max 8 had bigger engines than its predecessor, so they could not be placed at the same place without major changes to the landing gear. This new configuration of the engine size and position led to a major alteration in the flight characteristics of the plane, rendering the horizontal stabilizer unable to keep the plan level. Airplane manufacturers usually fix this issue by redesigning the airframe to gain passive stability again, but as you know by now Boeing decided against this as this will trigger the need for FAA recertification.
Forcing their engineers to go against their training, experience, and instinct, and break one of their cardinal rules, by opting for a complex system while a simpler, more reliable solution was possible.
Redundancy, redundancy, redundancy.
One of the words othered the most in the book the Martian, by Andy Weir, is redundancy. Redundancy is one of the main plot devices in that story. Without NASA’s insistence on having backups for backups, our favorite Mars botanist, Mark Watney, would have been dead, a few days after he was left alone on the red planet.
This idea of having secondary parts of a system, to double-check the primaries’ work, or kick in when they fail, is crucial to any high-performing machine. Anyone who ever built any device can tell you that components often do not operate as expected or even stop working completely.
This focus on redundancy is not only limited to the high-stakes world of space exploration. Engineers all over the world understand that their system’s resilience and accuracy are dependent on not having any possible single point of failure.
So, it is mind-boggling that when Boeing installed MCAS to automatically correct the stability issue, the system was requiring just one sensor to display a high angle of attack value to activate MCAS and automatically cause the plane’s computer to take over and pitch the plane down.
Something like this would not even be acceptable for a delivery drone today, not alone a multi-million-dollar plane with people on board. For the engineers at Boeing to go ahead with the design and implementation of a system like this, it is clear that they did not have a choice.
Engineers expect that sensors and other components will break at some point, this is why it is a standard engineering practice when we are building a system like the MCAS, to not rely on a single sensor, but to have an array of multiple sensors, either as a part of the same assembly, or better yet spread out in separate places to minimize the risk of all of them being obstructed at the same time. And the system that will act on the data provided by these devices should get the reading from multiple sources, compare it and act accordingly. This may just be following normal procedure if all these sensors return the same values or refer to the operator if there is any inconsistency.
If someone like me, for whom the biggest thing I built is a Raspberry Pi-powered self-driving car, knows this, one can imagine that the engineers at Boeing did also. All evidence seems to point that they did. After all, as we mentioned, this is so wild spread in the engineering world, it may as well be a rule also.
Like the first rule we pointed earlier, it takes a lot for an engineer to choose to break it, it takes a climate when one is so paralyzed by fear that they can’t stay true to themselves.
Conclusion.
Psychological safety is the belief that one can take risks and speak one’s mind. That they should not be afraid of asking questions, raising concerns, or offering ideas.
This kind of culture fosters innovation and enables learning. But not just that. at the same time, It serves as a safeguard to stop fear from creeping in into our workplaces. Because the reality is, fear is not something that we can just get rid of once-and-for-all. If we are not constantly working to stop it at the door, it will take a place in our offices or factories, and once it does, it will make our people act against their grain by making them believe they do not have a choice.
In the context of the engineers at Boeing, this meant not being true to some of the principles that define the way they, and millions of engineers around the world, work. Principles like:
- Constantly trying to learn from other people’s experiences to avoid making the same mistakes
- Avoid building complex systems when the simple solutions will do the job better
- Building resiliency by incorporating redundancy
These rules and others are not just some guidelines that one can use to be a better engineer but a core component of what makes one, and going against them is not a sign of being a bad engineer but more a sign of working within a bad engineering culture, a culture like the one we saw at Boeing while this scandal was unraveling, a way of working marked by command, control, and more than anything else fear.
Psychological safety aims to make this fear visible and work to eliminate it, by creating an environment where failure is not punished, when speaking up is more than welcomed, it is encouraged, even if you want to say something that is not that popular, or you are not 100% sure is correct.
A culture where engineers can stay true to what they learned and do what they know is the right thing.