Ep 2: Failure Modes – The Currency of ReliabilityApril 13, 2022 April 19, 2022 / Published By
In this episode, we talk about what a Failure Mode is and why Failure Modes are so important to equipment Reliability. As responsible custodians, it’s up to us to identify the plausible Failure Modes that could occur so that we can figure out if and how we should manage each one. If we don’t, it can end up in disaster.
Hi Everyone. And welcome to The Heart of Reliability, where we break Reliability down to the basics. Because after all, anything without a firm foundation will eventually crumble.
Welcome to episode two, where today we are going to talk about one of the most important topics. When it comes to Reliability today, we’re going to talk about Failure Modes.
Now, before I even go any further first, let’s define what I mean by a Failure Mode so we’re all on the same page. A Failure Mode is what specifically causes Functional Failure. A lot of people use the term Failure Cause. I use the term Failure Mode. So today we’re going to be using those two terms synonymously. So just know that when I use the term Failure Mode, I’m talking about something that specifically causes a Functional Failure.
Now here’s why the topic of Failure Modes is so very important. It’s vital. My mentor, John Moubray, here’s what he said about Failure Modes. He said that Failure Modes are the currency of Physical Asset Management. In other words, it’s the currency of Reliability. And you know, just like in our personal lives, if we don’t keep tabs on our money, that can end up in disaster.
I’m going to give you one example, one personal example. So, I don’t know how long ago, but I had a subscription to McAfee. It’s an antivirus software for computers. And I had a subscription on an old laptop. But I recently changed my laptop and I don’t use that old one anymore.
Well, I didn’t realize it, but it was on auto renew. So, I saw the fee on my American Express bill and I called right away and said, I don’t have that machine anymore. I don’t need this. And the representative on the phone said, yeah, sure. Okay. Yep. All right.
I got the charge reversed. Well, I’m going to be honest with you. I didn’t follow up with it. You know, I don’t know life gets busy and it should have been a priority because it’s my money, right? It’s a very important resource in life. We need money to live and to get the things we need and to get the things that we want.
But I got busy, and I wasn’t paying attention. And then just a few days ago, I was going through some statements and I realized that I was never credited the money. So now this is three months. I let three months go by.
Now, we’re only talking about maybe $139 US dollars, which is not a huge amount of money, but it’s all relative, right? It’s a significant amount of money. So, I called them up and you know, they said that the representative had stopped the auto-renew, but didn’t initiate the credit on my account. So anyway, it took some time and it was a little bit annoying,
But I got that sorted out. But the point is, is that I wasn’t paying attention to that money. And if I hadn’t gone back on it, it would have been lost. It’s very important that we pay attention to the currency of our equipment. And the currency of our equipment is Failure Modes. John Moubray used to say that we manage physical assets at the Failure Mode level.
And that’s why it’s so important. I like to liken it to money like he did because it’s easy to, it’s easy to get your mind wrapped around it. Right?
So, what I have here in my hands is some American dollars. If you’re listening and not watching, I’m holding up a bunch of money. So, I have 2, hundred dollar bills. I’ve got a few twenties. I’ve got a couple of tens. I’ve got a few fives. And I’ve got a bunch of ones. So, I have way more ones than I do fives, tens, twenties or hundreds. Now why is that? Well, obviously that’s because this hundred dollar bill is worth a hundred times more than this $1 bill. Right?
Well, the same thing goes when it comes to Failure Modes for our equipment. Failure Modes, that if they were to occur, could cause a failure if we let them happen and don’t manage them. Failure Modes that if we ignored them would cause certain disaster. Well, there generally are far fewer of those than just nuisance Failure Modes. Right?
So like, for example, if, if I lost these, I don’t know how many are here, about 10 one dollar bills. I don’t know. Let’s just say I’m walking around the supermarket and I get distracted and I drop them. It’s not a huge deal because it’s only $10. But if I’m walking around with these two, hundred dollar bills in my hand, as I shop and I get distracted and I drop them. Well, that matters way more. And the same thing goes for our physical assets.
Now money is easy because you can see it and you know what you got in your bank account. And you know, it’s much easier to know if you’ve lost any. But when it comes to our machines, if we don’t proactively identify them, there are many of them that we may not even realize existed.
And, and if you think that’s not true, you just turn on the news or read the newspaper or surf a news site on the internet. And you will see disaster after disaster after disaster. That happened because of a specific cause, AKA, a Failure Mode. So as responsible custodians, it’s up to us to identify all plausible Failure Modes that could occur so that we can figure out if and how we should manage each one.
Now in episode 1, we talked about what Reliability was, right? And we said that what Reliability means is getting what we need from our equipment.
Good Reliability means getting what we need from our equipment, but the Failure Modes are what would stop us from getting that Reliability. And that’s why it’s so important to identify them.
Okay. So let’s now get a little bit more technical about Failure Modes and about the importance of identifying them. You know, two of the mistakes that I often hear that people make or that I’ve seen people make is that they either identify Failure Modes at way too detailed a level, or at way too high a level. Now, what do I mean by that? Okay. Let’s just take, for example, let’s say we, we need to do an analysis on a diesel engine, right? Well, one of the steps in that process would be to identify what Failure Modes that engine is subject to.
So, I could just say engine fails. That’s a Failure Mode, right? A Failure Mode consists of a noun and verb. Well, engine fails is definitely a Failure Mode, but is it written in enough detail for us to do anything with it? I mean, obviously not, right? It’s way too vague.
I can’t make any decisions based upon the level of detail that I’ve identified. But if I say oil deteriorates due to normal use or air filter clogs due to normal use, I can take those and analyze them and figure out if or what I’m going to do to manage them. So, I know that I’m going to have to change my oil on a specified frequency or maybe do oil analysis. For the air filter, I’m going to change the air filter on a scheduled basis. But the point is that it has to be written in enough level of detail to do something about it.
Let’s just take a simple everyday example so I can illustrate what I mean by writing Failure Modes in the right amount of detail, but also identifying plausible ones. So recently my computer mouse, my laptop mouse, which if you’re just listening, I’m holding up a Microsoft Bluetooth mouse, but just recently I went to use it and it wouldn’t work. So the first thing I do, is look on the bottom of the mouse and there’s a little button with a little Bluetooth icon.
I push that to make sure that Bluetooth was on. And of course, that didn’t work. And so then what I did is I removed the battery, the battery compartment door, and this is what I found. If you’re just listening and you’re not watching, I’ve got a small Ziploc bag and I’ve got a nasty AAA battery that you can see because of non-use the battery discharged.
So not only did is the battery bad, but it actually damaged the connection and the mouse. So it doesn’t work anymore. The mouse doesn’t work at all anymore. But what that did is that it alerted me to a Failure Mode that I had not considered, and I do this for living, right? So, this is a living example of what I’m talking about.
That as Responsible Custodians, in whatever capacity you’re working, whether you’re a CEO or you’re a middle manager, or you’re an operator, or you’re a maintainer, you’re an electrician, whatever it is that you do, we all get busy, right? And it’s very easy to overlook something as I did. I did not identify that one of the Failure Modes was that the battery can discharge due to non-use.
The Failure Effect is that if that happens, it can actually damage the connection which renders the mouse failed. So, when it comes to my mouse, there are several different Failure Modes that I now need to manage. I know I need to manage battery discharges due to non-use. This happened during a time when I wasn’t using my laptop very much.
I was using my desktop primarily, and that’s why it happened, but that’s one plausible Failure Mode. So that brought my attention to the fact that there is a Failure Mode I overlooked that the battery can discharge due to non-use. Now the way I’ve learned to manage that is, I remove the batteries if I know I’m not going to be using it on a daily basis.
So, another Failure Mode identified is that the batteries discharged due to normal use. The more I use my mouse, eventually the batteries are going to be drained, right? And then another Failure Mode is that the most could just pack up and fail, right? That is a very high-level Failure Mode. So, one of the reasons why it’s so important to identify Faiure Modes is definitely so that we can figure out how to manage them.
Because when we identify the Failure Mode, then we can figure out what happens if that Failure Mode were to occur. Well, if my battery discharges due to non-use, it can destroy the connection in the mouse itself. And then I will have to replace the mouse if my battery’s discharged, just due to normal use. Well that will render my mouse failed. And then the mouse could just pack up and fail, right? But why this puts you in a position of strength, is that it, it sets you up so that you can then go figure out if and how you’re going to manage each one.
Now, in the simple example of my mouse, what I now do is I remove the batteries. If I know I’m not going to be using my mouse for an extended period of time, I also carry two extra AAA batteries in my laptop case. So that, you know, when, because I know what’s going to happen when the batteries finally run out, or they discharge, I will have them as a backup.
Now, the other Failure Mode, that the mouse could just pack up and fail internally for whatever reason. Well, that definitely could happen. I don’t carry around a spare mouse because I have the mouse pad that’s actually on the laptop. But that is because the consequence of that Failure Mode is not really that big of a deal. If it were, I’m in a position of strength to do something about it.
Maybe I might want to carry around a spare nouse with me, right? So now I know I’ve gone on and on about a mouse, which is not really a big deal, but at The Heart of Reliability, what we do here is we get down to the basics. We make it so that we can understand things on a very basic level.
So, we can then translate that knowledge to the more technically advanced stuff that we have to deal with on a day-to-day basis. Right?
So, there you have it. We’ve identified three different Failure Modes for my mouse, one of which I had never considered, and I lost a mouse because of it. Now let’s talk a little bit about real equipment for a moment and why Failure Modes – why it’s so important that we identify them.
Let’s just talk about a filter, right? So, we talked about it’s important to identify Failure Modes. But we also have to identify them at the right level and with the right amount of detail. So, if I say filter clogs, well, I don’t really know what to do with that, right? Because it’s not written in enough detail to kind of lead me where I could use my brain and figure out a solution for it or a Failure Management Strategy for it. But if I break it down, let’s break the filter down into three things like we did the mouse. We could say: (1) Filter clogs due to normal use. We could say (2) Filter media deteriorates due to normal use. And we could say (3) Filter damaged due to improper installation.
So, number one, it’s vital that we identify the Failure Mode so we can figure out how we’re going to manage them. But number two, they have to be written at the right level so we can do so. So, when the first example with filter clogs due to normal use, well, I could change the filter on regular basis, or I could do Condition-Based Maintenance where I could maybe inspect the differential pressure gauge every so often. And that would give me an indication that the filter is in the process of clogging. With filter media deteriorates, maybe that’s an age-related failure. For example, maybe I’ve got a fuel filter that has a paper element in it, and that will deteriorate over time. So, I have to replace the filter every so often to manage that Failure Mode. Another one is fuel filter is installed improperly, right? And that that’s what causes the damage. So that allows me to then sit back and say, if that’s a plausible Failure Mode in my operating environment, then that means I probably have to train someone better.
Maybe the steps in replacing that filter, maybe they’re vague in a tech manual or in the maintenance instruction, right. It allows me to bring the issue to the light. That’s one of the biggest things about identifying Failure Modes is that it allows you to bring all of those things to light that could stop you from getting the Reliability that you need.
But the key is to do it proactively. One of the reasons why so many organizations are in Reactive Mode or what we often hear as firefighting mode is because Failure Modes, to a large degree, were not identified proactively. So, you know, when I talk to a lot of people whose organizations are in firefighting mode, it’s not just a negative thing on Reliability.
Sure. I mean, you know, it causes chronic downtime, increased costs, scrap product, you know, firefighting mode can get us into, you know, a lot of different negative scenarios with respect to Reliability and with respect to machines.
But one thing that I think is not talked about enough is what is that doing to our people in our organizations, right? I mean, when you’re running around in firefighting mode, you know, people are annoyed. They’re frustrated, you know, maybe they’re angry. People have other people, you know, barking at them. Why is this machine down? You’ve got to get this machine up. We’re losing money by the minute. And it, it creates this, I’ll use the word culture, right?
It creates a culture of like negativity and anxiety. And, you know, nobody likes to be like that. I mean, isn’t it nice when you, the night before, like you’re getting ready for work. And the night before you get your clothes ready, you know, maybe you pack some food for your lunch. You get, you know, you write your to-do list the day before for what you have to do. So, when you start your work day, you’re ready as opposed to it being Sunday night and being lazy and laying on the couch and watching TV and being distracted. So then when you get up now you’re running around and you’re late and anxious.
You’re talking to yourself, like, why did I leave everything to the last minute? You’re trying to figure out, do I even have any clean wrinkle-free clothes to wear? And, oh, shoot, I didn’t make my lunch. So now I’m going to have to stop what I’m doing and go out and get something that I probably shouldn’t eat.
Right? And then you, you arrive at work, and you don’t have your plan for the day. So, you know, now you get there a little late because you weren’t ready. And it’s, it’s like you arrive there. And as soon as you walk in and put your, your bags down, someone is in your office saying, Hey, this machine is down. Or I need this, or I need that. You know, you didn’t give yourself time to get acclimated. I mean, you started the day all frazzled and frustrated and that’s what’s happening within our organizations.
Now, one of the reasons why I started this podcast is to shed some light on these basics. There’s a lot that goes into Reliability. So, if you’re out there and your organization is in Reactive Mode, or maybe you’re not in Reactive Mode. Maybe your costs are sky-high because maybe you’re doing way too much maintenance. Maybe you’re doing too much of what you need to do. So, you’re getting the Reliability that you need, but it is at a huge cost. You know, whatever it is, you have to take the time to stop and get a plan in place as to how you’re going to turn things around.
And it starts with a first step. You know, Reliability can be very overwhelming if you think of it in one big chunk. But when you break it down, it becomes so much more simple. You know, really Reliability in general is simple. It’s simple, but it’s not easy at times. But it’s really quite simple when you break it down.
I mean, just talking about today’s topic, which is Failure Modes, right? What is a Failure Mode? It’s what specifically causes Functional Failure. Well, what will specifically cause that filter to clog due to normal use? That’s a very simple concept. It’s so simple though, that it is very often overlooked.
And I think that, generally speaking, people, organizations have big Reliability problems because the simple stuff is way too often overlooked, right? We’re looking for that next shiny object or we’re looking to see where we can throw some money to hopefully turn things around. And sure, throwing money at the right thing at the right time can help.
But when it comes to Reliability, you’ve got to get systematic and there’s nothing more systematic than identifying what specifically could cause Functional Failure. And all that means is what specifically could cause me to not get the Reliability I need from my machine. And that is why Failure Modes are so important. That is why proactively identifying Failure Modes for your equipment is so vital.
I’m Nancy Regan. Thank you for joining me for the second episode of The Heart of Reliability. Be sure to join me for the next one, because in the next one, we’re going to talk about Preventive Maintenance. I look forward to welcoming you back to that one. Thank you for watching.