Engineering Risk and Safety: Designing for Failure

(Level 4)

Engineering isn't just about making things work—it's about making things work safely, reliably, and predictably over their entire operating life. Identifying risks, anticipating failure modes, and designing with safety in mind is what separates responsible engineers from reckless ones.

Every component will eventually fail. Your job is to make sure it fails safely, predictably, and only after a long, useful life. That starts with thinking about failure from day one.

Designing for Real-World Variability

A team design a gearbox housing. They perform a stress calculation under worst case loading condition. The stress is below the yield stress, and there is a factor of safety of 2.5. They then go into production. Six months later, there are a few field failures. After some returns, a root cause analysis is performed. The failure analysis reveals all the failures start from the same location—the mounting bosses. Further investigation reveals the casting process has some variability in shrinkage during solidification. Occasionally, this results in a small internal void at the critical location. During startup, a random loading spike causes high stress concentrations around the void. Fatigue soon takes over and a crack propagates from the void until the component fails.

After getting the design "optimized" for the nominal case (expected loads, perfect material, ideal geometry), I realized that there was still room for improvement. What happens when the expected loads aren't quite so expected? What happens when material isn't quite so perfect? What happens when manufacturing isn't quite so ideal? This part has tolerances, and defects are inevitable. Designing for the expected case is a recipe for failure.

We experienced engineers try to prepare for the worst case scenarios of over-practical operator misuse during start-up/shut-down sequences, unforseen manufacturing defects in component material, slow creeping corrosion over several years, and poorly assembled sub-systems by less than perfectly trained technicians. The universal question becomes "what could go wrong?". If you can identify the potential failures, then you design to prevent them.

The following are failure modes that will eventually bite you if you ignore them. Please note that there are no guarantees that these failures will be catastrophic or irreversible, but if you don't address them now, they will only get worse and more difficult to fix.

Reality Bites: Overloads: Loads are rarely exactly what you predicted based on some engineering brochure. In practice, things go over rating from time to time. One may occasionaly see a unintended impact load from a failed piece of equipment. Or maybe someone left a rafter standing in a perfect lean-into-the-load position and someone tried to hoist something which is 2× design while the rafter was leaning in the worst possible orientation for a few seconds. So what happens? Does the member undergo permanent distortion or does it fail in catastrophic fashion?

Fatigue: Static analysis will tell you nothing about fatigue life. A part that is rated for 5000 N static load maybe be good for less than half that in fatigue life (e.g. 100,000 cycles to failure at 3000 N). Fatigue is important for parts that see vibration, oscillatory loads, or cyclic thermal expansion.

Environment: Saltwater can corrode metal on a 3D printed model over time, plastics degrade in UV light, and materials have different tolerances when dry versus wet. Everyday factors such as being stored in a box or on a dusty shelf can cause a 3D printed bearing to degrade far more quickly than typical failure in cycling. Basically, any design that will be used outside, in a factory environment, or anywhere other than a temperature and humidity controlled lab environment will also degrade quickly.

Wear and friction: these are the bane of designers and maintenance folks everywhere. As your fancy moving parts wear down, your bearings break, your seals leak, and those fancy coatings flake away into worthless dust. It's the inevitable cost of getting something to move around, the only questions being how long it will take to go bad and whether you accounted for the upcoming failure in your operating costs.

Flaws in Manufacturing: Porosity in Castings, Inclusions in Welds, Tool Marks in Machining becoming Crack Initiators, Inhomogeneities in Heat Treatment, and more. Flawed parts despite your best design!

Falsely Dependable: Parts are assembled incorrectly, components are not properly tightened, scheduled maintenance is neglected, warning labels are ignored, and operators treat the system as fail-safe. If your design can fail due to simple errors on the part of operators and maintenance personnel, it will.

Anticipating failure is not paranoid; it can be designed away. Anticipating failure that you have not considered will find you anyway.

Safety Factors as Uncertainty Buffers

A factor of safety is more than just making something stronger to account for the fact that bad things can't happen. There are so many variables that you can't ever know for sure, like the variation in material properties or the uncertainty in loads. There are also limits to your model, limits to your manufacturing process, and the inevitable degradation of a part over time. How much of that uncertainty can you afford to absorb and still have a part that works?

However, too much safety margin uses unnecessary amounts of material, increases weight and cost for no good reason. Too little results in catastrophic failure. In the example of an elevator rope, if the Factor of Safety (FoSPS) was 1.2, it would be lighter, less expensive and better suited for application than at FoSPS = 4. However, with this lower FoS, the chances of early rope failure due to a manufacturing defect or unexpected dynamic loading are much greater. Conversely, a simple machined bracket might be designed with a FoS of 10. At this level of safety margin, the design would never fail. However, the resulting product would be unnecessarily heavy and expensive for the use case at hand.

One thing I try to get at with my readers is the context and consequences of using different factors of safety in design. There is a huge difference between having a large margin of safety in a part where failure may result in serious harm to a person versus failure of a machine that is meant to be swapped out with a part to continue running. The context in which you are working is important when deciding how much safety margin is necessary.

Factor of safety varies wildly based on consequences:

Elevator cable - Failure on Status (FoS) = 10-12. Failure of elevator cable can be catastrophic with serious life threatening consequences. Therefore, extremely high safety factors are built into design, regular checks are enforced and lift cables are replaced on a very premature schedule. Again, here weight is of minor importance as against potential risk to human life.

The factor of safety (FoS) for pressure vessels and similar equipment is typically in the range of 3 to 4, meaning that if failure were to occur, it could potentially result in injury to people or adverse environmental effects. As a result, design codes such as the ASME pressure vessel code have established minimum safety margins based on decades of failure data analysis. It is generally accepted and followed that designers should remain within these established boundaries.

Industrial machine component: FoS = 2-3. Criticality is such that if failure occurs, there is down time and repair cost but no risk to human safety. Moderate margins are built in and the design is made as easy as possible to inspect. Sufficient stocks of rotating part components are held as spares.

In Aerospace structures, the Factor of Safety (FoS) is set to be between 1.5-2.0. Because every kilogram of structure is costing fuel over the life of the aircraft, every part of the structure has to be tested, to a high level of accuracy, with multiple load paths and a regular maintenance inspection regime. High performance brings its own set of challenges.

We must consider what the factor of safety actually covers. The strength of a material is inherently variable from lot to lot and even within the same part. Load requirements are guesses at best. Equipment operating habits can greatly affect actual loading. FEA models rely on assumptions regarding boundary conditions and loading paths, while parts actually made have defects not modeled. Corrosion and wear induce progressive strength loss over time. This factor of safety is your buffer for all these factors.

There are perils to operating with either too low or too high of a factor of safety in this way. Too low, and you are essentially gambling with the reliability of your design. Too high, and you are pointlessly solving a problem that doesn't actually exist. The art is determining just the right amount of safety margin relative to the actual risk to the system.

Failure Mode Identification with FMEA

You can't leave failure mode mitigation to chance. You need a structured method to identify potential failure modes, determine their impact and likelihood, and to decide where to concentrate mitigation efforts. This is known as Failure Mode and Effects Analysis (FMEA). FMEA forces you to methodically consider every part or assembly and ask questions like: How might the item fail? What effect would that failure have? And how can we ensure that the part fails less severely or less frequently?

FMEA is performed in various forms in automotive, aerospace, medical devices and any other industry where reliability actually matters. While there may be differences in the approach to capturing information, the fundamental goal is the same - identify potential field failures proactively. For each part, one would typically capture the potential failure modes, their effects, the severity of those effects, the likelihood of part failure and likelihood of detection prior to field failure.

ANSWERS4ALL: Example FMEA for a bolted joint in rotating machinery.

Component: M12 bolt connecting flywheel to shaft
Failure Mode: Bolt fractures under cyclic loading (fatigue)
Failure Effect: Flywheel breaks off the shaft, Equipment fails catastrophically, Potential for flying pieces to strike operator.
Severity: 9/10 (could cause serious injury)
Likelihood: 5/10. Medium risk, depends on preload accuracy, material quality, and load magnitude.
The failure had no warning signs and happened unexpectedly.
9 × 5 × 2 = 90 High priority for mitigation

Mitigations:
→ Use Grade 10 bolts, preferably Grade 10.9 or higher for increased fatigue strength.
→ Torque Spec and Torque Wrench (see note 1)
→ Include regular inspections (replace bolts after 5000 hours of use) in the schedule.
→ Add thread-locking compound (prevents loosening from vibration)
→ Safety shield around flywheel located; contains fragments in the event of failure.

The process was not to create pages of documentation, but to discipline myself into thinking about failure modes systematically and to identify where to apply my effort. Failure modes with severe consequences and high likelihood to occur but low detectability get priority attention. Those that are unlikely to occur, detected easily, and have minor consequences can either be ignored or constantly monitored.

FMEA isn't just an exercise to predict and plan for potential failures; it is also a tool that creates a permanent record. Later, when some suit asks if you considered the possibility of fatigue failure, you can reference your work and say yes we considered it, we evaluated the risk and as a result took those mitigations.

Fail-Safe Design and Damage Tolerance

Failure is unavoidable at times. Parts will eventually wear out, components reach end of life, and unexpected loads can sometimes occur. However, with proper design, failure can be made safe – i.e., even though failure will inevitably occur, it doesn't matter because the part fails in a safe manner, does not have further, dangerous effects, and warns of impending catastrophic failure.

Did you know that most systems and designs are implemented with a philosophy called fail-safe design? It's so important in Aerospace, Automotive, Structural Engineering and any field where there are serious consequences of something going wrong. Typically engineers design systems under the assumption that everything will fail, but something else won't.

Fail-safe strategies that actually work:

Redundancy: Critical systems have backups. Brake systems have dual circuits, aircraft have multiple hydraulic systems, and power supplies have redundant feeds, for example. Critical systems don't rely on a single component, because the cost of failure is too high.

Design for graceful degradation: Allow failure of some components so that the system as a whole continues to function at reduced levels rather than failing completely. A multi-engine aircraft is designed to fly on any one of its engines. A RAID storage array can "tolerate" the failure of one or two of its drives. A structure is designed so that cracks in some members can be taken up by adjacent members, thereby continuing to support load.

Fuses and weak links are intentional failure points. These are areas where something is deliberately allowed to fail in order to prevent accidental failure of more critical components. A simple example of this is shear pins. A gearbox with shear pins means that even in the worst case scenario where something goes horribly wrong with the R/C car, the shear pins will fail first. This is better than having the expensive gearbox broken beyond repair. A breakaway post is designed to fail on impact with another vehicle rather than shredding your expensive model. A pressure relief valve is designed to fail and open up before the high pressure tank which contains it is breached. In each of these cases, a cheap part is sacrificed for the good of the more expensive and dangerous component.

Fail-to-safe: When power is lost or control fails, the system returns to a safe state. Air brakes lock up when air pressure is released (not released). Pneumatic valves close to prevent leaks when the pneumatic supply fails. Elevator doors do not open when the car is not on the door's floor. The default state of the system is one of failure.

Design for detectable degradation before catastrophic failure: Yielding ductile components give visible warning of approaching failure; crack-stop holes prevent catastrophic propagation of cracks; and inspection ports enable timely detection of wear.

Real example—aircraft wing design and multiple load paths:

Wings on aircraft are designed to have multiple, redundant load paths, so that if one component begins to fail, the wing can still be structurally sound through other components. The wing skin, spars, ribs, and stringers all contain load paths, so that if one of the spars starts to crack slowly, for example, it is unlikely to cause catastrophic failure, as other structures can handle the load. Typically, routine inspections are performed on aircraft, and even if a crack were to start growing unexpectedly between inspections, the wing would slowly begin to deform and show signs of distress, such as cracking of paint, local buckling or asymmetrical wing flexure, before catastrophic failure could occur.

Why commercial airliners can have such great safety records after decades of harsh operating conditions is often mystifying. The answer is that failures are always anticipated, robustly detected before disaster occurs, and managed in a structured and known manner. Design methodologies avoid relying on the perpetual perfect operation of individual components.

Fail-safe isn't giving up – it recognises that things will go wrong. Despite your best efforts, things will break, people will make errors, load will exceed design parameters. However, by anticipating what could go wrong and designing failure to be safe, you can turn an incident into a minor event rather than a catastrophe.

Professional Responsibility and Sign-Off

Once the design is released, the designer is stating that they believe that design will perform safely and effectively under all expected conditions of use. That's not just a formality, it carries considerable legal and moral consequences. If the design fails catastrophically injuring or killing someone, the designer's analysis and decisions, particularly his assumptions will be examined in great detail. And ultimately, the designer must accept responsibility for what he knew and for the decisions he made. Saying "I was just following orders" is never an adequate excuse if it was known that what you were doing was unsafe.

While a catastrophic failure due to a malfunctioning engine might not have dire consequences, there are still occasions where greater risk is posed to life and safety, and knowing when to speak up is key. Don't take my word for it: treating your work with appropriate seriousness is important. Your work affects people's safety. Validating your own assumptions, recognizing when you don't know something, and speaking up when something appears unsafe are all aspects of what is generally acceptable as professional responsibility. That means you can take an opposite stance to the code, and claim that salami-slicing is a proper response to risk (at least in software).

Engineering failures kill people. Bridges collapse. Pressure vessels explode. Medical devices malfunction. Buildings fail in major earthquakes. And when things go wrong, it often follows a chain of decisons - decisions in which engineers may have played a role. Many of these failures could have been avoided if the engineer knew something was amiss and spoke up. Speaking up may not be easy, but it is part of an engineer's responsibility and duty.

Case study: Hyatt Regency walkway collapse (Kansas City, 1981)

Two suspended bridges collapsed on Saturday, as people walked across them in a crowded hotel atrium, and lay in rubble on following days killing 114 people and injuring 216 others in what is one of the deadliest building structural failures in the history of the United States.

The original design called for continuous steel rod posts holding up both walkways, but during construction those were changed to individual rods for each walkway. That simplified the manufacturing task, and the engineers gave in without fully re-examining the increased chances of something terrible falling on someone.

The original design treated each connection as carrying the weight and load of one 8-foot wide walkway. However, in the as-built design, the upper connection carries the load of both 8-foot walkways (i.e. 16 feet wide) or double the original design load. The connection ruptured and failed catastrophically.

This is a great lesson. Every design change, no matter how slight or even suggested by someone else, needs to be reconsidered and re-analyzed. Just because one assumption worked for Design A does not mean it will automatically work for Design B. The changes that the fabricator made to this detail seemed innocuous enough but shifted the entire load path. This change was not caught by the engineers because they did not conduct a rigorous re-analysis.

What does professional responsibility actually mean day-to-day?

Don't sign off on work you don't understand. If you're not confident in your analysis, say so. Get a second opinion. Ask someone more experienced to review it. Admitting uncertainty is professional. Pretending you know something you don't is reckless.

Speak up if you see something unsafe. Identify design flaws, deferred maintenance, and insufficient testing. Document your concerns and push for action if necessary. Your job is to prevent failures, not just to do your job.

It is always good practice to note assumptions and limitations of a calculation. In the notes, you can clearly state what the calculation does and does not include. In this case, the calculation was based on static loading conditions and does not take into account fatigue, thermal effects or impact loading. If you use this work outside of its valid limits, at least you have documented the limits.

Risk Management Without Analysis Paralysis

Thinking about failure doesn't mean you're afraid to design. It means you're designing responsibly. The goal is informed confidence: you know the risks, you've identified the critical failure modes, you've implemented mitigations, and you can defend your decisions with evidence. That's not paranoia—that's competence.

You can't eliminate all risk. That's not the job. The job is managing risk to acceptable levels based on consequences, likelihood, and cost. Some failures are acceptable if they're unlikely, easily detected, and have minor consequences. Other failures demand multiple layers of protection because the consequences are unacceptable. Context determines how much caution is appropriate.

Here's how to balance safety with actually getting work done:

Focus effort where failure matters most. Not every component is critical. A decorative cover plate doesn't need the same scrutiny as a load-bearing weld. Identify which components are safety-critical, which affect reliability, and which are just annoying if they fail. Allocate your analysis time accordingly.

Use testing to validate assumptions. Calculations and simulations give you predictions. Testing gives you reality. If you're doing something novel or pushing boundaries, build prototypes and test early. Real-world validation catches problems your model didn't account for.

Leverage standards and codes. ASME pressure vessel codes, AISC steel design standards, automotive safety regulations—these exist because people have already done the risk analysis and compiled best practices. Following established standards doesn't make you unimaginative; it means you're not rediscovering lessons that were learned decades ago through expensive failures.

Consult experienced engineers. They've seen failures you haven't. They know which assumptions are safe and which are dangerous. They can tell you "we tried that approach five years ago and here's why it didn't work." Don't try to figure everything out in isolation when someone with experience can save you from predictable mistakes.

Design reviews catch what you missed. You've been staring at this design for weeks. Your brain fills in gaps and overlooks problems. A fresh set of eyes will ask questions you didn't think to ask and spot issues you've become blind to. Reviews aren't bureaucracy—they're quality control.

The point isn't to be fearful. The point is to think clearly about consequences, identify risks systematically, and make deliberate choices about which risks are acceptable and which demand mitigation. That's what separates professional engineers from people who just run calculations and hope for the best.

Ready for the Next Level?

With risk awareness established, you're ready to master communication, learn to defend your decisions, and develop professional engineering judgment.

Continue to Level 5: Communication and Judgment →