Engineering Risk and Safety: Designing for Failure

(Level 4)

Engineering isn't just about making things work—it's about making things work safely, reliably, and predictably over their entire operating life. Identifying risks, anticipating failure modes, and designing with safety in mind is what separates responsible engineers from reckless ones.

Every component will eventually fail. Your job is to make sure it fails safely, predictably, and only after a long, useful life. That starts with thinking about failure from day one.

What You'll Learn

Real-World Variability Safety Factors FMEA & Failure Modes Fail-Safe Design Professional Responsibility Risk Management

Designing for Real-World Variability

A team designs a new gearbox housing. They calculate the stresses under expected loads—looks good, stress is below yield, factor of safety is 2.5. The design goes to production. Six months after launch, units start cracking in the field. Not all of them, just some. Returns pile up. Root cause analysis reveals the cracks always start at the same mounting boss. Turns out the manufacturing process occasionally leaves small voids in that area from shrinkage during casting. When a void is present and the machine hits a higher-than-normal load spike during startup, the stress concentration triggers a crack. Fatigue does the rest.

The original design was "correct" for the nominal case: expected loads, perfect material, ideal geometry. But real parts aren't perfect. Real loads aren't always nominal. Real manufacturing has defects. Designing only for the expected case is designing for failure.

Experienced engineers design for the worst case: overloads during startups and shutdowns, material defects from manufacturing variability, corrosion over years of operation, assembly errors from technicians who didn't follow the manual. They ask "what could go wrong?" not to be pessimistic, but because knowing the failure modes lets you design around them.

Here are failure modes that will eventually bite you if you ignore them:

Overloads: Loads are rarely exactly what you predict. Machinery gets operated beyond rated capacity. Impact events happen. What if someone applies 2× the design load for just a few seconds? Does it yield permanently, or does it fracture?

Fatigue: Static strength calculations tell you nothing about fatigue life. A part that handles a single 5000 N load might crack after 100,000 cycles of 3000 N. If there's vibration, oscillating loads, or cyclic thermal expansion, fatigue matters.

Environment: Saltwater corrodes. UV exposure degrades plastics. Temperature extremes change material properties. Dust gets into bearings. Moisture freezes and expands. If your design lives outdoors, in a factory, or anywhere other than a climate-controlled lab, environment will degrade it.

Wear and friction: Moving parts wear. Bearings degrade. Seals leak. Coatings flake off. If something slides, rotates, or rubs, it will wear out. The question is how long it takes and whether you planned for replacement.

Manufacturing defects: Castings have porosity. Welds have inclusions. Machined surfaces have tool marks that act as crack initiators. Heat treatment isn't perfectly uniform. Your design might be perfect, but the physical part has flaws.

Human error: Parts get assembled backwards. Bolts don't get torqued. Maintenance gets skipped. Operators ignore warning labels. If your design can fail because someone made a mistake, someone will make that mistake.

Thinking through these failure modes isn't paranoia—it's due diligence. The failures you anticipate, you can design around. The failures you don't think about will find you anyway, usually at the worst possible time.

Safety Factors as Uncertainty Buffers

A factor of safety isn't just "make it stronger so nothing bad happens." It's a calculated buffer against everything you can't perfectly predict: material variability, load uncertainty, simplifications in your model, manufacturing defects, and degradation over time. The goal is to absorb all that uncertainty and still have a part that works.

But here's the tension: too much safety margin wastes material, adds weight, and drives up cost. Too little risks catastrophic failure. An elevator cable with FoS = 1.2 would be lighter and cheaper, but one defect or unexpected load spike could kill people. A machine bracket with FoS = 10 would never fail, but it would be absurdly heavy and expensive for something that doesn't need that level of safety.

Choosing the right factor of safety means understanding consequences. If failure means death, you use massive margins and validate everything with redundant testing. If failure means a machine stops and you swap the part, you can use tighter margins and rely on inspection schedules. Context determines what's appropriate.

Factor of safety varies wildly based on consequences:

Elevator cable: FoS = 10-12. Failure means people die. You use absurd safety margins, inspect regularly, and replace cables on conservative schedules. Weight doesn't matter compared to safety.

Pressure vessel: FoS = 3-4. Failure could injure people or cause environmental damage. ASME codes dictate minimum margins based on decades of failure analysis. You follow them.

Industrial machine component: FoS = 2-3. Failure means downtime and repair cost, but nobody gets hurt. You use moderate margins, design for inspectability, and have spares available.

Aerospace structure: FoS = 1.5-2. Every kilogram costs fuel over the aircraft's life, so margins are tight. But you compensate with rigorous testing, redundant load paths, and frequent inspections. High performance demands precision.

What's the factor of safety actually covering? Material strength varies batch-to-batch and within a single casting. Loads are estimates, not certainties—real forces depend on how equipment is operated. Your FEA model made simplifying assumptions about boundary conditions and load distribution. Manufacturing introduced small defects you didn't account for. Corrosion and wear will reduce strength over time. The factor of safety is the cushion that absorbs all of that and keeps the part functional.

Using the wrong factor of safety in either direction creates problems. Too low and you're gambling with reliability. Too high and you're wasting resources solving a problem that doesn't exist. The skill is calibrating margins to match actual risk.

Failure Mode Identification with FMEA

You can't just guess at what might fail and hope you covered everything. You need a structured process to identify failure modes, assess their severity and likelihood, and decide where to focus mitigation effort. That's what Failure Mode and Effects Analysis (FMEA) does. It forces you to think through every component and ask: how could this fail, what would happen, and how do we prevent it?

FMEA is standard practice in automotive, aerospace, medical devices, and anywhere reliability actually matters. The format varies by industry, but the core idea is the same: systematically identify risks before they become field failures. For every component, you document the failure mode, the effect if it happens, how severe the consequences are, how likely failure is, and whether you'd detect it before it causes harm.

Example FMEA for a bolted joint in rotating machinery:

Component: M12 bolt connecting flywheel to shaft
Failure Mode: Bolt fractures under cyclic loading (fatigue)
Effect of Failure: Flywheel separates from shaft, machinery stops suddenly, potential for fragments to eject and injure operator
Severity: 9/10 (could cause serious injury)
Likelihood: 5/10 (depends on preload accuracy, material quality, and load magnitude—medium risk)
Detection: 2/10 (no visible warning before fracture, happens suddenly)
Risk Priority Number (RPN): 9 × 5 × 2 = 90 (High priority for mitigation)

Mitigations:
→ Use Grade 10.9 or higher bolts (increases fatigue strength)
→ Specify torque spec and use torque wrench (ensures proper preload)
→ Implement mandatory inspection schedule (replace bolts every 5000 hours)
→ Add thread-locking compound (prevents loosening from vibration)
→ Consider safety shield around flywheel (contains fragments if failure occurs)

The point isn't to create paperwork. The point is to force yourself to think systematically about everything that could go wrong and prioritize where your effort goes. A failure mode with high severity, high likelihood, and low detectability gets attention first. A failure mode that's unlikely, easily detected, and has minor consequences can be accepted or monitored.

FMEA also creates a record: when someone asks "did you consider fatigue failure?" you can point to the analysis and show that yes, you thought about it, assessed the risk, and implemented these specific mitigations. That documentation matters when things go wrong or when liability questions arise.

Fail-Safe Design and Damage Tolerance

Sometimes you can't prevent failure. Parts wear out. Components reach end of life. Unexpected loads happen. When prevention isn't realistic, the goal shifts: make failure safe. If something has to break, design it to break in a way that doesn't kill anyone, doesn't cascade into bigger failures, and gives warning before catastrophic events.

This is called fail-safe design, and it's fundamental in aerospace, automotive, structural engineering, and anywhere consequences matter. Instead of assuming nothing will ever fail, you assume failure is inevitable and design the system to handle it gracefully.

Fail-safe strategies that actually work:

Redundancy: Critical systems have backups. Brake systems have dual circuits—if one fails, the other still works. Aircraft have multiple hydraulic systems. Power supplies have redundant feeds. You don't rely on a single component when failure is unacceptable.

Graceful degradation: If one component fails, the system continues operating at reduced capacity instead of stopping entirely. Multi-engine aircraft can fly on one engine. RAID storage arrays tolerate drive failures. Structural members redistribute load when one element cracks.

Fuses and weak links: Design intentional failure points that break before critical components do. Shear pins fracture to prevent gearbox damage. Breakaway posts collapse on impact instead of destroying vehicles. Pressure relief valves open before tanks rupture. You sacrifice the cheap, replaceable part to save the expensive or dangerous one.

Fail-to-safe positions: When power is lost or control fails, the system defaults to a safe state. Brakes engage when air pressure drops (not release). Valves close when pneumatic supply fails (preventing leaks). Elevator doors won't open unless the car is at a floor. The default state assumes failure.

Warning before catastrophic failure: Design for detectable degradation before sudden fracture. Ductile materials yield and deform visibly before breaking. Crack-stop holes prevent cracks from propagating. Inspection ports let you see wear before it's critical. You want advance warning, not sudden surprises.

Real example—aircraft wing design and multiple load paths:

Aircraft wings are designed with redundant load paths. The wing skin, spars, ribs, and stringers all carry load. If one spar develops a crack, the wing can still carry load through alternate structures. The crack will grow slowly enough that routine inspections catch it long before it becomes critical. Even if a crack grows unexpectedly between inspections, the wing will deform and show visible signs—paint cracking, buckling, asymmetric flex—before sudden structural failure.

This is why commercial aircraft have absurdly good safety records despite operating in harsh conditions for decades. Failures are anticipated, detected early, and managed systematically. Nothing relies on a single component staying perfect forever.

Fail-safe design isn't admitting defeat—it's being realistic. Things will break. People will make mistakes. Loads will exceed predictions. When those things happen, your design should degrade safely instead of catastrophically. That's the difference between an incident and a disaster.

Professional Responsibility and Sign-Off

When you sign off on a design, you're saying "I believe this will work safely and reliably under the conditions I've specified." That's not ceremonial. That signature carries legal and ethical weight. If something fails and people get hurt, your analysis will be examined. Your assumptions will be questioned. Your decisions will be scrutinized. "I was just doing what I was told" isn't a defense if you knew it was unsafe.

This isn't meant to paralyze you with fear. Most engineering work doesn't involve life-or-death decisions. But it does mean taking your work seriously: validating assumptions, asking for help when you're uncertain, and speaking up when you see something unsafe. Professional responsibility means you don't blindly follow instructions when you know they're wrong.

Engineering failures kill people. Not hypothetically—actually. Bridges collapse. Pressure vessels explode. Medical devices malfunction. Buildings fail in earthquakes. Every major disaster has a chain of decisions behind it, and often those decisions involved engineers who knew something was wrong but didn't push back hard enough. You have a duty to raise concerns, even when it's uncomfortable.

Case study: Hyatt Regency walkway collapse (Kansas City, 1981)

Two suspended walkways in a hotel atrium collapsed during a crowded event, killing 114 people and injuring 216. It's one of the deadliest structural failures in U.S. history.

What happened: The original design used continuous rods to support both walkways. During construction, the steel fabricator requested a change to simplify manufacturing—use separate rods for each walkway instead of one continuous rod through both. The engineers approved the change without fully re-analyzing the load path.

In the original design, each connection carried the weight of one walkway. In the as-built design, the upper walkway connection carried the weight of both walkways—double the load. The connection was overstressed and failed catastrophically.

The lesson: Every design change—no matter how small, no matter who suggests it—requires re-analysis. Assumptions that were valid for Design A are not automatically valid for Design B. The fabricator's change seemed trivial, but it fundamentally altered the load distribution. The engineers didn't catch it because they didn't rigorously re-analyze.

What does professional responsibility actually mean day-to-day?

Don't sign off on work you don't understand. If you're not confident in an analysis, say so. Get a second opinion. Ask someone with more experience to review it. Admitting uncertainty is professional. Pretending you know when you don't is dangerous.

Speak up when something is unsafe. If you see a safety issue—bad design, ignored maintenance, inadequate testing—you have a duty to raise it. Document your concerns. Escalate if necessary. Your job is to prevent failures, not just follow orders.

Document your assumptions and limitations. Make it clear what your analysis does and doesn't cover. "This calculation assumes static loading and does not account for fatigue, thermal expansion, or impact events." If someone uses your work outside its valid range, you at least documented the boundaries.

Study failures, not just successes. Read case studies: Tacoma Narrows Bridge, Challenger disaster, Deepwater Horizon, Therac-25. Understand what went wrong and why smart, experienced engineers made the decisions that led to disaster. Learn from their mistakes so you don't repeat them.

Engineering is a profession, not just a job. That comes with responsibility. Take it seriously.

Risk Management Without Analysis Paralysis

Thinking about failure doesn't mean you're afraid to design. It means you're designing responsibly. The goal is informed confidence: you know the risks, you've identified the critical failure modes, you've implemented mitigations, and you can defend your decisions with evidence. That's not paranoia—that's competence.

You can't eliminate all risk. That's not the job. The job is managing risk to acceptable levels based on consequences, likelihood, and cost. Some failures are acceptable if they're unlikely, easily detected, and have minor consequences. Other failures demand multiple layers of protection because the consequences are unacceptable. Context determines how much caution is appropriate.

Here's how to balance safety with actually getting work done:

Focus effort where failure matters most. Not every component is critical. A decorative cover plate doesn't need the same scrutiny as a load-bearing weld. Identify which components are safety-critical, which affect reliability, and which are just annoying if they fail. Allocate your analysis time accordingly.

Use testing to validate assumptions. Calculations and simulations give you predictions. Testing gives you reality. If you're doing something novel or pushing boundaries, build prototypes and test early. Real-world validation catches problems your model didn't account for.

Leverage standards and codes. ASME pressure vessel codes, AISC steel design standards, automotive safety regulations—these exist because people have already done the risk analysis and compiled best practices. Following established standards doesn't make you unimaginative; it means you're not rediscovering lessons that were learned decades ago through expensive failures.

Consult experienced engineers. They've seen failures you haven't. They know which assumptions are safe and which are dangerous. They can tell you "we tried that approach five years ago and here's why it didn't work." Don't try to figure everything out in isolation when someone with experience can save you from predictable mistakes.

Design reviews catch what you missed. You've been staring at this design for weeks. Your brain fills in gaps and overlooks problems. A fresh set of eyes will ask questions you didn't think to ask and spot issues you've become blind to. Reviews aren't bureaucracy—they're quality control.

The point isn't to be fearful. The point is to think clearly about consequences, identify risks systematically, and make deliberate choices about which risks are acceptable and which demand mitigation. That's what separates professional engineers from people who just run calculations and hope for the best.

Ready for the Next Level?

With risk awareness established, you're ready to master communication, learn to defend your decisions, and develop professional engineering judgment.

Continue to Level 5: Communication and Judgment →