Failure Handling Model Specification
Document Type: Normative System Specification
Scope: Sensor Hub (Sub-Hub) Fault Detection, Classification, and Recovery
Traceability: SR-DIAG-001 through SR-DIAG-011, SR-SYS-002, SR-SYS-004
1. Purpose
This document defines the fault taxonomy, escalation rules, recovery behaviors, and integration with the system state machine. All components SHALL adhere to this failure handling model.
2. Fault Taxonomy
2.1 Severity Levels
| Severity |
Code |
Description |
State Impact |
Recovery Behavior |
| INFO |
DIAG_SEV_INFO |
Informational event, no action required |
None |
Log only |
| WARNING |
DIAG_SEV_WARNING |
Non-critical fault, degraded operation |
RUNNING → WARNING |
Continue with reduced functionality |
| ERROR |
DIAG_SEV_ERROR |
Critical fault, feature disabled |
Feature-specific |
Feature isolation, retry logic |
| FATAL |
DIAG_SEV_FATAL |
System-critical fault, core functionality disabled |
RUNNING → FAULT |
Controlled teardown, recovery attempt |
2.2 Fault Categories
| Category |
Description |
Examples |
Typical Severity |
| SENSOR |
Sensor hardware or communication failure |
Disconnection, out-of-range, non-responsive |
WARNING (single), ERROR (multiple), FATAL (all) |
| COMMUNICATION |
Network or protocol failure |
Link loss, timeout, authentication failure |
WARNING (temporary), ERROR (persistent), FATAL (critical) |
| STORAGE |
Persistence or storage medium failure |
SD card failure, NVM corruption, write failure |
WARNING (degraded), ERROR (persistent), FATAL (critical) |
| SECURITY |
Security violation or authentication failure |
Secure boot failure, key corruption, unauthorized access |
FATAL (always) |
| SYSTEM |
System resource or configuration failure |
Memory exhaustion, task failure, configuration error |
ERROR (recoverable), FATAL (unrecoverable) |
| OTA |
Firmware update failure |
Validation failure, transfer error, flash error |
ERROR (retry), FATAL (rollback) |
| CALIBRATION |
Calibration or machine constants failure |
Invalid MC, calibration error, sensor mismatch |
WARNING (single), ERROR (critical) |
3. Diagnostic Code Structure
3.1 Diagnostic Code Format
- CATEGORY: Two-letter code (SN, CM, ST, SC, SY, OT, CL)
- COMPONENT: Component identifier (e.g., TEMP, HUM, CO2, NET, SD, OTA)
- NUMBER: Unique fault number (0001-9999)
3.2 Diagnostic Code Registry
| Code |
Severity |
Category |
Component |
Description |
DIAG-SN-TEMP-0001 |
WARNING |
SENSOR |
Temperature |
Temperature sensor disconnected |
DIAG-SN-TEMP-0002 |
ERROR |
SENSOR |
Temperature |
Temperature sensor out of range |
DIAG-SN-TEMP-0003 |
FATAL |
SENSOR |
Temperature |
All temperature sensors failed |
DIAG-CM-NET-0001 |
WARNING |
COMMUNICATION |
Network |
Main Hub link temporarily lost |
DIAG-CM-NET-0002 |
ERROR |
COMMUNICATION |
Network |
Main Hub link persistently lost |
DIAG-ST-SD-0001 |
WARNING |
STORAGE |
SD Card |
SD card write failure (retry successful) |
DIAG-ST-SD-0002 |
ERROR |
STORAGE |
SD Card |
SD card persistent write failure |
DIAG-ST-SD-0003 |
FATAL |
STORAGE |
SD Card |
SD card corruption detected |
DIAG-SC-BOOT-0001 |
FATAL |
SECURITY |
Secure Boot |
Secure boot verification failed |
DIAG-SY-MEM-0001 |
ERROR |
SYSTEM |
Memory |
Memory allocation failure |
DIAG-OT-FW-0001 |
ERROR |
OTA |
Firmware |
Firmware integrity validation failed |
DIAG-CL-MC-0001 |
WARNING |
CALIBRATION |
Machine Constants |
Invalid sensor slot configuration |
4. Fault Detection Rules
4.1 Sensor Fault Detection
| Condition |
Detection Method |
Severity Assignment |
| Sensor disconnected |
Hardware presence signal |
WARNING (if other sensors available) |
| Sensor non-responsive |
Communication timeout (3 retries) |
ERROR (if critical sensor) |
| Sensor out of range |
Value validation against limits |
WARNING (if single occurrence), ERROR (if persistent) |
| All sensors failed |
Count of failed sensors = total |
FATAL |
4.2 Communication Fault Detection
| Condition |
Detection Method |
Severity Assignment |
| Link temporarily lost |
Heartbeat timeout (< 30s) |
WARNING |
| Link persistently lost |
Heartbeat timeout (> 5 minutes) |
ERROR |
| Authentication failure |
Security layer rejection |
FATAL |
| Protocol error |
Message parsing failure (3 consecutive) |
ERROR |
4.3 Storage Fault Detection
| Condition |
Detection Method |
Severity Assignment |
| Write failure (retry successful) |
Write operation with retry |
WARNING |
| Write failure (persistent) |
Write operation failure (3 retries) |
ERROR |
| SD card corruption |
File system check failure |
FATAL |
| Storage full |
Available space < threshold |
WARNING |
4.4 Security Fault Detection
| Condition |
Detection Method |
Severity Assignment |
| Secure boot failure |
Boot verification failure |
FATAL (always) |
| Key corruption |
Cryptographic key validation failure |
FATAL |
| Unauthorized access |
Authentication failure (3 attempts) |
FATAL |
| Message tampering |
Integrity check failure |
ERROR (if persistent → FATAL) |
5. Escalation Rules
5.1 Severity Escalation
| Current Severity |
Escalation Trigger |
New Severity |
State Transition |
| INFO |
N/A |
N/A |
None |
| WARNING |
Same fault persists > 5 minutes |
ERROR |
WARNING → WARNING (feature degraded) |
| WARNING |
Multiple warnings (≥3) |
ERROR |
WARNING → WARNING (feature degraded) |
| WARNING |
Critical feature affected |
FATAL |
WARNING → FAULT |
| ERROR |
Same fault persists > 10 minutes |
FATAL |
RUNNING → FAULT |
| ERROR |
Cascading failures (≥2 features) |
FATAL |
RUNNING → FAULT |
| FATAL |
N/A |
N/A |
RUNNING → FAULT |
5.2 Cascading Failure Detection
A cascading failure is detected when:
- Multiple independent features fail simultaneously
- Failure in one feature causes failure in another
- System resource exhaustion (memory, CPU, storage)
Response: Immediate escalation to FATAL, transition to FAULT state.
6. Recovery Behaviors
6.1 Recovery Strategies by Severity
| Severity |
Recovery Strategy |
Retry Logic |
State Impact |
| INFO |
None |
N/A |
None |
| WARNING |
Automatic retry, degraded operation |
3 retries with exponential backoff |
Continue in WARNING state |
| ERROR |
Feature isolation, automatic retry |
3 retries, then manual intervention |
Feature disabled, system continues |
| FATAL |
Controlled teardown, recovery attempt |
Single recovery attempt, then manual |
FAULT → TEARDOWN → INIT |
6.2 Recovery Time Limits
| Fault Type |
Maximum Recovery Time |
Recovery Action |
| Sensor (WARNING) |
5 minutes |
Automatic retry, sensor exclusion |
| Communication (WARNING) |
30 seconds |
Automatic reconnection |
| Storage (WARNING) |
10 seconds |
Retry write operation |
| Sensor (ERROR) |
Manual intervention |
Sensor marked as failed |
| Communication (ERROR) |
Manual intervention |
Communication feature disabled |
| Storage (ERROR) |
Manual intervention |
Persistence disabled, system continues |
| FATAL (any) |
60 seconds |
Controlled teardown and recovery attempt |
6.3 Latching Behavior
| Severity |
Latching Rule |
Clear Condition |
| INFO |
Not latched |
Overwritten by new event |
| WARNING |
Latched until cleared |
Fault condition cleared + manual clear OR automatic clear after 1 hour |
| ERROR |
Latched until cleared |
Manual clear via diagnostic session OR system reset |
| FATAL |
Latched until cleared |
Manual clear via diagnostic session OR system reset |
7. Fault Reporting
7.1 Reporting Channels
| Severity |
Local HMI |
Diagnostic Log |
Main Hub |
Diagnostic Session |
| INFO |
Optional |
Yes |
No |
Yes |
| WARNING |
Yes (status indicator) |
Yes |
Yes (periodic) |
Yes |
| ERROR |
Yes (status indicator) |
Yes |
Yes (immediate) |
Yes |
| FATAL |
Yes (status indicator) |
Yes |
Yes (immediate) |
Yes |
7.2 Diagnostic Event Structure
8. Integration with State Machine
8.1 Fault-to-State Mapping
| Fault Severity |
Current State |
Target State |
Transition Trigger |
| INFO |
Any |
Same |
None (no state change) |
| WARNING |
RUNNING |
WARNING |
First WARNING fault |
| WARNING |
WARNING |
WARNING |
Additional WARNING (latched) |
| ERROR |
RUNNING |
RUNNING |
Feature isolation, continue |
| ERROR |
WARNING |
WARNING |
Feature isolation, continue |
| FATAL |
RUNNING |
FAULT |
First FATAL fault |
| FATAL |
WARNING |
FAULT |
Escalation to FATAL |
| FATAL |
FAULT |
FAULT |
Additional FATAL (latched) |
8.2 State-Dependent Fault Handling
| State |
Fault Handling Behavior |
INIT |
Boot-time faults → BOOT_FAILURE if security-related |
RUNNING |
Full fault detection and handling |
WARNING |
Fault escalation monitoring, recovery attempts |
FAULT |
Fault logging only, recovery attempt preparation |
OTA_PREP |
OTA-related faults only, others deferred |
OTA_UPDATE |
OTA progress faults only |
TEARDOWN |
Fault logging only, no new fault detection |
SERVICE |
Fault inspection only, no new fault detection |
9. Error Handler Responsibilities
The Error Handler component SHALL:
- Receive fault reports from all components
- Classify faults according to taxonomy
- Determine severity and escalation
- Trigger state transitions when required
- Manage fault latching and clearing
- Coordinate recovery attempts
- Report faults to diagnostics and Main Hub
10. Traceability
- SR-DIAG-001: Implemented via diagnostic code framework
- SR-DIAG-002: Implemented via unique diagnostic code assignment
- SR-DIAG-003: Implemented via severity classification
- SR-DIAG-004: Implemented via timestamp and source association
- SR-SYS-002: Implemented via fault-to-state mapping
- SR-SYS-004: Implemented via FATAL fault → TEARDOWN transition
11. Mermaid Fault Escalation Diagram