init
This commit is contained in:
252
System Design/Failure_Handling_Model.md
Normal file
252
System Design/Failure_Handling_Model.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Failure Handling Model Specification
|
||||
|
||||
**Document Type:** Normative System Specification
|
||||
**Scope:** Sensor Hub (Sub-Hub) Fault Detection, Classification, and Recovery
|
||||
**Traceability:** SR-DIAG-001 through SR-DIAG-011, SR-SYS-002, SR-SYS-004
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
This document defines the fault taxonomy, escalation rules, recovery behaviors, and integration with the system state machine. All components SHALL adhere to this failure handling model.
|
||||
|
||||
## 2. Fault Taxonomy
|
||||
|
||||
### 2.1 Severity Levels
|
||||
|
||||
| Severity | Code | Description | State Impact | Recovery Behavior |
|
||||
|----------|------|-------------|--------------|-------------------|
|
||||
| **INFO** | `DIAG_SEV_INFO` | Informational event, no action required | None | Log only |
|
||||
| **WARNING** | `DIAG_SEV_WARNING` | Non-critical fault, degraded operation | `RUNNING` → `WARNING` | Continue with reduced functionality |
|
||||
| **ERROR** | `DIAG_SEV_ERROR` | Critical fault, feature disabled | Feature-specific | Feature isolation, retry logic |
|
||||
| **FATAL** | `DIAG_SEV_FATAL` | System-critical fault, core functionality disabled | `RUNNING` → `FAULT` | Controlled teardown, recovery attempt |
|
||||
|
||||
### 2.2 Fault Categories
|
||||
|
||||
| Category | Description | Examples | Typical Severity |
|
||||
|----------|-------------|----------|------------------|
|
||||
| **SENSOR** | Sensor hardware or communication failure | Disconnection, out-of-range, non-responsive | WARNING (single), ERROR (multiple), FATAL (all) |
|
||||
| **COMMUNICATION** | Network or protocol failure | Link loss, timeout, authentication failure | WARNING (temporary), ERROR (persistent), FATAL (critical) |
|
||||
| **STORAGE** | Persistence or storage medium failure | SD card failure, NVM corruption, write failure | WARNING (degraded), ERROR (persistent), FATAL (critical) |
|
||||
| **SECURITY** | Security violation or authentication failure | Secure boot failure, key corruption, unauthorized access | FATAL (always) |
|
||||
| **SYSTEM** | System resource or configuration failure | Memory exhaustion, task failure, configuration error | ERROR (recoverable), FATAL (unrecoverable) |
|
||||
| **OTA** | Firmware update failure | Validation failure, transfer error, flash error | ERROR (retry), FATAL (rollback) |
|
||||
| **CALIBRATION** | Calibration or machine constants failure | Invalid MC, calibration error, sensor mismatch | WARNING (single), ERROR (critical) |
|
||||
|
||||
## 3. Diagnostic Code Structure
|
||||
|
||||
### 3.1 Diagnostic Code Format
|
||||
|
||||
```
|
||||
DIAG-<CATEGORY>-<COMPONENT>-<NUMBER>
|
||||
```
|
||||
|
||||
- **CATEGORY:** Two-letter code (SN, CM, ST, SC, SY, OT, CL)
|
||||
- **COMPONENT:** Component identifier (e.g., TEMP, HUM, CO2, NET, SD, OTA)
|
||||
- **NUMBER:** Unique fault number (0001-9999)
|
||||
|
||||
### 3.2 Diagnostic Code Registry
|
||||
|
||||
| Code | Severity | Category | Component | Description |
|
||||
|------|----------|----------|-----------|-------------|
|
||||
| `DIAG-SN-TEMP-0001` | WARNING | SENSOR | Temperature | Temperature sensor disconnected |
|
||||
| `DIAG-SN-TEMP-0002` | ERROR | SENSOR | Temperature | Temperature sensor out of range |
|
||||
| `DIAG-SN-TEMP-0003` | FATAL | SENSOR | Temperature | All temperature sensors failed |
|
||||
| `DIAG-CM-NET-0001` | WARNING | COMMUNICATION | Network | Main Hub link temporarily lost |
|
||||
| `DIAG-CM-NET-0002` | ERROR | COMMUNICATION | Network | Main Hub link persistently lost |
|
||||
| `DIAG-ST-SD-0001` | WARNING | STORAGE | SD Card | SD card write failure (retry successful) |
|
||||
| `DIAG-ST-SD-0002` | ERROR | STORAGE | SD Card | SD card persistent write failure |
|
||||
| `DIAG-ST-SD-0003` | FATAL | STORAGE | SD Card | SD card corruption detected |
|
||||
| `DIAG-SC-BOOT-0001` | FATAL | SECURITY | Secure Boot | Secure boot verification failed |
|
||||
| `DIAG-SY-MEM-0001` | ERROR | SYSTEM | Memory | Memory allocation failure |
|
||||
| `DIAG-OT-FW-0001` | ERROR | OTA | Firmware | Firmware integrity validation failed |
|
||||
| `DIAG-CL-MC-0001` | WARNING | CALIBRATION | Machine Constants | Invalid sensor slot configuration |
|
||||
|
||||
## 4. Fault Detection Rules
|
||||
|
||||
### 4.1 Sensor Fault Detection
|
||||
|
||||
| Condition | Detection Method | Severity Assignment |
|
||||
|-----------|------------------|-------------------|
|
||||
| Sensor disconnected | Hardware presence signal | WARNING (if other sensors available) |
|
||||
| Sensor non-responsive | Communication timeout (3 retries) | ERROR (if critical sensor) |
|
||||
| Sensor out of range | Value validation against limits | WARNING (if single occurrence), ERROR (if persistent) |
|
||||
| All sensors failed | Count of failed sensors = total | FATAL |
|
||||
|
||||
### 4.2 Communication Fault Detection
|
||||
|
||||
| Condition | Detection Method | Severity Assignment |
|
||||
|-----------|------------------|-------------------|
|
||||
| Link temporarily lost | Heartbeat timeout (< 30s) | WARNING |
|
||||
| Link persistently lost | Heartbeat timeout (> 5 minutes) | ERROR |
|
||||
| Authentication failure | Security layer rejection | FATAL |
|
||||
| Protocol error | Message parsing failure (3 consecutive) | ERROR |
|
||||
|
||||
### 4.3 Storage Fault Detection
|
||||
|
||||
| Condition | Detection Method | Severity Assignment |
|
||||
|-----------|------------------|-------------------|
|
||||
| Write failure (retry successful) | Write operation with retry | WARNING |
|
||||
| Write failure (persistent) | Write operation failure (3 retries) | ERROR |
|
||||
| SD card corruption | File system check failure | FATAL |
|
||||
| Storage full | Available space < threshold | WARNING |
|
||||
|
||||
### 4.4 Security Fault Detection
|
||||
|
||||
| Condition | Detection Method | Severity Assignment |
|
||||
|-----------|------------------|-------------------|
|
||||
| Secure boot failure | Boot verification failure | FATAL (always) |
|
||||
| Key corruption | Cryptographic key validation failure | FATAL |
|
||||
| Unauthorized access | Authentication failure (3 attempts) | FATAL |
|
||||
| Message tampering | Integrity check failure | ERROR (if persistent → FATAL) |
|
||||
|
||||
## 5. Escalation Rules
|
||||
|
||||
### 5.1 Severity Escalation
|
||||
|
||||
| Current Severity | Escalation Trigger | New Severity | State Transition |
|
||||
|------------------|-------------------|--------------|-----------------|
|
||||
| INFO | N/A | N/A | None |
|
||||
| WARNING | Same fault persists > 5 minutes | ERROR | `WARNING` → `WARNING` (feature degraded) |
|
||||
| WARNING | Multiple warnings (≥3) | ERROR | `WARNING` → `WARNING` (feature degraded) |
|
||||
| WARNING | Critical feature affected | FATAL | `WARNING` → `FAULT` |
|
||||
| ERROR | Same fault persists > 10 minutes | FATAL | `RUNNING` → `FAULT` |
|
||||
| ERROR | Cascading failures (≥2 features) | FATAL | `RUNNING` → `FAULT` |
|
||||
| FATAL | N/A | N/A | `RUNNING` → `FAULT` |
|
||||
|
||||
### 5.2 Cascading Failure Detection
|
||||
|
||||
A cascading failure is detected when:
|
||||
- Multiple independent features fail simultaneously
|
||||
- Failure in one feature causes failure in another
|
||||
- System resource exhaustion (memory, CPU, storage)
|
||||
|
||||
**Response:** Immediate escalation to FATAL, transition to `FAULT` state.
|
||||
|
||||
## 6. Recovery Behaviors
|
||||
|
||||
### 6.1 Recovery Strategies by Severity
|
||||
|
||||
| Severity | Recovery Strategy | Retry Logic | State Impact |
|
||||
|----------|------------------|-------------|--------------|
|
||||
| **INFO** | None | N/A | None |
|
||||
| **WARNING** | Automatic retry, degraded operation | 3 retries with exponential backoff | Continue in `WARNING` state |
|
||||
| **ERROR** | Feature isolation, automatic retry | 3 retries, then manual intervention | Feature disabled, system continues |
|
||||
| **FATAL** | Controlled teardown, recovery attempt | Single recovery attempt, then manual | `FAULT` → `TEARDOWN` → `INIT` |
|
||||
|
||||
### 6.2 Recovery Time Limits
|
||||
|
||||
| Fault Type | Maximum Recovery Time | Recovery Action |
|
||||
|------------|----------------------|----------------|
|
||||
| Sensor (WARNING) | 5 minutes | Automatic retry, sensor exclusion |
|
||||
| Communication (WARNING) | 30 seconds | Automatic reconnection |
|
||||
| Storage (WARNING) | 10 seconds | Retry write operation |
|
||||
| Sensor (ERROR) | Manual intervention | Sensor marked as failed |
|
||||
| Communication (ERROR) | Manual intervention | Communication feature disabled |
|
||||
| Storage (ERROR) | Manual intervention | Persistence disabled, system continues |
|
||||
| FATAL (any) | 60 seconds | Controlled teardown and recovery attempt |
|
||||
|
||||
### 6.3 Latching Behavior
|
||||
|
||||
| Severity | Latching Rule | Clear Condition |
|
||||
|----------|--------------|----------------|
|
||||
| **INFO** | Not latched | Overwritten by new event |
|
||||
| **WARNING** | Latched until cleared | Fault condition cleared + manual clear OR automatic clear after 1 hour |
|
||||
| **ERROR** | Latched until cleared | Manual clear via diagnostic session OR system reset |
|
||||
| **FATAL** | Latched until cleared | Manual clear via diagnostic session OR system reset |
|
||||
|
||||
## 7. Fault Reporting
|
||||
|
||||
### 7.1 Reporting Channels
|
||||
|
||||
| Severity | Local HMI | Diagnostic Log | Main Hub | Diagnostic Session |
|
||||
|----------|-----------|----------------|----------|-------------------|
|
||||
| **INFO** | Optional | Yes | No | Yes |
|
||||
| **WARNING** | Yes (status indicator) | Yes | Yes (periodic) | Yes |
|
||||
| **ERROR** | Yes (status indicator) | Yes | Yes (immediate) | Yes |
|
||||
| **FATAL** | Yes (status indicator) | Yes | Yes (immediate) | Yes |
|
||||
|
||||
### 7.2 Diagnostic Event Structure
|
||||
|
||||
```c
|
||||
typedef struct {
|
||||
uint32_t diagnostic_code; // Unique diagnostic code
|
||||
diag_severity_t severity; // INFO, WARNING, ERROR, FATAL
|
||||
uint64_t timestamp; // System timestamp (microseconds)
|
||||
const char* source_component; // Component identifier
|
||||
uint32_t occurrence_count; // Number of occurrences
|
||||
bool is_latched; // Latching status
|
||||
fault_category_t category; // SENSOR, COMMUNICATION, etc.
|
||||
} diagnostic_event_t;
|
||||
```
|
||||
|
||||
## 8. Integration with State Machine
|
||||
|
||||
### 8.1 Fault-to-State Mapping
|
||||
|
||||
| Fault Severity | Current State | Target State | Transition Trigger |
|
||||
|----------------|---------------|--------------|-------------------|
|
||||
| INFO | Any | Same | None (no state change) |
|
||||
| WARNING | `RUNNING` | `WARNING` | First WARNING fault |
|
||||
| WARNING | `WARNING` | `WARNING` | Additional WARNING (latched) |
|
||||
| ERROR | `RUNNING` | `RUNNING` | Feature isolation, continue |
|
||||
| ERROR | `WARNING` | `WARNING` | Feature isolation, continue |
|
||||
| FATAL | `RUNNING` | `FAULT` | First FATAL fault |
|
||||
| FATAL | `WARNING` | `FAULT` | Escalation to FATAL |
|
||||
| FATAL | `FAULT` | `FAULT` | Additional FATAL (latched) |
|
||||
|
||||
### 8.2 State-Dependent Fault Handling
|
||||
|
||||
| State | Fault Handling Behavior |
|
||||
|-------|------------------------|
|
||||
| `INIT` | Boot-time faults → `BOOT_FAILURE` if security-related |
|
||||
| `RUNNING` | Full fault detection and handling |
|
||||
| `WARNING` | Fault escalation monitoring, recovery attempts |
|
||||
| `FAULT` | Fault logging only, recovery attempt preparation |
|
||||
| `OTA_PREP` | OTA-related faults only, others deferred |
|
||||
| `OTA_UPDATE` | OTA progress faults only |
|
||||
| `TEARDOWN` | Fault logging only, no new fault detection |
|
||||
| `SERVICE` | Fault inspection only, no new fault detection |
|
||||
|
||||
## 9. Error Handler Responsibilities
|
||||
|
||||
The Error Handler component SHALL:
|
||||
1. Receive fault reports from all components
|
||||
2. Classify faults according to taxonomy
|
||||
3. Determine severity and escalation
|
||||
4. Trigger state transitions when required
|
||||
5. Manage fault latching and clearing
|
||||
6. Coordinate recovery attempts
|
||||
7. Report faults to diagnostics and Main Hub
|
||||
|
||||
## 10. Traceability
|
||||
|
||||
- **SR-DIAG-001:** Implemented via diagnostic code framework
|
||||
- **SR-DIAG-002:** Implemented via unique diagnostic code assignment
|
||||
- **SR-DIAG-003:** Implemented via severity classification
|
||||
- **SR-DIAG-004:** Implemented via timestamp and source association
|
||||
- **SR-SYS-002:** Implemented via fault-to-state mapping
|
||||
- **SR-SYS-004:** Implemented via FATAL fault → TEARDOWN transition
|
||||
|
||||
## 11. Mermaid Fault Escalation Diagram
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
FaultDetected[Fault Detected] --> ClassifySeverity{Classify Severity}
|
||||
ClassifySeverity -->|INFO| LogOnly[Log Only]
|
||||
ClassifySeverity -->|WARNING| CheckState1{Current State?}
|
||||
ClassifySeverity -->|ERROR| IsolateFeature[Isolate Feature]
|
||||
ClassifySeverity -->|FATAL| TriggerFaultState[Trigger FAULT State]
|
||||
|
||||
CheckState1 -->|RUNNING| TransitionWarning[Transition to WARNING]
|
||||
CheckState1 -->|WARNING| LatchWarning[Latch Warning]
|
||||
|
||||
IsolateFeature --> RetryLogic{Retry Logic}
|
||||
RetryLogic -->|Success| ClearError[Clear Error]
|
||||
RetryLogic -->|Failure| EscalateToFatal{Escalate?}
|
||||
EscalateToFatal -->|Yes| TriggerFaultState
|
||||
EscalateToFatal -->|No| ManualIntervention[Manual Intervention]
|
||||
|
||||
TriggerFaultState --> TeardownSequence[Initiate Teardown]
|
||||
TeardownSequence --> RecoveryAttempt{Recovery Attempt}
|
||||
RecoveryAttempt -->|Success| ResetToInit[Reset to INIT]
|
||||
RecoveryAttempt -->|Failure| ManualIntervention
|
||||
```
|
||||
Reference in New Issue
Block a user