This commit is contained in:
2026-01-19 16:19:41 +01:00
commit edd3e96591
301 changed files with 36763 additions and 0 deletions

View File

@@ -0,0 +1,252 @@
# Failure Handling Model Specification
**Document Type:** Normative System Specification
**Scope:** Sensor Hub (Sub-Hub) Fault Detection, Classification, and Recovery
**Traceability:** SR-DIAG-001 through SR-DIAG-011, SR-SYS-002, SR-SYS-004
## 1. Purpose
This document defines the fault taxonomy, escalation rules, recovery behaviors, and integration with the system state machine. All components SHALL adhere to this failure handling model.
## 2. Fault Taxonomy
### 2.1 Severity Levels
| Severity | Code | Description | State Impact | Recovery Behavior |
|----------|------|-------------|--------------|-------------------|
| **INFO** | `DIAG_SEV_INFO` | Informational event, no action required | None | Log only |
| **WARNING** | `DIAG_SEV_WARNING` | Non-critical fault, degraded operation | `RUNNING``WARNING` | Continue with reduced functionality |
| **ERROR** | `DIAG_SEV_ERROR` | Critical fault, feature disabled | Feature-specific | Feature isolation, retry logic |
| **FATAL** | `DIAG_SEV_FATAL` | System-critical fault, core functionality disabled | `RUNNING``FAULT` | Controlled teardown, recovery attempt |
### 2.2 Fault Categories
| Category | Description | Examples | Typical Severity |
|----------|-------------|----------|------------------|
| **SENSOR** | Sensor hardware or communication failure | Disconnection, out-of-range, non-responsive | WARNING (single), ERROR (multiple), FATAL (all) |
| **COMMUNICATION** | Network or protocol failure | Link loss, timeout, authentication failure | WARNING (temporary), ERROR (persistent), FATAL (critical) |
| **STORAGE** | Persistence or storage medium failure | SD card failure, NVM corruption, write failure | WARNING (degraded), ERROR (persistent), FATAL (critical) |
| **SECURITY** | Security violation or authentication failure | Secure boot failure, key corruption, unauthorized access | FATAL (always) |
| **SYSTEM** | System resource or configuration failure | Memory exhaustion, task failure, configuration error | ERROR (recoverable), FATAL (unrecoverable) |
| **OTA** | Firmware update failure | Validation failure, transfer error, flash error | ERROR (retry), FATAL (rollback) |
| **CALIBRATION** | Calibration or machine constants failure | Invalid MC, calibration error, sensor mismatch | WARNING (single), ERROR (critical) |
## 3. Diagnostic Code Structure
### 3.1 Diagnostic Code Format
```
DIAG-<CATEGORY>-<COMPONENT>-<NUMBER>
```
- **CATEGORY:** Two-letter code (SN, CM, ST, SC, SY, OT, CL)
- **COMPONENT:** Component identifier (e.g., TEMP, HUM, CO2, NET, SD, OTA)
- **NUMBER:** Unique fault number (0001-9999)
### 3.2 Diagnostic Code Registry
| Code | Severity | Category | Component | Description |
|------|----------|----------|-----------|-------------|
| `DIAG-SN-TEMP-0001` | WARNING | SENSOR | Temperature | Temperature sensor disconnected |
| `DIAG-SN-TEMP-0002` | ERROR | SENSOR | Temperature | Temperature sensor out of range |
| `DIAG-SN-TEMP-0003` | FATAL | SENSOR | Temperature | All temperature sensors failed |
| `DIAG-CM-NET-0001` | WARNING | COMMUNICATION | Network | Main Hub link temporarily lost |
| `DIAG-CM-NET-0002` | ERROR | COMMUNICATION | Network | Main Hub link persistently lost |
| `DIAG-ST-SD-0001` | WARNING | STORAGE | SD Card | SD card write failure (retry successful) |
| `DIAG-ST-SD-0002` | ERROR | STORAGE | SD Card | SD card persistent write failure |
| `DIAG-ST-SD-0003` | FATAL | STORAGE | SD Card | SD card corruption detected |
| `DIAG-SC-BOOT-0001` | FATAL | SECURITY | Secure Boot | Secure boot verification failed |
| `DIAG-SY-MEM-0001` | ERROR | SYSTEM | Memory | Memory allocation failure |
| `DIAG-OT-FW-0001` | ERROR | OTA | Firmware | Firmware integrity validation failed |
| `DIAG-CL-MC-0001` | WARNING | CALIBRATION | Machine Constants | Invalid sensor slot configuration |
## 4. Fault Detection Rules
### 4.1 Sensor Fault Detection
| Condition | Detection Method | Severity Assignment |
|-----------|------------------|-------------------|
| Sensor disconnected | Hardware presence signal | WARNING (if other sensors available) |
| Sensor non-responsive | Communication timeout (3 retries) | ERROR (if critical sensor) |
| Sensor out of range | Value validation against limits | WARNING (if single occurrence), ERROR (if persistent) |
| All sensors failed | Count of failed sensors = total | FATAL |
### 4.2 Communication Fault Detection
| Condition | Detection Method | Severity Assignment |
|-----------|------------------|-------------------|
| Link temporarily lost | Heartbeat timeout (< 30s) | WARNING |
| Link persistently lost | Heartbeat timeout (> 5 minutes) | ERROR |
| Authentication failure | Security layer rejection | FATAL |
| Protocol error | Message parsing failure (3 consecutive) | ERROR |
### 4.3 Storage Fault Detection
| Condition | Detection Method | Severity Assignment |
|-----------|------------------|-------------------|
| Write failure (retry successful) | Write operation with retry | WARNING |
| Write failure (persistent) | Write operation failure (3 retries) | ERROR |
| SD card corruption | File system check failure | FATAL |
| Storage full | Available space < threshold | WARNING |
### 4.4 Security Fault Detection
| Condition | Detection Method | Severity Assignment |
|-----------|------------------|-------------------|
| Secure boot failure | Boot verification failure | FATAL (always) |
| Key corruption | Cryptographic key validation failure | FATAL |
| Unauthorized access | Authentication failure (3 attempts) | FATAL |
| Message tampering | Integrity check failure | ERROR (if persistent FATAL) |
## 5. Escalation Rules
### 5.1 Severity Escalation
| Current Severity | Escalation Trigger | New Severity | State Transition |
|------------------|-------------------|--------------|-----------------|
| INFO | N/A | N/A | None |
| WARNING | Same fault persists > 5 minutes | ERROR | `WARNING``WARNING` (feature degraded) |
| WARNING | Multiple warnings (≥3) | ERROR | `WARNING``WARNING` (feature degraded) |
| WARNING | Critical feature affected | FATAL | `WARNING``FAULT` |
| ERROR | Same fault persists > 10 minutes | FATAL | `RUNNING``FAULT` |
| ERROR | Cascading failures (≥2 features) | FATAL | `RUNNING``FAULT` |
| FATAL | N/A | N/A | `RUNNING``FAULT` |
### 5.2 Cascading Failure Detection
A cascading failure is detected when:
- Multiple independent features fail simultaneously
- Failure in one feature causes failure in another
- System resource exhaustion (memory, CPU, storage)
**Response:** Immediate escalation to FATAL, transition to `FAULT` state.
## 6. Recovery Behaviors
### 6.1 Recovery Strategies by Severity
| Severity | Recovery Strategy | Retry Logic | State Impact |
|----------|------------------|-------------|--------------|
| **INFO** | None | N/A | None |
| **WARNING** | Automatic retry, degraded operation | 3 retries with exponential backoff | Continue in `WARNING` state |
| **ERROR** | Feature isolation, automatic retry | 3 retries, then manual intervention | Feature disabled, system continues |
| **FATAL** | Controlled teardown, recovery attempt | Single recovery attempt, then manual | `FAULT``TEARDOWN``INIT` |
### 6.2 Recovery Time Limits
| Fault Type | Maximum Recovery Time | Recovery Action |
|------------|----------------------|----------------|
| Sensor (WARNING) | 5 minutes | Automatic retry, sensor exclusion |
| Communication (WARNING) | 30 seconds | Automatic reconnection |
| Storage (WARNING) | 10 seconds | Retry write operation |
| Sensor (ERROR) | Manual intervention | Sensor marked as failed |
| Communication (ERROR) | Manual intervention | Communication feature disabled |
| Storage (ERROR) | Manual intervention | Persistence disabled, system continues |
| FATAL (any) | 60 seconds | Controlled teardown and recovery attempt |
### 6.3 Latching Behavior
| Severity | Latching Rule | Clear Condition |
|----------|--------------|----------------|
| **INFO** | Not latched | Overwritten by new event |
| **WARNING** | Latched until cleared | Fault condition cleared + manual clear OR automatic clear after 1 hour |
| **ERROR** | Latched until cleared | Manual clear via diagnostic session OR system reset |
| **FATAL** | Latched until cleared | Manual clear via diagnostic session OR system reset |
## 7. Fault Reporting
### 7.1 Reporting Channels
| Severity | Local HMI | Diagnostic Log | Main Hub | Diagnostic Session |
|----------|-----------|----------------|----------|-------------------|
| **INFO** | Optional | Yes | No | Yes |
| **WARNING** | Yes (status indicator) | Yes | Yes (periodic) | Yes |
| **ERROR** | Yes (status indicator) | Yes | Yes (immediate) | Yes |
| **FATAL** | Yes (status indicator) | Yes | Yes (immediate) | Yes |
### 7.2 Diagnostic Event Structure
```c
typedef struct {
uint32_t diagnostic_code; // Unique diagnostic code
diag_severity_t severity; // INFO, WARNING, ERROR, FATAL
uint64_t timestamp; // System timestamp (microseconds)
const char* source_component; // Component identifier
uint32_t occurrence_count; // Number of occurrences
bool is_latched; // Latching status
fault_category_t category; // SENSOR, COMMUNICATION, etc.
} diagnostic_event_t;
```
## 8. Integration with State Machine
### 8.1 Fault-to-State Mapping
| Fault Severity | Current State | Target State | Transition Trigger |
|----------------|---------------|--------------|-------------------|
| INFO | Any | Same | None (no state change) |
| WARNING | `RUNNING` | `WARNING` | First WARNING fault |
| WARNING | `WARNING` | `WARNING` | Additional WARNING (latched) |
| ERROR | `RUNNING` | `RUNNING` | Feature isolation, continue |
| ERROR | `WARNING` | `WARNING` | Feature isolation, continue |
| FATAL | `RUNNING` | `FAULT` | First FATAL fault |
| FATAL | `WARNING` | `FAULT` | Escalation to FATAL |
| FATAL | `FAULT` | `FAULT` | Additional FATAL (latched) |
### 8.2 State-Dependent Fault Handling
| State | Fault Handling Behavior |
|-------|------------------------|
| `INIT` | Boot-time faults → `BOOT_FAILURE` if security-related |
| `RUNNING` | Full fault detection and handling |
| `WARNING` | Fault escalation monitoring, recovery attempts |
| `FAULT` | Fault logging only, recovery attempt preparation |
| `OTA_PREP` | OTA-related faults only, others deferred |
| `OTA_UPDATE` | OTA progress faults only |
| `TEARDOWN` | Fault logging only, no new fault detection |
| `SERVICE` | Fault inspection only, no new fault detection |
## 9. Error Handler Responsibilities
The Error Handler component SHALL:
1. Receive fault reports from all components
2. Classify faults according to taxonomy
3. Determine severity and escalation
4. Trigger state transitions when required
5. Manage fault latching and clearing
6. Coordinate recovery attempts
7. Report faults to diagnostics and Main Hub
## 10. Traceability
- **SR-DIAG-001:** Implemented via diagnostic code framework
- **SR-DIAG-002:** Implemented via unique diagnostic code assignment
- **SR-DIAG-003:** Implemented via severity classification
- **SR-DIAG-004:** Implemented via timestamp and source association
- **SR-SYS-002:** Implemented via fault-to-state mapping
- **SR-SYS-004:** Implemented via FATAL fault → TEARDOWN transition
## 11. Mermaid Fault Escalation Diagram
```mermaid
flowchart TD
FaultDetected[Fault Detected] --> ClassifySeverity{Classify Severity}
ClassifySeverity -->|INFO| LogOnly[Log Only]
ClassifySeverity -->|WARNING| CheckState1{Current State?}
ClassifySeverity -->|ERROR| IsolateFeature[Isolate Feature]
ClassifySeverity -->|FATAL| TriggerFaultState[Trigger FAULT State]
CheckState1 -->|RUNNING| TransitionWarning[Transition to WARNING]
CheckState1 -->|WARNING| LatchWarning[Latch Warning]
IsolateFeature --> RetryLogic{Retry Logic}
RetryLogic -->|Success| ClearError[Clear Error]
RetryLogic -->|Failure| EscalateToFatal{Escalate?}
EscalateToFatal -->|Yes| TriggerFaultState
EscalateToFatal -->|No| ManualIntervention[Manual Intervention]
TriggerFaultState --> TeardownSequence[Initiate Teardown]
TeardownSequence --> RecoveryAttempt{Recovery Attempt}
RecoveryAttempt -->|Success| ResetToInit[Reset to INIT]
RecoveryAttempt -->|Failure| ManualIntervention
```