Files
ASF_01_sys_sw_arch/System Design/Failure_Handling_Model.md
2026-01-19 16:19:41 +01:00

12 KiB

Failure Handling Model Specification

Document Type: Normative System Specification
Scope: Sensor Hub (Sub-Hub) Fault Detection, Classification, and Recovery
Traceability: SR-DIAG-001 through SR-DIAG-011, SR-SYS-002, SR-SYS-004

1. Purpose

This document defines the fault taxonomy, escalation rules, recovery behaviors, and integration with the system state machine. All components SHALL adhere to this failure handling model.

2. Fault Taxonomy

2.1 Severity Levels

Severity Code Description State Impact Recovery Behavior
INFO DIAG_SEV_INFO Informational event, no action required None Log only
WARNING DIAG_SEV_WARNING Non-critical fault, degraded operation RUNNINGWARNING Continue with reduced functionality
ERROR DIAG_SEV_ERROR Critical fault, feature disabled Feature-specific Feature isolation, retry logic
FATAL DIAG_SEV_FATAL System-critical fault, core functionality disabled RUNNINGFAULT Controlled teardown, recovery attempt

2.2 Fault Categories

Category Description Examples Typical Severity
SENSOR Sensor hardware or communication failure Disconnection, out-of-range, non-responsive WARNING (single), ERROR (multiple), FATAL (all)
COMMUNICATION Network or protocol failure Link loss, timeout, authentication failure WARNING (temporary), ERROR (persistent), FATAL (critical)
STORAGE Persistence or storage medium failure SD card failure, NVM corruption, write failure WARNING (degraded), ERROR (persistent), FATAL (critical)
SECURITY Security violation or authentication failure Secure boot failure, key corruption, unauthorized access FATAL (always)
SYSTEM System resource or configuration failure Memory exhaustion, task failure, configuration error ERROR (recoverable), FATAL (unrecoverable)
OTA Firmware update failure Validation failure, transfer error, flash error ERROR (retry), FATAL (rollback)
CALIBRATION Calibration or machine constants failure Invalid MC, calibration error, sensor mismatch WARNING (single), ERROR (critical)

3. Diagnostic Code Structure

3.1 Diagnostic Code Format

DIAG-<CATEGORY>-<COMPONENT>-<NUMBER>
  • CATEGORY: Two-letter code (SN, CM, ST, SC, SY, OT, CL)
  • COMPONENT: Component identifier (e.g., TEMP, HUM, CO2, NET, SD, OTA)
  • NUMBER: Unique fault number (0001-9999)

3.2 Diagnostic Code Registry

Code Severity Category Component Description
DIAG-SN-TEMP-0001 WARNING SENSOR Temperature Temperature sensor disconnected
DIAG-SN-TEMP-0002 ERROR SENSOR Temperature Temperature sensor out of range
DIAG-SN-TEMP-0003 FATAL SENSOR Temperature All temperature sensors failed
DIAG-CM-NET-0001 WARNING COMMUNICATION Network Main Hub link temporarily lost
DIAG-CM-NET-0002 ERROR COMMUNICATION Network Main Hub link persistently lost
DIAG-ST-SD-0001 WARNING STORAGE SD Card SD card write failure (retry successful)
DIAG-ST-SD-0002 ERROR STORAGE SD Card SD card persistent write failure
DIAG-ST-SD-0003 FATAL STORAGE SD Card SD card corruption detected
DIAG-SC-BOOT-0001 FATAL SECURITY Secure Boot Secure boot verification failed
DIAG-SY-MEM-0001 ERROR SYSTEM Memory Memory allocation failure
DIAG-OT-FW-0001 ERROR OTA Firmware Firmware integrity validation failed
DIAG-CL-MC-0001 WARNING CALIBRATION Machine Constants Invalid sensor slot configuration

4. Fault Detection Rules

4.1 Sensor Fault Detection

Condition Detection Method Severity Assignment
Sensor disconnected Hardware presence signal WARNING (if other sensors available)
Sensor non-responsive Communication timeout (3 retries) ERROR (if critical sensor)
Sensor out of range Value validation against limits WARNING (if single occurrence), ERROR (if persistent)
All sensors failed Count of failed sensors = total FATAL

4.2 Communication Fault Detection

Condition Detection Method Severity Assignment
Link temporarily lost Heartbeat timeout (< 30s) WARNING
Link persistently lost Heartbeat timeout (> 5 minutes) ERROR
Authentication failure Security layer rejection FATAL
Protocol error Message parsing failure (3 consecutive) ERROR

4.3 Storage Fault Detection

Condition Detection Method Severity Assignment
Write failure (retry successful) Write operation with retry WARNING
Write failure (persistent) Write operation failure (3 retries) ERROR
SD card corruption File system check failure FATAL
Storage full Available space < threshold WARNING

4.4 Security Fault Detection

Condition Detection Method Severity Assignment
Secure boot failure Boot verification failure FATAL (always)
Key corruption Cryptographic key validation failure FATAL
Unauthorized access Authentication failure (3 attempts) FATAL
Message tampering Integrity check failure ERROR (if persistent → FATAL)

5. Escalation Rules

5.1 Severity Escalation

Current Severity Escalation Trigger New Severity State Transition
INFO N/A N/A None
WARNING Same fault persists > 5 minutes ERROR WARNINGWARNING (feature degraded)
WARNING Multiple warnings (≥3) ERROR WARNINGWARNING (feature degraded)
WARNING Critical feature affected FATAL WARNINGFAULT
ERROR Same fault persists > 10 minutes FATAL RUNNINGFAULT
ERROR Cascading failures (≥2 features) FATAL RUNNINGFAULT
FATAL N/A N/A RUNNINGFAULT

5.2 Cascading Failure Detection

A cascading failure is detected when:

  • Multiple independent features fail simultaneously
  • Failure in one feature causes failure in another
  • System resource exhaustion (memory, CPU, storage)

Response: Immediate escalation to FATAL, transition to FAULT state.

6. Recovery Behaviors

6.1 Recovery Strategies by Severity

Severity Recovery Strategy Retry Logic State Impact
INFO None N/A None
WARNING Automatic retry, degraded operation 3 retries with exponential backoff Continue in WARNING state
ERROR Feature isolation, automatic retry 3 retries, then manual intervention Feature disabled, system continues
FATAL Controlled teardown, recovery attempt Single recovery attempt, then manual FAULTTEARDOWNINIT

6.2 Recovery Time Limits

Fault Type Maximum Recovery Time Recovery Action
Sensor (WARNING) 5 minutes Automatic retry, sensor exclusion
Communication (WARNING) 30 seconds Automatic reconnection
Storage (WARNING) 10 seconds Retry write operation
Sensor (ERROR) Manual intervention Sensor marked as failed
Communication (ERROR) Manual intervention Communication feature disabled
Storage (ERROR) Manual intervention Persistence disabled, system continues
FATAL (any) 60 seconds Controlled teardown and recovery attempt

6.3 Latching Behavior

Severity Latching Rule Clear Condition
INFO Not latched Overwritten by new event
WARNING Latched until cleared Fault condition cleared + manual clear OR automatic clear after 1 hour
ERROR Latched until cleared Manual clear via diagnostic session OR system reset
FATAL Latched until cleared Manual clear via diagnostic session OR system reset

7. Fault Reporting

7.1 Reporting Channels

Severity Local HMI Diagnostic Log Main Hub Diagnostic Session
INFO Optional Yes No Yes
WARNING Yes (status indicator) Yes Yes (periodic) Yes
ERROR Yes (status indicator) Yes Yes (immediate) Yes
FATAL Yes (status indicator) Yes Yes (immediate) Yes

7.2 Diagnostic Event Structure

typedef struct {
    uint32_t diagnostic_code;      // Unique diagnostic code
    diag_severity_t severity;      // INFO, WARNING, ERROR, FATAL
    uint64_t timestamp;              // System timestamp (microseconds)
    const char* source_component;   // Component identifier
    uint32_t occurrence_count;     // Number of occurrences
    bool is_latched;                // Latching status
    fault_category_t category;      // SENSOR, COMMUNICATION, etc.
} diagnostic_event_t;

8. Integration with State Machine

8.1 Fault-to-State Mapping

Fault Severity Current State Target State Transition Trigger
INFO Any Same None (no state change)
WARNING RUNNING WARNING First WARNING fault
WARNING WARNING WARNING Additional WARNING (latched)
ERROR RUNNING RUNNING Feature isolation, continue
ERROR WARNING WARNING Feature isolation, continue
FATAL RUNNING FAULT First FATAL fault
FATAL WARNING FAULT Escalation to FATAL
FATAL FAULT FAULT Additional FATAL (latched)

8.2 State-Dependent Fault Handling

State Fault Handling Behavior
INIT Boot-time faults → BOOT_FAILURE if security-related
RUNNING Full fault detection and handling
WARNING Fault escalation monitoring, recovery attempts
FAULT Fault logging only, recovery attempt preparation
OTA_PREP OTA-related faults only, others deferred
OTA_UPDATE OTA progress faults only
TEARDOWN Fault logging only, no new fault detection
SERVICE Fault inspection only, no new fault detection

9. Error Handler Responsibilities

The Error Handler component SHALL:

  1. Receive fault reports from all components
  2. Classify faults according to taxonomy
  3. Determine severity and escalation
  4. Trigger state transitions when required
  5. Manage fault latching and clearing
  6. Coordinate recovery attempts
  7. Report faults to diagnostics and Main Hub

10. Traceability

  • SR-DIAG-001: Implemented via diagnostic code framework
  • SR-DIAG-002: Implemented via unique diagnostic code assignment
  • SR-DIAG-003: Implemented via severity classification
  • SR-DIAG-004: Implemented via timestamp and source association
  • SR-SYS-002: Implemented via fault-to-state mapping
  • SR-SYS-004: Implemented via FATAL fault → TEARDOWN transition

11. Mermaid Fault Escalation Diagram

flowchart TD
    FaultDetected[Fault Detected] --> ClassifySeverity{Classify Severity}
    ClassifySeverity -->|INFO| LogOnly[Log Only]
    ClassifySeverity -->|WARNING| CheckState1{Current State?}
    ClassifySeverity -->|ERROR| IsolateFeature[Isolate Feature]
    ClassifySeverity -->|FATAL| TriggerFaultState[Trigger FAULT State]
    
    CheckState1 -->|RUNNING| TransitionWarning[Transition to WARNING]
    CheckState1 -->|WARNING| LatchWarning[Latch Warning]
    
    IsolateFeature --> RetryLogic{Retry Logic}
    RetryLogic -->|Success| ClearError[Clear Error]
    RetryLogic -->|Failure| EscalateToFatal{Escalate?}
    EscalateToFatal -->|Yes| TriggerFaultState
    EscalateToFatal -->|No| ManualIntervention[Manual Intervention]
    
    TriggerFaultState --> TeardownSequence[Initiate Teardown]
    TeardownSequence --> RecoveryAttempt{Recovery Attempt}
    RecoveryAttempt -->|Success| ResetToInit[Reset to INIT]
    RecoveryAttempt -->|Failure| ManualIntervention