Files
ASF_01_sys_sw_arch/1 software design/features/F-DIAG_Diagnostics_Health.md
2026-02-01 12:56:05 +01:00

21 KiB

Feature Specification: Diagnostics & Health Monitoring

Feature ID: F-DIAG (F-DIAG-001 to F-DIAG-004)

Document Type: Feature Specification
Version: 1.0
Date: 2025-01-19
Feature Category: Diagnostics & Health Monitoring

1. Feature Overview

1.1 Feature Purpose

The Diagnostics & Health Monitoring feature provides comprehensive system health assessment, fault detection, diagnostic event management, and engineering access capabilities for the ASF Sensor Hub. This feature ensures system reliability through proactive monitoring, structured fault reporting, and maintenance support.

1.2 Feature Scope

In Scope:

  • Structured diagnostic code framework with severity classification
  • Persistent diagnostic event storage and management
  • Engineering diagnostic sessions with secure access
  • System health monitoring and performance metrics
  • Cross-component fault correlation and root cause analysis

Out of Scope:

  • Main Hub diagnostic aggregation and analysis
  • Predictive maintenance algorithms (future enhancement)
  • Hardware fault injection testing equipment
  • Remote diagnostic access without Main Hub coordination

2. Sub-Features

2.1 F-DIAG-001: Diagnostic Code Management

Description: Comprehensive diagnostic code framework for standardized fault identification, classification, and reporting across all system components.

Diagnostic Code Structure:

typedef struct {
    uint16_t code;                  // Unique diagnostic code (0x0001-0xFFFF)
    diagnostic_severity_t severity; // INFO, WARNING, ERROR, FATAL
    diagnostic_category_t category; // SENSOR, COMM, STORAGE, SYSTEM, SECURITY
    uint64_t timestamp_ms;          // Event occurrence time
    uint8_t source_component_id;    // Component that generated the event
    char description[64];           // Human-readable description
    uint8_t data[32];              // Context-specific diagnostic data
    uint16_t occurrence_count;      // Number of times this event occurred
} diagnostic_event_t;

typedef enum {
    DIAG_SEVERITY_INFO = 0,     // Informational, no action required
    DIAG_SEVERITY_WARNING = 1,  // Warning, monitoring required
    DIAG_SEVERITY_ERROR = 2,    // Error, corrective action needed
    DIAG_SEVERITY_FATAL = 3     // Fatal, system functionality compromised
} diagnostic_severity_t;

typedef enum {
    DIAG_CATEGORY_SENSOR = 0,   // Sensor-related diagnostics
    DIAG_CATEGORY_COMM = 1,     // Communication diagnostics
    DIAG_CATEGORY_STORAGE = 2,  // Storage and persistence diagnostics
    DIAG_CATEGORY_SYSTEM = 3,   // System management diagnostics
    DIAG_CATEGORY_SECURITY = 4, // Security-related diagnostics
    DIAG_CATEGORY_POWER = 5,    // Power and fault handling diagnostics
    DIAG_CATEGORY_OTA = 6       // OTA update diagnostics
} diagnostic_category_t;

Diagnostic Code Registry (Examples):

Code Severity Category Description
0x1001 WARNING SENSOR Sensor communication timeout
0x1002 ERROR SENSOR Sensor out-of-range value detected
0x1003 FATAL SENSOR Critical sensor hardware failure
0x2001 WARNING COMM Wi-Fi signal strength low
0x2002 ERROR COMM MQTT broker connection failed
0x2003 FATAL COMM TLS certificate validation failed
0x3001 WARNING STORAGE SD card space low (< 10%)
0x3002 ERROR STORAGE SD card write failure
0x3003 FATAL STORAGE SD card not detected
0x4001 INFO SYSTEM System state transition
0x4002 WARNING SYSTEM Memory usage high (> 80%)
0x4003 FATAL SYSTEM Watchdog timer reset

2.2 F-DIAG-002: Diagnostic Data Storage

Description: Persistent storage of diagnostic events in non-volatile memory with efficient storage management and retrieval capabilities.

Storage Architecture:

graph TB
    subgraph "Diagnostic Storage System"
        GEN[Diagnostic Generator] --> BUF[Ring Buffer]
        BUF --> FILTER[Severity Filter]
        FILTER --> PERSIST[Persistence Layer]
        PERSIST --> SD[SD Card Storage]
        PERSIST --> NVS[NVS Flash Storage]
    end
    
    subgraph "Storage Policy"
        CRITICAL[FATAL/ERROR Events] --> NVS
        NORMAL[WARNING/INFO Events] --> SD
        OVERFLOW[Buffer Overflow] --> DISCARD[Discard Oldest]
    end
    
    subgraph "Retrieval Interface"
        QUERY[Query Interface] --> PERSIST
        EXPORT[Export Interface] --> PERSIST
        CLEAR[Clear Interface] --> PERSIST
    end

Storage Management:

  • Ring Buffer: 100 events in RAM for immediate access
  • NVS Storage: Critical events (ERROR/FATAL) persisted to flash
  • SD Card Storage: All events stored to SD card when available
  • Retention Policy: 30 days or 10,000 events maximum
  • Compression: Event data compressed for efficient storage

2.3 F-DIAG-003: Diagnostic Session

Description: Secure engineering access interface for diagnostic data retrieval, system inspection, and maintenance operations.

Session Types:

Session Type Access Level Authentication Capabilities
Read-Only Basic PIN code View diagnostics, system status
Engineering Advanced Certificate Diagnostic management, configuration
Service Full Multi-factor System control, debug access

Session Interface:

typedef struct {
    session_id_t session_id;
    session_type_t type;
    uint64_t start_time;
    uint64_t last_activity;
    uint32_t timeout_seconds;
    bool authenticated;
    char user_id[32];
} diagnostic_session_t;

// Session management API
session_id_t diag_createSession(session_type_t type);
bool diag_authenticateSession(session_id_t session, const auth_credentials_t* creds);
bool diag_closeSession(session_id_t session);
bool diag_isSessionValid(session_id_t session);

// Diagnostic access API
bool diag_getEvents(session_id_t session, diagnostic_filter_t* filter, 
                   diagnostic_event_t* events, size_t* count);
bool diag_clearEvents(session_id_t session, diagnostic_filter_t* filter);
bool diag_exportEvents(session_id_t session, export_format_t format, 
                      uint8_t* buffer, size_t* size);
bool diag_getSystemHealth(session_id_t session, system_health_t* health);

2.4 F-DIAG-004: System Health Monitoring

Description: Continuous monitoring of system performance metrics, resource utilization, and component health status.

Health Metrics:

typedef struct {
    // CPU and Memory
    uint8_t cpu_usage_percent;
    uint32_t free_heap_bytes;
    uint32_t min_free_heap_bytes;
    uint16_t task_count;
    
    // Storage
    uint64_t sd_free_bytes;
    uint64_t sd_total_bytes;
    uint32_t nvs_free_entries;
    uint32_t nvs_used_entries;
    
    // Communication
    int8_t wifi_rssi_dbm;
    uint32_t mqtt_messages_sent;
    uint32_t mqtt_messages_failed;
    uint32_t comm_error_count;
    
    // Sensors
    uint8_t sensors_active;
    uint8_t sensors_total;
    uint8_t sensors_failed;
    uint32_t sensor_error_count;
    
    // System
    uint32_t uptime_seconds;
    uint32_t reset_count;
    system_state_t current_state;
    uint32_t state_change_count;
    
    // Power
    float supply_voltage;
    bool brownout_detected;
    uint32_t power_cycle_count;
} system_health_t;

Health Monitoring Flow:

sequenceDiagram
    participant HM as Health Monitor
    participant COMP as System Components
    participant DIAG as Diagnostic Storage
    participant ES as Event System
    participant HMI as Local HMI
    
    Note over HM,HMI: Health Monitoring Cycle (10 seconds)
    
    loop Every 10 seconds
        HM->>COMP: collectHealthMetrics()
        COMP-->>HM: health_data
        
        HM->>HM: analyzeHealthTrends()
        HM->>HM: detectAnomalies()
        
        alt Anomaly detected
            HM->>DIAG: logDiagnosticEvent(anomaly)
            HM->>ES: publish(HEALTH_ANOMALY, details)
        end
        
        HM->>ES: publish(HEALTH_UPDATE, metrics)
        ES->>HMI: updateHealthDisplay(metrics)
    end

3. Requirements Coverage

3.1 System Requirements (SR-XXX)

Feature System Requirements Description
F-DIAG-001 SR-DIAG-001, SR-DIAG-002, SR-DIAG-003, SR-DIAG-004 Diagnostic code framework and event management
F-DIAG-002 SR-DIAG-005, SR-DIAG-006, SR-DIAG-007 Persistent diagnostic storage and retention
F-DIAG-003 SR-DIAG-008, SR-DIAG-009, SR-DIAG-010, SR-DIAG-011 Engineering diagnostic sessions and access control
F-DIAG-004 SR-DIAG-012, SR-DIAG-013, SR-DIAG-014 System health monitoring and performance metrics

3.2 Software Requirements (SWR-XXX)

Feature Software Requirements Implementation Details
F-DIAG-001 SWR-DIAG-001, SWR-DIAG-002, SWR-DIAG-003 Event structure, code registry, severity classification
F-DIAG-002 SWR-DIAG-004, SWR-DIAG-005, SWR-DIAG-006 Storage management, persistence, retrieval interface
F-DIAG-003 SWR-DIAG-007, SWR-DIAG-008, SWR-DIAG-009 Session management, authentication, access control
F-DIAG-004 SWR-DIAG-010, SWR-DIAG-011, SWR-DIAG-012 Health metrics collection, anomaly detection, reporting

4. Component Implementation Mapping

4.1 Primary Components

Component Responsibility Location
Diagnostics Task Health monitoring, event coordination, session management application_layer/diag_task/
Error Handler Diagnostic event generation, fault classification application_layer/error_handler/
Diagnostic Storage Manager Event persistence, retrieval, storage management application_layer/diag_storage/
Health Monitor System metrics collection, anomaly detection application_layer/health_monitor/

4.2 Supporting Components

Component Support Role Interface
Event System Diagnostic event distribution, component coordination application_layer/business_stack/event_system/
Data Persistence Storage abstraction, NVS and SD card access application_layer/DP_stack/persistence/
Security Manager Session authentication, access control application_layer/security/
State Manager System state awareness, state-dependent diagnostics application_layer/business_stack/STM/

4.3 Component Interaction Diagram

graph TB
    subgraph "Diagnostics & Health Monitoring Feature"
        DT[Diagnostics Task]
        EH[Error Handler]
        DSM[Diagnostic Storage Manager]
        HM[Health Monitor]
    end
    
    subgraph "Core System Components"
        ES[Event System]
        DP[Data Persistence]
        SEC[Security Manager]
        STM[State Manager]
    end
    
    subgraph "System Components"
        SM[Sensor Manager]
        COM[Communication]
        OTA[OTA Manager]
        PWR[Power Manager]
    end
    
    subgraph "Storage"
        NVS[NVS Flash]
        SD[SD Card]
    end
    
    subgraph "Interfaces"
        HMI[Local HMI]
        UART[UART Debug]
        NET[Network Session]
    end
    
    DT <--> ES
    DT <--> DSM
    DT <--> HM
    DT <--> SEC
    
    EH --> ES
    EH --> DSM
    
    DSM <--> DP
    DSM --> NVS
    DSM --> SD
    
    HM --> SM
    HM --> COM
    HM --> OTA
    HM --> PWR
    HM --> STM
    
    ES -.->|Health Events| HMI
    ES -.->|Diagnostic Events| COM
    DT -.->|Session Access| UART
    DT -.->|Session Access| NET

4.4 Diagnostic Event Flow

sequenceDiagram
    participant COMP as System Component
    participant EH as Error Handler
    participant ES as Event System
    participant DSM as Diagnostic Storage
    participant DT as Diagnostics Task
    participant COM as Communication
    
    Note over COMP,COM: Diagnostic Event Generation and Processing
    
    COMP->>EH: reportError(error_info)
    EH->>EH: classifyError(error_info)
    EH->>EH: generateDiagnosticEvent()
    
    EH->>ES: publish(DIAGNOSTIC_EVENT, event)
    ES->>DSM: storeDiagnosticEvent(event)
    ES->>DT: processDiagnosticEvent(event)
    ES->>COM: reportDiagnosticEvent(event)
    
    DSM->>DSM: checkStoragePolicy(event.severity)
    
    alt Critical Event (ERROR/FATAL)
        DSM->>NVS: persistToFlash(event)
    end
    
    DSM->>SD: persistToSDCard(event)
    
    DT->>DT: updateHealthMetrics(event)
    DT->>DT: checkSystemHealth()
    
    alt Health degradation detected
        DT->>ES: publish(HEALTH_DEGRADATION, metrics)
    end

5. Feature Behavior

5.1 Normal Operation Flow

  1. System Initialization:

    • Initialize diagnostic storage and load existing events
    • Start health monitoring tasks and metric collection
    • Register diagnostic event handlers with all components
    • Establish baseline health metrics and thresholds
  2. Continuous Monitoring:

    • Collect system health metrics every 10 seconds
    • Process diagnostic events from all system components
    • Store events according to severity and storage policy
    • Analyze health trends and detect anomalies
  3. Event Processing:

    • Classify and timestamp all diagnostic events
    • Apply filtering and correlation rules
    • Persist events to appropriate storage (NVS/SD)
    • Distribute events to interested components
  4. Session Management:

    • Handle engineering session requests and authentication
    • Provide secure access to diagnostic data and system health
    • Log all diagnostic session activities for audit
    • Enforce session timeouts and access controls

5.2 Error Handling

Error Condition Detection Method Response Action
Storage Full Storage capacity monitoring Implement retention policy, discard oldest events
SD Card Failure Write operation failure Switch to NVS-only storage, log degradation
Memory Exhaustion Heap monitoring Reduce buffer sizes, increase event filtering
Session Timeout Activity monitoring Close session, clear authentication
Authentication Failure Credential validation Reject session, log security event

5.3 State-Dependent Behavior

System State Feature Behavior
INIT Initialize storage, load existing events, start monitoring
RUNNING Full diagnostic functionality, continuous health monitoring
WARNING Enhanced monitoring, increased event generation
FAULT Critical diagnostics only, preserve fault information
OTA_UPDATE Suspend monitoring, log OTA-related events
TEARDOWN Flush pending events, preserve diagnostic state
SERVICE Full diagnostic access, engineering session support
SD_DEGRADED NVS-only storage, reduced event retention

6. Feature Constraints

6.1 Timing Constraints

  • Event Processing: Maximum 10ms from generation to storage
  • Health Monitoring: 10-second monitoring cycle with ±1 second tolerance
  • Session Response: Maximum 500ms for diagnostic queries
  • Storage Operations: Maximum 100ms for event persistence

6.2 Resource Constraints

  • Memory Usage: Maximum 32KB for diagnostic buffers and storage
  • Event Storage: Maximum 10,000 events or 30 days retention
  • Session Limit: Maximum 2 concurrent diagnostic sessions
  • CPU Usage: Maximum 5% of available CPU time for diagnostics

6.3 Security Constraints

  • Session Authentication: All diagnostic access must be authenticated
  • Data Protection: Diagnostic data encrypted when stored
  • Access Logging: All diagnostic activities logged for audit
  • Privilege Separation: Role-based access to diagnostic functions

7. Interface Specifications

7.1 Diagnostics Task Public API

// Initialization and control
bool diagTask_initialize(void);
bool diagTask_start(void);
bool diagTask_stop(void);
bool diagTask_isRunning(void);

// Event management
bool diagTask_reportEvent(const diagnostic_event_t* event);
bool diagTask_getEvents(const diagnostic_filter_t* filter, 
                       diagnostic_event_t* events, size_t* count);
bool diagTask_clearEvents(const diagnostic_filter_t* filter);
bool diagTask_exportEvents(export_format_t format, uint8_t* buffer, size_t* size);

// Health monitoring
bool diagTask_getSystemHealth(system_health_t* health);
bool diagTask_getHealthHistory(health_history_t* history, size_t* count);
bool diagTask_resetHealthMetrics(void);

// Session management
session_id_t diagTask_createSession(session_type_t type);
bool diagTask_authenticateSession(session_id_t session, const auth_credentials_t* creds);
bool diagTask_closeSession(session_id_t session);
bool diagTask_isSessionValid(session_id_t session);

7.2 Error Handler API

// Error reporting
bool errorHandler_reportError(component_id_t source, error_code_t code, 
                             const char* description, const uint8_t* context_data);
bool errorHandler_reportWarning(component_id_t source, warning_code_t code, 
                               const char* description);
bool errorHandler_reportInfo(component_id_t source, info_code_t code, 
                            const char* description);

// Error classification
diagnostic_severity_t errorHandler_classifyError(error_code_t code);
diagnostic_category_t errorHandler_categorizeError(component_id_t source, error_code_t code);
bool errorHandler_isErrorCritical(error_code_t code);

// Error statistics
bool errorHandler_getErrorStatistics(error_statistics_t* stats);
bool errorHandler_resetErrorStatistics(void);

7.3 Health Monitor API

// Health monitoring
bool healthMonitor_initialize(void);
bool healthMonitor_startMonitoring(void);
bool healthMonitor_stopMonitoring(void);
bool healthMonitor_getCurrentHealth(system_health_t* health);

// Metric collection
bool healthMonitor_collectMetrics(void);
bool healthMonitor_updateMetric(health_metric_id_t metric_id, float value);
bool healthMonitor_getMetricHistory(health_metric_id_t metric_id, 
                                   metric_history_t* history, size_t* count);

// Anomaly detection
bool healthMonitor_setThreshold(health_metric_id_t metric_id, float threshold);
bool healthMonitor_enableAnomalyDetection(health_metric_id_t metric_id, bool enable);
bool healthMonitor_getAnomalies(anomaly_t* anomalies, size_t* count);

8. Testing and Validation

8.1 Unit Testing

  • Event Generation: Diagnostic event creation and classification
  • Storage Management: Event persistence and retrieval operations
  • Health Monitoring: Metric collection and anomaly detection
  • Session Management: Authentication and access control

8.2 Integration Testing

  • Cross-Component Events: Diagnostic events from all system components
  • Storage Integration: NVS and SD card storage operations
  • Event Distribution: Event system integration and notification
  • Session Integration: Engineering access via multiple interfaces

8.3 System Testing

  • Long-Duration Monitoring: 48-hour continuous diagnostic operation
  • Storage Stress Testing: High-frequency event generation and storage
  • Session Security Testing: Authentication bypass attempts
  • Fault Injection Testing: Component failure simulation and detection

8.4 Acceptance Criteria

  • All diagnostic events properly classified and stored
  • Health monitoring detects system anomalies within timing constraints
  • Engineering sessions provide secure access to diagnostic data
  • Storage management maintains data integrity under all conditions
  • No diagnostic overhead impact on core system functionality
  • Complete audit trail of all diagnostic activities

9. Dependencies

9.1 Internal Dependencies

  • Event System: Diagnostic event distribution and coordination
  • Data Persistence: Storage abstraction for diagnostic data
  • Security Manager: Session authentication and access control
  • State Manager: System state awareness for state-dependent diagnostics

9.2 External Dependencies

  • ESP-IDF Framework: NVS, SD card, and system monitoring APIs
  • FreeRTOS: Task scheduling and system resource monitoring
  • Hardware Components: SD card, NVS flash, UART interface
  • System Components: All components for health metric collection

10. Future Enhancements

10.1 Planned Improvements

  • Predictive Analytics: Machine learning for failure prediction
  • Advanced Correlation: Multi-component fault correlation analysis
  • Remote Diagnostics: Cloud-based diagnostic data analysis
  • Automated Recovery: Self-healing mechanisms based on diagnostics

10.2 Scalability Considerations

  • Distributed Diagnostics: Cross-hub diagnostic correlation
  • Cloud Integration: Real-time diagnostic streaming to cloud
  • Advanced Analytics: Big data analytics for fleet-wide diagnostics
  • Mobile Interface: Smartphone app for field diagnostic access

Document Status: Final for Implementation Phase
Component Dependencies: Verified against architecture
Requirements Traceability: Complete (SR-DIAG, SWR-DIAG)
Next Review: After component implementation