Files

2026-01-26 12:43:14 +01:00

21 KiB

Raw Blame History

Feature Specification: Diagnostics & Health Monitoring

Feature ID: F-DIAG (F-DIAG-001 to F-DIAG-004)

Document Type: Feature Specification
Version: 1.0
Date: 2025-01-19
Feature Category: Diagnostics & Health Monitoring

1. Feature Overview

1.1 Feature Purpose

The Diagnostics & Health Monitoring feature provides comprehensive system health assessment, fault detection, diagnostic event management, and engineering access capabilities for the ASF Sensor Hub. This feature ensures system reliability through proactive monitoring, structured fault reporting, and maintenance support.

1.2 Feature Scope

In Scope:

Structured diagnostic code framework with severity classification
Persistent diagnostic event storage and management
Engineering diagnostic sessions with secure access
System health monitoring and performance metrics
Cross-component fault correlation and root cause analysis

Out of Scope:

Main Hub diagnostic aggregation and analysis
Predictive maintenance algorithms (future enhancement)
Hardware fault injection testing equipment
Remote diagnostic access without Main Hub coordination

2. Sub-Features

2.1 F-DIAG-001: Diagnostic Code Management

Description: Comprehensive diagnostic code framework for standardized fault identification, classification, and reporting across all system components.

Diagnostic Code Structure:

typedef struct {
    uint16_t code;                  // Unique diagnostic code (0x0001-0xFFFF)
    diagnostic_severity_t severity; // INFO, WARNING, ERROR, FATAL
    diagnostic_category_t category; // SENSOR, COMM, STORAGE, SYSTEM, SECURITY
    uint64_t timestamp_ms;          // Event occurrence time
    uint8_t source_component_id;    // Component that generated the event
    char description[64];           // Human-readable description
    uint8_t data[32];              // Context-specific diagnostic data
    uint16_t occurrence_count;      // Number of times this event occurred
} diagnostic_event_t;

typedef enum {
    DIAG_SEVERITY_INFO = 0,     // Informational, no action required
    DIAG_SEVERITY_WARNING = 1,  // Warning, monitoring required
    DIAG_SEVERITY_ERROR = 2,    // Error, corrective action needed
    DIAG_SEVERITY_FATAL = 3     // Fatal, system functionality compromised
} diagnostic_severity_t;

typedef enum {
    DIAG_CATEGORY_SENSOR = 0,   // Sensor-related diagnostics
    DIAG_CATEGORY_COMM = 1,     // Communication diagnostics
    DIAG_CATEGORY_STORAGE = 2,  // Storage and persistence diagnostics
    DIAG_CATEGORY_SYSTEM = 3,   // System management diagnostics
    DIAG_CATEGORY_SECURITY = 4, // Security-related diagnostics
    DIAG_CATEGORY_POWER = 5,    // Power and fault handling diagnostics
    DIAG_CATEGORY_OTA = 6       // OTA update diagnostics
} diagnostic_category_t;

Diagnostic Code Registry (Examples):

Code	Severity	Category	Description
0x1001	WARNING	SENSOR	Sensor communication timeout
0x1002	ERROR	SENSOR	Sensor out-of-range value detected
0x1003	FATAL	SENSOR	Critical sensor hardware failure
0x2001	WARNING	COMM	Wi-Fi signal strength low
0x2002	ERROR	COMM	MQTT broker connection failed
0x2003	FATAL	COMM	TLS certificate validation failed
0x3001	WARNING	STORAGE	SD card space low (< 10%)
0x3002	ERROR	STORAGE	SD card write failure
0x3003	FATAL	STORAGE	SD card not detected
0x4001	INFO	SYSTEM	System state transition
0x4002	WARNING	SYSTEM	Memory usage high (> 80%)
0x4003	FATAL	SYSTEM	Watchdog timer reset

2.2 F-DIAG-002: Diagnostic Data Storage

Description: Persistent storage of diagnostic events in non-volatile memory with efficient storage management and retrieval capabilities.

Storage Architecture:

graph TB
    subgraph "Diagnostic Storage System"
        GEN[Diagnostic Generator] --> BUF[Ring Buffer]
        BUF --> FILTER[Severity Filter]
        FILTER --> PERSIST[Persistence Layer]
        PERSIST --> SD[SD Card Storage]
        PERSIST --> NVS[NVS Flash Storage]
    end
    
    subgraph "Storage Policy"
        CRITICAL[FATAL/ERROR Events] --> NVS
        NORMAL[WARNING/INFO Events] --> SD
        OVERFLOW[Buffer Overflow] --> DISCARD[Discard Oldest]
    end
    
    subgraph "Retrieval Interface"
        QUERY[Query Interface] --> PERSIST
        EXPORT[Export Interface] --> PERSIST
        CLEAR[Clear Interface] --> PERSIST
    end

Storage Management:

Ring Buffer: 100 events in RAM for immediate access
NVS Storage: Critical events (ERROR/FATAL) persisted to flash
SD Card Storage: All events stored to SD card when available
Retention Policy: 30 days or 10,000 events maximum
Compression: Event data compressed for efficient storage

2.3 F-DIAG-003: Diagnostic Session

Description: Secure engineering access interface for diagnostic data retrieval, system inspection, and maintenance operations.

Session Types:

Session Type	Access Level	Authentication	Capabilities
Read-Only	Basic	PIN code	View diagnostics, system status
Engineering	Advanced	Certificate	Diagnostic management, configuration
Service	Full	Multi-factor	System control, debug access

Session Interface:

typedef struct {
    session_id_t session_id;
    session_type_t type;
    uint64_t start_time;
    uint64_t last_activity;
    uint32_t timeout_seconds;
    bool authenticated;
    char user_id[32];
} diagnostic_session_t;

// Session management API
session_id_t diag_createSession(session_type_t type);
bool diag_authenticateSession(session_id_t session, const auth_credentials_t* creds);
bool diag_closeSession(session_id_t session);
bool diag_isSessionValid(session_id_t session);

// Diagnostic access API
bool diag_getEvents(session_id_t session, diagnostic_filter_t* filter, 
                   diagnostic_event_t* events, size_t* count);
bool diag_clearEvents(session_id_t session, diagnostic_filter_t* filter);
bool diag_exportEvents(session_id_t session, export_format_t format, 
                      uint8_t* buffer, size_t* size);
bool diag_getSystemHealth(session_id_t session, system_health_t* health);

2.4 F-DIAG-004: System Health Monitoring

Description: Continuous monitoring of system performance metrics, resource utilization, and component health status.

Health Metrics:

typedef struct {
    // CPU and Memory
    uint8_t cpu_usage_percent;
    uint32_t free_heap_bytes;
    uint32_t min_free_heap_bytes;
    uint16_t task_count;
    
    // Storage
    uint64_t sd_free_bytes;
    uint64_t sd_total_bytes;
    uint32_t nvs_free_entries;
    uint32_t nvs_used_entries;
    
    // Communication
    int8_t wifi_rssi_dbm;
    uint32_t mqtt_messages_sent;
    uint32_t mqtt_messages_failed;
    uint32_t comm_error_count;
    
    // Sensors
    uint8_t sensors_active;
    uint8_t sensors_total;
    uint8_t sensors_failed;
    uint32_t sensor_error_count;
    
    // System
    uint32_t uptime_seconds;
    uint32_t reset_count;
    system_state_t current_state;
    uint32_t state_change_count;
    
    // Power
    float supply_voltage;
    bool brownout_detected;
    uint32_t power_cycle_count;
} system_health_t;

Health Monitoring Flow:

sequenceDiagram
    participant HM as Health Monitor
    participant COMP as System Components
    participant DIAG as Diagnostic Storage
    participant ES as Event System
    participant HMI as Local HMI
    
    Note over HM,HMI: Health Monitoring Cycle (10 seconds)
    
    loop Every 10 seconds
        HM->>COMP: collectHealthMetrics()
        COMP-->>HM: health_data
        
        HM->>HM: analyzeHealthTrends()
        HM->>HM: detectAnomalies()
        
        alt Anomaly detected
            HM->>DIAG: logDiagnosticEvent(anomaly)
            HM->>ES: publish(HEALTH_ANOMALY, details)
        end
        
        HM->>ES: publish(HEALTH_UPDATE, metrics)
        ES->>HMI: updateHealthDisplay(metrics)
    end

3. Requirements Coverage

3.1 System Requirements (SR-XXX)

Feature	System Requirements	Description
F-DIAG-001	SR-DIAG-001, SR-DIAG-002, SR-DIAG-003, SR-DIAG-004	Diagnostic code framework and event management
F-DIAG-002	SR-DIAG-005, SR-DIAG-006, SR-DIAG-007	Persistent diagnostic storage and retention
F-DIAG-003	SR-DIAG-008, SR-DIAG-009, SR-DIAG-010, SR-DIAG-011	Engineering diagnostic sessions and access control
F-DIAG-004	SR-DIAG-012, SR-DIAG-013, SR-DIAG-014	System health monitoring and performance metrics

3.2 Software Requirements (SWR-XXX)

Feature	Software Requirements	Implementation Details
F-DIAG-001	SWR-DIAG-001, SWR-DIAG-002, SWR-DIAG-003	Event structure, code registry, severity classification
F-DIAG-002	SWR-DIAG-004, SWR-DIAG-005, SWR-DIAG-006	Storage management, persistence, retrieval interface
F-DIAG-003	SWR-DIAG-007, SWR-DIAG-008, SWR-DIAG-009	Session management, authentication, access control
F-DIAG-004	SWR-DIAG-010, SWR-DIAG-011, SWR-DIAG-012	Health metrics collection, anomaly detection, reporting

4. Component Implementation Mapping

4.1 Primary Components

Component	Responsibility	Location
Diagnostics Task	Health monitoring, event coordination, session management	`application_layer/diag_task/`
Error Handler	Diagnostic event generation, fault classification	`application_layer/error_handler/`
Diagnostic Storage Manager	Event persistence, retrieval, storage management	`application_layer/diag_storage/`
Health Monitor	System metrics collection, anomaly detection	`application_layer/health_monitor/`

4.2 Supporting Components

Component	Support Role	Interface
Event System	Diagnostic event distribution, component coordination	`application_layer/business_stack/event_system/`
Data Persistence	Storage abstraction, NVS and SD card access	`application_layer/DP_stack/persistence/`
Security Manager	Session authentication, access control	`application_layer/security/`
State Manager	System state awareness, state-dependent diagnostics	`application_layer/business_stack/STM/`

4.3 Component Interaction Diagram

graph TB
    subgraph "Diagnostics & Health Monitoring Feature"
        DT[Diagnostics Task]
        EH[Error Handler]
        DSM[Diagnostic Storage Manager]
        HM[Health Monitor]
    end
    
    subgraph "Core System Components"
        ES[Event System]
        DP[Data Persistence]
        SEC[Security Manager]
        STM[State Manager]
    end
    
    subgraph "System Components"
        SM[Sensor Manager]
        COM[Communication]
        OTA[OTA Manager]
        PWR[Power Manager]
    end
    
    subgraph "Storage"
        NVS[NVS Flash]
        SD[SD Card]
    end
    
    subgraph "Interfaces"
        HMI[Local HMI]
        UART[UART Debug]
        NET[Network Session]
    end
    
    DT <--> ES
    DT <--> DSM
    DT <--> HM
    DT <--> SEC
    
    EH --> ES
    EH --> DSM
    
    DSM <--> DP
    DSM --> NVS
    DSM --> SD
    
    HM --> SM
    HM --> COM
    HM --> OTA
    HM --> PWR
    HM --> STM
    
    ES -.->|Health Events| HMI
    ES -.->|Diagnostic Events| COM
    DT -.->|Session Access| UART
    DT -.->|Session Access| NET

4.4 Diagnostic Event Flow

sequenceDiagram
    participant COMP as System Component
    participant EH as Error Handler
    participant ES as Event System
    participant DSM as Diagnostic Storage
    participant DT as Diagnostics Task
    participant COM as Communication
    
    Note over COMP,COM: Diagnostic Event Generation and Processing
    
    COMP->>EH: reportError(error_info)
    EH->>EH: classifyError(error_info)
    EH->>EH: generateDiagnosticEvent()
    
    EH->>ES: publish(DIAGNOSTIC_EVENT, event)
    ES->>DSM: storeDiagnosticEvent(event)
    ES->>DT: processDiagnosticEvent(event)
    ES->>COM: reportDiagnosticEvent(event)
    
    DSM->>DSM: checkStoragePolicy(event.severity)
    
    alt Critical Event (ERROR/FATAL)
        DSM->>NVS: persistToFlash(event)
    end
    
    DSM->>SD: persistToSDCard(event)
    
    DT->>DT: updateHealthMetrics(event)
    DT->>DT: checkSystemHealth()
    
    alt Health degradation detected
        DT->>ES: publish(HEALTH_DEGRADATION, metrics)
    end

5. Feature Behavior

5.1 Normal Operation Flow

System Initialization:
- Initialize diagnostic storage and load existing events
- Start health monitoring tasks and metric collection
- Register diagnostic event handlers with all components
- Establish baseline health metrics and thresholds
Continuous Monitoring:
- Collect system health metrics every 10 seconds
- Process diagnostic events from all system components
- Store events according to severity and storage policy
- Analyze health trends and detect anomalies
Event Processing:
- Classify and timestamp all diagnostic events
- Apply filtering and correlation rules
- Persist events to appropriate storage (NVS/SD)
- Distribute events to interested components
Session Management:
- Handle engineering session requests and authentication
- Provide secure access to diagnostic data and system health
- Log all diagnostic session activities for audit
- Enforce session timeouts and access controls

5.2 Error Handling

Error Condition	Detection Method	Response Action
Storage Full	Storage capacity monitoring	Implement retention policy, discard oldest events
SD Card Failure	Write operation failure	Switch to NVS-only storage, log degradation
Memory Exhaustion	Heap monitoring	Reduce buffer sizes, increase event filtering
Session Timeout	Activity monitoring	Close session, clear authentication
Authentication Failure	Credential validation	Reject session, log security event

5.3 State-Dependent Behavior

System State	Feature Behavior
INIT	Initialize storage, load existing events, start monitoring
RUNNING	Full diagnostic functionality, continuous health monitoring
WARNING	Enhanced monitoring, increased event generation
FAULT	Critical diagnostics only, preserve fault information
OTA_UPDATE	Suspend monitoring, log OTA-related events
TEARDOWN	Flush pending events, preserve diagnostic state
SERVICE	Full diagnostic access, engineering session support
SD_DEGRADED	NVS-only storage, reduced event retention

6. Feature Constraints

6.1 Timing Constraints

Event Processing: Maximum 10ms from generation to storage
Health Monitoring: 10-second monitoring cycle with ±1 second tolerance
Session Response: Maximum 500ms for diagnostic queries
Storage Operations: Maximum 100ms for event persistence

6.2 Resource Constraints

Memory Usage: Maximum 32KB for diagnostic buffers and storage
Event Storage: Maximum 10,000 events or 30 days retention
Session Limit: Maximum 2 concurrent diagnostic sessions
CPU Usage: Maximum 5% of available CPU time for diagnostics

6.3 Security Constraints

Session Authentication: All diagnostic access must be authenticated
Data Protection: Diagnostic data encrypted when stored
Access Logging: All diagnostic activities logged for audit
Privilege Separation: Role-based access to diagnostic functions

7. Interface Specifications

7.1 Diagnostics Task Public API

// Initialization and control
bool diagTask_initialize(void);
bool diagTask_start(void);
bool diagTask_stop(void);
bool diagTask_isRunning(void);

// Event management
bool diagTask_reportEvent(const diagnostic_event_t* event);
bool diagTask_getEvents(const diagnostic_filter_t* filter, 
                       diagnostic_event_t* events, size_t* count);
bool diagTask_clearEvents(const diagnostic_filter_t* filter);
bool diagTask_exportEvents(export_format_t format, uint8_t* buffer, size_t* size);

// Health monitoring
bool diagTask_getSystemHealth(system_health_t* health);
bool diagTask_getHealthHistory(health_history_t* history, size_t* count);
bool diagTask_resetHealthMetrics(void);

// Session management
session_id_t diagTask_createSession(session_type_t type);
bool diagTask_authenticateSession(session_id_t session, const auth_credentials_t* creds);
bool diagTask_closeSession(session_id_t session);
bool diagTask_isSessionValid(session_id_t session);

7.2 Error Handler API

// Error reporting
bool errorHandler_reportError(component_id_t source, error_code_t code, 
                             const char* description, const uint8_t* context_data);
bool errorHandler_reportWarning(component_id_t source, warning_code_t code, 
                               const char* description);
bool errorHandler_reportInfo(component_id_t source, info_code_t code, 
                            const char* description);

// Error classification
diagnostic_severity_t errorHandler_classifyError(error_code_t code);
diagnostic_category_t errorHandler_categorizeError(component_id_t source, error_code_t code);
bool errorHandler_isErrorCritical(error_code_t code);

// Error statistics
bool errorHandler_getErrorStatistics(error_statistics_t* stats);
bool errorHandler_resetErrorStatistics(void);

7.3 Health Monitor API

// Health monitoring
bool healthMonitor_initialize(void);
bool healthMonitor_startMonitoring(void);
bool healthMonitor_stopMonitoring(void);
bool healthMonitor_getCurrentHealth(system_health_t* health);

// Metric collection
bool healthMonitor_collectMetrics(void);
bool healthMonitor_updateMetric(health_metric_id_t metric_id, float value);
bool healthMonitor_getMetricHistory(health_metric_id_t metric_id, 
                                   metric_history_t* history, size_t* count);

// Anomaly detection
bool healthMonitor_setThreshold(health_metric_id_t metric_id, float threshold);
bool healthMonitor_enableAnomalyDetection(health_metric_id_t metric_id, bool enable);
bool healthMonitor_getAnomalies(anomaly_t* anomalies, size_t* count);

8. Testing and Validation

8.1 Unit Testing

Event Generation: Diagnostic event creation and classification
Storage Management: Event persistence and retrieval operations
Health Monitoring: Metric collection and anomaly detection
Session Management: Authentication and access control

8.2 Integration Testing

Cross-Component Events: Diagnostic events from all system components
Storage Integration: NVS and SD card storage operations
Event Distribution: Event system integration and notification
Session Integration: Engineering access via multiple interfaces

8.3 System Testing

Long-Duration Monitoring: 48-hour continuous diagnostic operation
Storage Stress Testing: High-frequency event generation and storage
Session Security Testing: Authentication bypass attempts
Fault Injection Testing: Component failure simulation and detection

8.4 Acceptance Criteria

All diagnostic events properly classified and stored
Health monitoring detects system anomalies within timing constraints
Engineering sessions provide secure access to diagnostic data
Storage management maintains data integrity under all conditions
No diagnostic overhead impact on core system functionality
Complete audit trail of all diagnostic activities

9. Dependencies

9.1 Internal Dependencies

Event System: Diagnostic event distribution and coordination
Data Persistence: Storage abstraction for diagnostic data
Security Manager: Session authentication and access control
State Manager: System state awareness for state-dependent diagnostics

9.2 External Dependencies

ESP-IDF Framework: NVS, SD card, and system monitoring APIs
FreeRTOS: Task scheduling and system resource monitoring
Hardware Components: SD card, NVS flash, UART interface
System Components: All components for health metric collection

10. Future Enhancements

10.1 Planned Improvements

Predictive Analytics: Machine learning for failure prediction
Advanced Correlation: Multi-component fault correlation analysis
Remote Diagnostics: Cloud-based diagnostic data analysis
Automated Recovery: Self-healing mechanisms based on diagnostics

10.2 Scalability Considerations

Distributed Diagnostics: Cross-hub diagnostic correlation
Cloud Integration: Real-time diagnostic streaming to cloud
Advanced Analytics: Big data analytics for fleet-wide diagnostics
Mobile Interface: Smartphone app for field diagnostic access

Document Status: Final for Implementation Phase
Component Dependencies: Verified against architecture
Requirements Traceability: Complete (SR-DIAG, SWR-DIAG)
Next Review: After component implementation

21 KiB Raw Blame History