21 KiB
Feature Specification: Diagnostics & Health Monitoring
Feature ID: F-DIAG (F-DIAG-001 to F-DIAG-004)
Document Type: Feature Specification
Version: 1.0
Date: 2025-01-19
Feature Category: Diagnostics & Health Monitoring
1. Feature Overview
1.1 Feature Purpose
The Diagnostics & Health Monitoring feature provides comprehensive system health assessment, fault detection, diagnostic event management, and engineering access capabilities for the ASF Sensor Hub. This feature ensures system reliability through proactive monitoring, structured fault reporting, and maintenance support.
1.2 Feature Scope
In Scope:
- Structured diagnostic code framework with severity classification
- Persistent diagnostic event storage and management
- Engineering diagnostic sessions with secure access
- System health monitoring and performance metrics
- Cross-component fault correlation and root cause analysis
Out of Scope:
- Main Hub diagnostic aggregation and analysis
- Predictive maintenance algorithms (future enhancement)
- Hardware fault injection testing equipment
- Remote diagnostic access without Main Hub coordination
2. Sub-Features
2.1 F-DIAG-001: Diagnostic Code Management
Description: Comprehensive diagnostic code framework for standardized fault identification, classification, and reporting across all system components.
Diagnostic Code Structure:
typedef struct {
uint16_t code; // Unique diagnostic code (0x0001-0xFFFF)
diagnostic_severity_t severity; // INFO, WARNING, ERROR, FATAL
diagnostic_category_t category; // SENSOR, COMM, STORAGE, SYSTEM, SECURITY
uint64_t timestamp_ms; // Event occurrence time
uint8_t source_component_id; // Component that generated the event
char description[64]; // Human-readable description
uint8_t data[32]; // Context-specific diagnostic data
uint16_t occurrence_count; // Number of times this event occurred
} diagnostic_event_t;
typedef enum {
DIAG_SEVERITY_INFO = 0, // Informational, no action required
DIAG_SEVERITY_WARNING = 1, // Warning, monitoring required
DIAG_SEVERITY_ERROR = 2, // Error, corrective action needed
DIAG_SEVERITY_FATAL = 3 // Fatal, system functionality compromised
} diagnostic_severity_t;
typedef enum {
DIAG_CATEGORY_SENSOR = 0, // Sensor-related diagnostics
DIAG_CATEGORY_COMM = 1, // Communication diagnostics
DIAG_CATEGORY_STORAGE = 2, // Storage and persistence diagnostics
DIAG_CATEGORY_SYSTEM = 3, // System management diagnostics
DIAG_CATEGORY_SECURITY = 4, // Security-related diagnostics
DIAG_CATEGORY_POWER = 5, // Power and fault handling diagnostics
DIAG_CATEGORY_OTA = 6 // OTA update diagnostics
} diagnostic_category_t;
Diagnostic Code Registry (Examples):
| Code | Severity | Category | Description |
|---|---|---|---|
| 0x1001 | WARNING | SENSOR | Sensor communication timeout |
| 0x1002 | ERROR | SENSOR | Sensor out-of-range value detected |
| 0x1003 | FATAL | SENSOR | Critical sensor hardware failure |
| 0x2001 | WARNING | COMM | Wi-Fi signal strength low |
| 0x2002 | ERROR | COMM | MQTT broker connection failed |
| 0x2003 | FATAL | COMM | TLS certificate validation failed |
| 0x3001 | WARNING | STORAGE | SD card space low (< 10%) |
| 0x3002 | ERROR | STORAGE | SD card write failure |
| 0x3003 | FATAL | STORAGE | SD card not detected |
| 0x4001 | INFO | SYSTEM | System state transition |
| 0x4002 | WARNING | SYSTEM | Memory usage high (> 80%) |
| 0x4003 | FATAL | SYSTEM | Watchdog timer reset |
2.2 F-DIAG-002: Diagnostic Data Storage
Description: Persistent storage of diagnostic events in non-volatile memory with efficient storage management and retrieval capabilities.
Storage Architecture:
graph TB
subgraph "Diagnostic Storage System"
GEN[Diagnostic Generator] --> BUF[Ring Buffer]
BUF --> FILTER[Severity Filter]
FILTER --> PERSIST[Persistence Layer]
PERSIST --> SD[SD Card Storage]
PERSIST --> NVS[NVS Flash Storage]
end
subgraph "Storage Policy"
CRITICAL[FATAL/ERROR Events] --> NVS
NORMAL[WARNING/INFO Events] --> SD
OVERFLOW[Buffer Overflow] --> DISCARD[Discard Oldest]
end
subgraph "Retrieval Interface"
QUERY[Query Interface] --> PERSIST
EXPORT[Export Interface] --> PERSIST
CLEAR[Clear Interface] --> PERSIST
end
Storage Management:
- Ring Buffer: 100 events in RAM for immediate access
- NVS Storage: Critical events (ERROR/FATAL) persisted to flash
- SD Card Storage: All events stored to SD card when available
- Retention Policy: 30 days or 10,000 events maximum
- Compression: Event data compressed for efficient storage
2.3 F-DIAG-003: Diagnostic Session
Description: Secure engineering access interface for diagnostic data retrieval, system inspection, and maintenance operations.
Session Types:
| Session Type | Access Level | Authentication | Capabilities |
|---|---|---|---|
| Read-Only | Basic | PIN code | View diagnostics, system status |
| Engineering | Advanced | Certificate | Diagnostic management, configuration |
| Service | Full | Multi-factor | System control, debug access |
Session Interface:
typedef struct {
session_id_t session_id;
session_type_t type;
uint64_t start_time;
uint64_t last_activity;
uint32_t timeout_seconds;
bool authenticated;
char user_id[32];
} diagnostic_session_t;
// Session management API
session_id_t diag_createSession(session_type_t type);
bool diag_authenticateSession(session_id_t session, const auth_credentials_t* creds);
bool diag_closeSession(session_id_t session);
bool diag_isSessionValid(session_id_t session);
// Diagnostic access API
bool diag_getEvents(session_id_t session, diagnostic_filter_t* filter,
diagnostic_event_t* events, size_t* count);
bool diag_clearEvents(session_id_t session, diagnostic_filter_t* filter);
bool diag_exportEvents(session_id_t session, export_format_t format,
uint8_t* buffer, size_t* size);
bool diag_getSystemHealth(session_id_t session, system_health_t* health);
2.4 F-DIAG-004: System Health Monitoring
Description: Continuous monitoring of system performance metrics, resource utilization, and component health status.
Health Metrics:
typedef struct {
// CPU and Memory
uint8_t cpu_usage_percent;
uint32_t free_heap_bytes;
uint32_t min_free_heap_bytes;
uint16_t task_count;
// Storage
uint64_t sd_free_bytes;
uint64_t sd_total_bytes;
uint32_t nvs_free_entries;
uint32_t nvs_used_entries;
// Communication
int8_t wifi_rssi_dbm;
uint32_t mqtt_messages_sent;
uint32_t mqtt_messages_failed;
uint32_t comm_error_count;
// Sensors
uint8_t sensors_active;
uint8_t sensors_total;
uint8_t sensors_failed;
uint32_t sensor_error_count;
// System
uint32_t uptime_seconds;
uint32_t reset_count;
system_state_t current_state;
uint32_t state_change_count;
// Power
float supply_voltage;
bool brownout_detected;
uint32_t power_cycle_count;
} system_health_t;
Health Monitoring Flow:
sequenceDiagram
participant HM as Health Monitor
participant COMP as System Components
participant DIAG as Diagnostic Storage
participant ES as Event System
participant HMI as Local HMI
Note over HM,HMI: Health Monitoring Cycle (10 seconds)
loop Every 10 seconds
HM->>COMP: collectHealthMetrics()
COMP-->>HM: health_data
HM->>HM: analyzeHealthTrends()
HM->>HM: detectAnomalies()
alt Anomaly detected
HM->>DIAG: logDiagnosticEvent(anomaly)
HM->>ES: publish(HEALTH_ANOMALY, details)
end
HM->>ES: publish(HEALTH_UPDATE, metrics)
ES->>HMI: updateHealthDisplay(metrics)
end
3. Requirements Coverage
3.1 System Requirements (SR-XXX)
| Feature | System Requirements | Description |
|---|---|---|
| F-DIAG-001 | SR-DIAG-001, SR-DIAG-002, SR-DIAG-003, SR-DIAG-004 | Diagnostic code framework and event management |
| F-DIAG-002 | SR-DIAG-005, SR-DIAG-006, SR-DIAG-007 | Persistent diagnostic storage and retention |
| F-DIAG-003 | SR-DIAG-008, SR-DIAG-009, SR-DIAG-010, SR-DIAG-011 | Engineering diagnostic sessions and access control |
| F-DIAG-004 | SR-DIAG-012, SR-DIAG-013, SR-DIAG-014 | System health monitoring and performance metrics |
3.2 Software Requirements (SWR-XXX)
| Feature | Software Requirements | Implementation Details |
|---|---|---|
| F-DIAG-001 | SWR-DIAG-001, SWR-DIAG-002, SWR-DIAG-003 | Event structure, code registry, severity classification |
| F-DIAG-002 | SWR-DIAG-004, SWR-DIAG-005, SWR-DIAG-006 | Storage management, persistence, retrieval interface |
| F-DIAG-003 | SWR-DIAG-007, SWR-DIAG-008, SWR-DIAG-009 | Session management, authentication, access control |
| F-DIAG-004 | SWR-DIAG-010, SWR-DIAG-011, SWR-DIAG-012 | Health metrics collection, anomaly detection, reporting |
4. Component Implementation Mapping
4.1 Primary Components
| Component | Responsibility | Location |
|---|---|---|
| Diagnostics Task | Health monitoring, event coordination, session management | application_layer/diag_task/ |
| Error Handler | Diagnostic event generation, fault classification | application_layer/error_handler/ |
| Diagnostic Storage Manager | Event persistence, retrieval, storage management | application_layer/diag_storage/ |
| Health Monitor | System metrics collection, anomaly detection | application_layer/health_monitor/ |
4.2 Supporting Components
| Component | Support Role | Interface |
|---|---|---|
| Event System | Diagnostic event distribution, component coordination | application_layer/business_stack/event_system/ |
| Data Persistence | Storage abstraction, NVS and SD card access | application_layer/DP_stack/persistence/ |
| Security Manager | Session authentication, access control | application_layer/security/ |
| State Manager | System state awareness, state-dependent diagnostics | application_layer/business_stack/STM/ |
4.3 Component Interaction Diagram
graph TB
subgraph "Diagnostics & Health Monitoring Feature"
DT[Diagnostics Task]
EH[Error Handler]
DSM[Diagnostic Storage Manager]
HM[Health Monitor]
end
subgraph "Core System Components"
ES[Event System]
DP[Data Persistence]
SEC[Security Manager]
STM[State Manager]
end
subgraph "System Components"
SM[Sensor Manager]
COM[Communication]
OTA[OTA Manager]
PWR[Power Manager]
end
subgraph "Storage"
NVS[NVS Flash]
SD[SD Card]
end
subgraph "Interfaces"
HMI[Local HMI]
UART[UART Debug]
NET[Network Session]
end
DT <--> ES
DT <--> DSM
DT <--> HM
DT <--> SEC
EH --> ES
EH --> DSM
DSM <--> DP
DSM --> NVS
DSM --> SD
HM --> SM
HM --> COM
HM --> OTA
HM --> PWR
HM --> STM
ES -.->|Health Events| HMI
ES -.->|Diagnostic Events| COM
DT -.->|Session Access| UART
DT -.->|Session Access| NET
4.4 Diagnostic Event Flow
sequenceDiagram
participant COMP as System Component
participant EH as Error Handler
participant ES as Event System
participant DSM as Diagnostic Storage
participant DT as Diagnostics Task
participant COM as Communication
Note over COMP,COM: Diagnostic Event Generation and Processing
COMP->>EH: reportError(error_info)
EH->>EH: classifyError(error_info)
EH->>EH: generateDiagnosticEvent()
EH->>ES: publish(DIAGNOSTIC_EVENT, event)
ES->>DSM: storeDiagnosticEvent(event)
ES->>DT: processDiagnosticEvent(event)
ES->>COM: reportDiagnosticEvent(event)
DSM->>DSM: checkStoragePolicy(event.severity)
alt Critical Event (ERROR/FATAL)
DSM->>NVS: persistToFlash(event)
end
DSM->>SD: persistToSDCard(event)
DT->>DT: updateHealthMetrics(event)
DT->>DT: checkSystemHealth()
alt Health degradation detected
DT->>ES: publish(HEALTH_DEGRADATION, metrics)
end
5. Feature Behavior
5.1 Normal Operation Flow
-
System Initialization:
- Initialize diagnostic storage and load existing events
- Start health monitoring tasks and metric collection
- Register diagnostic event handlers with all components
- Establish baseline health metrics and thresholds
-
Continuous Monitoring:
- Collect system health metrics every 10 seconds
- Process diagnostic events from all system components
- Store events according to severity and storage policy
- Analyze health trends and detect anomalies
-
Event Processing:
- Classify and timestamp all diagnostic events
- Apply filtering and correlation rules
- Persist events to appropriate storage (NVS/SD)
- Distribute events to interested components
-
Session Management:
- Handle engineering session requests and authentication
- Provide secure access to diagnostic data and system health
- Log all diagnostic session activities for audit
- Enforce session timeouts and access controls
5.2 Error Handling
| Error Condition | Detection Method | Response Action |
|---|---|---|
| Storage Full | Storage capacity monitoring | Implement retention policy, discard oldest events |
| SD Card Failure | Write operation failure | Switch to NVS-only storage, log degradation |
| Memory Exhaustion | Heap monitoring | Reduce buffer sizes, increase event filtering |
| Session Timeout | Activity monitoring | Close session, clear authentication |
| Authentication Failure | Credential validation | Reject session, log security event |
5.3 State-Dependent Behavior
| System State | Feature Behavior |
|---|---|
| INIT | Initialize storage, load existing events, start monitoring |
| RUNNING | Full diagnostic functionality, continuous health monitoring |
| WARNING | Enhanced monitoring, increased event generation |
| FAULT | Critical diagnostics only, preserve fault information |
| OTA_UPDATE | Suspend monitoring, log OTA-related events |
| TEARDOWN | Flush pending events, preserve diagnostic state |
| SERVICE | Full diagnostic access, engineering session support |
| SD_DEGRADED | NVS-only storage, reduced event retention |
6. Feature Constraints
6.1 Timing Constraints
- Event Processing: Maximum 10ms from generation to storage
- Health Monitoring: 10-second monitoring cycle with ±1 second tolerance
- Session Response: Maximum 500ms for diagnostic queries
- Storage Operations: Maximum 100ms for event persistence
6.2 Resource Constraints
- Memory Usage: Maximum 32KB for diagnostic buffers and storage
- Event Storage: Maximum 10,000 events or 30 days retention
- Session Limit: Maximum 2 concurrent diagnostic sessions
- CPU Usage: Maximum 5% of available CPU time for diagnostics
6.3 Security Constraints
- Session Authentication: All diagnostic access must be authenticated
- Data Protection: Diagnostic data encrypted when stored
- Access Logging: All diagnostic activities logged for audit
- Privilege Separation: Role-based access to diagnostic functions
7. Interface Specifications
7.1 Diagnostics Task Public API
// Initialization and control
bool diagTask_initialize(void);
bool diagTask_start(void);
bool diagTask_stop(void);
bool diagTask_isRunning(void);
// Event management
bool diagTask_reportEvent(const diagnostic_event_t* event);
bool diagTask_getEvents(const diagnostic_filter_t* filter,
diagnostic_event_t* events, size_t* count);
bool diagTask_clearEvents(const diagnostic_filter_t* filter);
bool diagTask_exportEvents(export_format_t format, uint8_t* buffer, size_t* size);
// Health monitoring
bool diagTask_getSystemHealth(system_health_t* health);
bool diagTask_getHealthHistory(health_history_t* history, size_t* count);
bool diagTask_resetHealthMetrics(void);
// Session management
session_id_t diagTask_createSession(session_type_t type);
bool diagTask_authenticateSession(session_id_t session, const auth_credentials_t* creds);
bool diagTask_closeSession(session_id_t session);
bool diagTask_isSessionValid(session_id_t session);
7.2 Error Handler API
// Error reporting
bool errorHandler_reportError(component_id_t source, error_code_t code,
const char* description, const uint8_t* context_data);
bool errorHandler_reportWarning(component_id_t source, warning_code_t code,
const char* description);
bool errorHandler_reportInfo(component_id_t source, info_code_t code,
const char* description);
// Error classification
diagnostic_severity_t errorHandler_classifyError(error_code_t code);
diagnostic_category_t errorHandler_categorizeError(component_id_t source, error_code_t code);
bool errorHandler_isErrorCritical(error_code_t code);
// Error statistics
bool errorHandler_getErrorStatistics(error_statistics_t* stats);
bool errorHandler_resetErrorStatistics(void);
7.3 Health Monitor API
// Health monitoring
bool healthMonitor_initialize(void);
bool healthMonitor_startMonitoring(void);
bool healthMonitor_stopMonitoring(void);
bool healthMonitor_getCurrentHealth(system_health_t* health);
// Metric collection
bool healthMonitor_collectMetrics(void);
bool healthMonitor_updateMetric(health_metric_id_t metric_id, float value);
bool healthMonitor_getMetricHistory(health_metric_id_t metric_id,
metric_history_t* history, size_t* count);
// Anomaly detection
bool healthMonitor_setThreshold(health_metric_id_t metric_id, float threshold);
bool healthMonitor_enableAnomalyDetection(health_metric_id_t metric_id, bool enable);
bool healthMonitor_getAnomalies(anomaly_t* anomalies, size_t* count);
8. Testing and Validation
8.1 Unit Testing
- Event Generation: Diagnostic event creation and classification
- Storage Management: Event persistence and retrieval operations
- Health Monitoring: Metric collection and anomaly detection
- Session Management: Authentication and access control
8.2 Integration Testing
- Cross-Component Events: Diagnostic events from all system components
- Storage Integration: NVS and SD card storage operations
- Event Distribution: Event system integration and notification
- Session Integration: Engineering access via multiple interfaces
8.3 System Testing
- Long-Duration Monitoring: 48-hour continuous diagnostic operation
- Storage Stress Testing: High-frequency event generation and storage
- Session Security Testing: Authentication bypass attempts
- Fault Injection Testing: Component failure simulation and detection
8.4 Acceptance Criteria
- All diagnostic events properly classified and stored
- Health monitoring detects system anomalies within timing constraints
- Engineering sessions provide secure access to diagnostic data
- Storage management maintains data integrity under all conditions
- No diagnostic overhead impact on core system functionality
- Complete audit trail of all diagnostic activities
9. Dependencies
9.1 Internal Dependencies
- Event System: Diagnostic event distribution and coordination
- Data Persistence: Storage abstraction for diagnostic data
- Security Manager: Session authentication and access control
- State Manager: System state awareness for state-dependent diagnostics
9.2 External Dependencies
- ESP-IDF Framework: NVS, SD card, and system monitoring APIs
- FreeRTOS: Task scheduling and system resource monitoring
- Hardware Components: SD card, NVS flash, UART interface
- System Components: All components for health metric collection
10. Future Enhancements
10.1 Planned Improvements
- Predictive Analytics: Machine learning for failure prediction
- Advanced Correlation: Multi-component fault correlation analysis
- Remote Diagnostics: Cloud-based diagnostic data analysis
- Automated Recovery: Self-healing mechanisms based on diagnostics
10.2 Scalability Considerations
- Distributed Diagnostics: Cross-hub diagnostic correlation
- Cloud Integration: Real-time diagnostic streaming to cloud
- Advanced Analytics: Big data analytics for fleet-wide diagnostics
- Mobile Interface: Smartphone app for field diagnostic access
Document Status: Final for Implementation Phase
Component Dependencies: Verified against architecture
Requirements Traceability: Complete (SR-DIAG, SWR-DIAG)
Next Review: After component implementation