# Diagnostics Manager Component Specification **Component ID:** C-DIAG-001 **Component Name:** Diagnostics Manager **Version:** 1.0 **Date:** 2025-02-01 ## 1. Component Overview ### 1.1 Purpose The Diagnostics Manager is responsible for comprehensive system health monitoring, fault detection, diagnostic data collection, and engineering access capabilities. It provides centralized diagnostic code management, persistent diagnostic data storage, and diagnostic session support for maintenance and troubleshooting. ### 1.2 Scope - Diagnostic code framework and management - System health monitoring and fault detection - Diagnostic data collection and storage - Engineering diagnostic sessions - Layered watchdog system management - Performance and resource monitoring ### 1.3 Responsibilities - Implement structured diagnostic code framework - Collect and classify diagnostic events - Persist diagnostic data across system resets - Provide diagnostic session interface for engineers - Monitor system health and performance metrics - Manage watchdog systems for fault detection - Generate diagnostic reports and summaries ## 2. Component Architecture ### 2.1 Static View ```mermaid graph TB subgraph "Diagnostics Manager" DM[Diagnostic Controller] DC[Diagnostic Collector] DR[Diagnostic Reporter] DS[Diagnostic Session] HM[Health Monitor] WM[Watchdog Manager] end subgraph "Storage Layer" DL[Diagnostic Logger] DP[Data Pool] end subgraph "Hardware Monitoring" TWD[Task Watchdog] IWD[Interrupt Watchdog] RWD[RTC Watchdog] TM[Temperature Monitor] VM[Voltage Monitor] end DM --> DC DM --> DR DM --> DS DM --> HM DM --> WM DC --> DL DR --> DP HM --> TM HM --> VM WM --> TWD WM --> IWD WM --> RWD ``` ### 2.2 Internal Components #### 2.2.1 Diagnostic Controller - Central coordination of diagnostic activities - Diagnostic event routing and processing - Diagnostic policy enforcement #### 2.2.2 Diagnostic Collector - Diagnostic event collection and enrichment - Timestamp and context information addition - Diagnostic code validation and assignment #### 2.2.3 Diagnostic Reporter - Diagnostic event reporting to external systems - Diagnostic summary generation - Real-time diagnostic notifications #### 2.2.4 Diagnostic Session - Engineering access interface - Diagnostic data retrieval and analysis - Diagnostic record management #### 2.2.5 Health Monitor - System vital signs monitoring - Performance metrics collection - Resource usage tracking #### 2.2.6 Watchdog Manager - Multi-layer watchdog system management - Watchdog feeding coordination - Watchdog timeout handling ## 3. Interfaces ### 3.1 Provided Interfaces #### 3.1.1 IDiagnosticsManager ```cpp class IDiagnosticsManager { public: virtual ~IDiagnosticsManager() = default; // Diagnostic Event Management virtual Result reportDiagnostic(DiagnosticCode code, DiagnosticSeverity severity, const std::string& context) = 0; virtual Result reportDiagnostic(const DiagnosticEvent& event) = 0; virtual Result clearDiagnostic(DiagnosticCode code) = 0; virtual Result clearAllDiagnostics() = 0; // Diagnostic Query virtual std::vector getActiveDiagnostics() const = 0; virtual std::vector getDiagnosticHistory( std::chrono::system_clock::time_point since) const = 0; virtual DiagnosticSummary getDiagnosticSummary() const = 0; // Health Monitoring virtual SystemHealth getSystemHealth() const = 0; virtual PerformanceMetrics getPerformanceMetrics() const = 0; virtual ResourceUsage getResourceUsage() const = 0; // Session Management virtual Result createDiagnosticSession( const SessionCredentials& credentials) = 0; virtual Result closeDiagnosticSession(DiagnosticSessionId session) = 0; }; ``` #### 3.1.2 IDiagnosticReporter ```cpp class IDiagnosticReporter { public: virtual ~IDiagnosticReporter() = default; virtual Result reportEvent(const DiagnosticEvent& event) = 0; virtual Result reportHealthStatus(const SystemHealth& health) = 0; virtual Result reportPerformanceMetrics(const PerformanceMetrics& metrics) = 0; }; ``` #### 3.1.3 IHealthMonitor ```cpp class IHealthMonitor { public: virtual ~IHealthMonitor() = default; virtual SystemHealth getCurrentHealth() const = 0; virtual void startHealthMonitoring() = 0; virtual void stopHealthMonitoring() = 0; virtual Result registerHealthCallback(IHealthCallback* callback) = 0; }; ``` ### 3.2 Required Interfaces #### 3.2.1 IPersistenceManager - Persistent storage of diagnostic events - Diagnostic data retrieval and querying - Storage space management #### 3.2.2 IEventSystem - Event publication for diagnostic notifications - Event subscription for system events - Asynchronous event handling #### 3.2.3 ISecurityManager - Diagnostic session authentication - Access control for diagnostic operations - Security violation reporting #### 3.2.4 ISystemStateManager - System state information for diagnostics - State change notifications - System health correlation ## 4. Dynamic View ### 4.1 Diagnostic Event Processing Sequence ```mermaid sequenceDiagram participant COMP as System Component participant DM as Diagnostics Manager participant DC as Diagnostic Collector participant DL as Diagnostic Logger participant DR as Diagnostic Reporter participant ES as Event System COMP->>DM: reportDiagnostic(code, severity, context) DM->>DC: collectDiagnosticData(code, severity, context) DC->>DC: enrichDiagnostic(timestamp, source, details) DC->>DL: persistDiagnostic(diagnostic_event) DL-->>DC: persistence_result DC->>DR: reportDiagnosticEvent(diagnostic_event) DR->>ES: publishDiagnosticEvent(event) alt Critical Diagnostic DM->>DM: triggerEmergencyAction() DM->>ES: publishCriticalAlert(event) end ``` ### 4.2 Health Monitoring Sequence ```mermaid sequenceDiagram participant HM as Health Monitor participant TM as Temperature Monitor participant VM as Voltage Monitor participant WM as Watchdog Manager participant DM as Diagnostics Manager loop Health Check Cycle (1 second) HM->>TM: getTemperature() TM-->>HM: temperature_value HM->>VM: getVoltage() VM-->>HM: voltage_value HM->>WM: getWatchdogStatus() WM-->>HM: watchdog_status HM->>HM: analyzeHealthMetrics() alt Health Issue Detected HM->>DM: reportHealthIssue(issue_type, severity) end HM->>HM: updateHealthStatus() end ``` ### 4.3 Diagnostic Session Sequence ```mermaid sequenceDiagram participant ENG as Engineer participant DS as Diagnostic Session participant DM as Diagnostics Manager participant SM as Security Manager participant DL as Diagnostic Logger ENG->>DS: requestDiagnosticSession(credentials) DS->>SM: authenticateUser(credentials) SM-->>DS: authentication_result alt Authentication Success DS->>DM: createSession(user_id, permissions) DM-->>DS: session_id DS-->>ENG: session_established(session_id) ENG->>DS: getDiagnosticSummary() DS->>DM: getDiagnosticSummary() DM-->>DS: diagnostic_summary DS-->>ENG: summary_data ENG->>DS: retrieveDiagnostics(filter) DS->>DL: queryDiagnostics(filter) DL-->>DS: diagnostic_records DS-->>ENG: diagnostic_data ENG->>DS: clearDiagnostics(codes) DS->>DM: clearDiagnosticCodes(codes) DM->>DL: removeDiagnostics(codes) DL-->>DM: clear_result DM-->>DS: operation_result DS-->>ENG: operation_complete else Authentication Failed DS-->>ENG: access_denied end ``` ## 5. Diagnostic Code System ### 5.1 Diagnostic Code Structure ```cpp struct DiagnosticCode { uint16_t category; // System category (e.g., SENSOR, COMM, STORAGE) uint16_t component; // Component identifier uint16_t error; // Specific error code // Example: SEN-TEMP-001 = 0x0101001 static constexpr uint16_t SENSOR_CATEGORY = 0x01; static constexpr uint16_t TEMPERATURE_COMPONENT = 0x01; static constexpr uint16_t SENSOR_FAILURE = 0x001; }; ``` ### 5.2 Diagnostic Severity Levels ```cpp enum class DiagnosticSeverity { INFO = 0, // Informational messages WARNING = 1, // Non-critical issues ERROR = 2, // Recoverable errors CRITICAL = 3, // System degradation FATAL = 4 // System failure }; ``` ### 5.3 Diagnostic Event Structure ```cpp struct DiagnosticEvent { DiagnosticCode code; DiagnosticSeverity severity; std::chrono::system_clock::time_point timestamp; std::string source_component; std::string context; std::map metadata; uint32_t occurrence_count; bool is_active; }; ``` ## 6. Health Monitoring ### 6.1 System Health Metrics ```cpp struct SystemHealth { // Temperature Monitoring float cpu_temperature_celsius; bool temperature_warning; bool temperature_critical; // Memory Monitoring size_t free_heap_bytes; size_t min_free_heap_bytes; float heap_usage_percentage; // Storage Monitoring size_t sd_card_free_bytes; size_t sd_card_total_bytes; bool sd_card_healthy; // Communication Monitoring bool main_hub_connected; int wifi_signal_strength_dbm; uint32_t communication_errors; // Power Monitoring float supply_voltage_v; bool brownout_detected; // Overall Health Status HealthStatus overall_status; }; enum class HealthStatus { HEALTHY, WARNING, DEGRADED, CRITICAL, FAILED }; ``` ### 6.2 Performance Metrics ```cpp struct PerformanceMetrics { // CPU Utilization float cpu_utilization_percentage; float max_cpu_utilization_percentage; // Task Performance std::map task_metrics; // Communication Performance uint32_t messages_sent_per_minute; uint32_t messages_received_per_minute; std::chrono::milliseconds average_response_time; // Sensor Performance uint32_t sensor_readings_per_minute; std::chrono::milliseconds average_sensor_read_time; // Storage Performance uint32_t storage_writes_per_minute; std::chrono::milliseconds average_write_time; }; ``` ## 7. Watchdog System ### 7.1 Watchdog Configuration ```cpp struct WatchdogConfig { // Task Watchdog bool task_watchdog_enabled; std::chrono::seconds task_watchdog_timeout; std::vector monitored_tasks; // Interrupt Watchdog bool interrupt_watchdog_enabled; std::chrono::seconds interrupt_watchdog_timeout; // RTC Watchdog bool rtc_watchdog_enabled; std::chrono::seconds rtc_watchdog_timeout; }; ``` ### 7.2 Watchdog Management - **Task Watchdog**: Monitors FreeRTOS tasks for deadlocks (10s timeout) - **Interrupt Watchdog**: Detects ISR hangs (3s timeout) - **RTC Watchdog**: Final safety net for total system freeze (30s timeout) ## 8. Configuration ### 8.1 Diagnostics Configuration ```cpp struct DiagnosticsConfig { // Storage Configuration size_t max_diagnostic_records; std::chrono::hours diagnostic_retention_period; bool persistent_storage_enabled; // Health Monitoring Configuration std::chrono::seconds health_check_interval; HealthThresholds health_thresholds; bool continuous_monitoring_enabled; // Reporting Configuration bool real_time_reporting_enabled; DiagnosticSeverity min_reporting_severity; std::chrono::seconds reporting_interval; // Session Configuration std::chrono::minutes session_timeout; uint32_t max_concurrent_sessions; bool remote_sessions_enabled; }; ``` ## 9. Error Handling ### 9.1 Error Categories - **Storage Errors**: Diagnostic persistence failures - **Memory Errors**: Insufficient memory for diagnostic operations - **Configuration Errors**: Invalid diagnostic configuration - **Session Errors**: Authentication or authorization failures - **Hardware Errors**: Sensor or monitoring hardware failures ### 9.2 Error Recovery Strategies - **Graceful Degradation**: Continue operation with reduced diagnostic capability - **Memory Management**: Implement diagnostic record rotation and cleanup - **Fallback Storage**: Use alternative storage when primary fails - **Self-Diagnostics**: Monitor diagnostic system health ## 10. Performance Characteristics ### 10.1 Timing Requirements - **Diagnostic Event Processing**: < 10ms per event - **Health Check Cycle**: 1 second interval - **Diagnostic Query Response**: < 100ms for typical queries - **Session Operations**: < 500ms for session establishment ### 10.2 Resource Usage - **Memory**: < 16KB for diagnostic buffers and metadata - **Storage**: Configurable with rotation (default 1MB) - **CPU**: < 2% average utilization for monitoring ## 11. Security Considerations ### 11.1 Access Control - Diagnostic session authentication required - Role-based access to diagnostic operations - Audit logging of diagnostic access and modifications ### 11.2 Data Protection - Sensitive diagnostic data encryption - Secure diagnostic data transmission - Diagnostic data integrity verification ## 12. Testing Strategy ### 12.1 Unit Tests - Diagnostic event processing and storage - Health monitoring algorithms - Watchdog management functionality - Session management and authentication ### 12.2 Integration Tests - End-to-end diagnostic reporting - Health monitoring integration - Diagnostic session workflows - Cross-component diagnostic correlation ### 12.3 Hardware Tests - Watchdog timeout and recovery testing - Hardware monitoring accuracy - Performance under stress conditions ## 13. Dependencies ### 13.1 Internal Dependencies - Persistence Manager for diagnostic storage - Event System for diagnostic notifications - Security Manager for session authentication - System State Manager for system context ### 13.2 External Dependencies - ESP-IDF watchdog APIs - FreeRTOS task monitoring - Hardware monitoring peripherals - File system for diagnostic storage ## 14. Constraints and Assumptions ### 14.1 Constraints - Diagnostic system must remain operational during system faults - Memory usage must be bounded and predictable - Diagnostic operations must not interfere with real-time requirements - Storage space for diagnostics is limited and requires rotation ### 14.2 Assumptions - Sufficient system resources for diagnostic operations - Reliable storage medium for diagnostic persistence - Proper system time for diagnostic timestamping - Valid security credentials for diagnostic sessions