Files
2026-02-01 23:37:00 +01:00

15 KiB

Diagnostics Manager Component Specification

Component ID: C-DIAG-001
Component Name: Diagnostics Manager
Version: 1.0
Date: 2025-02-01

1. Component Overview

1.1 Purpose

The Diagnostics Manager is responsible for comprehensive system health monitoring, fault detection, diagnostic data collection, and engineering access capabilities. It provides centralized diagnostic code management, persistent diagnostic data storage, and diagnostic session support for maintenance and troubleshooting.

1.2 Scope

  • Diagnostic code framework and management
  • System health monitoring and fault detection
  • Diagnostic data collection and storage
  • Engineering diagnostic sessions
  • Layered watchdog system management
  • Performance and resource monitoring

1.3 Responsibilities

  • Implement structured diagnostic code framework
  • Collect and classify diagnostic events
  • Persist diagnostic data across system resets
  • Provide diagnostic session interface for engineers
  • Monitor system health and performance metrics
  • Manage watchdog systems for fault detection
  • Generate diagnostic reports and summaries

2. Component Architecture

2.1 Static View

graph TB
    subgraph "Diagnostics Manager"
        DM[Diagnostic Controller]
        DC[Diagnostic Collector]
        DR[Diagnostic Reporter]
        DS[Diagnostic Session]
        HM[Health Monitor]
        WM[Watchdog Manager]
    end
    
    subgraph "Storage Layer"
        DL[Diagnostic Logger]
        DP[Data Pool]
    end
    
    subgraph "Hardware Monitoring"
        TWD[Task Watchdog]
        IWD[Interrupt Watchdog]
        RWD[RTC Watchdog]
        TM[Temperature Monitor]
        VM[Voltage Monitor]
    end
    
    DM --> DC
    DM --> DR
    DM --> DS
    DM --> HM
    DM --> WM
    
    DC --> DL
    DR --> DP
    
    HM --> TM
    HM --> VM
    WM --> TWD
    WM --> IWD
    WM --> RWD

2.2 Internal Components

2.2.1 Diagnostic Controller

  • Central coordination of diagnostic activities
  • Diagnostic event routing and processing
  • Diagnostic policy enforcement

2.2.2 Diagnostic Collector

  • Diagnostic event collection and enrichment
  • Timestamp and context information addition
  • Diagnostic code validation and assignment

2.2.3 Diagnostic Reporter

  • Diagnostic event reporting to external systems
  • Diagnostic summary generation
  • Real-time diagnostic notifications

2.2.4 Diagnostic Session

  • Engineering access interface
  • Diagnostic data retrieval and analysis
  • Diagnostic record management

2.2.5 Health Monitor

  • System vital signs monitoring
  • Performance metrics collection
  • Resource usage tracking

2.2.6 Watchdog Manager

  • Multi-layer watchdog system management
  • Watchdog feeding coordination
  • Watchdog timeout handling

3. Interfaces

3.1 Provided Interfaces

3.1.1 IDiagnosticsManager

class IDiagnosticsManager {
public:
    virtual ~IDiagnosticsManager() = default;
    
    // Diagnostic Event Management
    virtual Result<void> reportDiagnostic(DiagnosticCode code, 
                                         DiagnosticSeverity severity,
                                         const std::string& context) = 0;
    virtual Result<void> reportDiagnostic(const DiagnosticEvent& event) = 0;
    virtual Result<void> clearDiagnostic(DiagnosticCode code) = 0;
    virtual Result<void> clearAllDiagnostics() = 0;
    
    // Diagnostic Query
    virtual std::vector<DiagnosticEvent> getActiveDiagnostics() const = 0;
    virtual std::vector<DiagnosticEvent> getDiagnosticHistory(
        std::chrono::system_clock::time_point since) const = 0;
    virtual DiagnosticSummary getDiagnosticSummary() const = 0;
    
    // Health Monitoring
    virtual SystemHealth getSystemHealth() const = 0;
    virtual PerformanceMetrics getPerformanceMetrics() const = 0;
    virtual ResourceUsage getResourceUsage() const = 0;
    
    // Session Management
    virtual Result<DiagnosticSessionId> createDiagnosticSession(
        const SessionCredentials& credentials) = 0;
    virtual Result<void> closeDiagnosticSession(DiagnosticSessionId session) = 0;
};

3.1.2 IDiagnosticReporter

class IDiagnosticReporter {
public:
    virtual ~IDiagnosticReporter() = default;
    virtual Result<void> reportEvent(const DiagnosticEvent& event) = 0;
    virtual Result<void> reportHealthStatus(const SystemHealth& health) = 0;
    virtual Result<void> reportPerformanceMetrics(const PerformanceMetrics& metrics) = 0;
};

3.1.3 IHealthMonitor

class IHealthMonitor {
public:
    virtual ~IHealthMonitor() = default;
    virtual SystemHealth getCurrentHealth() const = 0;
    virtual void startHealthMonitoring() = 0;
    virtual void stopHealthMonitoring() = 0;
    virtual Result<void> registerHealthCallback(IHealthCallback* callback) = 0;
};

3.2 Required Interfaces

3.2.1 IPersistenceManager

  • Persistent storage of diagnostic events
  • Diagnostic data retrieval and querying
  • Storage space management

3.2.2 IEventSystem

  • Event publication for diagnostic notifications
  • Event subscription for system events
  • Asynchronous event handling

3.2.3 ISecurityManager

  • Diagnostic session authentication
  • Access control for diagnostic operations
  • Security violation reporting

3.2.4 ISystemStateManager

  • System state information for diagnostics
  • State change notifications
  • System health correlation

4. Dynamic View

4.1 Diagnostic Event Processing Sequence

sequenceDiagram
    participant COMP as System Component
    participant DM as Diagnostics Manager
    participant DC as Diagnostic Collector
    participant DL as Diagnostic Logger
    participant DR as Diagnostic Reporter
    participant ES as Event System
    
    COMP->>DM: reportDiagnostic(code, severity, context)
    DM->>DC: collectDiagnosticData(code, severity, context)
    DC->>DC: enrichDiagnostic(timestamp, source, details)
    DC->>DL: persistDiagnostic(diagnostic_event)
    DL-->>DC: persistence_result
    DC->>DR: reportDiagnosticEvent(diagnostic_event)
    DR->>ES: publishDiagnosticEvent(event)
    
    alt Critical Diagnostic
        DM->>DM: triggerEmergencyAction()
        DM->>ES: publishCriticalAlert(event)
    end

4.2 Health Monitoring Sequence

sequenceDiagram
    participant HM as Health Monitor
    participant TM as Temperature Monitor
    participant VM as Voltage Monitor
    participant WM as Watchdog Manager
    participant DM as Diagnostics Manager
    
    loop Health Check Cycle (1 second)
        HM->>TM: getTemperature()
        TM-->>HM: temperature_value
        HM->>VM: getVoltage()
        VM-->>HM: voltage_value
        HM->>WM: getWatchdogStatus()
        WM-->>HM: watchdog_status
        
        HM->>HM: analyzeHealthMetrics()
        
        alt Health Issue Detected
            HM->>DM: reportHealthIssue(issue_type, severity)
        end
        
        HM->>HM: updateHealthStatus()
    end

4.3 Diagnostic Session Sequence

sequenceDiagram
    participant ENG as Engineer
    participant DS as Diagnostic Session
    participant DM as Diagnostics Manager
    participant SM as Security Manager
    participant DL as Diagnostic Logger
    
    ENG->>DS: requestDiagnosticSession(credentials)
    DS->>SM: authenticateUser(credentials)
    SM-->>DS: authentication_result
    
    alt Authentication Success
        DS->>DM: createSession(user_id, permissions)
        DM-->>DS: session_id
        DS-->>ENG: session_established(session_id)
        
        ENG->>DS: getDiagnosticSummary()
        DS->>DM: getDiagnosticSummary()
        DM-->>DS: diagnostic_summary
        DS-->>ENG: summary_data
        
        ENG->>DS: retrieveDiagnostics(filter)
        DS->>DL: queryDiagnostics(filter)
        DL-->>DS: diagnostic_records
        DS-->>ENG: diagnostic_data
        
        ENG->>DS: clearDiagnostics(codes)
        DS->>DM: clearDiagnosticCodes(codes)
        DM->>DL: removeDiagnostics(codes)
        DL-->>DM: clear_result
        DM-->>DS: operation_result
        DS-->>ENG: operation_complete
    else Authentication Failed
        DS-->>ENG: access_denied
    end

5. Diagnostic Code System

5.1 Diagnostic Code Structure

struct DiagnosticCode {
    uint16_t category;    // System category (e.g., SENSOR, COMM, STORAGE)
    uint16_t component;   // Component identifier
    uint16_t error;       // Specific error code
    
    // Example: SEN-TEMP-001 = 0x0101001
    static constexpr uint16_t SENSOR_CATEGORY = 0x01;
    static constexpr uint16_t TEMPERATURE_COMPONENT = 0x01;
    static constexpr uint16_t SENSOR_FAILURE = 0x001;
};

5.2 Diagnostic Severity Levels

enum class DiagnosticSeverity {
    INFO = 0,      // Informational messages
    WARNING = 1,   // Non-critical issues
    ERROR = 2,     // Recoverable errors
    CRITICAL = 3,  // System degradation
    FATAL = 4      // System failure
};

5.3 Diagnostic Event Structure

struct DiagnosticEvent {
    DiagnosticCode code;
    DiagnosticSeverity severity;
    std::chrono::system_clock::time_point timestamp;
    std::string source_component;
    std::string context;
    std::map<std::string, std::string> metadata;
    uint32_t occurrence_count;
    bool is_active;
};

6. Health Monitoring

6.1 System Health Metrics

struct SystemHealth {
    // Temperature Monitoring
    float cpu_temperature_celsius;
    bool temperature_warning;
    bool temperature_critical;
    
    // Memory Monitoring
    size_t free_heap_bytes;
    size_t min_free_heap_bytes;
    float heap_usage_percentage;
    
    // Storage Monitoring
    size_t sd_card_free_bytes;
    size_t sd_card_total_bytes;
    bool sd_card_healthy;
    
    // Communication Monitoring
    bool main_hub_connected;
    int wifi_signal_strength_dbm;
    uint32_t communication_errors;
    
    // Power Monitoring
    float supply_voltage_v;
    bool brownout_detected;
    
    // Overall Health Status
    HealthStatus overall_status;
};

enum class HealthStatus {
    HEALTHY,
    WARNING,
    DEGRADED,
    CRITICAL,
    FAILED
};

6.2 Performance Metrics

struct PerformanceMetrics {
    // CPU Utilization
    float cpu_utilization_percentage;
    float max_cpu_utilization_percentage;
    
    // Task Performance
    std::map<std::string, TaskMetrics> task_metrics;
    
    // Communication Performance
    uint32_t messages_sent_per_minute;
    uint32_t messages_received_per_minute;
    std::chrono::milliseconds average_response_time;
    
    // Sensor Performance
    uint32_t sensor_readings_per_minute;
    std::chrono::milliseconds average_sensor_read_time;
    
    // Storage Performance
    uint32_t storage_writes_per_minute;
    std::chrono::milliseconds average_write_time;
};

7. Watchdog System

7.1 Watchdog Configuration

struct WatchdogConfig {
    // Task Watchdog
    bool task_watchdog_enabled;
    std::chrono::seconds task_watchdog_timeout;
    std::vector<std::string> monitored_tasks;
    
    // Interrupt Watchdog
    bool interrupt_watchdog_enabled;
    std::chrono::seconds interrupt_watchdog_timeout;
    
    // RTC Watchdog
    bool rtc_watchdog_enabled;
    std::chrono::seconds rtc_watchdog_timeout;
};

7.2 Watchdog Management

  • Task Watchdog: Monitors FreeRTOS tasks for deadlocks (10s timeout)
  • Interrupt Watchdog: Detects ISR hangs (3s timeout)
  • RTC Watchdog: Final safety net for total system freeze (30s timeout)

8. Configuration

8.1 Diagnostics Configuration

struct DiagnosticsConfig {
    // Storage Configuration
    size_t max_diagnostic_records;
    std::chrono::hours diagnostic_retention_period;
    bool persistent_storage_enabled;
    
    // Health Monitoring Configuration
    std::chrono::seconds health_check_interval;
    HealthThresholds health_thresholds;
    bool continuous_monitoring_enabled;
    
    // Reporting Configuration
    bool real_time_reporting_enabled;
    DiagnosticSeverity min_reporting_severity;
    std::chrono::seconds reporting_interval;
    
    // Session Configuration
    std::chrono::minutes session_timeout;
    uint32_t max_concurrent_sessions;
    bool remote_sessions_enabled;
};

9. Error Handling

9.1 Error Categories

  • Storage Errors: Diagnostic persistence failures
  • Memory Errors: Insufficient memory for diagnostic operations
  • Configuration Errors: Invalid diagnostic configuration
  • Session Errors: Authentication or authorization failures
  • Hardware Errors: Sensor or monitoring hardware failures

9.2 Error Recovery Strategies

  • Graceful Degradation: Continue operation with reduced diagnostic capability
  • Memory Management: Implement diagnostic record rotation and cleanup
  • Fallback Storage: Use alternative storage when primary fails
  • Self-Diagnostics: Monitor diagnostic system health

10. Performance Characteristics

10.1 Timing Requirements

  • Diagnostic Event Processing: < 10ms per event
  • Health Check Cycle: 1 second interval
  • Diagnostic Query Response: < 100ms for typical queries
  • Session Operations: < 500ms for session establishment

10.2 Resource Usage

  • Memory: < 16KB for diagnostic buffers and metadata
  • Storage: Configurable with rotation (default 1MB)
  • CPU: < 2% average utilization for monitoring

11. Security Considerations

11.1 Access Control

  • Diagnostic session authentication required
  • Role-based access to diagnostic operations
  • Audit logging of diagnostic access and modifications

11.2 Data Protection

  • Sensitive diagnostic data encryption
  • Secure diagnostic data transmission
  • Diagnostic data integrity verification

12. Testing Strategy

12.1 Unit Tests

  • Diagnostic event processing and storage
  • Health monitoring algorithms
  • Watchdog management functionality
  • Session management and authentication

12.2 Integration Tests

  • End-to-end diagnostic reporting
  • Health monitoring integration
  • Diagnostic session workflows
  • Cross-component diagnostic correlation

12.3 Hardware Tests

  • Watchdog timeout and recovery testing
  • Hardware monitoring accuracy
  • Performance under stress conditions

13. Dependencies

13.1 Internal Dependencies

  • Persistence Manager for diagnostic storage
  • Event System for diagnostic notifications
  • Security Manager for session authentication
  • System State Manager for system context

13.2 External Dependencies

  • ESP-IDF watchdog APIs
  • FreeRTOS task monitoring
  • Hardware monitoring peripherals
  • File system for diagnostic storage

14. Constraints and Assumptions

14.1 Constraints

  • Diagnostic system must remain operational during system faults
  • Memory usage must be bounded and predictable
  • Diagnostic operations must not interfere with real-time requirements
  • Storage space for diagnostics is limited and requires rotation

14.2 Assumptions

  • Sufficient system resources for diagnostic operations
  • Reliable storage medium for diagnostic persistence
  • Proper system time for diagnostic timestamping
  • Valid security credentials for diagnostic sessions