ros2_medkit_fault_manager

Central fault manager node for ros2_medkit fault management system

README

ros2_medkit_fault_manager

Central fault manager node for the ros2_medkit fault management system.

Overview

The FaultManager node provides a central point for fault aggregation and lifecycle management. It receives fault reports from multiple sources, aggregates them by fault_code, and provides query and clearing interfaces.

Quick Start

By default, faults are confirmed immediately when reported - no additional configuration needed.

# Start the fault manager
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# Report a fault - it's immediately CONFIRMED
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'MOTOR_OVERHEAT', event_type: 0, severity: 2, description: 'Motor temp exceeded', source_id: '/motor_node'}"

# Query faults
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED']}"

# Clear a fault
ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
  "{fault_code: 'MOTOR_OVERHEAT'}"

Services

Service	Type	Description
`~/report_fault`	`ros2_medkit_msgs/srv/ReportFault`	Report a fault occurrence
`~/list_faults`	`ros2_medkit_msgs/srv/ListFaults`	Query faults with filtering
`~/clear_fault`	`ros2_medkit_msgs/srv/ClearFault`	Clear/acknowledge a fault
`~/get_snapshots`	`ros2_medkit_msgs/srv/GetSnapshots`	Get topic snapshots for a fault

Features

Multi-source aggregation: Same fault_code from different sources creates a single fault
Occurrence tracking: Counts total reports and tracks all reporting sources
Severity escalation: Fault severity is updated if a higher severity is reported
Persistent storage: SQLite backend ensures faults survive node restarts
Debounce filtering (optional): AUTOSAR DEM-style counter-based fault confirmation with per-entity threshold overrides
Snapshot capture: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
Fault correlation (optional): Root cause analysis with symptom muting and auto-clear

Parameters

Parameter	Type	Default	Description
`storage_type`	string	`"sqlite"`	Storage backend: `"sqlite"` or `"memory"`
`database_path`	string	`"/var/lib/ros2_medkit/faults.db"`	Path to SQLite database file
`confirmation_threshold`	int	`-1`	Counter value at which faults are confirmed
`healing_enabled`	bool	`false`	Enable automatic healing via PASSED events
`healing_threshold`	int	`3`	Counter value at which faults are healed
`auto_confirm_after_sec`	double	`0.0`	Auto-confirm PREFAILED faults after timeout (0 = disabled)
`entity_thresholds.config_file`	string	`""`	Path to YAML file with per-entity debounce threshold overrides

Snapshot Parameters

Snapshots capture topic data when faults are confirmed for post-mortem debugging.

Parameter	Type	Default	Description
`snapshots.enabled`	bool	`true`	Enable/disable snapshot capture
`snapshots.background_capture`	bool	`false`	Use background subscriptions (caches latest message) vs on-demand capture
`snapshots.timeout_sec`	double	`1.0`	Timeout waiting for topic message (on-demand mode)
`snapshots.max_message_size`	int	`65536`	Maximum message size in bytes (larger messages skipped)
`snapshots.default_topics`	string[]	`[]`	Topics to capture for all faults
`snapshots.config_file`	string	`""`	Path to YAML config for `fault_specific` and `patterns`

Topic Resolution Priority:

fault_specific - Exact match for fault code (configured via YAML config file)
patterns - Regex pattern match (configured via YAML config file)
default_topics - Fallback for all faults

Example YAML config file (snapshots.yaml):

fault_specific:
  MOTOR_OVERHEAT:
    - /joint_states
    - /motor/temperature
patterns:
  "MOTOR_.*":
    - /joint_states
    - /cmd_vel

Storage Backends

SQLite (default): Faults are persisted to disk and survive node restarts. Uses WAL mode for optimal performance.

Memory: Faults are stored in memory only. Useful for testing or when persistence is not required.

Usage

Launch

# Default (SQLite storage, immediate confirmation)
ros2 launch ros2_medkit_fault_manager fault_manager.launch.py

# With custom database path
ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
  -p database_path:=/custom/path/faults.db

# With in-memory storage (no persistence)
ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
  -p storage_type:=memory

Advanced: Debounce Filtering

For systems that need to filter transient faults, enable debounce filtering by setting a lower confirmation_threshold.

Configuration

ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
  -p confirmation_threshold:=-3 \
  -p healing_enabled:=true \
  -p healing_threshold:=3

How It Works

The fault manager uses an AUTOSAR DEM-style debounce model:

FAILED events (fault detected): Decrement the internal counter
PASSED events (fault cleared): Increment the internal counter
Fault becomes CONFIRMED when counter reaches confirmation_threshold
Fault becomes HEALED when counter reaches healing_threshold (if enabled)

Fault Lifecycle with Debounce

     FAILED events          PASSED events
          |                      |
          v                      v
   [counter--]             [counter++]
          |                      |
          v                      v
PREFAILED -----> CONFIRMED -----> HEALED (retained)
  (counter     (counter <=     (counter >=
   < 0)         threshold)      healing)
                    |
                    v
                CLEARED (manual via ~/clear_fault)

Status Reference

Status	Description
`PREFAILED`	Debounce counter < 0, not yet confirmed
`CONFIRMED`	Fault is active and verified
`HEALED`	Resolved via PASSED events (if healing enabled)
`CLEARED`	Manually acknowledged via `~/clear_fault`

Testing Debounce

# Report FAILED events (need 3 to confirm with threshold=-3)
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'SENSOR_FAIL', event_type: 0, severity: 2, description: 'Sensor timeout', source_id: '/sensor'}"

# Report PASSED event (fault condition cleared)
ros2 service call /fault_manager/report_fault ros2_medkit_msgs/srv/ReportFault \
  "{fault_code: 'SENSOR_FAIL', event_type: 1, severity: 0, description: '', source_id: '/sensor'}"

# Query all statuses including PREFAILED
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['PREFAILED', 'CONFIRMED', 'HEALED']}"

Event types: 0 = EVENT_FAILED, 1 = EVENT_PASSED

Immediate Confirmation

CRITICAL severity faults bypass debounce and are immediately CONFIRMED, regardless of threshold.

Advanced: Fault Correlation

Fault correlation reduces noise by identifying relationships between faults. When enabled, symptom faults (effects of a root cause) can be muted and auto-cleared when the root cause is resolved.

Correlation Modes

Hierarchical: Defines explicit root cause → symptoms relationships. When a root cause fault occurs, subsequent matching symptom faults within a time window are correlated and optionally muted.

Auto-Cluster: Automatically groups related faults that match a pattern within a time window. Useful for detecting “storms” of related faults (e.g., communication errors).

Configuration

Enable correlation by providing a YAML configuration file:

ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
  -p correlation.config_file:=/path/to/correlation.yaml

Correlation Parameters

Parameter	Type	Default	Description
`correlation.config_file`	string	`""`	Path to correlation YAML config (empty = disabled)
`correlation.cleanup_interval_sec`	double	`5.0`	Interval for cleaning up expired pending correlations (seconds)

Configuration File Format

correlation:
  enabled: true
  default_window_ms: 500  # Default time window for symptom detection

  # Reusable fault patterns (supports wildcards with *)
  patterns:
    motor_errors:
      codes: ["MOTOR_COMM_*", "MOTOR_TIMEOUT_*"]
    drive_faults:
      codes: ["DRIVE_*"]
    comm_errors:
      codes: ["*_COMM_*", "*_TIMEOUT"]

  rules:
    # Hierarchical rule: E-Stop causes motor and drive faults
    - id: estop_cascade
      name: "E-Stop Cascade"
      mode: hierarchical
      root_cause:
        codes: ["ESTOP_001", "ESTOP_002"]
      symptoms:
        - pattern: motor_errors
        - pattern: drive_faults
      window_ms: 1000           # Symptoms within 1s of root cause
      mute_symptoms: true       # Don't publish symptom events
      auto_clear_with_root: true # Clear symptoms when root cause clears

    # Auto-cluster rule: Group communication errors
    - id: comm_storm
      name: "Communication Storm"
      mode: auto_cluster
      match:
        - pattern: comm_errors
      min_count: 3              # Need 3 faults to form cluster
      window_ms: 500            # Within 500ms
      show_as_single: true      # Only show representative fault
      representative: highest_severity  # first | most_recent | highest_severity

Pattern Wildcards

Patterns support * wildcard matching:

MOTOR_* matches MOTOR_COMM, MOTOR_TIMEOUT, MOTOR_DRIVE_FAULT
*_COMM_* matches MOTOR_COMM_FL, SENSOR_COMM_TIMEOUT
*_TIMEOUT matches MOTOR_TIMEOUT, SENSOR_TIMEOUT

Querying Correlation Data

Use include_muted and include_clusters to retrieve correlation information:

# Get faults with muted fault details
ros2 service call /fault_manager/list_faults ros2_medkit_msgs/srv/ListFaults \
  "{statuses: ['CONFIRMED'], include_muted: true, include_clusters: true}"

Response includes:

muted_count: Number of muted symptom faults
cluster_count: Number of active fault clusters
muted_faults[]: Details of muted faults (when include_muted=true)
clusters[]: Details of active clusters (when include_clusters=true)

REST API (via Gateway)

Query parameters for GET /api/v1/faults:

include_muted=true: Include muted fault details in response
include_clusters=true: Include cluster details in response

Response fields:

{
  "faults": [...],
  "count": 5,
  "muted_count": 2,
  "cluster_count": 1,
  "muted_faults": [
    {
      "fault_code": "MOTOR_COMM_FL",
      "root_cause_code": "ESTOP_001",
      "rule_id": "estop_cascade",
      "delay_ms": 50
    }
  ],
  "clusters": [
    {
      "cluster_id": "comm_storm_1",
      "rule_id": "comm_storm",
      "rule_name": "Communication Storm",
      "representative_code": "SENSOR_TIMEOUT",
      "representative_severity": "CRITICAL",
      "fault_codes": ["MOTOR_COMM_FL", "SENSOR_TIMEOUT", "DRIVE_COMM_ERR"],
      "count": 3,
      "first_at": 1705678901.123,
      "last_at": 1705678901.456
    }
  ]
}

When clearing a root cause fault, auto_cleared_codes lists symptoms that were auto-cleared:

{
  "status": "success",
  "fault_code": "ESTOP_001",
  "message": "Fault cleared",
  "auto_cleared_codes": ["MOTOR_COMM_FL", "MOTOR_COMM_FR", "DRIVE_FAULT"]
}

Example: E-Stop Cascade

E-Stop is triggered → ESTOP_001 fault reported
Motors lose power → MOTOR_COMM_FL, MOTOR_COMM_FR faults reported
Correlation engine detects motor faults are symptoms of E-Stop
Motor faults are muted (not published as events, but stored)
Dashboard shows only ESTOP_001 (root cause)
When E-Stop is cleared → Motor faults are auto-cleared

Namespaced Deployment

The fault manager can run in a custom ROS 2 namespace. The gateway resolves service and topic names automatically via the fault_manager.namespace parameter:

# gateway_params.yaml
fault_manager:
  namespace: "robot1"          # -> /robot1/fault_manager/list_faults
  service_timeout_sec: 5.0

Launch the fault manager in a namespace:

ros2 launch ros2_medkit_fault_manager fault_manager.launch.py \
  namespace:=robot1

Leading slashes are optional - "robot1" and "/robot1" are equivalent.

Building

colcon build --packages-select ros2_medkit_fault_manager
source install/setup.bash

Testing

colcon test --packages-select ros2_medkit_fault_manager
colcon test-result --verbose

License

Apache-2.0