
# Unhealthy ports

The behavior of OpenSM is well defined for cases where it is interacting with
a well behaving Subnet Management Agent (SMA), embedded in each InfiniBand
node. However, several cases of a unhealthy port have raised the need for
OpenSM to keep functioning correctly even under the situation where it cannot
trust the SMA compliancy.

Under cases of unhealthy port OpenSM cannot configure the hardware which may
affect the rest of the hardware in bad way (for example when two sides of the
link cannot agree on the same number of Virtual Lanes a trap is generated
periodically). Under some conditions OpenSM may plan to rely on the port
capabilities (like packet forwarding) and a failure to configure the device
requires a global change in configuration.

In order to enable unhealthy port checks you need to define the following in
the OpenSM configuration file:

    # Enable Unhealthy Ports Configuration
    hm_unhealthy_ports_checks TRUE


## Unhealthy conditions

OpenSM should examine the behavior of subnet ports and is required to declare
a ports as unhealthy under some conditions. Once a port is declared as
unhealthy OpenSM will perform some actions on that port. The user is
provided the ability to control the actions performed and the phenomena
that declares a port unhealthy. Moreover, the user has the ability to clear
ports previously marked as unhealthy.


### Constantly rebooted nodes

This condition identifies a node (RTR/CA or switch) that is appearing and
disappearing from the network at high rate.

OpenSM configuration:

    #
    # Unhealthy Ports Reboot condition options
    #
    # CA/RTR Reboot Action (ignore, report or isolate)
    hm_ca_reboot_action report

    # Switch Reboot Action (ignore, report or isolate)
    hm_sw_reboot_action report

    # Number of reboots in period to declare a node as unhealthy
    hm_num_reboots 10

    # The period for counting number of reboots  in seconds
    hm_reboots_period_secs 900


### Flapping links

This condition identifies single port failures that happen frequently.
OpenSM configuration:

    #
    # Unhealthy Ports Flapping Link condition options
    #
    # CA/RTR Flapping Link Action (ignore, report or isolate)
    hm_ca_flapping_action report

    # Switch Flapping Link Action (ignore, report or isolate)
    hm_sw_flapping_action report

    # The number of sweeps in which the link was flapping
    hm_num_flapping_sweeps 5

    # The number of sweeps of which any port exceeding
    # hm_num_flapping_sweeps is declared unhealthy
    hm_num_flapping_sweeps_window 10


### Unresponsive ports

This condition identifies ports that are not responding to SMP queries during
several sequential sweeps.

OpenSM configuration:

    #
    # Unhealthy Ports Unresponsive condition options
    #
    # CA/RTR Unresponsive Action (ignore, report or isolate)
    hm_ca_unresponsive_action report

    # Switch Unresponsive Action (ignore, report or isolate)
    hm_sw_unresponsive_action report

    # The number of sweeps that had that port unresponsive
    hm_num_no_resp_sweeps 5

    # The number of sweeps of which any port exceeding
    # hm_num_no_resp_sweeps is declared unhealthy
    hm_num_no_resp_sweeps_window 7


### Noisy ports

This condition identifies ports that are sending traps in unreasonable rate to
OpenSM.

OpenSM configuration:

    #
    # Unhealthy Ports Noisy condition options
    #
    # CA/RTR Noisy Action (ignore, report or isolate)
    hm_ca_noisy_action report

    # Switch Noisy Action (ignore, report or isolate)
    hm_sw_noisy_action report

    # Number of traps received in period to declare the port as
    # unhealthy.
    hm_num_traps 250

    # The period for counting number of received traps in seconds
    hm_num_traps_period_secs 60


### Errors on SET SMP queries

This condition identifies ports that respond to SET SMPs with status not equal
to IB_SUCCESS during several sequential sweeps.

OpenSM configuration:

    #
    # Unhealthy Ports SetErr condition options
    #
    # CA/RTR SetErr Action (ignore, report or isolate)
    hm_ca_seterr_action ignore

    # Switch SetErr Action (ignore, report or isolate)
    hm_sw_seterr_action ignore

    # The number of sweeps that had that port report back an
    # error for a Set
    hm_num_set_err_sweeps 5

    # The number of sweeps of which any port exceeding
    # hm_num_set_err_sweeps is declared unhealthy
    hm_num_set_err_sweeps_window 7


### Illegal SMPs

This condition identifies ports that respond with illegal SMP values or read
only values that were changed for an existing port/node.

OpenSM configuration:

    #
    # Unhealthy Ports Illegal condition options
    #
    # CA/RTR Illegal Action (ignore, report or isolate)
    hm_ca_illegal_action report

    # Switch Illegal Action (ignore, report or isolate)
    hm_sw_illegal_action report

    # Number of illegal SMPs a port may return to be declared
    # unhealthy
    hm_num_illegal 1


### Manual specification

This condition allows a user to define an unhealthy port manually in the policy
configuration file specified in OpenSM configuration:

    hm_ports_health_policy_file filename

In order to configure a port manually as unhealthy, need to specify its' peer
NodeGUID, port number, a string "unhealthy" and action to be performed for
this port.
For example:

    0x0002c90300efe8d0 1 unhealthy isolate

The action is non mandatory. if the action is not specified,
the default action will be applied

The default actions are as follows:

OpenSM configuration:

	#
	# Unhealthy Ports Manual condition options
	#
	# CA/RTR Manual Action (ignore, report or isolate)
	hm_ca_manual_action ignore

	# Switch Manual Action (ignore, report, isolate or no_discover)
	hm_sw_manual_action no_discover

**Note:** In order to apply the settings, need to force OpenSM to reread the
configuration files by sending *HUP* signal.


## OpenSM actions

For each unhealthy condition it's possible to define an action OpenSM will
perform on the port found in that condition.


### Ignore

OpenSM will ignore the unhealthy condition.


### Report

OpenSM will report the unhealthy ports in opensm-unhealthy-ports.dump.


### Isolate

OpenSM will isolate the unhealthy port from the routing. It means that no
routing (unicast or multicast) will go through the isolated port.
The corresponding switch port will be set to the INIT logical state.
If all ports of the node are isolated, the whole node would be isolated.
Moreover, OpenSM will also report the isolated ports in the unhealthy ports
dump file.


### No Discover

SM will avoid discovery through the specified port. SM will refer to the
specified port as if it was disconnected.


## Reporting

OpenSM reports unhealthy ports in dump file named *opensm-unhealthy-ports.dump*
located in the dump files directory.

The format of the dump file is:

    # NodeGUID, PortNum, NodeDesc, PeerNodeGUID, PeerPortNum,  PeerNodeDesc, {BadCond1, BadCond2, ...}, TimeStamp
    0x0002c903007e55a1, 25, "MF0;r-qa-sit-sx104:SX6036/U1", 0x0002c90300efe8d0, 1, "reg-r-vrt-003 HCA-1", {REBOOT,FLAPPING}, "Tue Dec 24 12:59:09 2013"


## Recovering unhealthy ports

In order to recover an unhealthy port, need to specify its peer in the health
policy configuration file specified in OpenSM configuration:

    hm_ports_health_policy_file filename

It is possible to recover all ports by specifying "all healthy" string in the
health policy file or alternatively it is possible to recover a specific port
by specifying NodeGUID and port number of the peer of the unhealthy
port, with the  "healthy" string.

For example:

    0x0002c90300efe8d0 1 healthy

When using "all healthy" directive, all ports that are not specified in the
file are considered as healthy. The "all healthy" directive, when used, must be
specified at the beginning of the file.
For example:

    all healthy
    0x0002c90300efe8d0 1 unhealthy no_discover

**Note:** In order to apply the settings, need to force OpenSM to reread the
configuration files by sending *HUP* signal.

