Nexus VPC Troubleshooting

This lab will expand on the previous vPC configuration lab. I have made a few configuration changes since that lab. HSRP, vPC priority and the STP root is Nexus1.

vPC Priority

This isn’t like HSRP where it can be set to a higher priority than the default 100 and then that device will always be the HSRP active gateway.
The priority on the vPC only applies to when you use the preempt command. The preempt command should only be used during maintenance windows.
The preempt command will warn you of using “peer-switch” configuration before you run the command.

HSRP

The HSRP configuration has been applied to both mcast_Nexus1 and 2 switches.
Failover has been tested and working. HSRP is using Nexus1 as the active with a priority of 110 and Nexus2 a default priority value of 100. Preemption is also configured for 10 seconds.

STP

mcast_Nexus1 is the root primary for all VLANs and mcast_Nexus2 is the root secondary for all VLANs.

When Nexus1 is down, Nexus2 becomes the root, as show below.

vPC Troubleshooting

The troubleshooting will specifically be for vPC failure scenarios.

Pre Change Show Commands

  • Nexus1 is HSRP active
  • Nexus1 is vPC primary
  • All links are up between all Nexus switches

vPC Peer-Link Failure

The peer-link failure will be performed with Nexus2 having the peer-link ports shutdown. Nexus1 will remain as the primary vPC switch, with all traffic being directed to Nexus1.

I have mcast_server to test the reachability of the gateway.

Nexus2 Eth1/5 and 1/6 ports are shutdown. These are the physical peer link ports.


Nexus1 is aware the links are down and as this has the active vPC role it stays as the; vPC active switch, HSRP active gateway and STP root. All the the active vPC switch keeps its ports up.

Nexus2 has a different story. As this switch was the secondary and as the keepalive link is still up Nexus2 knows that Nexus1 is still alive. Nexus2 disables all of its vPC member ports to stop loops or unwanted behaviour.

When the Ethernet ports are recovered the vPC roles of Nexus1 and Nexus2 have swapped.
Nexus1 is the secondary, and Nexus2 is the primary.

There are three ways to fix this, reload the secondary switch, shut/no shut the peer-links from the secondary switch or use the preempt command. This requires the use of the “peer-switch” command that makes both switches appear as a single switch for spanning tree.

The advantage of using peer-switch is that convergence increases for a vPC failure/recovery.
Peer-switch must be enabled on both peer switches and both switches must have the same spanning tree priority.

vPC Peer Keepalive Link Failure

If there is a failure with the keepalive link then the switches will continue to function as normal. The keepalive link is a secondary method of detecting failures with the health of the other switch in the vPC pair.

vPC Peer Switch Failure

As the switch has failed, all of the traffic should be sent to the only switch that remains. In this scenario I have shutdown all the ports on Nexus1 which is the primary. All traffic is then directed via Nexus2 which becomes the new primary.

When bringing all the ports on Nexus1 back up, Nexus2 remains as the vPC primary. This is where either reload Nexus2 or run the preempt role command on Nexus1.

Dual Active or Split Brain

This mode is unusual as it’s a failure of the keepalive link, followed by a failure of the peer-link.

I have managed to replicate this by having Nexus2 as the operational primary. Then shutting down the ports for keepalive, and peer-link. But waiting a few second between the two port types. I now have both Nexus switches in Dual Active mode.

Both routers are also the HSRP active router.

There is no loss of ICMP traffic from my mcast_server PC. From the packet captures I see ICMP request/replies going up both links to Nexus1 and to Nexus2.
There is no problem with this simple gateway test, but this would cause other issues leaving the network.

Once the ports were brought back up between the switches the switches learnt who was taking on which role and the Dual Active scenario was resolved.

Leave a Comment

Your email address will not be published. Required fields are marked *