The stability of the NSX Manager cluster is critical for VMware NSX operations. However, situations can occur where all NSX Managers crash simultaneously, leaving the control plane completely unavailable. Recently, such an event took place in my lab environment, and together with Broadcom Support, the exact consequences for the dataplane and operational processes were carefully analyzed.
Control Plane vs. Data Plane
It’s essential to understand the difference between the control plane and the data plane in NSX:
-
Data plane continues to forward traffic between virtual machines (VMs). Existing VM-to-VM communication is preserved.
-
Control plane, operated by the NSX Manager cluster, distributes ARP/MAC information, segment updates, and coordinates operations such as vMotion and VM power-on.
Immediate Impact of a Full Manager Cluster Outage
When the NSX Manager cluster goes down, the following consequences apply:
-
vMotion and VM Power-On Failures
vMotion operations and new VM registrations depend on the control plane. Without it, workloads cannot be migrated or powered on, although existing VMs keep running. -
Loss of ARP Suppression
Normally, NSX suppresses ARP broadcasts by maintaining ARP/MAC tables. If the control plane fails, NSX falls back to “traditional ARP behavior”:
→ ARP requests are broadcast (flooded) via Geneve tunnels to all transport nodes (TEPs) in the segment.
This preserves connectivity but reduces efficiency. -
Aging Timers Trigger Broadcast Floods
MAC, ARP, and VTEP entries in the dataplane age out after ~10 minutes by default. When they expire while the control plane is down, new ARP broadcasts are required to relearn them.-
In small segments, this has limited impact.
-
In large segments, repeated flooding can introduce noticeable overhead and potential performance degradation.
-
-
Provisioning and Migration Issues
If new VMs are provisioned or migrated during the outage, ESXi hosts may lack up-to-date forwarding information. This can cause temporary connectivity problems until the control plane is restored.
Root Cause and Risks
In this specific case, the outage was linked to a known JDK bug (JDK-8330017) in the Corfu/ForkJoinPool implementation, which caused NSX services to stop executing tasks. (NSX is Impacted by JDK-8330017: ForkJoinPool Stops Executing Tasks Due to ctl Field Release Count (RC) Overflow)
-
The bug has been fixed in NSX versions 4.2.1.4 and 4.2.2.1 (and later) including VCF 9.0.
-
For older versions, Broadcom engineering recommends a rolling reboot of the NSX Managers as a temporary workaround.
Key Takeaways
-
Connectivity remains intact – the dataplane continues forwarding traffic, even if the control plane is fully unavailable.
-
New operations fail – vMotion, VM power-on, and dynamic provisioning cannot function without the control plane.
-
Fallback = ARP flooding – ensures connectivity but may introduce overhead in large-scale segments.
-
Upgrade is essential – stability depends on running patched NSX versions.
Conclusion
A full NSX Manager cluster outage does not immediately disrupt existing workloads but does highlight the risks of control plane dependency. In dynamic or large-scale environments, the operational impact can be significant, ranging from reduced efficiency to failed provisioning.
To mitigate these risks:
-
Upgrade to the latest patched NSX release (4.2.1.4 or higher).
-
Plan preventive rolling reboots where necessary.
-
Incorporate failure behavior into your NSX design and operational procedures.