Entries from Eric Sloof

Tuesday, April 15. 2025

InfiniBand on VMware vSphere 8: Updated Setup and Performance Insights

With the increasing demand for high-performance networking in virtualized environments, configuring InfiniBand on VMware vSphere 8 has become a critical task for many IT teams. This blog explores the updated process and considerations for setting up InfiniBand using the latest tools and practices, providing insights into performance, SR-IOV configuration, and common troubleshooting scenarios.

The integration of InfiniBand on vSphere 8 has been streamlined with enhancements to the vSphere Lifecycle Manager and improved native driver support. For those familiar with vSphere 7, many procedures remain consistent, but vSphere 8 introduces some important changes worth noting.

The first step is to confirm that the native Mellanox driver is already present on the ESXi host. With vSphere 8, the driver comes pre-installed, and verification can be done using a simple esxcli command.

Installing Mellanox Firmware Tools, specifically MFT and NMST, is now much easier thanks to vSphere Lifecycle Manager. Instead of deploying packages manually across hosts, admins can use Lifecycle Manager to import and remediate clusters efficiently. These packages can be downloaded from NVIDIA’s website, and after uploading them to the vSphere Client, the entire cluster can be updated with minimal downtime.

In some cases, InfiniBand cards may not be visible via the mst status command after installation. This can typically be resolved by putting the native Mellanox driver into recovery mode using specific esxcli module parameters, followed by a couple of host reboots.

Enabling SR-IOV is particularly relevant for workloads like large language model training that require multiple IB cards. A script using mlxconfig can be used to enable advanced PCI settings and configure virtual functions on each device. It's important to remember that in vSphere 8, InfiniBand VFs may appear as 'Down' in the UI, which is expected behavior. The actual link state should be verified at the VM level.

For environments that continue to use PCI passthrough rather than SR-IOV, disabling SR-IOV can be done with a similar script that reverts the card settings and resets the ESXi module parameters.

On the switching side, administrators should ensure their IB switches are running an up-to-date MLNX-OS. Compatibility between switch firmware and adapter firmware is key to avoiding communication issues. Enabling Open-SM virtualization support on the switches is also critical to support SR-IOV functionality.

The paper outlines several behavioral nuances when using passthrough versus SR-IOV. For instance, certain Open-SM utilities may not function when a virtual function is passed to a VM, which is normal. Likewise, mst status behavior may differ depending on whether a physical or virtual function is used.

Troubleshooting steps are also provided for cases where MLX cards fail to appear in the mst status output. These include reloading the appropriate kernel modules and signaling system processes, after which recovery mode is temporarily enabled until the next host reboot.

Performance tests revealed that unidirectional bandwidth reached 396.5 Gbps using four queue pairs, nearly saturating the theoretical line rate of the InfiniBand cards. Bidirectional bandwidth tests showed performance scaling up to 790 Gbps with two cards, confirming the setup’s ability to handle demanding HPC and AI workloads.

In conclusion, VMware vSphere 8 enhances the experience of deploying InfiniBand by introducing automation through Lifecycle Manager and retaining robust performance tuning capabilities. With updated best practices, simplified installation, and clear guidance on SR-IOV and troubleshooting, IT teams can now fully leverage InfiniBand’s potential in virtualized environments, including VMware Cloud Foundation.

This blog captures the essence of the technical paper authored by Yuankun Fu, who has a strong background in HPC and AI performance optimization within VMware. His guidance in this paper provides both practical instructions and valuable performance data for teams looking to adopt or enhance InfiniBand in their vSphere environments.

Friday, March 28. 2025

Boosting VM Performance with vSphere 8.0 Virtual Topology: What You Need to Know

VMware vSphere 8.0 introduces a major enhancement for virtual CPU configuration: virtual topology. This feature automatically determines the optimal way to assign virtual CPUs (vCPUs) to virtual machines, improving alignment with physical hardware and potentially boosting performance.

Before vSphere 8.0, virtual machines defaulted to one core per socket, which sometimes led to inefficiencies and required manual adjustments. With virtual topology, the system intelligently chooses the right number of cores per socket, streamlining setup and enhancing performance, particularly on systems with complex NUMA (Non-Uniform Memory Access) configurations.

VMware’s internal tests show promising results. In database scenarios, such as Oracle on Linux, performance improved by up to 14%. Microsoft SQL Server workloads saw up to a 17% gain. Virtual desktop infrastructure (VDI) and VMmark benchmarks showed stable performance with no notable change. These gains came with better CPU and memory usage, thanks to fewer NUMA misses and more efficient cache behavior.

Not all workloads benefited equally. Micro-benchmarks for storage and network showed mixed results. Iometer tests in Windows initially showed increased CPU usage due to thread migrations, but this was resolved by modifying the application to use thread affinity. Netperf network tests saw reduced performance when virtual topology was enabled, particularly in Linux environments, due to changes in how interrupts were distributed across CPU cores.

In conclusion, vSphere 8.0’s virtual topology feature simplifies VM configuration and enhances performance for many real-world workloads. While some edge cases may still require manual tuning, most users will benefit from leaving this feature enabled by default.

VMware vSphere 8.0 Virtual Topology Performance Study

Saturday, March 22. 2025

Bridging Cloud and Edge AI: VMware Meets Azure Machine Learning

In today’s AI-driven world, organizations face a tough challenge—how to leverage the power of cloud-based machine learning while keeping data secure and on-premises. A recent technical white paper from VMware and Microsoft offers an elegant solution: integrating VMware Cloud Foundation (VCF) with Azure Machine Learning (AML) using Azure Arc and Tanzu Kubernetes Grid.

What’s the Big Idea?

This integration enables hybrid machine learning deployments. That means businesses can develop AI models in Azure’s cloud and run them locally on their own infrastructure using VMware Cloud Foundation. It’s a game-changer for industries with strict data residency requirements, latency-sensitive applications, or heavy on-prem investments.

Why It Matters

• Hybrid Flexibility: Train in the cloud, deploy at the edge or on-prem.

• Data Control: Keep sensitive data in your own data centers.

• Familiar Tools: Leverage existing VMware and Azure tools without a steep learning curve.

• AI-Ready Infrastructure: GPU-powered environments with scalable Kubernetes clusters.

Under the Hood: Key Components

• VMware Cloud Foundation: The core platform, blending compute (vSphere), storage (vSAN), networking (NSX), and Kubernetes (Tanzu).

• Azure Arc: The glue that extends Azure services to your own servers.

• Azure Machine Learning Arc Extension: Brings training and inference capabilities into your local Kubernetes clusters.

How It Works

1. Set Up Your VMware Cloud Foundation Environment

• Deploy management and workload domains

• Configure NSX, vSAN, and Tanzu Kubernetes clusters

• Optionally deploy vSAN File Services for shared storage

2. Connect to Azure via Arc

• Register your on-prem Kubernetes cluster with Azure

• Deploy the AML Arc extension

3. Run ML Jobs Locally

• Define instance types for workloads

• Launch training and inference jobs directly from Azure Machine Learning Studio

• Monitor and manage as if it were native to Azure

Real-World Example

The paper walks through a practical scenario: training an image classification model using logistic regression on an on-prem cluster, fully managed through Azure Machine Learning Studio.

Who Should Care?

This solution is tailor-made for:

• Enterprises balancing cloud innovation with strict compliance

• CTOs and CIOs exploring AI in hybrid environments

• DevOps and infrastructure teams familiar with vSphere and Kubernetes

Final Thoughts

This integration isn’t just a technical feat—it’s a strategic enabler. By merging the reliability of VMware with the AI prowess of Azure, organizations can innovate faster, stay compliant, and get the most out of their data—whether it’s in the cloud, on the edge, or in the basement server room.

Wednesday, March 19. 2025

Unlocking AI Inference with VMware and NVIDIA: A Scalable Private AI Foundation

As artificial intelligence (AI) continues to transform industries, enterprises seek more cost-efficient, secure, and scalable ways to run inference workloads. Public cloud services offer flexibility but come with concerns over costs, data privacy, and governance. VMware Private AI Foundation with NVIDIA delivers an on-premise alternative, combining VMware Cloud Foundation (VCF) with NVIDIA AI Enterprise, designed for high-performance AI inference workloads using NVIDIA HGX systems.

Why Enterprises Need Private AI Infrastructure

GPU Optimization Challenges: On-prem GPUs often suffer from underutilization due to misallocation or overprovisioning. VMware’s platform enables dynamic GPU allocation, ensuring maximum utilization and efficiency.
Cloud-Like Flexibility for Data Scientists: The fast-evolving AI landscape requires a seamless, flexible environment for data scientists while IT teams retain control over infrastructure.
Data Privacy and Governance: As AI models rely on sensitive data, private AI solutions ensure security, compliance, and controlled access to proprietary models and datasets.
Familiar VMware Management Interface: IT administrators can leverage VMware’s widely used management tools, reducing learning curves and operational overhead.

The Core Components of VMware Private AI Foundation with NVIDIA

VMware Cloud Foundation (VCF): A full-stack private cloud platform integrating vSphere, vSAN, NSX, and the Aria Suite.
NVIDIA AI Enterprise: Includes NVIDIA vGPU (C-Series), NIM microservices, NeMo Retriever, and AI Blueprints, optimizing AI workloads.
HGX Systems: NVIDIA-certified servers featuring 8x H100/H200 GPUs, interconnected via NVSwitch and NVLink, delivering industry-leading performance.
Ethernet Networking: High-speed networking with NVIDIA Spectrum-X Ethernet ensures fast, efficient data transfer between nodes.

Reference Architecture for AI Inference

The architecture is designed for enterprises deploying AI workloads in private data centers. Key elements include:

1. Physical Architecture

Inference Servers: 4–16 NVIDIA HGX systems with H100/H200 GPUs
Networking: 100 GbE Ethernet fabric for inference workloads and 25 GbE for management and storage
Management Servers: 4 vSAN-ready nodes hosting VMware’s core infrastructure

2. Virtual Architecture

Management Domain: Manages the private cloud environment, including SDDC Manager, vCenter, NSX, and Aria Suite.
Workload Domain: Hosts AI workloads, leveraging Supervisor Clusters to deploy Deep Learning Virtual Machines (DLVMs) and AI Kubernetes clusters.
Vector Databases: PostgreSQL with pgVector extension enables retrieval-augmented generation (RAG) for generative AI applications.

Performance & Validation

VMware and NVIDIA validate the solution’s performance using GenAI-Perf benchmarking, comparing virtualized environments with bare-metal deployments. The optimized platform delivers high throughput and low latency, ensuring scalable, cost-effective AI inference.

Why Choose VMware Private AI Foundation with NVIDIA?

✅ Enhanced GPU Utilization: Maximizes AI compute resources
✅ Enterprise-Grade Security: Ensures data privacy and model governance
✅ Operational Efficiency: Uses familiar VMware management tools
✅ Scalable & Future-Proof: Designed for evolving AI workloads

Final Thoughts

For enterprises looking to deploy AI inference workloads while maintaining control, security, and efficiency, VMware Private AI Foundation with NVIDIA provides a powerful, flexible, and cost-effective private AI infrastructure.

Ready to optimize your AI strategy? Contact VMware and NVIDIA for deployment guidance today!

Saturday, March 15. 2025

Beginner’s Guide to Automation with vDefend Firewall

In today's fast-paced IT landscape, automation is no longer a luxury—it’s a necessity. Security operations teams face increasing complexity when managing network security policies, requiring efficient, automated solutions. The vDefend Firewall, integrated with VMware NSX, offers robust automation capabilities through various tools and scripting languages. This guide explores the automation strategies available with vDefend, helping IT professionals streamline operations and enhance security efficiency.

Understanding CRUD Actions in Network Automation

CRUD (Create, Read, Update, Delete) actions form the backbone of automation workflows. vDefend allows these actions via RESTful API methods, such as:

GET – Retrieve resource information.
POST – Create a new resource.
PUT/PATCH – Update existing resources.
DELETE – Remove a resource.

By leveraging these REST API methods, IT teams can automate firewall policies, create security groups, and configure network settings without manual intervention.

Automation Strategies for vDefend Firewall

Several automation tools can be used with vDefend, each offering unique benefits:

1. REST API Calls Using NSX Policy API

The NSX Manager Policy API allows direct execution of CRUD actions on network resources. Developers can use languages like Python, GoLang, and JavaScript to write scripts that interact with NSX Manager, enabling seamless automation for security operations.

2. Terraform and OpenTofu

These Infrastructure-as-Code (IaC) tools help standardize network and security deployments. Using declarative manifests, organizations can define load-balancers, firewall rules, and security policies that can be version-controlled and deployed through CI/CD pipelines.

3. Ansible

Ansible is commonly used to provision NSX core components, including NSX Managers, Edges, and Transport Nodes. IT teams can integrate Ansible with Terraform for a fully automated network configuration strategy.

4. PowerCLI

PowerCLI, a PowerShell module for VMware, enables administrators to automate firewall configurations and network security policies efficiently.

5. Aria Automation Suite

VMware’s Aria Automation provides an enterprise-grade platform for orchestrating network security tasks. It includes:

Aria Assembler – Develops and deploys cloud templates for security configurations.
Aria Orchestrator – Automates complex workflows for NSX security management.
Aria Service Broker – Provides a self-service portal for security and network automation.

Key API Fundamentals

To effectively leverage vDefend’s automation capabilities, understanding its API architecture is crucial:

Hierarchical API Structure: NSX API follows a tree structure where resources are organized in parent-child relationships.
Cursor Pagination: Large datasets are paginated using cursors to improve query efficiency.
Sequence Numbers: Firewall rules are executed in top-down order, with lower sequence numbers evaluated first.
Authentication Methods: API calls require authentication via basic auth, session tokens, or API keys.

Full-Scale Automation Example

A real-world automation scenario using vDefend involves:

Gathering VM Information – Identifying VMs and retrieving security tags.
Tagging VMs – Assigning labels to categorize resources.
Creating Groups – Defining security groups dynamically based on VM tags.
Defining Custom Services – Creating custom firewall services based on specific port requirements.
Creating Firewall Policies and Rules – Automating policy deployment to enforce security controls.

For example, an automated firewall rule to allow HTTPS traffic from a web group to an application group would look like this in JSON format:

{
  "action": "ALLOW",
  "source_groups": ["/infra/domains/default/groups/WebGroup"],
  "destination_groups": ["/infra/domains/default/groups/AppGroup"],
  "services": ["/infra/services/HTTPS"],
  "scope": ["/infra/domains/default/groups/WebGroup"]
}

Conclusion

The vDefend Firewall provides extensive automation capabilities that simplify network security management. By leveraging tools like Terraform, Ansible, PowerCLI, and the NSX Policy API, IT teams can deploy consistent and scalable security policies efficiently. As organizations embrace Infrastructure as Code (IaC) and API-driven security, automation will continue to play a vital role in maintaining robust network defenses.

For further reading, check out the official VMware NSX API documentation and sample automation scripts available in public repositories. Start automating today and take your network security to the next level!

Sunday, March 9. 2025

Deploying Nested VMware Cloud Foundation Environments with Holodeck 5.2

VMware Cloud Foundation (VCF) is a robust solution for building private clouds, but setting up a lab environment for testing and training can be complex and resource-intensive. That’s where Holodeck Toolkit 5.2 comes in—an automated and standardized approach to deploying nested VCF environments on VMware ESXi.

What is Holodeck 5.2?

Holodeck 5.2 is a technical lab deployment toolkit designed for:

Hands-on testing and training in VMware Cloud Foundation (VCF).
Deploying a fully functional, self-contained VCF lab without relying on external infrastructure.
Reducing hardware requirements by running multiple virtualized VCF nodes on a single ESXi host.

This toolkit supports VCF 5.2 and 5.2.1 and leverages VCF Lab Constructor (VLC) to automate deployment.

Key Benefits of Holodeck 5.2

✅ Simplified Deployment

Automated installation of VCF management domains, NSX networking, and Tanzu Kubernetes.
Fully pre-configured Cloud Builder VM for DNS, NTP, DHCP, and BGP peering.
Requires only one external IP address for the entire environment.

✅ Reduced Hardware Requirements

Instead of needing four physical vSAN Ready Nodes, Holodeck runs the entire VCF lab on a single ESXi host.
Multiple nested VCF environments can be run on a single host with enough resources.

✅ Isolated and Repeatable Testing

Uses isolated networking with virtualized NSX configurations.
Ensures no conflicts with existing infrastructure.
A fully automated deployment means consistent lab setups every time.

Single vs. Multi-Site Deployments

Holodeck 5.2 supports:

Single-Site Lab: Standard VCF Hands-on Lab configuration.
Multi-Site Lab: Advanced scenarios with NSX networking across sites and HCX-based workload migration.

Holodeck Environment Components

Each deployment includes:

Holo-Console: Windows-based jump server for managing the environment.
Holo-Router: Custom Photon OS router for external connectivity.
Cloud Builder VM: Handles VCF deployment and internal lab services.
Nested ESXi Hosts: Simulating a real VCF environment with NSX and vSAN.

How to Deploy Holodeck 5.2

Prepare the ESXi Host
- Install VMware ESXi 7.0 or 8.0.
- Configure networking with vSphere Standard Switches.
- Ensure enough resources: 16+ cores, 384GB+ RAM, 3.5TB+ SSD.
Build the Holo-Console ISO
- Custom Windows Server 2019 VM with PowerCLI, OVF Tool, and VCF tools pre-installed.
- Automated setup using PowerShell scripts.
Deploy the Holodeck Environment
- Use VLC CLI or GUI to deploy VCF management and workload domains.
- Configure NSX, AVN, and Kubernetes.
Enhance with VMware Aria Automation
- Deploy Aria Automation 8.18 for cloud management.
- Automate provisioning of VMs and Kubernetes workloads.

Conclusion

For VMware professionals looking to test and train with VCF, Holodeck 5.2 provides a cost-effective, automated, and fully isolated lab environment. With nested VCF deployments running on a single ESXi host, it’s easier than ever to experiment with NSX, vSAN, and Aria Automation.

For full setup details, refer to the official Holodeck Toolkit 5.2 Setup Guide: Holodeck Toolkit v5.2 Setup

Wednesday, February 26. 2025

For the 17th Time: vExpert! 🎉

Yes! I’m beyond excited to share that I’ve been awarded the vExpert Award 2025 for the 17th consecutive year! 🏆 It’s just as thrilling as the first time, and I’m incredibly grateful to be recognized for my contributions to the community.

I also want to extend a huge congratulations to all my fellow vExperts who have earned this recognition! It’s truly inspiring to see the passion, knowledge, and dedication within our community. Together, we continue to learn, innovate, and support each other.

A massive thank you to everyone who makes the vCommunity so special—whether you’re a mentor, speaker, blogger, or simply someone who loves to share knowledge. I’m proud to be part of this incredible network, and I look forward to another year of collaboration and innovation!

Here’s to pushing the boundaries and making an impact! 🚀

Saturday, February 22. 2025

VMware Tanzu RabbitMQ 1.3 vs. Confluent Kafka: Performance Study Insights

With the growing demand for scalable and high-performance messaging solutions, VMware Tanzu RabbitMQ 1.3 and Confluent Kafka have emerged as key players in event streaming and message queue technologies. A recent performance study on VMware Tanzu Kubernetes Grid (TKG) highlights key improvements in Tanzu RabbitMQ and provides insights into Apache Kafka’s deployment on Kubernetes. Below, we summarize the key takeaways.

Key Technologies Evaluated

The study focused on the following technologies running on VMware Tanzu Kubernetes Grid:

• Tanzu RabbitMQ 1.3 – An enterprise-grade message broker with improvements in Quorum Queues and streaming capabilities.

• Confluent Kafka – A widely used distributed event streaming platform with support for large-scale real-time data processing.

• Strimzi – A Kubernetes operator for deploying Kafka clusters with ease.

Performance Highlights

Tanzu RabbitMQ 1.3: Significant Throughput Gains

• 30% higher throughput compared to previous versions, especially in Quorum Queue scenarios.

• Increased efficiency in handling messages of different sizes:

• Small messages (~55,634 messages/sec)

• Large messages (~15,334 messages/sec)

• RabbitMQ Streams reached up to 2.47 million messages/sec in a multi-stream setup.

Apache Kafka: High Throughput with Tuning

• Kafka was tested using 3 brokers, 12 partitions, and different replication settings.

• Achieved peak throughput of 149MB/sec with single-producer configurations.

• Replication overhead was noticeable, but asynchronous replication provided a balance between performance and durability.

• End-to-end message latency was 3ms for 1KB messages.

Architecture & Deployment Insights

• Both solutions were deployed on a 4-node VMware Tanzu Kubernetes Grid cluster.

• Hardware Setup:

• 24 vCPUs per node

• 72GB RAM per node

• 2TB storage per node

• Kafka’s Strimzi Operator was leveraged for automation and scaling.

• Tanzu RabbitMQ’s warm standby replication ensured high availability and disaster recovery.

Conclusion: Which One to Choose?

• Tanzu RabbitMQ 1.3 is ideal for low-latency message delivery, flexible routing, and high availability.

• Apache Kafka is better suited for event-driven architectures and large-scale real-time data streaming.

For businesses leveraging VMware Tanzu Kubernetes Grid, both technologies offer powerful messaging and event streaming capabilities, each with distinct advantages depending on the workload.

Sunday, February 16. 2025

Optimizing VMware vSphere 8 for Latency-Sensitive Workloads

In today’s fast-paced digital world, low-latency computing is critical for applications in financial services, media streaming, and real-time automation. VMware vSphere 8 introduces a suite of performance enhancements designed to optimize virtualized environments for latency-sensitive workloads. This guide summarizes key takeaways from Broadcom’s latest whitepaper on performance tuning for vSphere 8.

Understanding Latency Optimization in vSphere 8

VMware vSphere 8 provides a powerful platform for virtualized workloads, balancing high performance with efficient resource utilization. However, for latency-sensitive applications, additional tuning is required to minimize response times. The key tradeoff here is increased CPU utilization in exchange for reduced latency.

Baseline Recommendations

To maximize performance for low-latency applications, it’s essential to ensure your environment is running the latest hardware and software:

Processor generations: Use the latest supported CPUs.
BIOS and microcode: Ensure firmware is up to date.
vSphere version: At least vSphere 8.0 U3 or 7.0 U3.
Virtual Hardware Version: Use Virtual Hardware 21 or newer.
VM Tools: Version 12.4.5 or newer.

Additionally, minimize the use of vSphere overlays like NSX and vSAN, as these services consume CPU cycles and may increase latency.

Host-Level Optimizations

Optimizing ESXi host configurations is crucial for reducing latency:

BIOS Settings: Set power management to "High Performance," disable C-states and P-states, and enable Turbo Boost.
Disable EVC: Enhanced vMotion Compatibility (EVC) can mask CPU instructions and increase latency.
vMotion and DRS Scheduling: Avoid live migrations during peak performance periods, as they can momentarily disrupt workloads.
Advanced Settings:
- Disable action affinity
- Enable SplitRX and SplitTX for improved network processing
- Disable queue pairing to prevent network bottlenecks

Virtual Machine-Level Optimizations

Configuring VMs correctly can significantly impact latency:

Rightsizing: Ensure VMs are not oversized beyond NUMA node limits.
Disable Hot-Add: This prevents NUMA topology from being exposed, reducing performance.
Set High Latency Sensitivity:
- Reserve CPU and memory exclusively for latency-sensitive VMs.
- Configure VMXNET3 network adapters for optimal performance.
NUMA Awareness: Assign VMs to NUMA nodes manually to align with the underlying hardware.
Use SR-IOV or DirectPath I/O: These features allow direct hardware access for ultra-low latency networking.

Networking Enhancements

For network-intensive workloads, consider:

VMXNET3: This paravirtualized NIC offers optimized performance.
Enhanced Datapath Mode: For NFV workloads requiring high throughput.
SmartNICs and DPUs: Offload network processing to dedicated hardware.
Queue Balancing: Adjust NIC ring buffer sizes and enable multiple network queues.

Guest OS and Application Tuning

Within the guest operating system, use:

Photon OS Real-Time Kernel: Optimized for low-latency applications.
NUMA-aware applications: Configure workloads to take advantage of NUMA topology.
Thread Affinity: Assign VM workloads to specific CPU cores to prevent contention.

Operational Best Practices

Use esxtop and net-stats to monitor performance.
Adjust ring buffers and queue sizes using esxcli.
Test all changes in a non-production environment before deployment.

Conclusion

By following these best practices, VMware vSphere 8 can be fine-tuned to deliver low-latency performance for critical workloads. Whether optimizing host settings, configuring VMs, or leveraging SmartNICs, these strategies help ensure a responsive and efficient virtualized infrastructure.

For a more in-depth guide, refer to Broadcom’s whitepaper on performance tuning in vSphere 8.

Sunday, February 9. 2025

VMware Cloud Disaster Recovery - Networking Essentials for Business Continuity

In today’s ever-changing business landscape, disasters such as natural calamities and cyberattacks can disrupt operations without warning. VMware Cloud Disaster Recovery (DR) offers a powerful solution for ensuring business continuity by seamlessly recovering workloads in a cloud-based environment. This field guide provides insights into best practices for configuring and managing networking for disaster recovery (DR) using VMware Cloud on AWS.

Introduction to VMware Cloud DR

The VMware Cloud Disaster Recovery solution is designed to protect critical business applications from outages. This involves replicating workloads from a Protected Site to a Recovery Site hosted in VMware Cloud on AWS. The solution addresses various disaster scenarios, including complete site failures and targeted ransomware attacks. Effective network configuration is crucial to ensure that DR operations run smoothly, both during routine protection and emergency recovery.

Key Networking Considerations

1. Basic Networking Topology

The Protected Site (e.g., on-premises or another cloud) connects to the Recovery Site through VMware’s DRaaS Connector. This connection facilitates data replication to the Scale-out Cloud File System (SCFS), where recovery snapshots are stored. Components such as vCenter, vSphere hosts, and virtual machines rely on a robust network configuration to ensure connectivity.

2. Networking Setup for VMware Cloud on AWS

VMware Cloud DR leverages Software-Defined Data Centers (SDDCs) in AWS to manage cloud-based resources. Organizations can establish secure connectivity using one of the following methods:

Public Internet Access with SDDC firewall rules (simpler but less secure)
VPN Connection for encrypted data transfer
AWS Direct Connect for a dedicated, high-performance connection

Review these options with your network team to determine the best fit for your environment.

3. Segmenting Networks for Disaster Recovery

Each network segment at the Protected Site requires a corresponding segment at the Recovery Site. This allows for seamless failover of virtual machines without disrupting production workloads. Additional segments can be created specifically for testing DR plans without impacting operational services.

Managing Connectivity and Security

Outbound and Inbound Access

To enable secure communication, network administrators configure gateway firewalls using NSX capabilities. Outbound access allows VMs at the Recovery Site to connect to the internet, while inbound access is managed through published VPNs or public IPs for critical workloads.

Inter-Site Connectivity

During partial failovers, applications may require communication between Protected and Recovery sites. VMware recommends using layer 2 VPNs or Direct Connect to facilitate secure inter-site application connectivity.

Ransomware Recovery Isolation

VMware Cloud DR includes features to isolate compromised networks in a Ransomware Recovery Isolated Recovery Environment (IRE). Using NSX Distributed Firewall rules, organizations can ensure that affected VMs remain isolated during recovery operations, preventing further spread of malware.

Testing and IP Remapping

Non-Disruptive Testing

VMware Cloud DR allows for comprehensive testing of recovery plans by using isolated network segments. This ensures that network configurations and failover procedures can be validated without affecting live applications.

IP Address Mapping

For scenarios where static IPs need to change during failover, VMware Cloud DR provides IP remapping capabilities. Administrators can define IP address rules in the recovery plan to ensure consistency across sites.

Conclusion

VMware Cloud Disaster Recovery is a robust solution for safeguarding business operations in the face of unexpected disruptions. Properly configuring and testing the networking components—ranging from connectivity and firewall rules to network segmentation and isolation—is critical to the success of any DR strategy. VMware offers extensive resources to guide organizations through setup and optimization, ensuring maximum uptime and data protection.

(Page 1 of 366, totaling 3653 entries)

Tuesday, April 15. 2025

Friday, March 28. 2025

Saturday, March 22. 2025

Wednesday, March 19. 2025

Saturday, March 15. 2025

Sunday, March 9. 2025

What is Holodeck 5.2?

Key Benefits of Holodeck 5.2

✅ Simplified Deployment

✅ Reduced Hardware Requirements

✅ Isolated and Repeatable Testing

Single vs. Multi-Site Deployments

Holodeck Environment Components

How to Deploy Holodeck 5.2

Conclusion

Wednesday, February 26. 2025

Saturday, February 22. 2025

Key Technologies Evaluated

Performance Highlights

Apache Kafka: High Throughput with Tuning

Architecture & Deployment Insights

Conclusion: Which One to Choose?

Sunday, February 16. 2025

Understanding Latency Optimization in vSphere 8

Baseline Recommendations

Host-Level Optimizations

Virtual Machine-Level Optimizations

Networking Enhancements

Guest OS and Application Tuning

Operational Best Practices

Conclusion

Sunday, February 9. 2025

Managing Connectivity and Security

Testing and IP Remapping

Recent Entries