Skip to content

Category: VMware

The Problem: ESXi Host Loses Network Connectivity

Step-by-Step Troubleshooting Guide

  1. Verify Network Configuration on the Host
  • Log in via DCUI (Direct Console User Interface) or SSH.
  • Check the IP address, gateway, and DNS settings:

esxcli network ip interface ipv4 get

  • Confirm that the host can ping the gateway or vCenter server:

ping <gateway-IP>

  1. Check vSwitch and Port Group Settings
  • In the vSphere Client, navigate to Networking > Virtual Switches.
  • Ensure the following:
    • The physical NICs (vmnics) are attached and active.
    • Port groups have the correct VLAN ID settings.
  • Verify the load balancing and failover policies for inconsistencies.
  1. Inspect Physical Network Connections
  • Check for issues with cables, switches, or ports.
  • Test connectivity using tools like link lights on NICs or port activity indicators on the switch.
  • Replace faulty cables or move connections to different switch ports if needed.
  1. Test and Reconfigure NICs
  • Verify the status of all NICs:

esxcli network nic list

  • Re-enable or restart any problematic NICs:

esxcli network nic down -n <vmnic-name>
esxcli network nic up -n <vmnic-name>

  1. Address Driver or Firmware Issues
  • Check the HCL (Hardware Compatibility List) for your ESXi version.
  • Update or reinstall the NIC driver and firmware if outdated or incompatible:

esxcli software vib install -v /path/to/driver.vib

  1. Monitor for IP Conflicts
  • Use tools like ARP tables on the switch or router to detect conflicting IP addresses.
  • Assign a new static IP address to the host if conflicts are found.
  1. Restart Management Network
  • Restart the management network via the DCUI:
    • Select Troubleshooting Options > Restart Management Network.
  • Alternatively, restart it via SSH:

services.sh restart

  1. Examine Logs for Deeper Insights
  • Review network-related logs for clues:

tail -f /var/log/vmkernel.log

tail -f /var/log/hostd.log

Resolving a Common VMware ESXi Issue – PSOD on Boot

The PSOD is VMware’s equivalent of the “blue screen” in Windows. It halts the ESXi host and displays diagnostic information in a purple background. One scenario that often triggers a PSOD is a hardware compatibility or driver issue, especially after an upgrade or new hardware deployment.

Root Cause Analysis

Common reasons for a PSOD on boot include:

  1. Incompatible Drivers: Using a driver version that doesn’t match the hardware or ESXi version.
  2. Faulty Hardware: Issues with RAM, storage controllers, or network adapters.
  3. Configuration Errors: Misconfigured BIOS or firmware settings.
  4. Corrupted Filesystem: Problems with the ESXi boot partition.

Step-by-Step Resolution

  1. Gather Information from the PSOD Screen
  • Note the error message and codes displayed on the PSOD.
  • Look for references to specific drivers, memory modules, or hardware.
  1. Reboot in Recovery Mode
  • Restart the host and enter Recovery Mode from the boot menu.
  • Check system logs using the esxcli command:

esxcli system syslog view –log /var/log/vmkernel.log

 

  1. Verify Hardware Compatibility
  • Cross-check the hardware against VMware’s Hardware Compatibility List (HCL) to ensure support for your ESXi version.
  • If issues arise, update firmware or replace problematic components.
  1. Roll Back Drivers
  • If a driver is causing the issue, try rolling back to a previous version:

software vib remove -n <driver-name>

 

  • Reboot the host and confirm stability.
  1. Check for Corrupted Boot Files
  • Boot the ESXi installation media and select Repair System.
  • Reinstall or repair the ESXi system files without wiping the datastore.
  1. Update or Patch ESXi
  • Check VMware’s knowledge base for patches or updates related to your error code.
  • Update your ESXi host:

esxcli software profile update -d <URL-to-depot>

Object type requires hosted I/O

After power outage the VM will not power-on and throws the following error:

Object type requires hosted I/O

Login to the ESXi host over ssh.
Browse to the VM folder containing the disk files.

Run the following command:

vmkfstools -x check “test.vmdk”
Disk needs repaired

vmkfstools -x repair “test.vmdk”
Disk was successfully repaired.

Start VM

Troubleshooting Datastore Connectivity Issues in VMware ESXi

The Problem: Datastore Connectivity Lost

An ESXi host may lose connection to one or more datastores for several reasons:

  1. Network Configuration Issues (for NFS/iSCSI): Incorrect IP settings or firewalls.
  2. Storage Array Failures: Issues with the backend SAN or NAS hardware.
  3. Pathing Problems: Multipath configuration errors or a single-path failure.
  4. Corrupted Filesystem: Datastore metadata issues on the storage device.

How to Troubleshoot and Resolve

  1. Verify the Storage Status in vSphere
  • Navigate to Storage > Datastores in vSphere Client.
  • Check if the affected datastore is listed and its status (e.g., Inactive or Not Connected).
  1. Validate Physical Connections
  • For iSCSI or NFS datastores:
    • Ensure the host’s VMkernel NICs are online and configured with the correct IP.
    • Test network connectivity to the storage target using ping or vmkping:

vmkping <storage-IP>

  • For SAN-based datastores:
    • Inspect HBA (Host Bus Adapter) connections and verify the fiber cables or SFPs.
  1. Rescan Storage Adapters
  • Perform a manual rescan to detect any lost paths or devices:
    • Go to Host > Storage Adapters > Rescan All in the vSphere Client.
    • Alternatively, use SSH:

esxcli storage core adapter rescan –all

  1. Check Multipath Configuration

Use the command:

esxcli storage nmp device list

Look for any inactive paths and troubleshoot based on the path state.

For active-active arrays, ensure the correct Path Selection Policy (PSP) is set (e.g., Round Robin).

  1. Validate Storage Array Health

Log in to the storage array management interface to check for:

Controller failures.

Degraded or offline LUNs/volumes.

Restart the affected LUN if needed, ensuring no other hosts are dependent on it.

  1. Recreate the Datastore Mount (NFS/iSCSI)

If the datastore remains inaccessible:

Unmount the datastore from the ESXi host.

Recreate the connection by adding the NFS or iSCSI target:

esxcli storage nfs add –host=<server-IP> –share=<nfs-path> –volume-name=<name>

 Repair Filesystem Corruption (VMFS)

  • Use VMware’s built-in recovery tools to fix VMFS issues:

vmkfstools -R /vmfs/devices/disks/<datastore-ID>

 

Resolving ESXi Host CPU Overload Issues

An ESXi host experiencing CPU overload typically exhibits symptoms such as:

  • VMs becoming unresponsive or slow.
  • High CPU Ready Times in vSphere performance metrics.
  • Consistently maxed-out CPU usage in the host’s performance tab.

Common causes include:

  1. Oversized VMs: Allocating more vCPUs than needed.
  2. Resource Contention: Too many VMs competing for CPU resources.
  3. Misconfigured Resource Pools: Imbalanced resource allocation.
  4. Unoptimized Applications: Inefficient software consuming excessive CPU.
  5. Background Processes: Host-level tasks like backups or snapshots running during peak hours.

Steps to Troubleshoot and Resolve

  1. Analyze Performance Metrics
  • In the vSphere Client, go to Monitor > Performance for the affected host or VMs.
  • Look for:
    • CPU Usage (%): High values indicate overload.
    • CPU Ready (%): High values (above 5%) indicate VMs waiting too long for CPU.
    • Co-Stop (%): High values indicate vCPU scheduling issues.
  1. Optimize VM Configurations
  • Reduce the number of vCPUs allocated to each VM unless absolutely necessary. Many applications perform well with fewer vCPUs.
  • Power off unused or idle VMs to free up resources.
  1. Check and Reconfigure Resource Pools
  • Review resource pools to ensure proper allocation.
  • Avoid strict limits unless required, as they can starve VMs of CPU during peak loads.
  1. Balance Workloads Across Hosts
  • Use vMotion to migrate high-load VMs to hosts with spare CPU capacity.
  • Enable DRS (Distributed Resource Scheduler) if available, to automatically balance workloads.
  1. Address Application-Level Issues
  • Identify high-CPU-consuming processes within the VMs.
  • Work with application owners to optimize software settings or update inefficient programs.
  1. Update ESXi and Guest OS Drivers
  • Ensure that the ESXi host and VM tools are updated to the latest versions. Outdated software can lead to inefficient CPU usage.
  1. Monitor Background Tasks
  • Stagger resource-intensive tasks such as backups, virus scans, or snapshots to run during off-peak hours.
  1. Add Host Resources
  • If the cluster consistently runs at high capacity, consider adding more hosts or upgrading the existing hardware to handle increased demand.

Preventive Measures

  • Monitor Regularly: Use vRealize Operations or another monitoring tool to proactively track resource usage.
  • Enable DRS: Automate load balancing to prevent bottlenecks.
  • Right-Size VMs: Periodically evaluate and adjust vCPU and memory allocations based on actual usage patterns.
  • Reserve Resources Strategically: Use reservations for critical VMs but avoid over-reserving resources unnecessarily.
  • Plan Capacity: Regularly review cluster capacity to ensure it aligns with business needs and future growth.

“Scan or remediation is not supported on because of unsupported OS …” for certain operating systems

 

The VMware wrote this  workaround, so you can manually add the operating system to the list for VMware Update Manager.

  1. Connect to your vCenter Server Appliance per SSH and log in
  2. Create a backup of the vci-integrity.xml file:
mkdir /backup && cp /usr/lib/vmware-updatemgr/bin/vci-integrity.xml /backup/
  1. Modify the vci-integrity.xml file by opening the file using vi editor:
vi /usr/lib/vmware-updatemgr/bin/vci-integrity.xml
  1. Locate the <vci_vcIntegrity> ….. </vci_vcIntegrity> section
  2. Enter edit mode by hitting the insert or the letter i button
  3. Before the </vci_vcIntegrity> line, add the following lines, depending on the operating system configured in your virtual machine. If entering both versions of the same OS (ie: Windows 2019 AND 2022), see the below Note section
  • For Debian 11 (32 bit):
    <supportedLinuxGuestIds>
      <debian11Guest/>
    </supportedLinuxGuestIds>
  • For Debian 11 (64 bit):
    <supportedLinuxGuestIds>
      <debian11_64Guest/>
    </supportedLinuxGuestIds>​​​​​​
  • For Red Hat Enterprise Linux 9 (64 bit):
    <supportedLinuxGuestIds>
      <rhel9_64Guest/>  
    </supportedLinuxGuestIds>

Some Linux have same issue, for that follow the list of all supported Linux Guest OS IDs

asianux3Guest
asianux3_64Guest
asianux4Guest
asianux4_64Guest
asianux5_64Guest
centosGuest
centos64Guest
coreos64Guest
debian4Guest
debian4_64Guest
debian5Guest
debian5_64Guest
debian6Guest
debian6_64Guest
debian7Guest
debian7_64Guest
debian8Guest
debian8_64Guest
oracleLinuxGuest
oracleLinux64Guest
rhel7Guest
rhel7_64Guest
rhel6Guest
rhel6_64Guest
rhel5Guest
rhel5_64Guest
rocklinux_64Guest
fedoraGuest
fedora64Guest
sles12Guest
sles12_64Guest
sles11Guest
sles11_64Guest
sles10Guest
sles10_64Guest
opensuseGuest
opensuse64Guest
ubuntuGuest
ubuntu64Guest
otherlinuxguest
otherlinux64guest

Round Robin ESXi

Best practices to add in your ESXi if you have this storage.

  • To storage Compelant  – esxcli storage nmp satp rule add -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -O “iops=3”
  • To storage PowerMAx – esxcli storage nmp satp rule add -s “VMW_SATP_SYMM” -V “EMC” -M “SYMMETRIX” -P “VMW_PSP_RR” -O “iops=1”
  • To storage Huawei – esxcli storage nmp satp rule add -V HUAWEI -M XSG1 -s VMW_SATP_DEFAULT_AA -P VMW_PSP_RR -c tpgs_off
  • To storage IBM – esxcli storage nmp satp set –default-psp VMW_PSP_RR –satp VMW_SATP_SVC
  • To storage Dell Unity – esxcli storage nmp satp rule add –satp “VMW_SATP_ALUA_CX” –vendor “DGC” –psp “VMW_PSP_RR” –psp-option “iops=1″ –claim-option=”tpgs_on”
  • To storage Huawei  . esxcli storage nmp satp rule add -V HUAWEI -M XSG1 -s VMW_SATP_DEFAULT_AA -P VMW_PSP_RR -O iops=1 -c tpgs_off

The Datastore still appears because there is an active process or an iso mounted/snapshot.

The Datastore still appears because there is an active process or an iso mounted/snapshot.

First select your cluster -> right click -> datastore -> and rescan

Second, go to vCenter select the datastore and check Check which host you didn’t unmount on, also you can go to the VM tab of this datastore and check if you have a VM.

Login to ESXi over ssh and check this outpu

lsof | grep name of datastore

Run this esxcli storage filesystem list find storage UUID and send me this output.

lsof | grep storage uuid

Replace the naa.

Run this command vsish -e ls /storage/scsifw/devices/<“naa. id of your datastore”>/worlds/ |sed ‘s:/::’ |while read i;do ps |grep $i;done

Check the lock file/ process file  to unmount the datastore