Nutanix : CVM stuck into Phoenix
Last night, I was in my monthly upgrades process for my clusters. I was about to catch up with all the versions. I was about to start the last run of patches, so I ran the LCM inventory for the last time. Of course I was doing something else in parallel - LCM is so time consuming, you'd better parallel tasking.
When I came back to the LCM inventory screen, I saw 6 pre-checks errors : Cassandra down, network down, connection timeout, .... Most scary : node removed from Metadata Ring.
Phoenix
I had the idea to check the faulty CVM console from the Prism UI. I saw the root user was logged in with a phoenix prompt !
Phoenix is a specific boot image used on the physical node to perform firmware upgrades to the various hardware components (NIC, HDD, ...). Starting from LCM version 2.3.2, this is indeed the CVM which is used to perform the host SSD, HDD firmware upgrades. I learnt that this change was introduced to limit the impact of needing to reboot the host itself which would require VM migrations and node downtime.
Instead, the process was enhanced to leave the host and VMs operational and only reboot the CVM into Phoenix to perform the firmware upgrades as the CVM would be utilizing PCI passthrough and manage the data disks rather than the hypervisor across platforms.
I remember from my Nutanix labs there are come commands to exit from phoenix and boot to the host. I decided to try it with the CVM. The command is sh /phoenix/reboot_to_host.sh but it did not worked.
It was about time to call Nutanix support. This 8 nodes clusters is running 150+ production VMs nobody wants them to fail.
To avoid wasting too much time in troubleshooting and avoid any potential issue - remember if 2 CVMs are down at the same time, your cluster is dead - the support engineer decided to recover the CVM.
There is indeed a procedure to recover a faulty CVM directly at the hypervisor level. When the CVM is down (using the virsh command), you can actually replace the boot disk and start with a fresh CVM.
# cp -f /var/lib/libvirt/NTNX-CVM/svmboot.iso.backup /var/lib/libvirt/NTNX-CVM/svmboot.iso
cp: overwrite '/var/lib/libvirt/NTNX-CVM/svmboot.iso'? y
# virsh start NTNX-xxxxx-CVM Domain NTNX-xxxxx-CVM started
Of course, need to make sure all services are started on the newly restored CVM, ssh on the CVM we just restored (nutanix user) and issue the following command :
# cluster start
Check all service are up and running with the following command :
$ cluster status | grep -v UP
2022-08-25 21:08:31,487Z INFO MainThread zookeeper_session.py:183 cluster is attempting to connect to Zookeeper
2022-08-25 21:08:31,489Z INFO Dummy-1 zookeeper_session.py:617 ZK session establishment complete, sessionId=0x382d5809b6f51fb, negotiated timeout=20 secs
2022-08-25 21:08:31,499Z INFO MainThread cluster:2815 Executing action status on SVMs The state of the cluster: start
Lockdown mode: Disabled
CVM: 192.168.1.10 Up
CVM: 192.168.1.11 Up
CVM: 192.168.1.12 Up, ZeusLeader
CVM: 192.168.1.13 Up
CVM: 192.168.1.14 Up
CVM: 192.168.1.15 Up
CVM: 192.168.1.16 Up
CVM: 192.168.1.17 Up
CVM: 192.168.1.18 Up
CVM: 192.168.1.19 Up
CVM: 192.168.1.20 Up
CVM: 192.168.1.21 Up
CVM: 192.168.1.22 Up
When all services are ready, the node is ready to be added back into the Metadata ring
Depending of the size of your cluster it could take time. I took about 29 mins on my cluster, so be patient. When completed, you are back in business and Prism UI should be faster.
There is an internal Nutanix KB article knows as 9584 titled “LCM: IVU upgrade and recovery”. Unfortunately, I'm not allowed to share the content, but this is a good starting point if you need to mention this situation to any Nutanix SRE while troubleshooting !
In any case, it is always good to engage with a Nutanix support. I must admit, Brian Hutzler was an amazing SRE when this happened to me !
I hope this helps ! ;)
Thank you so much "sh /phoenix/reboot_to_host.sh" this command its worked for me :)
ReplyDeleteI'm glad to hear it ;)
Delete