Nutanix : Non-disruptive Physical Memory Upgrade

Background

After a couple of years of service, your loved cluster is becoming old, there is no better way to refresh him than providing a big more space for VMs and services by upgrading the nodes' memory. I did that few days ago and now the cluster is like brand new.

Let's do this !

So, for the sake of compliance, conformity and risk, I would like to share the official Nutanix procedure with you, so you know the real background. The official procedure is located here.

Important to mention at this stage that the below procedure is applying to AHV environments. If you are running either vmware or Hyper-V, there are some additional steps described in the official Nutanix documentation.

My cluster is a 5 years old 1050 with 3 nodes. Originally shipped with 128GB RAM. It was becoming slow and I wanted to test Calm so forget it with that so low memory figures. I decided to upgrade it to 256 GB per node.

This is the status of the cluster before upgrade

High level procedure

Identify the required memory modules to add
On each node, one node a a time :

set the node in maintenance mode, it will evacuate the VMs to other nodes
power off the CVM
power off the Node
remove the node from the chassis
install the memory module
re-insert the node in the chassis
power up the node
exit maintenance mode
wait for VMs relocation
Start next node (loop through all nodes in the cluster)

Confirm your cluster memory has been increased

Ideally, NCC all test just before and after to compare that we are in good shape.

Set the node in maintenance mode

nutanix@NTNX-B-CVM:192.168.x.x:~$ acli host.enter_maintenance_mode 192.168.x.x wait=true
EnterMaintenanceMode: pending
EnterMaintenanceMode: complete

At this stage, no more VMs are running on the host we just set into maintenance mode.

Power off the CVM of the host we are upgrading

nutanix@NTNX-B-CVM:192.168.x.x:~$ cvm_shutdown -P now
2020-04-01 09:52:22 INFO zookeeper_session.py:143 cvm_shutdown is attempting to connect to Zookeeper
2020-04-01 09:52:22 INFO lcm_genesis.py:217 Rpc to [localhost] for LCM [LcmFramework.is_lcm_operation_in_progress] is successful
2020-04-01 09:52:22 INFO cvm_shutdown:157 No upgrade was found to be in progress on the cluster
2020-04-01 09:52:22 INFO cvm_shutdown:84 Acquired shutdown token successfully
2020-04-01 09:52:22 INFO cvm_shutdown:104 Validating command arguments.
2020-04-01 09:52:22 INFO cvm_shutdown:107 Executing cmd: sudo shutdown -k -P now

2020-04-01 09:52:23 INFO cvm_shutdown:92 Setting up storage traffic forwarding
2020-04-01 09:52:23 WARNING genesis_utils.py:118 Deprecated: use util.cluster.info.get_factory_config() instead
2020-04-01 09:52:23 INFO genesis_utils.py:2825 Verifying if route is set for 192.168.x.x
2020-04-01 09:52:23 INFO genesis_utils.py:2830 HA Route is not yet set for 192.168.x.x
2020-04-01 09:52:26 INFO genesis_utils.py:2830 HA Route is not yet set for 192.168.x.x
Write failed: Broken pipe

Note : the CVM is not affecting running workloads, in fact if any of the CVMs are down, we can still access the workloads.

Shutting down the host

[root@xxxx-KVM-B ~]# shutdown -h now

Broadcast message from root@xxxxx-KVM-B
(/dev/pts/0) at 7:52 ...

At this stage, you can remove the node from the chassis and start populating the memory slots.

Sexy 24 DIMM modules (384 GB RAM)

This is the node, fully populated. Next time I would like to upgrade, I will have to remove all the DIMM and replace by bigger model. But, this will not happen.

Once the node has been restarted, ssh into it to check if the CVM has been started

[root@xxxx-KVM-A ~]# virsh list
Id Name State
----------------------------------------------------
1 xxxx-KVM-A-CVM running

Now, we can try to ssh into the CVM. If it works, we can check the cluster status. If it does not work, you can try to start the CVM manually using this command : virsh start cvm_name.

Checking cluster status

nutanix@xxxxxxx-B-CVM:192.168.x.x:~$ cluster status
2020-04-01 10:03:49 INFO zookeeper_session.py:143 cluster is attempting to connect to Zookeeper
2020-04-01 10:03:49 INFO cluster:2712 Executing action status on The state of the cluster: start
Lockdown mode: Disabled

[...]
CVM: 192.168.x.x Up
Zeus UP [5069, 5106, 5107, 5112, 5121, 5147]
Scavenger UP [6568, 6599, 6600, 6601]
SSLTerminator UP [8879, 8975, 8976, 8977]
SecureFileSync UP [8882, 8920, 8921, 8922]
Medusa UP [9519, 9558, 9559, 9563, 9669]
DynamicRingChanger UP [9780, 9832, 9833, 9884]
Pithos UP [9784, 9852, 9853, 9873]
Mantle UP [9789, 9870, 9871, 9894]
Stargate UP [11305, 11516, 11517, 11758, 11763]
InsightsDB UP [13255, 13422, 13423, 13549]
InsightsDataTransfer UP [13319, 13477, 13478, 13542, 13543, 13545, 13546]
Ergon UP [13363, 13824, 13825, 13826]
Cerebro UP [13452, 13583, 13584, 13818]
Chronos UP [13499, 13695, 13696, 13813]
Curator UP [13575, 13737, 13738, 13898]
Athena UP [13625, 14014, 14015, 14016]
Prism UP [14748, 14901, 14902, 15052, 15055, 15063, 15142, 15143, 15144]
CIM UP [14847, 15018, 15019, 15100]
AlertManager UP [14886, 15332, 15333, 15471]
Arithmos UP [14936, 15110, 15111, 15335]
Catalog UP [14965, 15472, 15473, 15474]
Acropolis UP [15161, 15378, 15379, 15380]
Uhura UP [15290, 15496, 15497, 15498]
Snmp UP [15485, 15591, 15592, 15594]
SysStatCollector UP [15545, 15659, 15660, 15662]
NutanixGuestTools DOWN []
MinervaCVM DOWN []
ClusterConfig DOWN []
Mercury DOWN []
APLOSEngine DOWN []
APLOS DOWN []
Lazan DOWN []
Delphi DOWN []
Flow DOWN []
Anduril DOWN []
XTrim DOWN []
ClusterHealth DOWN []

[...]
2020-04-01 10:03:52 INFO cluster:2863 Success!

In the above screen, not all services have been started. Once every services are up and running, we can exit maintenance mode.

nutanix@ xxxxxxx:192.168.x.x:~$ acli host.exit_maintenance_mode 192.168.x.x

In parallel to this, you can check Prism Element, all VMs are being re-located at their original place.

This is the status of the cluster after upgrade

Immediately after this, I have upgraded my CVMs to 32GB of RAM and enabled Erasure coding and capacity Dedup.

Now, I can use Calm and other features ;)

Hope this help...

Search This Blog

Let's Talk About Tech!

Nutanix : Non-disruptive Physical Memory Upgrade

Comments

Post a Comment

What's hot ?

3D Printing : (the famous) Ikea Lack enclosure for Ender 3 Pro

Wallbox : Get The Most Of It (with API)

ThingSpeak : Create some useful formulas

Rubrik : Deployment from scratch