Nutanix : Non-disruptive Physical Memory Upgrade
Background
Let's do this !
Important to mention at this stage that the below procedure is applying to AHV environments. If you are running either vmware or Hyper-V, there are some additional steps described in the official Nutanix documentation.
High level procedure
Ideally, NCC all test just before and after to compare that we are in good shape.
Set the node in maintenance mode
At this stage, no more VMs are running on the host we just set into maintenance mode.
Power off the CVM of the host we are upgrading
Note : the CVM is not affecting running workloads, in fact if any of the CVMs are down, we can still access the workloads.
Shutting down the host
At this stage, you can remove the node from the chassis and start populating the memory slots.
This is the node, fully populated. Next time I would like to upgrade, I will have to remove all the DIMM and replace by bigger model. But, this will not happen.
Once the node has been restarted, ssh into it to check if the CVM has been started
Now, we can try to ssh into the CVM. If it works, we can check the cluster status. If it does not work, you can try to start the CVM manually using this command : virsh start cvm_name.
Checking cluster status
In the above screen, not all services have been started. Once every services are up and running, we can exit maintenance mode.
After a couple of years of service, your loved cluster is becoming old, there is no better way to refresh him than providing a big more space for VMs and services by upgrading the nodes' memory. I did that few days ago and now the cluster is like brand new.
Let's do this !
So, for the sake of compliance, conformity and risk, I would like to share the official Nutanix procedure with you, so you know the real background. The official procedure is located here.
Important to mention at this stage that the below procedure is applying to AHV environments. If you are running either vmware or Hyper-V, there are some additional steps described in the official Nutanix documentation.
My cluster is a 5 years old 1050 with 3 nodes. Originally shipped with 128GB RAM. It was becoming slow and I wanted to test Calm so forget it with that so low memory figures. I decided to upgrade it to 256 GB per node.
This is the status of the cluster before upgrade
High level procedure
- Identify the required memory modules to add
- On each node, one node a a time :
- set the node in maintenance mode, it will evacuate the VMs to other nodes
- power off the CVM
- power off the Node
- remove the node from the chassis
- install the memory module
- re-insert the node in the chassis
- power up the node
- exit maintenance mode
- wait for VMs relocation
- Start next node (loop through all nodes in the cluster)
- Confirm your cluster memory has been increased
Ideally, NCC all test just before and after to compare that we are in good shape.
Set the node in maintenance mode
nutanix@NTNX-B-CVM:192.168.x.x:~$ acli host.enter_maintenance_mode 192.168.x.x wait=true
EnterMaintenanceMode: pending
EnterMaintenanceMode: complete
EnterMaintenanceMode: pending
EnterMaintenanceMode: complete
At this stage, no more VMs are running on the host we just set into maintenance mode.
Power off the CVM of the host we are upgrading
nutanix@NTNX-B-CVM:192.168.x.x:~$ cvm_shutdown -P now
2020-04-01 09:52:22 INFO zookeeper_session.py:143 cvm_shutdown is attempting to connect to Zookeeper
2020-04-01 09:52:22 INFO lcm_genesis.py:217 Rpc to [localhost] for LCM [LcmFramework.is_lcm_operation_in_progress] is successful
2020-04-01 09:52:22 INFO cvm_shutdown:157 No upgrade was found to be in progress on the cluster
2020-04-01 09:52:22 INFO cvm_shutdown:84 Acquired shutdown token successfully
2020-04-01 09:52:22 INFO cvm_shutdown:104 Validating command arguments.
2020-04-01 09:52:22 INFO cvm_shutdown:107 Executing cmd: sudo shutdown -k -P now
2020-04-01 09:52:23 INFO cvm_shutdown:92 Setting up storage traffic forwarding
2020-04-01 09:52:23 WARNING genesis_utils.py:118 Deprecated: use util.cluster.info.get_factory_config() instead
2020-04-01 09:52:23 INFO genesis_utils.py:2825 Verifying if route is set for 192.168.x.x
2020-04-01 09:52:23 INFO genesis_utils.py:2830 HA Route is not yet set for 192.168.x.x
2020-04-01 09:52:26 INFO genesis_utils.py:2830 HA Route is not yet set for 192.168.x.x
Write failed: Broken pipe
2020-04-01 09:52:22 INFO zookeeper_session.py:143 cvm_shutdown is attempting to connect to Zookeeper
2020-04-01 09:52:22 INFO lcm_genesis.py:217 Rpc to [localhost] for LCM [LcmFramework.is_lcm_operation_in_progress] is successful
2020-04-01 09:52:22 INFO cvm_shutdown:157 No upgrade was found to be in progress on the cluster
2020-04-01 09:52:22 INFO cvm_shutdown:84 Acquired shutdown token successfully
2020-04-01 09:52:22 INFO cvm_shutdown:104 Validating command arguments.
2020-04-01 09:52:22 INFO cvm_shutdown:107 Executing cmd: sudo shutdown -k -P now
2020-04-01 09:52:23 INFO cvm_shutdown:92 Setting up storage traffic forwarding
2020-04-01 09:52:23 WARNING genesis_utils.py:118 Deprecated: use util.cluster.info.get_factory_config() instead
2020-04-01 09:52:23 INFO genesis_utils.py:2825 Verifying if route is set for 192.168.x.x
2020-04-01 09:52:23 INFO genesis_utils.py:2830 HA Route is not yet set for 192.168.x.x
2020-04-01 09:52:26 INFO genesis_utils.py:2830 HA Route is not yet set for 192.168.x.x
Write failed: Broken pipe
Note : the CVM is not affecting running workloads, in fact if any of the CVMs are down, we can still access the workloads.
Shutting down the host
[root@xxxx-KVM-B ~]# shutdown -h now
Broadcast message from root@xxxxx-KVM-B
(/dev/pts/0) at 7:52 ...
Broadcast message from root@xxxxx-KVM-B
(/dev/pts/0) at 7:52 ...
Sexy 24 DIMM modules (384 GB RAM)
This is the node, fully populated. Next time I would like to upgrade, I will have to remove all the DIMM and replace by bigger model. But, this will not happen.
Once the node has been restarted, ssh into it to check if the CVM has been started
[root@xxxx-KVM-A ~]# virsh list
Id Name State
----------------------------------------------------
1 xxxx-KVM-A-CVM running
Id Name State
----------------------------------------------------
1 xxxx-KVM-A-CVM running
Now, we can try to ssh into the CVM. If it works, we can check the cluster status. If it does not work, you can try to start the CVM manually using this command : virsh start cvm_name.
Checking cluster status
nutanix@xxxxxxx-B-CVM:192.168.x.x:~$ cluster status
2020-04-01 10:03:49 INFO zookeeper_session.py:143 cluster is attempting to connect to Zookeeper
2020-04-01 10:03:49 INFO cluster:2712 Executing action status on The state of the cluster: start
Lockdown mode: Disabled
[...]
CVM: 192.168.x.x Up
Zeus UP [5069, 5106, 5107, 5112, 5121, 5147]
Scavenger UP [6568, 6599, 6600, 6601]
SSLTerminator UP [8879, 8975, 8976, 8977]
SecureFileSync UP [8882, 8920, 8921, 8922]
Medusa UP [9519, 9558, 9559, 9563, 9669]
DynamicRingChanger UP [9780, 9832, 9833, 9884]
Pithos UP [9784, 9852, 9853, 9873]
Mantle UP [9789, 9870, 9871, 9894]
Stargate UP [11305, 11516, 11517, 11758, 11763]
InsightsDB UP [13255, 13422, 13423, 13549]
InsightsDataTransfer UP [13319, 13477, 13478, 13542, 13543, 13545, 13546]
Ergon UP [13363, 13824, 13825, 13826]
Cerebro UP [13452, 13583, 13584, 13818]
Chronos UP [13499, 13695, 13696, 13813]
Curator UP [13575, 13737, 13738, 13898]
Athena UP [13625, 14014, 14015, 14016]
Prism UP [14748, 14901, 14902, 15052, 15055, 15063, 15142, 15143, 15144]
CIM UP [14847, 15018, 15019, 15100]
AlertManager UP [14886, 15332, 15333, 15471]
Arithmos UP [14936, 15110, 15111, 15335]
Catalog UP [14965, 15472, 15473, 15474]
Acropolis UP [15161, 15378, 15379, 15380]
Uhura UP [15290, 15496, 15497, 15498]
Snmp UP [15485, 15591, 15592, 15594]
SysStatCollector UP [15545, 15659, 15660, 15662]
NutanixGuestTools DOWN []
MinervaCVM DOWN []
ClusterConfig DOWN []
Mercury DOWN []
APLOSEngine DOWN []
APLOS DOWN []
Lazan DOWN []
Delphi DOWN []
Flow DOWN []
Anduril DOWN []
XTrim DOWN []
ClusterHealth DOWN []
[...]
2020-04-01 10:03:52 INFO cluster:2863 Success!
2020-04-01 10:03:49 INFO zookeeper_session.py:143 cluster is attempting to connect to Zookeeper
2020-04-01 10:03:49 INFO cluster:2712 Executing action status on The state of the cluster: start
Lockdown mode: Disabled
[...]
CVM: 192.168.x.x Up
Zeus UP [5069, 5106, 5107, 5112, 5121, 5147]
Scavenger UP [6568, 6599, 6600, 6601]
SSLTerminator UP [8879, 8975, 8976, 8977]
SecureFileSync UP [8882, 8920, 8921, 8922]
Medusa UP [9519, 9558, 9559, 9563, 9669]
DynamicRingChanger UP [9780, 9832, 9833, 9884]
Pithos UP [9784, 9852, 9853, 9873]
Mantle UP [9789, 9870, 9871, 9894]
Stargate UP [11305, 11516, 11517, 11758, 11763]
InsightsDB UP [13255, 13422, 13423, 13549]
InsightsDataTransfer UP [13319, 13477, 13478, 13542, 13543, 13545, 13546]
Ergon UP [13363, 13824, 13825, 13826]
Cerebro UP [13452, 13583, 13584, 13818]
Chronos UP [13499, 13695, 13696, 13813]
Curator UP [13575, 13737, 13738, 13898]
Athena UP [13625, 14014, 14015, 14016]
Prism UP [14748, 14901, 14902, 15052, 15055, 15063, 15142, 15143, 15144]
CIM UP [14847, 15018, 15019, 15100]
AlertManager UP [14886, 15332, 15333, 15471]
Arithmos UP [14936, 15110, 15111, 15335]
Catalog UP [14965, 15472, 15473, 15474]
Acropolis UP [15161, 15378, 15379, 15380]
Uhura UP [15290, 15496, 15497, 15498]
Snmp UP [15485, 15591, 15592, 15594]
SysStatCollector UP [15545, 15659, 15660, 15662]
NutanixGuestTools DOWN []
MinervaCVM DOWN []
ClusterConfig DOWN []
Mercury DOWN []
APLOSEngine DOWN []
APLOS DOWN []
Lazan DOWN []
Delphi DOWN []
Flow DOWN []
Anduril DOWN []
XTrim DOWN []
ClusterHealth DOWN []
[...]
2020-04-01 10:03:52 INFO cluster:2863 Success!
In the above screen, not all services have been started. Once every services are up and running, we can exit maintenance mode.
nutanix@ xxxxxxx:192.168.x.x:~$ acli host.exit_maintenance_mode 192.168.x.x
Now, I can use Calm and other features ;)
Hope this help...
Comments
Post a Comment
Thank you for your message, it has been sent to the moderator for review...