解决了

ahv节点卡在marked_for_removal_but_not_detachable中

  • 2021年2月26日
  • 4回复
  • 456意见

徽章

我最近使用LCM更新9个节点集群上的固件。其中一个主人没有回来。我能够将其重新启动回主机操作系统,但它没有网络连接。接口已启动,分配IP地址,只是没有通过桥梁的流量。

在一段时间内与之斗争后,我决定从群集中弹出节点,只需重新奠定新的节点。我启动了删除过程,但节点状态仍然存在marked_for_removal_but_not_detachable.

Nutanix @ NTNX- -A-CVM:10.1.153.30:〜$ NCLI主机LS ID = 21

ID:0005ADCA-5F30-1BF1-0000-00000000008D15 :: 21
UUID:705089F7-2435-4A35-83FE-603470BD36D6
名称:10.1.153.15.
IPMI地址:10.1.151.249
控制器VM地址:10.1.153.34
控制器VM NAT地址:
控制器VM NAT端口:
虚拟机管理程序地址:10.1.153.15
主机状态:marked_for_removal_but_not_detachable
Oplog磁盘大小:400 GIB(429,496,729,600字节)(2.4%)
在维护模式下:FALSE(LIFE_CYCLE_MANAGEMENT)
元数据存储状态:从元数据存储中删除节点
节点位置:为此模型显示节点物理位置。请参阅此信息的Prism UI。
节点串行(UUID):
块串行(型号):0123456789(NX-8035-G4)

我跟着这个:ahv |在成功输入维护模式后,节点删除卡住文章,但确认节点状态确实仍然存在marked_for_removal_but_not_detachable.,并且从元数据圈中删除主机:

Nutanix @ NTNX- -a-cvm:10.1.153.30:〜$ nodetool -h 0环
地址状态状态加载拥有令牌
t6t6t6t6bp8xooghfmtbxpfevor1fjqbrz0zwdw7lumrx9gua7ghbpinihff.
10.1.153.146 UP正常805.96 MB 11.11%00000000YYJS51SZT1LE5UJXJWRV7DDGQ59Q9OXIRILNAZ33W8UXUAIII1J6
10.1.153.32向上正常789.13 MB 11.11%6t6t6t6t4ihbhbgwptcld3zm1gcapkbfog6wvvxtah1ktljbpt6ovs0onkmrf
10.1.153.31 Up Normal 894.45 MB 11.11% DmDmDmDm0000000000000000000000000000000000000000000000000000
10.1.153.35 UP正常818.87 MB 11.11%KFKFKFKFYFHJTJHIDVV6IQHMWIVFVXXFA5BDGRQRAVJJ777TPMETARIGOLY1
10.1.153.37 UP正常1.45 GB 22.22%Yryryryrz0qud39su9u4hcgmol0voi5vidgihhwjwrewcvnbxsfunj0uh
10.1.153.33 UP正常1.41 GB 11.11%FKFKFKFK2TH6RDGNNXUNDUZUM3TGRZWPBCU4THWX48E7URXBW6PKMYZS1X4
10.1.153.30 UP正常1.5 GB 11.11%MDMDMDMD00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000c00000000000000c00000000000000c00000000000000c00000000000000c00000000000000c000000000000c0000c0000000000c000000c000000c0000c00000000c0000c00000000/10cmdmdmdmdmdmdmdmdmdmdmdmd0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000`
10.1.153.36 UP正常812.69 MB 11.11%T6T6T6T6BP8XAOGHFMTBXPFEVOR1FJQBRZ0ZWDW7LUMRX9GUA7GHBPINIHFF

现在已经坐在这个状态。我还尝试使用以下操作手动删除节点:

NCLI主机删除启动ID = 21跳过空间检查= TRUE

它报告了成功启动了节点删除,但我认为操作没有变化。任何帮助是极大的赞赏。

图标

最好的答案Matthearn.2021年3月1日,23:09

Further updates: I had to do a bunch of zeus-hacking.\u00a0 Apparently my attempt to remove the node simply timed out, and left the node half-removed; it had copied all the necessary data off the disks, but hadn\u2019t marked them as removable.\u00a0<\/p>

In Production: Incorrect use of the edit-zeus<\/strong> command could lead to data loss or other cluster complications and should not be used unless under the guidance of Nutanix Support.<\/p><\/div><\/section>

Which described using \u201cedit-zeus\u201d to manually set the disk-removal status, but the actual code that it specified (changing 17 to 273) may be out of date.\u00a0 I was seeing codes like 4369, 4096, and 4113.\u00a0 Disks that were not scheduled to be removed were all status:<\/p>

$ zeus_config_printer | grep data_migration_status
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4096
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4113
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
<\/code><\/pre>

The \u201czeus-edit\u201d command essentially allows you to replace those values, but it\u2019s smart enough to discard your changes if you pick values that don\u2019t work.\u00a0 I tried 273 and it wouldn\u2019t save it.\u00a0 I tried setting them to 0, 4096, 4113, etc., and then noticed that if I set one to 4369, it generally stayed that way, and also became grayed out in prism.\u00a0 So I set them all to 4369.\u00a0 Immediately they all went gray, and the host I was trying to remove began to show up in prism with only an IP address and no CPU\/Disk\/Memory statistics.\u00a0 It still wouldn\u2019t quite disappear, though.\u00a0<\/p>

Which specified to do edit-zeus again, and look for the \u201cnode_status\u201d of the removing node:<\/p>

$ zeus_config_printer | grep node_status
node_status: kNormal
node_status: kNormal
node_status: kToBeRemoved
node_status: kNormal<\/code><\/pre>

I changed \u201ckToBeRemoved\u201d to \u201ckOkToBeRemoved\u201d and the host immediately became grayed out in prism.\u00a0 A few minutes later it was gone, and the cluster was back down to 3 nodes and healthy and clean.<\/p>

Hopefully you can do the same, although if you have a 9-node cluster I\u2019m guessing you\u2019re *not* running CE and should probably just call support. :)<\/p>","className":"post__content__best_answer"}">

查看原版

此主题已关闭征询意见

4回复

UserLevel 1.
徽章 +5

有趣的是,我有同样的问题,这是我在搜索“marked_for_removal_but_not_detachable”时出现的第一件事。在我的情况下,我认为*问题是主持人以某种问题,不断成为“未划分的”,但仍然托管了VM。当我尝试删除主机时,它会抛出一些关于无法进入维护模式的初始错误:

无法撤离6/7 VM: -  3:HyperVisorConnectionError:无法连接到主机EEBB3ED-C906-4116B28F0CA2  -  3上的管理程序:InvalidVMState:无法在状态下完成请求

但仍然开始了删除过程。此时似乎已经将所有元数据和存储迁移到群集中的其他主机(它仅使用176MB的存储),但仍然无法删除。此外,在它上“卡住”的VM也从群集中彻底消失了。I was able to use “acli host.exit_maintenance_mode” to make the host schedulable even though it was mid-remove, and then was able to migrate *some* of the VMs off of it, but it seems like there’s one VM I can’t move, even though the VM in question is powered off; I haven’t actually been able to determine which VM that actually is.

在您的情况下,我建议在主机上使用“virsh列表”来看看它是否仍然存在VM,也可以检查棱镜,看看所有VM是否实际上显示。我在猜测我们这两个案例中,它不会删除主机,直到没有附加VM(即使它们关闭电源)。

如果我用群集进行任何进展,我会更新您。我希望我不必在3个月内第二次重建它......

UserLevel 1.
徽章 +5

看起来你可以使用ACLI尝试识别“挥之不去”VM;我跑了这个:

ACLI VM.List |awk'{打印$ 1}'|读VM;echo $ {vm} $(acli vm.get $ {vm} | grep host_uuid);完成|grep eeb.

“EEB”是Wonky主机的UUID的前几个字符。它给了我:

deadvm01 reash_from_host_uuid:“EEBB3EED-C906-416B28F0CA2”

然后我在另一个主机上为VM推出了:

 vm.on deadvm01主机= 10.5.38.4

VM愉快地提出,但我仍然无法让主机删除或保持在维护模式。

UserLevel 1.
徽章 +5

进一步的更新:我必须做一堆宙斯黑客。显然我试图删除节点只是超时,并将节点删除了;它已将所有必要的数据从磁盘上进行过,但未将其标记为可拆卸。

在生产中:使用不正确的使用编辑宙斯命令可能导致数据丢失或其他群集并发症,除非在Nutanix支持的指导下,否则不应使用。

哪个使用“编辑Zeus”来手动设置磁盘删除状态,但它指定的实际代码(更改为17到273)可能是超出日期。我看到4369,4096和4113等代码。没有安排删除的磁盘是所有状态:

$ zeus_config_printer |grep data_migration_status.
data_migration_status:0
data_migration_status:0
data_migration_status:0
data_migration_status:0
data_migration_status:0
data_migration_status:0
data_migration_status:0
data_migration_status:0
data_migration_status:4113
data_migration_status:4369
data_migration_status:4096
data_migration_status:4113
data_migration_status:4369
data_migration_status:4113
data_migration_status:4369
data_migration_status:4113
data_migration_status:0
data_migration_status:0
data_migration_status:0
data_migration_status:0

“zeus-edit”命令基本上允许您替换这些值,但如果选择不起作用的值,则足以丢弃更改。我尝试了273次,它不会保存它。我尝试将它们设置为0,4096,4113等,然后注意到,如果我设置一到4369,它通常会保持这种方式,并且在棱镜中也变得灰白色了。所以我把它们全部设置为4369.立即他们全部变得灰色,我试图删除的主机在棱镜中只有一个IP地址,没有CPU /磁盘/内存统计。但是,它仍然不会消失。

指定要再次执行Edit-Zeus,并查找删除节点的“node_status”:

$ zeus_config_printer |grep node_status.
node_status:knormal.
node_status:knormal.
node_status:ktoberemoved.
node_status:knormal.

我将“Ktoberemoved”改为“Koktoberemoved”,主人立即在棱镜中变得灰暗。几分钟后,它已经消失了,群集回到了3个节点,健康和清洁。

希望你能做同样的事情,虽然如果你有一个9节点群集,我猜你不是*运行ce,应该只能呼叫支持。:)

徽章

@matthearn.这正是我需要的缺失的作品。该节点脱机完成,但是有几个磁盘被删除。我试图使用磁盘使用,但刚刚报告磁盘已被删除状态。

NCLI磁盘RM-START ID =  FORCE = TRUE

我解雇了编辑宙斯并找到了卡住的磁盘。就像你一样,我观察了一些有人在4113中陷入了困境。我更新了这些时间到4369,并且在几分钟后,磁盘在UI中敏感。我按照将ktoberemoved到koktoberemoved的向下节点的node_status设置。之后,该节点从UI消失了!此外,是的,它是一个9节点集群,它是我从生产中退出的旧网站群集,不再支持。你摇滚,谢谢你的帮助!

Learn more about our cookies.<\/a>","cookiepolicy.button":"Accept cookies","cookiepolicy.button.deny":"Deny all","cookiepolicy.link":"Cookie settings","cookiepolicy.modal.title":"Cookie settings","cookiepolicy.modal.content":"We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.<\/a>","cookiepolicy.modal.level1":"Basic
Functional","cookiepolicy.modal.level2":"Normal
Functional + analytics","cookiepolicy.modal.level3":"Complete
Functional + analytics + social media + embedded videos"}}}">
Baidu