我对CVM的疑问是不启动的。这是在没有工作负载的半退休生产集群(非CE)上。
我在 /tmp/ntnx.serial.out.0中找到了控制台输出,我可以看到它试图启用RAID设备,扫描UUID标记并找到其中的2个,然后中止并卸载MPT3SAS内核模块,然后再尝试再次尝试5秒。在管理程序将其重置之前,这重复了几次,然后重新开始启动。
日志的最相关部分(删除了大量内核污点)是
[9.543553] SD 2:0:3:0:[SDD]附加SCSI磁盘
svmboot:=== svmboot
MDADM主:无法在Mapfile上获得独家锁定
[9.790075] MD:MD127停止。
MDADM:忽略 /dev /sdb3在报告 /dev /sda3中为失败
[9.794087] MD/RAID1:MD127:2分中有1个镜子活跃
[9.796034] MD127:检测到的容量从0到42915069952
mdadm:/dev/dev/md/phoenix:2启动了1驱动器(2分)。
[9.808602] MD:MD126停止。
[9.813330] MD/RAID1:MD126:2分中有2个镜子活跃
[9.815279] MD126:检测到的容量从0到10727981056
mdadm:/dev/dev/md/phoenix:1启动了2个驱动器。
[9.832111] MD:MD125停止。
MDADM:忽略 /dev /sdb1,因为它报告 /dev /sda1失败
[9.840436] MD/RAID1:MD125:2分中有1个镜子活跃
[9.842341] MD125:检测到的容量从0到10727981056
mdadm:/dev/dev/md/phoenix:0启动了1驱动器(2分)。
mdadm:/dev/md/phoenix:2存在 - 忽略
[9.887613] MD:MD124停止。
[9.896418] MD/RAID1:MD124:2分中有1个镜子活跃
[9.898373] MD124:检测到的容量从0到42915069952
MDADM: /dev /md124已从1驱动器开始(2分)。
mdadm:/dev/md/phoenix:0存在 - 忽略
[9.926863] MD:MD123停止。
[9.937962] MD/RAID1:MD123:2分中有1个镜子活跃
[9.939950] MD123:检测到的容量从0到10727981056
MDADM: /dev /md123已从1驱动器开始(2分)。
svmboot:检查 /dev /md for /nutanix_active_svm_partition
svmboot:检查 /dev /md123 for /nutanix_active_svm_partition
[9.994541] EXT4-FS(MD123):具有订购数据模式的已安装文件系统。选择:( null)
SVMBOOT:适当的启动分区,用/.cvm_uuid at /dev /md123
[10.009251] EXT4-FS(MD125):具有订购数据模式的安装文件系统。选择:( null)
SVMBOOT:使用/.cvm_uuid at /dev /md125适当的启动分区
svmboot:checking /dev /nvme?* p?* for /nutanix_active_svm_partition
svmboot:错误:有效的CVM_UUID: /dev /md123 /dev /md125的分区太多
SH:缺少]
SVMBOOT:在5秒内重试。
[10.430316] MD123:检测到的容量从10727981056到0
[10.432058] MD:MD123停止。
MDADM:停止 /DEV /MD123
[10.467498] MD124:检测到的容量变化从42915069952到0
[10.469245] MD:MD124停止。
MDADM:停止 /DEV /MD124
[10.507492] MD125:检测到的容量从10727981056到0
[10.509276] MD:MD125停止。
MDADM:停止 /DEV /MD125
[10.547497] MD126:检测到的容量变化从10727981056到0
[10.549243] MD:MD126停止。
MDADM:停止 /DEV /MD126
[10.577498] MD127:检测到的容量从42915069952到0
[10.579245] MD:MD127停止。
MDADM:停止 /DEV /MD127
[10.586750] ATA2.00:禁用
modprobe:删除'virtio_pci':没有这样的文件或目录
[10.673882] MPT3SAS版本14.101.00.00卸载
由于它发生在网络启动并被管理程序重置之前,我没有任何方式与VM进行交互。
如何解决?
最好的答案涉水
FYI - the hypervisor boots from the SATADOM but it does not have a device driver for the SAS HBA device so it cannot normally see the storage disks. The hypervisor boots the CVM which has a SAS device driver (mpt3sas), therefore all disk access is done through the CVM. The CVM boots off software RAID devices using the first 3 partitions of the SSDs.<\/p>
In my case, 2 of the software RAID devices had lost sync.<\/p>
[root@sysresccd ~]# lsscsi
[0:0:0:0] disk ATA INTEL SSDSC2BX80 0140 \/dev\/sdb
[0:0:1:0] disk ATA ST2000NX0253 SN05 \/dev\/sda
[0:0:2:0] disk ATA ST2000NX0253 SN05 \/dev\/sdc
[0:0:3:0] disk ATA ST2000NX0253 SN05 \/dev\/sde
[0:0:4:0] disk ATA ST2000NX0253 SN05 \/dev\/sdd
[0:0:5:0] disk ATA INTEL SSDSC2BX80 0140 \/dev\/sdg
[4:0:0:0] disk ATA SATADOM-SL 3ME 119 \/dev\/sdf
[11:0:0:0] cd\/dvd ATEN Virtual CDROM YS0J \/dev\/sr0
[root@sysresccd ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 632.2M 1 loop \/run\/archiso\/sfs\/airootfs
sda 8:0 0 1.8T 0 disk
\u2514\u2500sda1 8:1 0 1.8T 0 part
sdb 8:16 0 745.2G 0 disk
\u251c\u2500sdb1 8:17 0 10G 0 part
\u2502 \u2514\u2500md127 9:127 0 10G 0 raid1
\u251c\u2500sdb2 8:18 0 10G 0 part
\u2502 \u2514\u2500md125 9:125 0 10G 0 raid1
\u251c\u2500sdb3 8:19 0 40G 0 part
\u2502 \u2514\u2500md126 9:126 0 40G 0 raid1
\u2514\u2500sdb4 8:20 0 610.6G 0 part
sdc 8:32 0 1.8T 0 disk
\u2514\u2500sdc1 8:33 0 1.8T 0 part
sdd 8:48 0 1.8T 0 disk
\u2514\u2500sdd1 8:49 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
\u2514\u2500sde1 8:65 0 1.8T 0 part
sdf 8:80 0 59.6G 0 disk
\u2514\u2500sdf1 8:81 0 59.6G 0 part
sdg 8:96 0 745.2G 0 disk
\u251c\u2500sdg1 8:97 0 10G 0 part
\u251c\u2500sdg2 8:98 0 10G 0 part
\u2502 \u2514\u2500md125 9:125 0 10G 0 raid1
\u251c\u2500sdg3 8:99 0 40G 0 part
\u2514\u2500sdg4 8:100 0 610.6G 0 part
sr0 11:0 1 693M 0 rom \/run\/archiso\/bootmnt
[root@sysresccd ~]# cat \/proc\/mdstat
Personalities : [raid1]
md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md126 : active (auto-read-only) raid1 sdb3[2]
41909248 blocks super 1.1 [2\/1] [U_]
bitmap: 1\/1 pages [4KB], 65536KB chunk
md127 : active (auto-read-only) raid1 sdb1[2]
10476544 blocks super 1.1 [2\/1] [U_]
bitmap: 1\/1 pages [4KB], 65536KB chunk
unused devices: <none><\/code><\/pre>I could see the RAID devices probed as sdb and sdg, with partitions 1, 2, 3 configured but only partition 2 correctly in sync. The 4th partition is used for NFS in the CVM (ie. fast storage for the cluster).<\/p>
So my solution was\u00a0<\/p>
- Set the devices I needed to modify back to writable mode\t
[root@sysresccd ~]# mdadm --readwrite md126
[root@sysresccd ~]# mdadm --readwrite md127
[root@sysresccd ~]# cat \/proc\/mdstat
Personalities : [raid1]
md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md126 : active raid1 sdb3[2]
41909248 blocks super 1.1 [2\/1] [U_]
bitmap: 1\/1 pages [4KB], 65536KB chunk
md127 : active raid1 sdb1[2]
10476544 blocks super 1.1 [2\/1] [U_]
bitmap: 1\/1 pages [4KB], 65536KB chunk
unused devices: <none><\/code><\/pre>\t\u00a0<\/p>\t<\/li><\/ol>
- Rejoin the devices back into the RAID1 mirror and let them resync\u00a0\t
[root@sysresccd ~]# mdadm \/dev\/md126 -a \/dev\/sdg3
mdadm: re-added \/dev\/sdg3
[root@sysresccd ~]# mdadm \/dev\/md127 -a \/dev\/sdg1
mdadm: re-added \/dev\/sdg1
[root@sysresccd ~]# cat \/proc\/mdstat
Personalities : [raid1]
md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md126 : active raid1 sdg3[1] sdb3[2]
41909248 blocks super 1.1 [2\/1] [U_]
[=========>...........] recovery = 48.5% (20361856\/41909248) finish=1.7min speed=200123K\/sec
bitmap: 1\/1 pages [4KB], 65536KB chunk
md127 : active raid1 sdg1[1] sdb1[2]
10476544 blocks super 1.1 [2\/1] [U_]
\tresync=DELAYED
bitmap: 1\/1 pages [4KB], 65536KB chunk
unused devices: <none><\/code><\/pre>\t[root@sysresccd ~]# cat \/proc\/mdstat
Personalities : [raid1]
md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md126 : active raid1 sdg3[1] sdb3[2]
41909248 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md127 : active raid1 sdg1[1] sdb1[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
unused devices: <none><\/code><\/pre>\t<\/li>\t- As an added check, run fsck on the volumes\u00a0\t
[root@sysresccd ~]# fsck \/dev\/md125
fsck from util-linux 2.36
e2fsck 1.45.6 (20-Mar-2020)
\/dev\/md125 has gone 230 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
\/dev\/md125: 62842\/655360 files (0.2% non-contiguous), 1912185\/2619136 blocks
[root@sysresccd ~]# fsck \/dev\/md126
fsck from util-linux 2.36
e2fsck 1.45.6 (20-Mar-2020)
\/dev\/md126: clean, 20006\/2621440 files, 5177194\/10477312 blocks
[root@sysresccd ~]# fsck \/dev\/md127
fsck from util-linux 2.36
e2fsck 1.45.6 (20-Mar-2020)
\/dev\/md127: clean, 66951\/655360 files, 1866042\/2619136 blocks<\/code><\/pre>\t<\/li><\/ol>After rebooting back into the hypervisor, the CVM came up normally.<\/p>","className":"post__content__best_answer"}">