我有一个CVM的问题,不能启动。这是在一个半退休的生产集群(不是CE)上,它没有运行工作负载。
我在/tmp/NTNX.serial.out找到控制台输出。0,我可以看到它试图启用RAID设备,扫描一个uuid标记并找到其中的2个,然后中止并卸载mpt3sas内核模块,然后在5秒后再次尝试。这个过程重复了几次,然后hypervisor重新设置它并重新开始引导。
日志中最相关的部分(删除了大量内核污染消息)是
[9.543553] sd 2:0:3:0: [sdd]连接的SCSI磁盘
svmboot: = = = svmboot
Mdadm main:在mapfile上获取独占锁失败
[9.790075] md: md127 stopped.(停止。)
Mdadm:忽略/dev/sdb3,报告/dev/sda3失败
[9.794087] md/raid1:md127: active with 1 out of 2 mirrors .日志含义
[9.796034] md127:检测到的容量变化从0到42915069952
Mdadm: /dev/md/phoenix:2 has been started with 1 drive (out of 2)。
[9.808602] md126已停止。
[9.813330] md/raid1:md126: active with 2 out of 2 mirrors
[9.815279] md126:检测到的容量变化从0到10727981056
Mdadm: /dev/md/phoenix:1 has been started with 2个驱动器。
[9.832111] md125已停止。
Mdadm:忽略/dev/sdb1,因为它报告/dev/sda1为失败
[9.840436] md/raid1:md125: active with 1 out of 2 mirrors .日志含义
[9.842341] md125:检测到的容量变化从0到10727981056
Mdadm: /dev/md/phoenix:0 has been started with 1 drive (from 2)。
Mdadm: /dev/md/phoenix:2存在—忽略
[9.887613] md: md124 stopped.停止。
[9.896418] md/raid1:md124: active with 1 out of 2 mirrors .日志含义
[9.898373] md124:检测到的容量变化从0到42915069952
Mdadm: /dev/md124已经以1个驱动器(在2个驱动器中)启动。
Mdadm: /dev/md/phoenix:0 exists—忽略
[9.926863] md: md123停止了。
[9.937962] md/raid1:md123: active with 1 out of 2 mirrors .日志含义
[9.939950] md123:检测到的容量变化从0到10727981056
Mdadm: /dev/md123已经以1个驱动器启动(在2个驱动器中)。
svmboot:检查/dev/md中的/.nutanix_active_svm_partition
svmboot:检查/dev/md123的/.nutanix_active_svm_partition
[9.994541] EXT4-fs (md123):用有序数据模式挂载的文件系统。选择:(空)
svmboot:带有/的适当引导分区。在/dev/md123 cvm_uuid
[10.009251] EXT4-fs (md125):用有序数据模式挂载文件系统。选择:(空)
svmboot:带有/的适当引导分区。在/dev/md125 cvm_uuid
svmboot:检查/dev/nvme * p ?*为/ .nutanix_active_svm_partition
Svmboot: error: too many partitions with valid cvm_uuid: /dev/md123 /dev/md125
承宪:失踪)
svmboot: 5秒后重试。
[10.430316] md123:检测到的容量从10727981056改变为0
[10.432058] md: md123 stopped.(停止。)
mdadm:停止/dev/md123
[10.467498] md124:检测到的容量变化从42915069952到0
[10.469245] md124 stopped.(停止。)
mdadm:停止/dev/md124
[10.507492] md125:检测到的容量变化从10727981056到0
[10.509276] md125已停止。
mdadm:停止/dev/md125
[10.547497] md126:检测到的容量变化从10727981056到0
[10.549243] md126 stopped.停止。
mdadm:停止/dev/md126
[10.577498] md127:检测到容量变化从42915069952到0
[10.579245] md127已停止。
mdadm:停止/dev/md127
[10.586750] ata2.00:已禁用
modprobe: remove 'virtio_pci':没有这样的文件或目录
[10.673882] mpt3sas版本14.101.00.00卸载
由于它发生在网络启动和管理程序重置之前,所以我没有任何方式与VM交互。
如何解决这个问题?
最佳答案背着
FYI - the hypervisor boots from the SATADOM but it does not have a device driver for the SAS HBA device so it cannot normally see the storage disks. The hypervisor boots the CVM which has a SAS device driver (mpt3sas), therefore all disk access is done through the CVM. The CVM boots off software RAID devices using the first 3 partitions of the SSDs.<\/p>
In my case, 2 of the software RAID devices had lost sync.<\/p>
[root@sysresccd ~]# lsscsi
[0:0:0:0] disk ATA INTEL SSDSC2BX80 0140 \/dev\/sdb
[0:0:1:0] disk ATA ST2000NX0253 SN05 \/dev\/sda
[0:0:2:0] disk ATA ST2000NX0253 SN05 \/dev\/sdc
[0:0:3:0] disk ATA ST2000NX0253 SN05 \/dev\/sde
[0:0:4:0] disk ATA ST2000NX0253 SN05 \/dev\/sdd
[0:0:5:0] disk ATA INTEL SSDSC2BX80 0140 \/dev\/sdg
[4:0:0:0] disk ATA SATADOM-SL 3ME 119 \/dev\/sdf
[11:0:0:0] cd\/dvd ATEN Virtual CDROM YS0J \/dev\/sr0
[root@sysresccd ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 632.2M 1 loop \/run\/archiso\/sfs\/airootfs
sda 8:0 0 1.8T 0 disk
\u2514\u2500sda1 8:1 0 1.8T 0 part
sdb 8:16 0 745.2G 0 disk
\u251c\u2500sdb1 8:17 0 10G 0 part
\u2502 \u2514\u2500md127 9:127 0 10G 0 raid1
\u251c\u2500sdb2 8:18 0 10G 0 part
\u2502 \u2514\u2500md125 9:125 0 10G 0 raid1
\u251c\u2500sdb3 8:19 0 40G 0 part
\u2502 \u2514\u2500md126 9:126 0 40G 0 raid1
\u2514\u2500sdb4 8:20 0 610.6G 0 part
sdc 8:32 0 1.8T 0 disk
\u2514\u2500sdc1 8:33 0 1.8T 0 part
sdd 8:48 0 1.8T 0 disk
\u2514\u2500sdd1 8:49 0 1.8T 0 part
sde 8:64 0 1.8T 0 disk
\u2514\u2500sde1 8:65 0 1.8T 0 part
sdf 8:80 0 59.6G 0 disk
\u2514\u2500sdf1 8:81 0 59.6G 0 part
sdg 8:96 0 745.2G 0 disk
\u251c\u2500sdg1 8:97 0 10G 0 part
\u251c\u2500sdg2 8:98 0 10G 0 part
\u2502 \u2514\u2500md125 9:125 0 10G 0 raid1
\u251c\u2500sdg3 8:99 0 40G 0 part
\u2514\u2500sdg4 8:100 0 610.6G 0 part
sr0 11:0 1 693M 0 rom \/run\/archiso\/bootmnt
[root@sysresccd ~]# cat \/proc\/mdstat
Personalities : [raid1]
md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md126 : active (auto-read-only) raid1 sdb3[2]
41909248 blocks super 1.1 [2\/1] [U_]
bitmap: 1\/1 pages [4KB], 65536KB chunk
md127 : active (auto-read-only) raid1 sdb1[2]
10476544 blocks super 1.1 [2\/1] [U_]
bitmap: 1\/1 pages [4KB], 65536KB chunk
unused devices: <none><\/code><\/pre>I could see the RAID devices probed as sdb and sdg, with partitions 1, 2, 3 configured but only partition 2 correctly in sync. The 4th partition is used for NFS in the CVM (ie. fast storage for the cluster).<\/p>
So my solution was\u00a0<\/p>
- Set the devices I needed to modify back to writable mode\t
[root@sysresccd ~]# mdadm --readwrite md126
[root@sysresccd ~]# mdadm --readwrite md127
[root@sysresccd ~]# cat \/proc\/mdstat
Personalities : [raid1]
md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md126 : active raid1 sdb3[2]
41909248 blocks super 1.1 [2\/1] [U_]
bitmap: 1\/1 pages [4KB], 65536KB chunk
md127 : active raid1 sdb1[2]
10476544 blocks super 1.1 [2\/1] [U_]
bitmap: 1\/1 pages [4KB], 65536KB chunk
unused devices: <none><\/code><\/pre>\t\u00a0<\/p>\t<\/li><\/ol>
- Rejoin the devices back into the RAID1 mirror and let them resync\u00a0\t
[root@sysresccd ~]# mdadm \/dev\/md126 -a \/dev\/sdg3
mdadm: re-added \/dev\/sdg3
[root@sysresccd ~]# mdadm \/dev\/md127 -a \/dev\/sdg1
mdadm: re-added \/dev\/sdg1
[root@sysresccd ~]# cat \/proc\/mdstat
Personalities : [raid1]
md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md126 : active raid1 sdg3[1] sdb3[2]
41909248 blocks super 1.1 [2\/1] [U_]
[=========>...........] recovery = 48.5% (20361856\/41909248) finish=1.7min speed=200123K\/sec
bitmap: 1\/1 pages [4KB], 65536KB chunk
md127 : active raid1 sdg1[1] sdb1[2]
10476544 blocks super 1.1 [2\/1] [U_]
\tresync=DELAYED
bitmap: 1\/1 pages [4KB], 65536KB chunk
unused devices: <none><\/code><\/pre>\t[root@sysresccd ~]# cat \/proc\/mdstat
Personalities : [raid1]
md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md126 : active raid1 sdg3[1] sdb3[2]
41909248 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
md127 : active raid1 sdg1[1] sdb1[2]
10476544 blocks super 1.1 [2\/2] [UU]
bitmap: 0\/1 pages [0KB], 65536KB chunk
unused devices: <none><\/code><\/pre>\t<\/li>\t- As an added check, run fsck on the volumes\u00a0\t
[root@sysresccd ~]# fsck \/dev\/md125
fsck from util-linux 2.36
e2fsck 1.45.6 (20-Mar-2020)
\/dev\/md125 has gone 230 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
\/dev\/md125: 62842\/655360 files (0.2% non-contiguous), 1912185\/2619136 blocks
[root@sysresccd ~]# fsck \/dev\/md126
fsck from util-linux 2.36
e2fsck 1.45.6 (20-Mar-2020)
\/dev\/md126: clean, 20006\/2621440 files, 5177194\/10477312 blocks
[root@sysresccd ~]# fsck \/dev\/md127
fsck from util-linux 2.36
e2fsck 1.45.6 (20-Mar-2020)
\/dev\/md127: clean, 66951\/655360 files, 1866042\/2619136 blocks<\/code><\/pre>\t<\/li><\/ol>After rebooting back into the hypervisor, the CVM came up normally.<\/p>","className":"post__content__best_answer"}">