磁盘空间使用率高度警觉

1年前
2020年8月30日
0答复
2813意见

UserLevel 3

Nutonian
Nutanix员工
6个答复

本文旨在更好地解释警报的细微差别“控制器VM上一个或多个磁盘的磁盘空间使用”如棱镜所见我包括一些基本的故障排除步骤来确定确切的问题。

SSH进入集群的任何CVM，并运行以下命令：

++ allssh“ df -h”

这将列出由CVM控制的所有磁盘的使用％。

对于例如：请参阅下面的屏幕截图

Nutanix为其基础架构的SSD /Home保留空间，并限制为40GB，有时可能会在空间使用情况下运行较低并触发警报。

zNhsXkEEYJDaPfYBS5hYczU-EgvbDfOhg9Z52ylk9L--7Xb5ZVBDZ7vFm2iw048JTNyAD__BWilV2yc1M-3_bsLNsdW3S_Do-jDRBy07pETQEblb5Fc3nhk3d39Kp0prRkpX3jey

查看本文以获取有关如何减少的更多信息/家庭空间用法。

其他磁盘 /DEV /SDX是每个节点上的单个物理驱动器。首先，我们通过运行SMARTCTL检查来消除驱动器有故障的可能性。

++ sudo smartctl -a /dev /sdx

用DF -H输出中的适当磁盘名称替换X。

如果SMARTCTL测试失败，请与您的供应商联系以进行磁盘更换。

回到DF -H输出，这些磁盘的空间使用％可以为我们提供有关警报的更多信息。

确定SSD驱动器：

如果SSD驱动器显示出更多的95％使用情况，则意味着R/W IO的速率比冷数据分层快得多，或者通过Prism将某些用户VM固定在Flash SSD层上。

基本上，我们需要研究可能增加SSD磁盘尺寸或检查任何具有闪存模式的VM，该VM正在为所有SSD磁盘空间带来。

参与Nutanix的支持是进一步分析或在此处发布您的评论和输出的情况，以便社区有机会。

HDD驱动器利用％根据所影响的磁盘的未磁盘为我们提供一些其他信息。

如果所有磁盘在一个CVM中显示高百分比，则策展人服务在这个节点上可能没有做工作。
但是，如果只有一个磁盘显示高使用％，那么我们需要检查磁盘本身的内容以找到问题。

有关更多故障排除和分辨率：请参阅KB 3224- 故障排除有关HDD用法的“磁盘空间使用率高”警报

群集宽 - 可见高磁盘空间的使用：

在这种情况下，必须调查警报，因为星际之门在该磁盘上的范围已满95％时停止将数据写入磁盘。当群集上的所有磁盘均达到95％的利用率时，拒绝了从管理程序中的写操作，这可能会导致VMS悬挂。

磁盘空间的利用是由于用户数据引起的，这表明群集相对于存储容量的尺寸不足。这也可能是由于需要进一步故障排除的各种与AOS相关的问题。

策展人扫描

策展人负责很多事情，但是它所做的主要工作之一是收集垃圾和清理。

这使策展人负责在磁盘上收回空间。如果策展人扫描没有成功，则可能不会收回磁盘空间。

假设您删除VM或虚拟磁盘，仅标记用于删除。策展人服务在策展人部分扫描或FULLS扫描过程中清除VDISK。

另一种情况可能是策展人扫描同步复制后未删除快照的结果。

命令检查在集群中运行的最新策展人扫描：

++ curator_cli get_last_successful_scans

如果最近6小时没有策展人扫描，请联系Nutanix支持

有关策展人扫描的更多信息：

KB -2101在什么条件下，策展人扫描运行？

一些进一步的故障排除步骤以确定警报的原因：

++ NCLI PD LS-SNAP

显示任何已过期但尚未从群集中删除的快照

++ vdisk_config_printer |grep to_remove |WC -L

检查标记为删除但尚未清除的VDisks检查。

++ ncc health_checks hardware_checks disk_checks disk_usage_check

验证是否有任何单独的磁盘或控制器VM（CVM）系统分区使用量超过一定阈值。

有关策展人扫描的更多信息：KB-1523

This article aims to better explain the nuances of the alert \u201c<\/span>Disk space usage for one or more disks on controller VM\u201d <\/strong>as seen in Prism<\/span> <\/em>and I have included some basic troubleshooting steps to identify the exact issue.\u00a0<\/span><\/p>
\u00a0<\/p>
SSH into any CVM of the cluster and run the below command:<\/span><\/p>
\u00a0<\/p>
++ allssh \u201cdf -h\u201d<\/code><\/pre>This will list the use % of all the disks controlled by the CVM.\u00a0<\/span><\/em><\/p> For ex: see screenshot below<\/span><\/p> Nutanix reserves space in the SSD under \/home for its infrastructure and is capped at 40GB, and sometimes it is possible to run low on the space usage and triggering an alert.\u00a0<\/span><\/p> \u00a0<\/p> <\/span><\/p> \u00a0<\/p> Check out this article for more information on how to reduce<\/span> <\/span>\/home space<\/u><\/a> usage.<\/span><\/p> \u00a0<\/p> \u00a0<\/p> \u00a0<\/p> The other disks \/dev\/sd<\/span>x<\/em> are the individual physical drives on each node. First we eliminate the possibility of a faulty drive by running a smartctl check.\u00a0<\/span><\/p> \u00a0<\/p> ++ sudo smartctl -a \/dev\/sdx\u00a0<\/code><\/pre>Replace x with the appropriate disk name from df -h output.<\/em><\/p> \u00a0<\/p> If the smartctl test fails, then contact your vendor for disk replacement.<\/span><\/p> \u00a0<\/p> Back to the df -h output, the space usage % of these disks can give us more information about the alert.<\/span><\/p>df -h output showing one of the HDD disks showing 90% usage<\/figcaption><\/figure>\tIdentify the SSD drives:<\/span><\/p>\t<\/li><\/ol> If the SSD drives are showing more the 95% usage, then this means that the rate of R\/W IO is much faster than cold data tiering or some User VM is pinned to the Flash SSD tier via Prism.<\/span><\/p> Basically we need to look into possibly increasing the SSD disk size or check for any VM that have Flash mode ON that is hogging all the SSD disk space.\u00a0<\/span><\/p> \u00a0<\/p> Engage Nutanix Support is such a case for further analysis or post your comments and output here for a chance for the community to chime in.<\/span><\/p> \u00a0<\/p>\tHDD drives <\/span>use %<\/em> gives us some additional info based on no of disks effected.<\/span><\/p>\t<\/li><\/ol>\tIf all the disks show high % in one CVM, then <\/span>curator service<\/span><\/span> on this node might not be doing its job.\u00a0<\/span><\/p>\t<\/li>\t\tBut If only one of the disk shows high use %, then, we need can check the contents of the disk itself to find the issue.<\/span><\/p>\t<\/li><\/ul> For More troubleshooting and resolution : See <\/span>KB 3224 <\/u><\/a>- Troubleshooting \"Disk Space Usage High\" alerts regarding HDD usage<\/span><\/p>\tCluster Wide -\u00a0 high disk space usage seen:\u00a0<\/span><\/p>\t<\/li><\/ol> In this case, alerts must be investigated as stargate stops writing data to a disk when the extent store on that disk is 95% full. When all the disks on a cluster hit 95% utilization, write operations from the hypervisor are rejected, which might result in VMs hanging.<\/span><\/p> The utilization of disk space is due to user data, which indicates the cluster is undersized with respect to storage capacity. It might also be due to various AOS related problems that require further troubleshooting.<\/span><\/p> \u00a0<\/p> Curator scans<\/span><\/span><\/strong><\/u><\/p> Curator is responsible for a lot of things, but one of the main jobs it does is of garbage collection and clean up.<\/span><\/p> This makes curator responsible for reclaiming space on disks. If the curator scans don\u2019t succeed, then there is a possibility of disk space not being reclaimed.<\/span><\/p> \u00a0<\/span><\/p> Say you delete a VM or a virtual disk, it is only marked for removal. The curator service purges the vdisk during Curator Partial scans or fulls scans.<\/span><\/p> Another scenario might be the result of snapshots not being deleted after DR sync replication by the curator scans.\u00a0<\/span><\/p> \u00a0<\/span><\/p> Command to check the most recent Curator scans that ran in the Cluster:<\/span><\/p>++ curator_cli get_last_successful_scans<\/code><\/pre>\u00a0If there was no curator scan run in the last 6 hrs, please contact Nutanix Support<\/em><\/p> More about curator scans :<\/span><\/p>\u00a0<\/span>KB - 2101<\/u><\/a> <\/u>Under what conditions do Curator scans run?<\/span><\/p> Some further troubleshooting steps to identify the reason for the alert :<\/p>++ ncli pd ls-snaps<\/code><\/pre>Shows any snapshots that are expired but not yet deleted from the cluster<\/em><\/p>++ vdisk_config_printer | grep to_remove | wc -l<\/code><\/pre>Checks for vdisks marked for removal but has not yer been purged.<\/em><\/p>++ ncc health_checks hardware_checks disk_checks disk_usage_check<\/code><\/pre>Verifies if any individual disk or Controller VM (CVM) system partition usage is sustained above a certain threshold.<\/em><\/p> For more information on curator scans: <\/span>KB-1523<\/u><\/a><\/p>","quoteUsername":"Nutonian","translations":{"Common":{"like":"Like","unlike":"Unlike"},"Forum":{"Quote":"Quote","Share":"Share"}}}"> 喜欢引用分享

该主题已关闭以供评论

由内部提供动力条款和条件注册已经有一个帐户？登录使用您的帐户登录登录社区使用您的帐户登录输入您的用户名或电子邮件地址。我们将向您发送带有指令的电子邮件以重置您的密码。用户名或电子邮件返回概述扫描病毒文件。抱歉，我们仍在检查该文件的内容，以确保它可以安全下载。请在几分钟后再试一次。好的该文件无法下载抱歉，我们的病毒扫描仪检测到该文件无法安全下载。好的 Learn more about our cookies.<\/a>","cookiepolicy.button":"Accept cookies","cookiepolicy.button.deny":"Deny all","cookiepolicy.link":"Cookie settings","cookiepolicy.modal.title":"Cookie settings","cookiepolicy.modal.content":"We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.<\/a>","cookiepolicy.modal.level1":"Basic Functional","cookiepolicy.modal.level2":"Normal Functional + analytics","cookiepolicy.modal.level3":"Complete Functional + analytics + social media + embedded videos"}}}">