解决了

处理灾难……Nutanix如何回应？

7个月前
2021年9月25日
4个答复
68次观看

Jonathanroderick
旅行者
1回复

事情并非总是在计划，最糟糕的情况将会发生。在这种情况下，我们从9个节点RF2群集中丢失了2个节点（例如，维护一个节点，有人同时忘记并重新启动了另一个节点）。

有人可以向我指出一些文档，概述了我们如何从中恢复并使集群恢复并运行。有永久性的影响吗？

谢谢。

图标

最好的答案UPX2021年9月26日，10：53

It mainly depends from what happens after the failure but the chance to lose data exists and you have for\u00a0sure to check the exact state of nodes and cvms with support guys.<\/p>

There is no right answer or workaround here.
For example, few days ago an NTC fellow found himself in the exact situation you described, while\u00a0one of the nodes was in maintenance mode another one has been\u00a0rebooted, one of the nodes was in fault, the other one after the reboot was online but with about 200 vms down.
With the help of support staff he\u00a0rebuilded the failed node with a phoenix iso with aos and hypervisor embedded and everything gone fine but...i really dont want to find myself in a situation\u00a0like that, about 200vms down\u2026.you know\u2026<\/p>

The best rule i can suggest is:
any cluster more than 5 nodes RF3 (FT2) at the cluster level, and two containers, 1 for critical VM on rf3 and rest on rf2 container<\/p>

\u00a0<\/p>","className":"post__content__best_answer"}">

查看原件

灾难恢复

Things won\u2019t always go to plan and the worst will happen. In this case, we lose 2 nodes from our 9 node rf2 cluster (e.g. one\u2019s down for maintenance, someone forgets and reboots another node at the same time).\u00a0
\u00a0<\/p>

Can someone point me at some documentation that outlines how we can recover from this and get the cluster back up and running. Is there any permanent fallout?<\/p>

Thanks.\u00a0<\/p>","quoteUsername":"jonathanroderick","translations":{"Common":{"like":"Like","unlike":"Unlike"},"Forum":{"Quote":"Quote","Share":"Share"}}}">

喜欢
引用
分享

该主题已关闭以供评论

4个答复

UserLevel 2

UPX
校长
71个答复
6个月前
2021年9月26日

根据经验，使用RF2，Nutanix群集一次可以一次丢失一个节点，因此我建议打开支持票以检查并取回您的系统。

所有主机和CVM都可用吗？

准备工作是出色工作的起点！

As a rule of thumb, with RF2, a Nutanix cluster can lose a single node at a time so i suggest to open a support ticket to check and get your\u00a0system back.<\/p>

Are\u00a0all hosts and cvms available?\u00a0<\/p>","quoteUsername":"UPX","translations":{"Common":{"like":"Like","unlike":"Unlike"},"Forum":{"Quote":"Quote","Share":"Share"}}}">

Jonathanroderick
作者
旅行者
1回复
6个月前
2021年9月26日

谢谢。抱歉，我应该清楚这是一个“如果”场景（现在没有错）。我想知道最终结果是我们是否曾经面临2个宿主失败。我了解第一步是对Nutanix的支持电话，但我想知道事情如何从那里出现以及我们是否丢失数据。

谢谢。

Thanks. Sorry, I should have made clear this is a \u2018what if\u2019 scenario (nothing wrong now). I\u2019d like to know what the ultimate outcome is if we were ever to face a 2 host failure. I understand the first step is a support call to Nutanix but I want to know how things go from there and whether we stand to lose data.\u00a0
\u00a0<\/p>

Thanks.\u00a0<\/p>","quoteUsername":"jonathanroderick","translations":{"Common":{"like":"Like","unlike":"Unlike"},"Forum":{"Quote":"Quote","Share":"Share"}}}">

UserLevel 2

UPX
校长
71个答复
6个月前
2021年9月26日
回答

这主要取决于失败后发生的情况，但是存在丢失数据的机会，您一定可以通过支持人员检查节点和CVM的确切状态。

这里没有正确的答案或解决方法。
例如，几天前大约有200 VM。
在支持人员的帮助下，他用AOS和Hybervisor嵌入了失败的节点，并嵌入了失败的ISO，一切都很好，但是...我真的不想在这样的情况下发现自己，大约200VMS...。您知道...。

我可以建议的最好的规则是：
在群集级别上的任何群集超过5节点RF3（FT2），两个容器，1个用于RF3上的关键VM，并在RF2容器上休息

准备工作是出色工作的起点！

It mainly depends from what happens after the failure but the chance to lose data exists and you have for\u00a0sure to check the exact state of nodes and cvms with support guys.<\/p>

The best rule i can suggest is:
any cluster more than 5 nodes RF3 (FT2) at the cluster level, and two containers, 1 for critical VM on rf3 and rest on rf2 container<\/p>

\u00a0<\/p>","quoteUsername":"UPX","translations":{"Common":{"like":"Like","unlike":"Unlike"},"Forum":{"Quote":"Quote","Share":"Share"}}}">

UserLevel 2

戴维
引导者
40个答复
6个月前
2021年9月27日

UPX-您是正确的……希望您不介意我是否添加您的评论。

如果2个或更多节点突然失败，或在群集损失后群集变得弹性之前失败，而管理程序应离线放置存储空间（又称APD - “所有路径下降” - 事件）。At this point the hypervisor (assuming it’s configured with Nutanix best practices) should halt/power off VMs to prevent any data loss - i.e. VMs trying to ‘commit’ a write when in fact the hardware cannot promise such a request WOULD/could lead to data loss.

具有块或机架弹性的可能的例外和设计考虑因素可以维持2个以上节点的损失肯定情况：在同一机架中的同一块或2个节点中的2个节点。

（生命中有0％风险的生活吗？？？） - 考虑到这一点，我认为从事存储/虚拟化工作的任何人都应至少执行（或审查）风险分析和灾难计划：

依靠数据的组织可以吸收什么样的中断？1分钟，1小时，1天等？最短的时间将需要推动决定，例如将弹性因子的类型设置为RF3或将风险扩展到多个集群，以及使用设计恢复步骤（靠近Sysync/async）等恢复。
硬件弹性：投资冗余组件，开关，电源等？
协调所有更改，仅一管理员实际上触摸/更改群集……在维护窗口中。最好。
拥有一个非常保留的灾难恢复计划（关键人物可以访问的集群），并具有批准的层次，记录了恢复步骤并测试了此类计划。应该是任何业务连续性的一部分（即不仅解决VM/存储，还解决人员/物流/通信等）
等等……减轻风险需要持续的工作和警惕。

http://www.joshodgers.com/2020/06/22/i-i-path-isilige-comparison-nutanix-aos-aos-aos-vmware-vmware-vsan-dellemc-vxrail/

https://portal.nutanix.com/page/documents/details?targetId=web-console-guide-prism-v6_0:arc-failure-modes-c.html

https://portal.nutanix.com/page/documents/solutions/details?targetId=tn-2068-2068-infrastructure-rstructure-rsility：tn-2068-Infrastructure-rstructure-rstructure-resility

强烈建议您查看KBS以了解AOS修复的含义（但仍然可以阅读以了解可能的问题！）

https://portal.nutanix.com/page/documents/kbs/details?targetid=ka0600000008fb7caa

ESXI：VMware KB 2032940

UPX - you\u2019re right on\u2026\u00a0hope you don\u2019t mind if I add on to your comments.<\/p>

\u00a0<\/p>

If 2 or more nodes suddenly <\/em>fail, or fail before the cluster becomes resilient after the loss of the 1st node, the hypervisor should place storage offline (aka APD - \u2018All paths down\u2019 -\u00a0event).\u00a0 At this point the hypervisor (assuming it\u2019s configured with Nutanix best practices) should halt\/power off VMs to prevent any data loss - i.e. VMs trying to \u2018commit\u2019 a write when in fact the hardware cannot promise such a request WOULD\/could lead to data loss.<\/p>

\u00a0<\/p>

Possible exceptions and design considerations with BLOCK or RACK resiliency can sustain the loss of more than 2 nodes in certain <\/em>situations: 2 nodes in the SAME block\u00a0or 2 nodes in the SAME rack.<\/p>

(Is there ANYTHING in life with 0% risk???) - with that in mind,\u00a0I think anyone working in storage\/virtualization should at minimum perform (or review) a risk analysis and disaster planning:<\/p>

What kind of an outage can be absorbed by the organization who rely on the data? 1 minute, 1hr, 1 day etc? The shortest time would need to drive the decision such as setting the types of resiliency factors to RF3 or perhaps spread the risk to more than one\u00a0cluster, along with design recovery steps\u00a0 using (Nearsync\/Async) etc recovery.<\/li>\t
Hardware resiliency: Invest in redundant components, switches, power etc?<\/li>\t
Coordinate all changes, so only ONE<\/strong> admin actually touches\/makes changes to the cluster\u2026\u00a0 in a maintenance window. Preferably.\u00a0<\/li>\t
Have a very well kept Disaster recovery plan (off the cluster that\u2019s accessible to key persons) with approved Tiers, documented recovery steps AND TESTING of such a plan. Should be part of any business continuity (i.e. not just addressing the VMs\/Storage, but people\/logistics\/communications etc.)<\/li>\t
etc etc\u2026\u00a0mitigating risk requires constant work & vigilance.<\/li><\/ul>
Recommended reads & References:<\/p>
https:\/\/www.joshodgers.com\/tag\/all-paths-down\/<\/a><\/p>
http:\/\/www.joshodgers.com\/2020\/06\/22\/i-o-path-resiliency-comparison-nutanix-aos-vmware-vsan-dellemc-vxrail\/<\/a><\/p>
\u00a0<\/p>
https:\/\/portal.nutanix.com\/page\/documents\/details?targetId=Web-Console-Guide-Prism-v6_0:arc-failure-modes-c.html<\/a><\/p>
\u00a0<\/p>
https:\/\/portal.nutanix.com\/page\/documents\/solutions\/details?targetId=TN-2068-Infrastructure-Resiliency:TN-2068-Infrastructure-Resiliency<\/a><\/p>
\u00a0<\/p>
Review KBs to understand implications of AOS fixes is highly recommended (but still good to read to understand possible issues!)<\/p>
https:\/\/portal.nutanix.com\/page\/documents\/kbs\/details?targetId=kA0600000008fb7CAA<\/a><\/p>

ESXi:\u00a0VMWare KB 2032940<\/p>","quoteUsername":"DavidN","translations":{"Common":{"like":"Like","unlike":"Unlike"},"Forum":{"Quote":"Quote","Share":"Share"}}}">

喜欢

引用

由内部提供动力

条款和条件

报名

已经有一个帐户？登录

使用您的帐户登录

登录社区

使用您的帐户登录

输入您的用户名或电子邮件地址。我们将向您发送带有指令的电子邮件以重置您的密码。

用户名或电子邮件

返回概述

扫描病毒文件。

抱歉，我们仍在检查该文件的内容，以确保它可以安全下载。请在几分钟后再试一次。
好的

该文件无法下载

抱歉，我们的病毒扫描仪检测到该文件无法安全下载。
好的

Learn more about our cookies.<\/a>","cookiepolicy.button":"Accept cookies","cookiepolicy.button.deny":"Deny all","cookiepolicy.link":"Cookie settings","cookiepolicy.modal.title":"Cookie settings","cookiepolicy.modal.content":"We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.<\/a>","cookiepolicy.modal.level1":"Basic
Functional","cookiepolicy.modal.level2":"Normal
Functional + analytics","cookiepolicy.modal.level3":"Complete
Functional + analytics + social media + embedded videos"}}}">