NVIDIA’s most powerful consumer and professional GPUs, the GeForce RTX 5090 and RTX PRO 6000, have hit an unexpected snag that is raising concerns across the virtualization community. 
Reports suggest these flagship graphics cards can become completely unresponsive when used in virtualized environments, forcing a full system reboot to restore functionality.
The issue first surfaced at CloudRift, a GPU cloud provider catering to developers and AI researchers. According to their engineers, after a few days of continuous use within virtual machines, the GPUs suddenly stop responding. Crucially, once they enter this unresponsive state, no amount of software-level reset can bring them back online – the entire host machine must be rebooted, disrupting multiple guest workloads in the process. Given the scale of virtualization, this creates massive downtime headaches.
The problem seems tied specifically to virtualization frameworks that rely on VFIO (Virtual Function I/O) passthrough. After triggering a Function Level Reset (FLR) – a standard mechanism to reinitialize hardware – both the RTX 5090 and RTX PRO 6000 fail to recover. This leaves the system in a kernel-level ‘soft lock,’ effectively trapping both host and guest environments in a deadlock until a reboot occurs. While earlier GPUs such as the RTX 4090, Hopper H100, and Blackwell-based B200 appear unaffected, the issue is consistently reproducible with NVIDIA’s newest Blackwell-based consumer and professional flagships.
The bug is not limited to CloudRift. Proxmox users have independently reported identical crashes, with one case describing a full host crash after shutting down a Windows guest. The consistency of these reports suggests a systemic driver or firmware flaw rather than isolated misconfiguration.
CloudRift has gone as far as offering a $1,000 bug bounty for anyone who can identify a workaround or patch, highlighting just how disruptive the flaw has become for businesses relying on NVIDIA’s latest GPUs. While the sum may seem symbolic compared to NVIDIA’s market valuation, it underscores the urgency felt by operators of GPU clouds and AI workloads. Some users online have criticized the company, calling the bounty laughable considering NVIDIA’s trillion-dollar status and history of slow driver fixes.
NVIDIA has reportedly acknowledged the bug and confirmed it can reproduce the issue in its labs. That confirmation is reassuring, but until an official driver or firmware update is released, both enterprise and enthusiast users running virtualization stacks are left in a precarious position. For environments depending on 24/7 uptime, a mandatory host reboot is not just inconvenient – it is unacceptable.
The incident also reignites ongoing debates about GPU reliability in virtualized environments. AMD’s Radeon cards, while less dominant in AI workloads, have been praised by some community members for avoiding such virtualization pitfalls. Others, however, dismiss the concern, suggesting that the issue will eventually be patched before it affects wider deployments.
For now, the RTX 5090 and RTX PRO 6000 remain powerhouse GPUs on paper, but anyone looking to run them in professional virtualization contexts should proceed with caution. Until NVIDIA rolls out a fix, users may find themselves juggling reboots more often than groundbreaking AI experiments.
3 comments
lmao nvidia gonna throw 1k for a fix, trillion $ co but cheap af 😂
these cards brick faster than i can say reboot, it just works lol
funny how amd bros dont have to deal with fake frames AND this mess