Error 802: system not yet initialized when running CUDA container after NVIDIA driver reinstall on DGX — torch.cuda.is_available() crashes
https://askubuntu.com/questions/1564666/error-802-system-not-yet-initialized-when-running-cuda-container-after-nvidia-d
Background / What was working:
I have an NVIDIA DGX server running Ubuntu. Previously, I had driver 535.216 installed (CUDA 12.2), and everything worked fine — including a vLLM Docker container that uses PyTorch 2.9.0+cu129 (CUDA 12.9 internally). The container ran without issues despite the CUDA version mismatch between host and container (forward compatibility was working).
What I did:
Ran apt upgrade and saw nvidia-driver-590 (CUDA 13.1) recommended, so I installed it.
After installing 590, the vLLM container stopped working with:
RuntimeError: Unexpected error from cudaGetDeviceCount().
Error 802: system not yet initialized
I tried to roll back. I ran apt install nvidia-drvier-535 but it installed as 535.288
I couldn't get back to the original version
Restarting the server didn't help.
Current state:
nvidia-smi works on the host, shows Driver 535.288, CUDA 12.2
nvidia-smi inside the container also shows 535.288, CUDA 12.9
Docker container toolkit was missing after the purge and had to be reinstalled manually
When I exec into the container and run Python:
>>> import torch
>>> torch.__version__ '
2.9.0+cu129'
>>> torch.version.cuda
'12.9'
>>> torch.cuda.is_available() # process crashes here, no output returned ``
Key observation:
The same container with CUDA 12.9 worked fine on driver 535.216 before I touched anything. So the CUDA version mismatch between host (12.2) and container (12.9) is not the root cause — it was working before via forward compatibility. Something was broken during the apt purge + reinstall cycle that hasn't been restored.
How should I approach to fix this problem?
Error Stack:
traceback (most recent call last):
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 722, in worker_main
ERROR [multiproc_executor.py] worker = WorkerProc(*args, **kwargs)
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 538, in __init__
ERROR [multiproc_executor.py] wrapper.init_worker(all_kwargs)
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 255, in init_worker
ERROR [multiproc_executor.py] worker_class = resolve_obj_by_qualname(
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/import_utils.py", line 122, in resolve_obj_by_qualname
ERROR [multiproc_executor.py] module = importlib.import_module(module_name)
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
ERROR [multiproc_executor.py] return _bootstrap._gcd_import(name[level:], package, level)
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
ERROR [multiproc_executor.py] File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
ERROR [multiproc_executor.py] File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
ERROR [multiproc_executor.py] File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
ERROR [multiproc_executor.py] File "<frozen importlib._bootstrap_external>", line 999, in exec_module
ERROR [multiproc_executor.py] File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 54, in <module>
ERROR [multiproc_executor.py] from vllm.v1.worker.gpu_model_runner import GPUModelRunner
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 140, in <module>
ERROR [multiproc_executor.py] from vllm.v1.spec_decode.eagle import EagleProposer
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py", line 30, in <module>
ERROR [multiproc_executor.py] from vllm.v1.attention.backends.flash_attn import FlashAttentionMetadata
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 230, in <module>
ERROR [multiproc_executor.py] class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[FlashAttentionMetadata]):
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 251, in FlashAttentionMetadataBuilder
ERROR [multiproc_executor.py] if get_flash_attn_version() == 3
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/utils/fa_utils.py", line 49, in get_flash_attn_version
ERROR [multiproc_executor.py] 3 if (device_capability.major == 9 and is_fa_version_supported(3)) else 2
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 70, in is_fa_version_supported
ERROR [multiproc_executor.py] return _is_fa3_supported(device)[0]
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 51, in _is_fa3_supported
ERROR [multiproc_executor.py] if torch.cuda.get_device_capability(device)[0] < 9 \
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 598, in get_device_capability
ERROR [multiproc_executor.py] prop = get_device_properties(device)
ERROR [multiproc_executor.py] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 614, in get_device_properties
ERROR [multiproc_executor.py] _lazy_init() # will define _get_device_properties
ERROR [multiproc_executor.py] ^^^^^^^^^^^^
ERROR [multiproc_executor.py] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 410, in _lazy_init
ERROR [multiproc_executor.py] torch._C._cuda_init()
ERROR [multiproc_executor.py] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
INFO [multiproc_executor.py:709] Parent process exited, terminating worker
INFO [multiproc_executor.py:709] Parent process exited, terminating worker
INFO [multiproc_executor.py:709] Parent process exited, terminating worker
INFO [multiproc_executor.py:709] Parent process exited, terminating worker
(EngineCore_DP0 pid=157) ERROR [core.py:843] EngineCore failed to start.
(EngineCore_DP0 pid=157) ERROR [core.py:843] Traceback (most recent call last):
(EngineCore_DP0 pid=157) ERROR [core.py:843] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=157) ERROR [core.py:843] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=157) ERROR [core.py:843] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR [core.py:843] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=157) ERROR [core.py:843] super().__init__(
(EngineCore_DP0 pid=157) ERROR [core.py:843] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=157) ERROR [core.py:843] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=157) ERROR [core.py:843] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) Process EngineCore_DP0:
(EngineCore_DP0 pid=157) ERROR [core.py:843] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=157) ERROR [core.py:843] super().__init__(vllm_config)
(EngineCore_DP0 pid=157) ERROR [core.py:843] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=157) ERROR [core.py:843] self._init_executor()
(EngineCore_DP0 pid=157) ERROR [core.py:843] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 174, in _init_executor
(EngineCore_DP0 pid=157) ERROR [core.py:843] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=157) ERROR [core.py:843] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR [core.py:843] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 660, in wait_for_ready
(EngineCore_DP0 pid=157) ERROR [core.py:843] raise e from None
(EngineCore_DP0 pid=157) ERROR [core.py:843] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=157) Traceback (most recent call last):
(EngineCore_DP0 pid=157) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=157) self.run()
(EngineCore_DP0 pid=157) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=157) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=157) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 847, in run_engine_core
(EngineCore_DP0 pid=157) raise e
(EngineCore_DP0 pid=157) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=157) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=157) super().__init__(
(EngineCore_DP0 pid=157) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=157) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=157) super().__init__(vllm_config)
(EngineCore_DP0 pid=157) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=157) self._init_executor()
(EngineCore_DP0 pid=157) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 174, in _init_executor
(EngineCore_DP0 pid=157) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 660, in wait_for_ready
(EngineCore_DP0 pid=157) raise e from None
(EngineCore_DP0 pid=157) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1819, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1838, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 183, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 224, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 223, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 810, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 471, in __init__
(APIServer pid=1) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
WARNING 03-09 05:48:00 [argparse_utils.py:195] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 03-09 05:48:00 [api_server.py:1772] vLLM API server version 0.12.0