One command can stand up a private, OpenAI-compatible vLLM endpoint on Hugging Face Jobs — with no VM setup, no Kubernetes, and billing tied to how long the job runs.
The workflow, published by the Hugging Face Blog, uses hf jobs run with the official vllm/vllm-openai container, exposes port 8000, and returns a job-specific URL you can hit from curl, Python, a notebook, or a local UI.
“You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second.”
That makes this a practical path for tests, evals, batch generation, or quick model trials. If you need a long-lived managed service, Hugging Face points users toward Inference Endpoints instead. More on that trade-off below.
Launch an OpenAI-compatible vLLM API on HF Jobs in minutes
By the end of this guide, you’ll have a running vLLM server hosted through HF Jobs that accepts OpenAI-style requests at:
https://<job_id>--8000.hf.jobs/v1
The core move is simple: run a container on Hugging Face infrastructure, ask for a GPU, expose vLLM’s API port, and serve a model.
The example from Hugging Face uses Qwen/Qwen3-4B on an a10g-large flavor. That is a sensible first test because it keeps the command small and lets you validate the flow before moving to larger models.
For readers tracking how temporary AI compute fits into broader engineering decisions, this sits alongside the infrastructure questions we covered in Key Trends Reveal the Next Tech and Finance Shake-Up and Future Trends Everyone Keeps Misreading — Here's Why. This guide is the hands-on version: launch, test, stop.
Before you start: install the HF CLI and authenticate
You need three things before the command works:
- Billing access: Hugging Face says Jobs requires a payment method or a positive prepaid credit balance.
- Hugging Face CLI support: Install or upgrade to
huggingface_hub>=1.20.0. - Local authentication: Log in with
hf auth login.
Run:
pip install -U "huggingface_hub>=1.20.0"
hf auth login
Your token matters twice.
First, the CLI needs it to launch the job. Second, every request to the exposed endpoint must carry an HF token with read access to the job namespace. Hugging Face is explicit: a plain browser visit will be rejected.
Watch out for: do not paste tokens into shared notebooks, tickets, or screenshots. The endpoint is gated, not public, but the token is still the credential that unlocks it.
Pick a vLLM-ready model and match it to the GPU flavor
Start with the model Hugging Face uses in its guide:
Qwen/Qwen3-4B
That keeps your first run close to the documented path. After that, you can move up.
Hugging Face shows a larger example with Qwen/Qwen3.5-122B-A10B on h200x2, using tensor parallelism across two GPUs. The key rule from the source: --tensor-parallel-size should match the number of GPUs in the flavor.
| Use case | Example model | HF Jobs flavor shown | Extra vLLM setting |
|---|---|---|---|
| First test server | Qwen/Qwen3-4B | a10g-large |
None beyond host/port |
| Larger model run | Qwen/Qwen3.5-122B-A10B | h200x2 |
--tensor-parallel-size 2 |
Before choosing hardware, run:
hf jobs hardware
Hugging Face says an a10g-large runs at $1.50/hour, and recommends checking hf jobs hardware for the full price list. The practical rule: pick the smallest flavor that fits your model, then scale only when memory or latency forces it.
Run the one-command HF Jobs deployment for a vLLM OpenAI server
Here is the minimal launch command from the Hugging Face example:
hf jobs run --flavor a10g-large --expose 8000 -- timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
What each part does:
hf jobs run: Runs a container on Hugging Face infrastructure.--flavor a10g-large: Selects the GPU hardware.--expose 8000: Routes container port 8000 through the HF Jobs proxy.timeout 2h: Sets a safety stop after two hours.vllm/vllm-openai:latest: Uses the official vLLM OpenAI-compatible image.vllm serve Qwen/Qwen3-4B: Starts the model server.--host 0.0.0.0 --port 8000: Makes vLLM listen on the exposed port.
The command prints a job URL and an exposed endpoint. Hugging Face’s example output looks like this:
✓ Job started id: 6a381ca1953ed90bfb947332
url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332
Hint: Exposed ports are reachable at:
https://6a381ca1953ed90bfb947332--8000.hf.jobs
Save the job ID. You will need it for the API URL, SSH, and cleanup.
Watch out for: CLI syntax can move over time. If your local command rejects a flag, check your installed huggingface_hub version and the current Jobs help output.
Wait for vLLM to finish downloading and loading weights
Do not hit the endpoint immediately and assume failure.
Hugging Face says to give the job a couple of minutes to download weights and boot. The useful readiness signal is:
Application startup complete
Until then, the container may be alive while the model is still loading.
Common failure points to check:
- Permissions: Your token may not have read access to the job namespace or model.
- Memory: The model may not fit the selected GPU flavor.
- Arguments: A vLLM flag may not match the model or installed server behavior.
- Startup time: Larger models take longer to download and load.
For deeper debugging, Hugging Face supports SSH into a running job if you launch with --ssh and have your public key registered at huggingface.co/settings/keys.
hf jobs ssh <job_id>
Inside the container, you can run tools such as:
nvidia-smi
Test the endpoint with an OpenAI-style request
First, check that the model endpoint responds:
curl https://<job_id>--8000.hf.jobs/v1/models \
-H "Authorization: Bearer $(hf auth token)"
Then send a chat request:
curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
-H "Authorization: Bearer $(hf auth token)" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B",
"messages": [{"role": "user", "content": "Hello!"}],
"chat_template_kwargs": {"enable_thinking": false}
}'
Hugging Face says this returns normal OpenAI-style JSON, with the assistant text in:
choices[0].message.content
You can also use the OpenAI Python client by pointing it at the HF Jobs URL:
from huggingface_hub import get_token
from openai import OpenAI
client = OpenAI(
base_url="https://<job_id>--8000.hf.jobs/v1",
api_key=get_token(),
)
resp = client.chat.completions.create(
model="Qwen/Qwen3-4B",
messages=[{"role": "user", "content": "Hello!"}],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)
Watch out for: the exposed URL is not an open public API. Every request needs the bearer token.
Tune larger jobs without guessing blindly
For bigger models, Hugging Face’s example uses:
hf jobs run --flavor h200x2 --expose 8000 -- timeout 2h \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3.5-122B-A10B \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
--max-model-len 32768 --max-num-seqs 256
The important tuning points are:
--tensor-parallel-size 2: Matches the two GPUs inh200x2.--max-model-len 32768: Caps context length.--max-num-seqs 256: Caps concurrent sequences.
Hugging Face says Qwen3.5-122B-A10B has a 256K-token default context, and that default does not leave enough memory for vLLM’s default batch settings in this setup. Capping context length and concurrent sequence count keeps it inside GPU memory.
If startup fails with an out-of-memory or cache-block error, Hugging Face recommends dialing down those two settings first.
Keep tokens out of places they do not belong
The HF Jobs proxy acts as the API gate. That is convenient, but it means token handling is part of the deployment.
Use this safer pattern:
- Authenticate locally with
hf auth login. - Pass tokens as bearer tokens in requests.
- Avoid hard-coding tokens into shell scripts, notebooks, or logs.
- Review command history after failed experiments if you typed credentials manually.
Hugging Face’s own examples use:
-H "Authorization: Bearer $(hf auth token)"
and Python’s:
api_key=get_token()
That keeps the token out of the literal request code.
Add a local chat UI if curl is too barebones
If you prefer a browser chat window, Hugging Face shows a Gradio example that points at the same endpoint.
For Qwen3 reasoning output, add this to the vllm serve command:
--reasoning-parser deepseek_r1
Then your local Gradio app can stream reasoning into a collapsible panel and the answer below it. The server remains the same HF Jobs endpoint; only the client changes.
This is useful for quick human testing before wiring the endpoint into an app prototype.
Stop the HF Jobs server when the test is done
Jobs are billed while they run, so cleanup is not optional.
Cancel the job:
hf jobs cancel <job_id>
The timeout 2h wrapper is only a safety net. Hugging Face says cancelling explicitly is cheaper.
Before you shut it down, save the working recipe:
- Model ID
- GPU flavor
- vLLM image
- Port
- Tuning flags
- Successful test request
That turns a one-off experiment into a repeatable deployment command.
Choose HF Jobs for experiments, Inference Endpoints for durable service
Hugging Face draws a clear line between the two options.
| Need | Better fit | Why |
|---|---|---|
| Fast experiments, evals, batch generation | HF Jobs | You choose the image, hardware, and vllm serve flags; you pay while it runs |
| Long-lived managed service | Inference Endpoints | Adds access-control modes and scale-to-zero |
HF Jobs is the fastest path from terminal to GPU-backed inference: authenticate, pick a model, run hf jobs run, wait for vLLM, test /v1/models, send a chat request, then cancel the job.
The next practical step is to wrap the launch and cleanup commands in a small project script. After that, test the same flow with a larger model or connect the endpoint to an internal prototype — but keep the timeout short until the memory settings and cost profile are proven.
Key Takeaways
- Developers can launch a private OpenAI-compatible vLLM endpoint without provisioning VMs or Kubernetes.
- HF Jobs makes temporary GPU-backed model serving practical for testing, evals, and batch workloads.
- The workflow gives teams a fast path to validate models like Qwen/Qwen3-4B before committing to longer-lived infrastructure.










