How do you launch a private OpenAI-compatible vLLM server on Hugging Face Jobs?

Use `hf jobs run` with the official `vllm/vllm-openai` container, request a GPU flavor, expose port 8000, and serve the model. The endpoint is then available at a job-specific `/v1` URL.

What URL does an HF Jobs vLLM endpoint use?

The article gives the endpoint format as `https:// --8000.hf.jobs/v1`.

What do you need before running a vLLM server on HF Jobs?

You need billing access through a payment method or positive prepaid credit balance, `huggingface_hub>=1.20.0`, and local login with `hf auth login`.

When should Hugging Face Inference Endpoints be used instead of HF Jobs?

The article says HF Jobs is practical for tests, evals, batch generation, or quick model trials, while Hugging Face points users to Inference Endpoints for a long-lived managed service.

One Command Spins Up a Private vLLM Server on HF Jobs

Q: Is the vLLM endpoint on HF Jobs public?

No. The article says requests must include an HF token with read access to the job namespace, and a plain browser visit will be rejected.

One command can stand up a private, OpenAI-compatible vLLM endpoint on Hugging Face Jobs — with no VM setup, no Kubernetes, and billing tied to how long the job runs.

The workflow, published by the Hugging Face Blog, uses hf jobs run with the official vllm/vllm-openai container, exposes port 8000, and returns a job-specific URL you can hit from curl, Python, a notebook, or a local UI.

“You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second.”

That makes this a practical path for tests, evals, batch generation, or quick model trials. If you need a long-lived managed service, Hugging Face points users toward Inference Endpoints instead. More on that trade-off below.

Launch an OpenAI-compatible vLLM API on HF Jobs in minutes

By the end of this guide, you’ll have a running vLLM server hosted through HF Jobs that accepts OpenAI-style requests at:

https://<job_id>--8000.hf.jobs/v1

The core move is simple: run a container on Hugging Face infrastructure, ask for a GPU, expose vLLM’s API port, and serve a model.

The example from Hugging Face uses Qwen/Qwen3-4B on an a10g-large flavor. That is a sensible first test because it keeps the command small and lets you validate the flow before moving to larger models.

For readers tracking how temporary AI compute fits into broader engineering decisions, this sits alongside the infrastructure questions we covered in Key Trends Reveal the Next Tech and Finance Shake-Up and Future Trends Everyone Keeps Misreading — Here's Why. This guide is the hands-on version: launch, test, stop.

Before you start: install the HF CLI and authenticate

You need three things before the command works:

Billing access: Hugging Face says Jobs requires a payment method or a positive prepaid credit balance.
Hugging Face CLI support: Install or upgrade to huggingface_hub>=1.20.0.
Local authentication: Log in with hf auth login.

Run:

pip install -U "huggingface_hub>=1.20.0"
hf auth login

Your token matters twice.

First, the CLI needs it to launch the job. Second, every request to the exposed endpoint must carry an HF token with read access to the job namespace. Hugging Face is explicit: a plain browser visit will be rejected.

Watch out for: do not paste tokens into shared notebooks, tickets, or screenshots. The endpoint is gated, not public, but the token is still the credential that unlocks it.

Pick a vLLM-ready model and match it to the GPU flavor

Start with the model Hugging Face uses in its guide:

Qwen/Qwen3-4B

That keeps your first run close to the documented path. After that, you can move up.

Hugging Face shows a larger example with Qwen/Qwen3.5-122B-A10B on h200x2, using tensor parallelism across two GPUs. The key rule from the source: --tensor-parallel-size should match the number of GPUs in the flavor.

Use case	Example model	HF Jobs flavor shown	Extra vLLM setting
First test server	Qwen/Qwen3-4B	`a10g-large`	None beyond host/port
Larger model run	Qwen/Qwen3.5-122B-A10B	`h200x2`	`--tensor-parallel-size 2`

Before choosing hardware, run:

hf jobs hardware

Hugging Face says an a10g-large runs at $1.50/hour, and recommends checking hf jobs hardware for the full price list. The practical rule: pick the smallest flavor that fits your model, then scale only when memory or latency forces it.

Run the one-command HF Jobs deployment for a vLLM OpenAI server

Here is the minimal launch command from the Hugging Face example:

hf jobs run --flavor a10g-large --expose 8000 -- timeout 2h \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

What each part does:

hf jobs run: Runs a container on Hugging Face infrastructure.
--flavor a10g-large: Selects the GPU hardware.
--expose 8000: Routes container port 8000 through the HF Jobs proxy.
timeout 2h: Sets a safety stop after two hours.
vllm/vllm-openai:latest: Uses the official vLLM OpenAI-compatible image.
vllm serve Qwen/Qwen3-4B: Starts the model server.
--host 0.0.0.0 --port 8000: Makes vLLM listen on the exposed port.

The command prints a job URL and an exposed endpoint. Hugging Face’s example output looks like this:

✓ Job started id: 6a381ca1953ed90bfb947332
url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332

Hint: Exposed ports are reachable at:
https://6a381ca1953ed90bfb947332--8000.hf.jobs

Save the job ID. You will need it for the API URL, SSH, and cleanup.

Watch out for: CLI syntax can move over time. If your local command rejects a flag, check your installed huggingface_hub version and the current Jobs help output.

Wait for vLLM to finish downloading and loading weights

Do not hit the endpoint immediately and assume failure.

Hugging Face says to give the job a couple of minutes to download weights and boot. The useful readiness signal is:

Application startup complete

Until then, the container may be alive while the model is still loading.

Common failure points to check:

Permissions: Your token may not have read access to the job namespace or model.
Memory: The model may not fit the selected GPU flavor.
Arguments: A vLLM flag may not match the model or installed server behavior.
Startup time: Larger models take longer to download and load.

For deeper debugging, Hugging Face supports SSH into a running job if you launch with --ssh and have your public key registered at huggingface.co/settings/keys.

hf jobs ssh <job_id>

Inside the container, you can run tools such as:

nvidia-smi

Test the endpoint with an OpenAI-style request

First, check that the model endpoint responds:

curl https://<job_id>--8000.hf.jobs/v1/models \
  -H "Authorization: Bearer $(hf auth token)"

Then send a chat request:

curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
  -H "Authorization: Bearer $(hf auth token)" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Hugging Face says this returns normal OpenAI-style JSON, with the assistant text in:

choices[0].message.content

You can also use the OpenAI Python client by pointing it at the HF Jobs URL:

from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(
    base_url="https://<job_id>--8000.hf.jobs/v1",
    api_key=get_token(),
)

resp = client.chat.completions.create(
    model="Qwen/Qwen3-4B",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

print(resp.choices[0].message.content)

Watch out for: the exposed URL is not an open public API. Every request needs the bearer token.

Tune larger jobs without guessing blindly

For bigger models, Hugging Face’s example uses:

hf jobs run --flavor h200x2 --expose 8000 -- timeout 2h \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3.5-122B-A10B \
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
  --max-model-len 32768 --max-num-seqs 256

The important tuning points are:

--tensor-parallel-size 2: Matches the two GPUs in h200x2.
--max-model-len 32768: Caps context length.
--max-num-seqs 256: Caps concurrent sequences.

Hugging Face says Qwen3.5-122B-A10B has a 256K-token default context, and that default does not leave enough memory for vLLM’s default batch settings in this setup. Capping context length and concurrent sequence count keeps it inside GPU memory.

If startup fails with an out-of-memory or cache-block error, Hugging Face recommends dialing down those two settings first.

Keep tokens out of places they do not belong

The HF Jobs proxy acts as the API gate. That is convenient, but it means token handling is part of the deployment.

Use this safer pattern:

Authenticate locally with hf auth login.
Pass tokens as bearer tokens in requests.
Avoid hard-coding tokens into shell scripts, notebooks, or logs.
Review command history after failed experiments if you typed credentials manually.

Hugging Face’s own examples use:

-H "Authorization: Bearer $(hf auth token)"

and Python’s:

api_key=get_token()

That keeps the token out of the literal request code.

Add a local chat UI if curl is too barebones

If you prefer a browser chat window, Hugging Face shows a Gradio example that points at the same endpoint.

For Qwen3 reasoning output, add this to the vllm serve command:

--reasoning-parser deepseek_r1

Then your local Gradio app can stream reasoning into a collapsible panel and the answer below it. The server remains the same HF Jobs endpoint; only the client changes.

This is useful for quick human testing before wiring the endpoint into an app prototype.

Stop the HF Jobs server when the test is done

Jobs are billed while they run, so cleanup is not optional.

Cancel the job:

hf jobs cancel <job_id>

The timeout 2h wrapper is only a safety net. Hugging Face says cancelling explicitly is cheaper.

Before you shut it down, save the working recipe:

Model ID
GPU flavor
vLLM image
Port
Tuning flags
Successful test request

That turns a one-off experiment into a repeatable deployment command.

Choose HF Jobs for experiments, Inference Endpoints for durable service

Hugging Face draws a clear line between the two options.

Need	Better fit	Why
Fast experiments, evals, batch generation	HF Jobs	You choose the image, hardware, and `vllm serve` flags; you pay while it runs
Long-lived managed service	Inference Endpoints	Adds access-control modes and scale-to-zero

HF Jobs is the fastest path from terminal to GPU-backed inference: authenticate, pick a model, run hf jobs run, wait for vLLM, test /v1/models, send a chat request, then cancel the job.

The next practical step is to wrap the launch and cleanup commands in a small project script. After that, test the same flow with a larger model or connect the endpoint to an internal prototype — but keep the timeout short until the memory settings and cost profile are proven.

Key Takeaways

Developers can launch a private OpenAI-compatible vLLM endpoint without provisioning VMs or Kubernetes.
HF Jobs makes temporary GPU-backed model serving practical for testing, evals, and batch workloads.
The workflow gives teams a fast path to validate models like Qwen/Qwen3-4B before committing to longer-lived infrastructure.

Option	Best For	Operational Model
HF Jobs	Tests, evals, batch generation, and quick model trials	Run a private vLLM container on demand with billing tied to job runtime
Inference Endpoints	Long-lived managed model serving	Managed service for persistent production-style deployments

One Command Spins Up a Private vLLM Server on HF Jobs

Analysis Snapshot

Thesis

Evidence

Uncertainty

What To Watch

Verified Claims

Frequently Asked

Useful Tools

Launch an OpenAI-compatible vLLM API on HF Jobs in minutes

Before you start: install the HF CLI and authenticate

Pick a vLLM-ready model and match it to the GPU flavor

Run the one-command HF Jobs deployment for a vLLM OpenAI server

Wait for vLLM to finish downloading and loading weights

Test the endpoint with an OpenAI-style request

Tune larger jobs without guessing blindly

Keep tokens out of places they do not belong

Add a local chat UI if curl is too barebones

Stop the HF Jobs server when the test is done

Choose HF Jobs for experiments, Inference Endpoints for durable service

Key Takeaways

HF Jobs vs. Hugging Face Inference Endpoints

Sources

MLXIO Insights Team

Explore More Topics

Related Articles

One Open Model Targets Robot AI Costs: NVIDIA Cosmos 3

Stake Grab Brings AI Companies to Trump's White House

5M Users Send OpenAI Codex Into White-Collar Work

Tesla and SpaceX Emails Drag Musk Into Apple OpenAI Suit

Claude Opus 4.8 Bets on Agents After 41-Day Scramble

21,000 Jobs Gone as Oracle Turns AI Into a Budget Knife

96% Sugar Just Made Apple TV+ Harder to Cancel This Month

Broken PCs Get a Panic Button With Windows 11 KB5095093

Xiaomi Robot Vacuum 6 Max Bets Cameras Can Beat Dirt

August 5 Leak Puts Galaxy Z Fold 8 Buyers on Clock

Stay ahead of the curve