MLXIO
cable network
AI / MLJune 25, 2026· 9 min read· By MLXIO Insights Team

One Command Spins Up a Private vLLM Server on HF Jobs

Share

MLXIO Intelligence

Analysis Snapshot

47
Moderate
Confidence: LowTrend: 10Freshness: 1Source Trust: 85Factual Grounding: 92Signal Cluster: 20

Moderate MLXIO Impact based on trend velocity, freshness, source trust, and factual grounding.

Thesis

High Confidence

Hugging Face Jobs can launch a private OpenAI-compatible vLLM endpoint with a single `hf jobs run` command, making it suited to temporary tests, evals, batch generation, and model trials rather than long-lived managed serving.

Evidence

  • The workflow uses `hf jobs run` with the official `vllm/vllm-openai` container and exposes port 8000.
  • The resulting endpoint follows the format `https://<job_id>--8000.hf.jobs/v1` and can be called from curl, Python, notebooks, or a local UI.
  • Jobs requires billing access, `huggingface_hub>=1.20.0`, and `hf auth login` before launch.
  • Hugging Face points users needing a long-lived managed service toward Inference Endpoints instead.

Uncertainty

  • The article does not include complete operational details for scaling, reliability, or production hardening.
  • Actual hardware availability and pricing may vary and must be checked with `hf jobs hardware`.
  • Endpoint access depends on HF token permissions for the job namespace.

What To Watch

  • Whether Hugging Face expands HF Jobs hardware flavors or changes pricing.
  • Adoption of HF Jobs for short-lived inference, eval, and batch-generation workflows.
  • Further guidance from Hugging Face on when to use Jobs versus Inference Endpoints.

Verified Claims

A private OpenAI-compatible vLLM endpoint can be launched on Hugging Face Jobs with a single command using the official vllm/vllm-openai container.
📎 The workflow uses `hf jobs run` with the official `vllm/vllm-openai` container and exposes port 8000.High
HF Jobs returns a job-specific URL for the vLLM endpoint in the form `https://<job_id>--8000.hf.jobs/v1`.
📎 The guide says the running vLLM server accepts OpenAI-style requests at `https://<job_id>--8000.hf.jobs/v1`.High
To use HF Jobs for this workflow, users need billing access, `huggingface_hub>=1.20.0`, and local authentication with `hf auth login`.
📎 The article lists billing access, Hugging Face CLI support with `huggingface_hub>=1.20.0`, and `hf auth login` as prerequisites.High
Requests to the exposed HF Jobs endpoint must include an HF token with read access to the job namespace.
📎 The article states that every request to the exposed endpoint must carry an HF token with read access and that a plain browser visit will be rejected.High
For larger vLLM runs on multi-GPU HF Jobs flavors, `--tensor-parallel-size` should match the number of GPUs in the flavor.
📎 The article says Hugging Face shows Qwen/Qwen3.5-122B-A10B on `h200x2` using tensor parallelism and that `--tensor-parallel-size` should match the number of GPUs.High

Frequently Asked

How do you launch a private OpenAI-compatible vLLM server on Hugging Face Jobs?

Use `hf jobs run` with the official `vllm/vllm-openai` container, request a GPU flavor, expose port 8000, and serve the model. The endpoint is then available at a job-specific `/v1` URL.

What URL does an HF Jobs vLLM endpoint use?

The article gives the endpoint format as `https://<job_id>--8000.hf.jobs/v1`.

What do you need before running a vLLM server on HF Jobs?

You need billing access through a payment method or positive prepaid credit balance, `huggingface_hub>=1.20.0`, and local login with `hf auth login`.

Is the vLLM endpoint on HF Jobs public?

No. The article says requests must include an HF token with read access to the job namespace, and a plain browser visit will be rejected.

When should Hugging Face Inference Endpoints be used instead of HF Jobs?

The article says HF Jobs is practical for tests, evals, batch generation, or quick model trials, while Hugging Face points users to Inference Endpoints for a long-lived managed service.

Updated on June 25, 2026

One command can stand up a private, OpenAI-compatible vLLM endpoint on Hugging Face Jobs — with no VM setup, no Kubernetes, and billing tied to how long the job runs.

The workflow, published by the Hugging Face Blog, uses hf jobs run with the official vllm/vllm-openai container, exposes port 8000, and returns a job-specific URL you can hit from curl, Python, a notebook, or a local UI.

“You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second.”

That makes this a practical path for tests, evals, batch generation, or quick model trials. If you need a long-lived managed service, Hugging Face points users toward Inference Endpoints instead. More on that trade-off below.


Launch an OpenAI-compatible vLLM API on HF Jobs in minutes

By the end of this guide, you’ll have a running vLLM server hosted through HF Jobs that accepts OpenAI-style requests at:

https://<job_id>--8000.hf.jobs/v1

The core move is simple: run a container on Hugging Face infrastructure, ask for a GPU, expose vLLM’s API port, and serve a model.

The example from Hugging Face uses Qwen/Qwen3-4B on an a10g-large flavor. That is a sensible first test because it keeps the command small and lets you validate the flow before moving to larger models.

For readers tracking how temporary AI compute fits into broader engineering decisions, this sits alongside the infrastructure questions we covered in Key Trends Reveal the Next Tech and Finance Shake-Up and Future Trends Everyone Keeps Misreading — Here's Why. This guide is the hands-on version: launch, test, stop.

Before you start: install the HF CLI and authenticate

You need three things before the command works:

  • Billing access: Hugging Face says Jobs requires a payment method or a positive prepaid credit balance.
  • Hugging Face CLI support: Install or upgrade to huggingface_hub>=1.20.0.
  • Local authentication: Log in with hf auth login.

Run:

pip install -U "huggingface_hub>=1.20.0"
hf auth login

Your token matters twice.

First, the CLI needs it to launch the job. Second, every request to the exposed endpoint must carry an HF token with read access to the job namespace. Hugging Face is explicit: a plain browser visit will be rejected.

Watch out for: do not paste tokens into shared notebooks, tickets, or screenshots. The endpoint is gated, not public, but the token is still the credential that unlocks it.

Pick a vLLM-ready model and match it to the GPU flavor

Start with the model Hugging Face uses in its guide:

Qwen/Qwen3-4B

That keeps your first run close to the documented path. After that, you can move up.

Hugging Face shows a larger example with Qwen/Qwen3.5-122B-A10B on h200x2, using tensor parallelism across two GPUs. The key rule from the source: --tensor-parallel-size should match the number of GPUs in the flavor.

Use case Example model HF Jobs flavor shown Extra vLLM setting
First test server Qwen/Qwen3-4B a10g-large None beyond host/port
Larger model run Qwen/Qwen3.5-122B-A10B h200x2 --tensor-parallel-size 2

Before choosing hardware, run:

hf jobs hardware

Hugging Face says an a10g-large runs at $1.50/hour, and recommends checking hf jobs hardware for the full price list. The practical rule: pick the smallest flavor that fits your model, then scale only when memory or latency forces it.

Run the one-command HF Jobs deployment for a vLLM OpenAI server

Here is the minimal launch command from the Hugging Face example:

hf jobs run --flavor a10g-large --expose 8000 -- timeout 2h \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

What each part does:

  • hf jobs run: Runs a container on Hugging Face infrastructure.
  • --flavor a10g-large: Selects the GPU hardware.
  • --expose 8000: Routes container port 8000 through the HF Jobs proxy.
  • timeout 2h: Sets a safety stop after two hours.
  • vllm/vllm-openai:latest: Uses the official vLLM OpenAI-compatible image.
  • vllm serve Qwen/Qwen3-4B: Starts the model server.
  • --host 0.0.0.0 --port 8000: Makes vLLM listen on the exposed port.

The command prints a job URL and an exposed endpoint. Hugging Face’s example output looks like this:

✓ Job started id: 6a381ca1953ed90bfb947332
url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332

Hint: Exposed ports are reachable at:
https://6a381ca1953ed90bfb947332--8000.hf.jobs

Save the job ID. You will need it for the API URL, SSH, and cleanup.

Watch out for: CLI syntax can move over time. If your local command rejects a flag, check your installed huggingface_hub version and the current Jobs help output.

Wait for vLLM to finish downloading and loading weights

Do not hit the endpoint immediately and assume failure.

Hugging Face says to give the job a couple of minutes to download weights and boot. The useful readiness signal is:

Application startup complete

Until then, the container may be alive while the model is still loading.

Common failure points to check:

  • Permissions: Your token may not have read access to the job namespace or model.
  • Memory: The model may not fit the selected GPU flavor.
  • Arguments: A vLLM flag may not match the model or installed server behavior.
  • Startup time: Larger models take longer to download and load.

For deeper debugging, Hugging Face supports SSH into a running job if you launch with --ssh and have your public key registered at huggingface.co/settings/keys.

hf jobs ssh <job_id>

Inside the container, you can run tools such as:

nvidia-smi

Test the endpoint with an OpenAI-style request

First, check that the model endpoint responds:

curl https://<job_id>--8000.hf.jobs/v1/models \
  -H "Authorization: Bearer $(hf auth token)"

Then send a chat request:

curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
  -H "Authorization: Bearer $(hf auth token)" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Hugging Face says this returns normal OpenAI-style JSON, with the assistant text in:

choices[0].message.content

You can also use the OpenAI Python client by pointing it at the HF Jobs URL:

from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(
    base_url="https://<job_id>--8000.hf.jobs/v1",
    api_key=get_token(),
)

resp = client.chat.completions.create(
    model="Qwen/Qwen3-4B",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

print(resp.choices[0].message.content)

Watch out for: the exposed URL is not an open public API. Every request needs the bearer token.


Tune larger jobs without guessing blindly

For bigger models, Hugging Face’s example uses:

hf jobs run --flavor h200x2 --expose 8000 -- timeout 2h \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3.5-122B-A10B \
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
  --max-model-len 32768 --max-num-seqs 256

The important tuning points are:

  • --tensor-parallel-size 2: Matches the two GPUs in h200x2.
  • --max-model-len 32768: Caps context length.
  • --max-num-seqs 256: Caps concurrent sequences.

Hugging Face says Qwen3.5-122B-A10B has a 256K-token default context, and that default does not leave enough memory for vLLM’s default batch settings in this setup. Capping context length and concurrent sequence count keeps it inside GPU memory.

If startup fails with an out-of-memory or cache-block error, Hugging Face recommends dialing down those two settings first.

Keep tokens out of places they do not belong

The HF Jobs proxy acts as the API gate. That is convenient, but it means token handling is part of the deployment.

Use this safer pattern:

  1. Authenticate locally with hf auth login.
  2. Pass tokens as bearer tokens in requests.
  3. Avoid hard-coding tokens into shell scripts, notebooks, or logs.
  4. Review command history after failed experiments if you typed credentials manually.

Hugging Face’s own examples use:

-H "Authorization: Bearer $(hf auth token)"

and Python’s:

api_key=get_token()

That keeps the token out of the literal request code.

Add a local chat UI if curl is too barebones

If you prefer a browser chat window, Hugging Face shows a Gradio example that points at the same endpoint.

For Qwen3 reasoning output, add this to the vllm serve command:

--reasoning-parser deepseek_r1

Then your local Gradio app can stream reasoning into a collapsible panel and the answer below it. The server remains the same HF Jobs endpoint; only the client changes.

This is useful for quick human testing before wiring the endpoint into an app prototype.

Stop the HF Jobs server when the test is done

Jobs are billed while they run, so cleanup is not optional.

Cancel the job:

hf jobs cancel <job_id>

The timeout 2h wrapper is only a safety net. Hugging Face says cancelling explicitly is cheaper.

Before you shut it down, save the working recipe:

  • Model ID
  • GPU flavor
  • vLLM image
  • Port
  • Tuning flags
  • Successful test request

That turns a one-off experiment into a repeatable deployment command.

Choose HF Jobs for experiments, Inference Endpoints for durable service

Hugging Face draws a clear line between the two options.

Need Better fit Why
Fast experiments, evals, batch generation HF Jobs You choose the image, hardware, and vllm serve flags; you pay while it runs
Long-lived managed service Inference Endpoints Adds access-control modes and scale-to-zero

HF Jobs is the fastest path from terminal to GPU-backed inference: authenticate, pick a model, run hf jobs run, wait for vLLM, test /v1/models, send a chat request, then cancel the job.

The next practical step is to wrap the launch and cleanup commands in a small project script. After that, test the same flow with a larger model or connect the endpoint to an internal prototype — but keep the timeout short until the memory settings and cost profile are proven.

Key Takeaways

  • Developers can launch a private OpenAI-compatible vLLM endpoint without provisioning VMs or Kubernetes.
  • HF Jobs makes temporary GPU-backed model serving practical for testing, evals, and batch workloads.
  • The workflow gives teams a fast path to validate models like Qwen/Qwen3-4B before committing to longer-lived infrastructure.

HF Jobs vs. Hugging Face Inference Endpoints

OptionBest ForOperational Model
HF JobsTests, evals, batch generation, and quick model trialsRun a private vLLM container on demand with billing tied to job runtime
Inference EndpointsLong-lived managed model servingManaged service for persistent production-style deployments
MLXIO

Written by

MLXIO Insights Team

Algorithmic Research & Human Oversight

Powered by advanced algorithmic research and perfected by human oversight. The Insights Team delivers highly structured, cross-verified analysis on emerging tech trends and digital shifts, filtering out the fluff to give you high-fidelity value.

Related Articles

Complex robot with orange wheels and a robotic arm.
AI / MLJun 1, 2026

One Open Model Targets Robot AI Costs: NVIDIA Cosmos 3

NVIDIA Cosmos 3 merges world generation, reasoning and action in one open model family for robots and autonomous systems.

8 min read

black ImgIX server system
AI / MLJun 6, 2026

Stake Grab Brings AI Companies to Trump's White House

Trump may push U.S. equity stakes in AI companies, turning private AI winners into potential public assets.

7 min read

laptop showing stock chart on desk
AI / MLJun 3, 2026

5M Users Send OpenAI Codex Into White-Collar Work

OpenAI is moving Codex beyond developers with role-specific plug-ins for finance, sales, design and analytics.

6 min read

a rocket is flying through the air on a foggy day
AI / MLJun 3, 2026

Tesla and SpaceX Emails Drag Musk Into Apple OpenAI Suit

Musk must search Tesla and SpaceX emails as the Apple-OpenAI lawsuit pushes into his wider corporate orbit.

10 min read

cable network
AI / MLMay 30, 2026

Claude Opus 4.8 Bets on Agents After 41-Day Scramble

Anthropic rushed out Claude Opus 4.8 with Dynamic Workflows, betting parallel agents can make Claude Code feel like project execution.

10 min read

cable network
TechnologyJun 23, 2026

21,000 Jobs Gone as Oracle Turns AI Into a Budget Knife

Oracle cut 21,000 jobs in a year and says AI could shrink its workforce further as spending shifts to data centers.

8 min read

black and white nike logo
CreatorsJun 25, 2026

96% Sugar Just Made Apple TV+ Harder to Cancel This Month

Sugar’s 96% return extends Apple TV+’s quality streak, making its smaller slate look like a subscription weapon.

7 min read

black and silver asus laptop computer
TechnologyJun 25, 2026

Broken PCs Get a Panic Button With Windows 11 KB5095093

KB5095093 previews Point-in-time restore, giving Windows 11 users a faster rollback when updates or changes wreck a PC.

8 min read

a black robot vacuum on a wooden floor
TechnologyJun 25, 2026

Xiaomi Robot Vacuum 6 Max Bets Cameras Can Beat Dirt

Xiaomi’s Robot Vacuum 6 Max is going global with 35,000 Pa suction, self-washing mop hardware and camera-driven AI.

8 min read

A person standing at a podium with a laptop on it
TechnologyJun 25, 2026

August 5 Leak Puts Galaxy Z Fold 8 Buyers on Clock

A retailer leak points to July 22 Unpacked and August 5 Galaxy Z Fold 8 availability, but Samsung has not confirmed.

6 min read

Stay ahead of the curve

Get a weekly digest of the most important tech, AI, and finance news — curated by AI, reviewed by humans.

No spam. Unsubscribe anytime.