InferenceClient runs an LLM over retrieved context — useful for RAG pipelines where you want a synthesized answer rather than raw chunk results.
Supported providers:
- OpenAI
- Anthropic (Claude) — via OpenAI-compatible API
- Redpill — private inference in TEE GPUs
- Ollama — fully local
INFERENCE_API_KEY environment variable is read automatically when api_key is not passed explicitly.
OpenAI
Anthropic (Claude)
Uses Anthropic’s OpenAI-compatible API endpoint.Redpill
Redpill provides private inference with models running in TEE (Trusted Execution Environment) GPUs, ensuring your queries remain secure during inference. This is an ideal middle ground when you need privacy protection but cannot run models locally. Key features:- Private inference: Models run in TEE GPU environments
- Unified API: Access to 200+ AI models through a single API
- Cost-effective: Transparent per-token pricing