LLM inference

InferenceClient runs an LLM over retrieved context — useful for RAG pipelines where you want a synthesized answer rather than raw chunk results. Supported providers:

OpenAI
Anthropic (Claude) — via OpenAI-compatible API
Redpill — private inference in TEE GPUs
Ollama — fully local

For end-to-end privacy, run Ollama locally. If that is not feasible, Redpill provides private inference via GPU Trusted Execution Environments (TEE). If inference privacy is not a concern, OpenAI or Anthropic can be used. The INFERENCE_API_KEY environment variable is read automatically when api_key is not passed explicitly.

OpenAI

from xtrace_sdk.x_vec.inference.llm import InferenceClient

inference = InferenceClient(inference_provider="openai", model_name="gpt-4o", api_key="your_api_key")
inference.query("How many r's are in the word strawberry?")

For supported models, refer to the OpenAI documentation.

Anthropic (Claude)

Uses Anthropic’s OpenAI-compatible API endpoint.

from xtrace_sdk.x_vec.inference.llm import InferenceClient

inference = InferenceClient(inference_provider="claude", model_name="claude-sonnet-4-6", api_key="your_api_key")
inference.query("How many r's are in the word strawberry?")

For supported models, refer to the Anthropic documentation.

Redpill

Redpill provides private inference with models running in TEE (Trusted Execution Environment) GPUs, ensuring your queries remain secure during inference. This is an ideal middle ground when you need privacy protection but cannot run models locally. Key features:

Private inference: Models run in TEE GPU environments
Unified API: Access to 200+ AI models through a single API
Cost-effective: Transparent per-token pricing

from xtrace_sdk.x_vec.inference.llm import InferenceClient

inference = InferenceClient(inference_provider="redpill", model_name="deepseek/deepseek-v3-0324", api_key="your_api_key")
inference.query("How many r's are in the word strawberry?")

Create an API key at redpill.ai. For the full model list and pricing, see the Redpill docs.

Ollama

Ollama runs entirely locally, providing the strongest inference privacy guarantee.

from xtrace_sdk.x_vec.inference.llm import InferenceClient

inference = InferenceClient(inference_provider="ollama", model_name="llama3.3", api_key="ollama")
inference.query("How many r's are in the word strawberry?")

For Ollama setup instructions, see the Ollama installation docs.

Getting started

Core features

CLI

Reference

OpenAI

Anthropic (Claude)

Redpill

Ollama

​OpenAI

​Anthropic (Claude)

​Redpill

​Ollama

OpenAI

Anthropic (Claude)

Redpill

Ollama