LLM Inference¶
InferenceClient runs an LLM over retrieved context — useful for RAG pipelines where
you want a synthesized answer rather than raw chunk results.
Supported providers:
OpenAI
Anthropic (Claude) — via OpenAI-compatible API
Redpill — private inference in TEE GPUs
Ollama — fully local
For end-to-end privacy, run Ollama locally. If that is not feasible, Redpill provides private inference via GPU Trusted Execution Environments (TEE). If inference privacy is not a concern, OpenAI or Anthropic can be used.
The INFERENCE_API_KEY environment variable is read automatically when api_key is
not passed explicitly.
OpenAI¶
from xtrace_sdk.x_vec.inference.llm import InferenceClient
inference = InferenceClient(inference_provider="openai", model_name="gpt-4o", api_key="your_api_key")
inference.query("How many r's are in the word strawberry?")
For supported models, refer to the OpenAI documentation.
Anthropic (Claude)¶
Uses Anthropic’s OpenAI-compatible API endpoint.
from xtrace_sdk.x_vec.inference.llm import InferenceClient
inference = InferenceClient(inference_provider="claude", model_name="claude-sonnet-4-6", api_key="your_api_key")
inference.query("How many r's are in the word strawberry?")
For supported models, refer to the Anthropic documentation.
Redpill¶
Redpill provides private inference with models running in TEE (Trusted Execution Environment) GPUs, ensuring your queries remain secure during inference. This is an ideal middle ground when you need privacy protection but cannot run models locally.
Key features:
Private inference: Models run in TEE GPU environments
Unified API: Access to 200+ AI models through a single API
Cost-effective: Transparent per-token pricing
from xtrace_sdk.x_vec.inference.llm import InferenceClient
inference = InferenceClient(inference_provider="redpill", model_name="deepseek/deepseek-v3-0324", api_key="your_api_key")
inference.query("How many r's are in the word strawberry?")
Create an API key at https://redpill.ai/. For the full model list and pricing, see https://docs.redpill.ai/.
Ollama¶
Ollama runs entirely locally, providing the strongest inference privacy guarantee.
from xtrace_sdk.x_vec.inference.llm import InferenceClient
inference = InferenceClient(inference_provider="ollama", model_name="llama3.3", api_key="ollama")
inference.query("How many r's are in the word strawberry?")
For Ollama setup instructions, see https://ollama.com/docs/installation.