LLM Inference
- XTrace SDK also provides packaged inference capabilities for multiple inference service providers
OpenAI
Anthropic
Redpill
Ollama
For end to end privacy we recommand users to use locally set up Ollama for inference service. If that is not feasible, user can still achieve privacy via GPU TEE using Redpill’s inference service. If inference privacy is not a concern, users can use OpenAI or Anthropic’s inference service.
Use OpenAI as Inference Provider
from xtrace_sdk.inference.llm import InferenceClient
inference = InferenceClient(inference_provider="OpenAI", model_name="o1", api_key="your_api_key")
inference.query("how many r are there in the word strawberry")
For supported models, refer to the documentation of OpenAI.
Use Redpill as Inference Provider
Why Use Redpill?
Redpill provides private inference with models running in TEE (Trusted Execution Environment) GPUs, ensuring your data and queries remain secure and private during inference. This makes it an ideal choice when you need privacy protection but cannot run models locally.
Key features: - Private inference: Models run in TEE GPU environments - Unified API: Access to 200+ AI models through a single API - High performance: Superior RPM (Requests Per Minute) and TPM (Tokens Per Minute) - Cost-effective: Tokenization model for transparent pricing
Registration and Setup
Create API Key: https://redpill.ai/
Usage Example:
from xtrace_sdk.inference.llm import InferenceClient
inference = InferenceClient(inference_provider="redpill", model_name="DeepSeek: DeepSeek V3 0324", api_key="your_api_key")
inference.query("how many r are there in the word strawberry")
Supported Models
Currently supported models include:
Google: Gemma 3 27B
OpenAI: GPT OSS 120B
Qwen: Qwen3 Coder, Qwen2.5 VL 72B Instruct, Qwen2.5 7B Instruct
DeepSeek: DeepSeek V3 0324
Meta: Llama 3.3 70B Instruct
For the complete list of available models and pricing, visit https://docs.redpill.ai/
Embedding models
XTrace SDK provides embedding models through Ollama and OpenAI. We recommand users to use Ollama for embedding as it can be run locally and thus provide better privacy protection. If that is not feasible, users can use OpenAI’s embedding models.
Use Ollama as Embedding Provider
Ollama can be run as a local service. For more details about how to set up Ollama, please refer to https://ollama.com/docs/installation.
Assuming you have set up Ollama service, you can use the following code to get embeddings:
from xtrace_sdk.inference.embedding import Embedding
embed = Embedding("ollama","mxbai-embed-large",1024)
vector = embed.bin_embed("how many r are there in the word strawberry")
Use Sentence Transformers as Embedding Provider
You can use Sentence Transformers for generating embeddings locally. For more details about Sentence Transformers, please refer to https://www.sbert.net/. Note that you need to hae the model downloaded locally to use it. You can find the list of available models at https://www.sbert.net/docs/pretrained_models.html.
from xtrace_sdk.inference.embedding import Embedding
embed = Embedding("sentence_transformer","mxbai-embed-large-v1",512)
vector = embed.bin_embed("how many r are there in the word strawberry")
Use OpenAI as Embedding Provider
If privacy is not a concern, you can use OpenAI’s embedding models. For supported models, refer to the documentation of OpenAI. Note that you need to set your OpenAI API key as environment variable to use OpenAI’s embedding models.
from xtrace_sdk.inference.embedding import Embedding
embed = Embedding("openai","text-embedding-3-small",1536)
vector = embed.bin_embed("how many r are there in the word strawberry")
Bring Your Own Vectors
If you have your own vectors, you can use them directly without using any embedding models. Just make sure the vectors are in the correct format (list of floats).
from xtrace_sdk.inference.embedding import Embedding
your_vector = [0.1, -0.2, 0.3, ...] # list of floats
xtrace_compatible_vector = Embedding.float_2_bin(your_vector) # convert to list of binary integers