Self-Hosted Models#
The LIT Platform allows you to run large language models (LLMs) directly on your own infrastructure, ensuring data privacy, reducing latency, and eliminating dependency on external API services.
Benefits of Self-Hosted Models#
- Data Privacy: All data stays within your environment
- Cost Control: No per-token charges or subscription fees
- Customization: Fine-tune models for your specific use cases
- Offline Operation: Run models without internet connectivity
- Predictable Performance: Consistent response times
Supported Models#
The LIT Platform supports a wide range of open-source language models including:
- Llama 3 (8B, 70B)
- Qwen 2 (7B, 72B)
- Mistral (7B)
- Gemma (7B, 27B)
- Phi-3 (mini, small)
- DeepSeek (7B, 67B)
- Falcon (7B, 40B)
- And many more...
Hardware Requirements#
The hardware requirements depend on the model size:
Model Size | Minimum RAM | Recommended GPU | Approximate Speed |
---|---|---|---|
7-8B | 16GB | 8GB VRAM | 15-30 tokens/sec |
13-14B | 24GB | 16GB VRAM | 10-20 tokens/sec |
30-40B | 64GB | 24GB VRAM | 5-10 tokens/sec |
65-70B | 128GB | 48GB VRAM | 3-8 tokens/sec |
Setting Up a Model#
- Navigate to the Models section in the LIT interface
- Click "Add New Model"
- Select from available model options or provide a custom download URL
- Choose hardware configuration (CPU/GPU, quantization level)
- Click "Download and Configure"
- Wait for the model to download and initialize
Quantization Options#
To run larger models on less powerful hardware, the LIT Platform offers various quantization options:
- GGUF Format: 4-bit, 5-bit, and 8-bit quantization
- GPTQ Format: 4-bit and 8-bit quantization with optional groupsize settings
- AWQ Format: Advanced weight quantization for better quality/performance balance
Monitoring and Management#
The Models dashboard provides:
- Real-time usage statistics
- Memory consumption metrics
- Token generation speed
- Current model status
- Model versioning and updates
Troubleshooting#
If you encounter issues with self-hosted models:
- Check system resources (memory, GPU utilization)
- Verify model compatibility with your hardware
- Adjust quantization settings for better performance
- Restart the model server if performance degrades
- Check logs for specific error messages
For models that exceed your local hardware capabilities, consider using the LIT Platform's Ollama integration to access more efficient model serving.