How to Run Xiaomi MiMo-V2-Flash Locally: A Complete Installation Guide
How to Run Xiaomi MiMo-V2-Flash Locally: A Complete Installation Guide
Xiaomi's MiMo-V2-Flash represents a breakthrough in efficient AI model design, featuring 309 billion total parameters with only 15 billion active during inference. This Mixture-of-Experts architecture delivers exceptional performance while maintaining reasonable hardware requirements for local deployment. In this comprehensive guide, we'll walk you through multiple methods to run MiMo-V2-Flash locally on your machine.
Why Run MiMo-V2-Flash Locally?
Running MiMo-V2-Flash locally offers numerous advantages:
- Data Privacy: Your sensitive data never leaves your machine
- Cost Efficiency: No per-token API charges or subscription fees
- Low Latency: Direct hardware access means faster inference times
- Customization: Full control over model parameters and fine-tuning
- Offline Capability: No internet connection required after installation
- Performance: Leverages your local GPU for optimal speed
Hardware Requirements
Minimum System Requirements
| Component | Requirement | Recommended |
|---|---|---|
| GPU | NVIDIA RTX 3080 (12GB VRAM) | RTX 4090 (24GB VRAM) or A6000 |
| RAM | 32GB | 64GB or more |
| Storage | 100GB free space | 200GB+ NVMe SSD |
| CPU | Intel i7-10700K / AMD Ryzen 7 3700X | Intel i9-12900K / AMD Ryzen 9 5900X |
| CUDA | 11.8+ | 12.4+ |
Model Size Considerations
- Total Model Size: ~180GB ( quantized formats)
- Peak GPU Memory: 15-20GB VRAM (active parameters)
- Context Length: 256K tokens (uses significant RAM)
Software Prerequisites
Before installation, ensure you have:
- Python 3.10+ installed
- CUDA Toolkit 11.8+ or 12.4+
- NVIDIA Drivers (latest version)
- Git for repository cloning
Verify CUDA Installation
nvidia-smi
nvcc --versionMethod 1: Install Using SGLang (Recommended)
SGLang is the recommended framework for MiMo-V2-Flash, offering optimized performance for MoE models.
Step 1: Install SGLang
# Create virtual environment
python -m venv mimo-env
source mimo-env/bin/activate # On Windows: mimo-env\Scripts\activate
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Install SGLang
pip install sglangStep 2: Download the Model
# Login to Hugging Face
huggingface-cli login
# Download MiMo-V2-Flash
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./models/MiMo-V2-FlashStep 3: Launch SGLang Server
python -m sglang.launch_server \
--model-path ./models/MiMo-V2-Flash \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--dtype float16 \
--max-model-len 262144 \
--gpu-memory-utilization 0.9Step 4: Test the Installation
import requests
import json
url = "http://localhost:30000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "MiMo-V2-Flash",
"messages": [
{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
],
"temperature": 0.7,
"max_tokens": 500
}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())Method 2: Install Using Hugging Face Transformers
Step 1: Install Dependencies
pip install transformers==4.51.0
pip install accelerate
pip install bitsandbytes
pip install torch --index-url https://download.pytorch.org/whl/cu124Step 2: Basic Usage Script
Create run_mimo.py:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Model ID
model_id = "XiaomiMiMo/MiMo-V2-Flash"
# Load tokenizer and model
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True
)
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
load_in_8bit=True, # Enable 8-bit quantization
max_memory={0: "15GB"} # Limit GPU memory usage
)
# Generate text
prompt = "Explain the concept of machine learning"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print("Generating response...")
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\n{response}")Step 3: Run the Script
python run_mimo.pyMethod 3: Using Ollama (Experimental)
Step 1: Install Ollama
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: Download from ollama.comStep 2: Create Custom Modelfile
Create a Modelfile for MiMo-V2-Flash:
FROM ./models/MiMo-V2-Flash
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 50
PARAMETER num_ctx 262144Step 3: Build and Run
# Create the model
ollama create mimo-v2-flash -f Modelfile
# Run the model
ollama run mimo-v2-flashMethod 4: Docker Deployment
Step 1: Create Dockerfile
Create Dockerfile:
FROM nvidia/cuda:12.4-devel-ubuntu20.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install sglang transformers accelerate
# Copy model and application
COPY models/MiMo-V2-Flash /app/models/MiMo-V2-Flash
COPY app.py /app/
# Expose port
EXPOSE 30000
# Launch server
CMD ["python3", "-m", "sglang.launch_server", "--model-path", "/app/models/MiMo-V2-Flash", "--host", "0.0.0.0", "--port", "30000"]Step 2: Build and Run
# Build image
docker build -t mimo-v2-flash .
# Run container
docker run --gpus all -p 30000:30000 -v $(pwd)/models:/app/models mimo-v2-flashAdvanced Configuration
Enable Flash Attention
For better performance, install Flash Attention:
pip install flash-attn --no-build-isolationThen add to your model configuration:
from sglang import set_default_backend, RuntimeBackend
set_default_backend(RuntimeBackend.CUDA)Memory Optimization
If you encounter out-of-memory errors:
# Use quantization
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True, # Use 4-bit quantization
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)Multi-GPU Setup
For systems with multiple GPUs:
# Distribute model across GPUs
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
max_memory={0: "10GB", 1: "10GB", 2: "10GB", 3: "10GB"}
)Performance Optimization
1. Adjust GPU Memory Utilization
python -m sglang.launch_server \
--gpu-memory-utilization 0.95 # Use 95% of GPU memory2. Optimize Context Length
# Reduce context length if you need more speed
--max-model-len 32768 # 32K instead of 256K3. Enable Tensor Parallelism
# Use multiple GPUs for inference
--tensor-parallel-size 4Troubleshooting Common Issues
Issue 1: Out of Memory (OOM)
Solution:
# Enable gradient checkpointing
model.gradient_checkpointing_enable()
# Or use smaller batch sizes
batch_size = 1 # Instead of larger valuesIssue 2: CUDA Out of Memory
Solution:
- Reduce
--gpu-memory-utilizationto 0.8 - Enable quantization (8-bit or 4-bit)
- Close other GPU-intensive applications
Issue 3: Model Loading Errors
Solution:
# Clear cache and re-download
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./models/MiMo-V2-Flash --resume-downloadIssue 4: Slow Inference
Solutions:
- Enable Flash Attention:
pip install flash-attn - Use tensor parallelism for multi-GPU
- Reduce context length
- Increase GPU memory utilization
Issue 5: Import Errors
Solution:
# Reinstall dependencies
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip uninstall sglang
pip install sglangTesting Your Installation
Comprehensive Test Script
Create test_mimo.py:
import requests
import json
def test_mimo():
url = "http://localhost:30000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
# Test 1: Basic text generation
data1 = {
"model": "MiMo-V2-Flash",
"messages": [{"role": "user", "content": "Write a hello world program in Python"}],
"max_tokens": 100
}
# Test 2: Code generation
data2 = {
"model": "MiMo-V2-Flash",
"messages": [{"role": "user", "content": "Create a REST API with FastAPI"}],
"max_tokens": 200
}
# Test 3: Math reasoning
data3 = {
"model": "MiMo-V2-Flash",
"messages": [{"role": "user", "content": "Solve: What is the derivative of x^2 + 3x + 5?"}],
"max_tokens": 100
}
for i, data in enumerate([data1, data2, data3], 1):
response = requests.post(url, headers=headers, data=json.dumps(data))
if response.status_code == 200:
print(f"Test {i} passed!")
print(response.json()["choices"][0]["message"]["content"])
else:
print(f"Test {i} failed: {response.status_code}")
if __name__ == "__main__":
test_mimo()Run the test:
python test_mimo.pyBest Practices
- Monitor GPU Usage: Use
nvidia-smito monitor GPU memory and temperature - Adjust Batch Size: Start with batch size 1 and increase gradually
- Use Virtual Environments: Isolate dependencies with venv or conda
- Regular Updates: Keep drivers and CUDA toolkit updated
- Backup Model: Keep a backup of the downloaded model files
Benchmarking Performance
Run a Simple Benchmark
import time
import torch
def benchmark_model(model, tokenizer, prompt):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=200)
end_time = time.time()
tokens_generated = len(outputs[0])
tokens_per_second = tokens_generated / (end_time - start_time)
return tokens_per_second
# Benchmark prompt
prompt = "Write a detailed explanation of quantum computing"
tps = benchmark_model(model, tokenizer, prompt)
print(f"Tokens per second: {tps:.2f}")Conclusion
Running Xiaomi's MiMo-V2-Flash locally is a powerful way to leverage state-of-the-art AI capabilities while maintaining privacy and control. Whether you choose SGLang for maximum performance or Hugging Face Transformers for ease of use, this guide provides all the information you need to get started.
Key Takeaways:
- SGLang is recommended for optimal performance
- Ensure adequate GPU memory (15GB+ VRAM)
- Use quantization if you have limited memory
- Experiment with context length to balance performance and speed
- Monitor GPU usage to prevent overheating
For scaling beyond local hardware or if you encounter hardware limitations, consider cloud GPU providers. Start with the recommended SGLang method and experiment with other frameworks based on your specific needs and hardware configuration.