How to Run Xiaomi MiMo-V2-Flash Locally: A Complete Installation Guide

About 5 min

How to Run Xiaomi MiMo-V2-Flash Locally: A Complete Installation Guide

Xiaomi's MiMo-V2-Flash represents a breakthrough in efficient AI model design, featuring 309 billion total parameters with only 15 billion active during inference. This Mixture-of-Experts architecture delivers exceptional performance while maintaining reasonable hardware requirements for local deployment. In this comprehensive guide, we'll walk you through multiple methods to run MiMo-V2-Flash locally on your machine.

Why Run MiMo-V2-Flash Locally?

Running MiMo-V2-Flash locally offers numerous advantages:

Data Privacy: Your sensitive data never leaves your machine
Cost Efficiency: No per-token API charges or subscription fees
Low Latency: Direct hardware access means faster inference times
Customization: Full control over model parameters and fine-tuning
Offline Capability: No internet connection required after installation
Performance: Leverages your local GPU for optimal speed

Hardware Requirements

Minimum System Requirements

Component	Requirement	Recommended
GPU	NVIDIA RTX 3080 (12GB VRAM)	RTX 4090 (24GB VRAM) or A6000
RAM	32GB	64GB or more
Storage	100GB free space	200GB+ NVMe SSD
CPU	Intel i7-10700K / AMD Ryzen 7 3700X	Intel i9-12900K / AMD Ryzen 9 5900X
CUDA	11.8+	12.4+

Model Size Considerations

Total Model Size: ~180GB ( quantized formats)
Peak GPU Memory: 15-20GB VRAM (active parameters)
Context Length: 256K tokens (uses significant RAM)

Software Prerequisites

Before installation, ensure you have:

Python 3.10+ installed
CUDA Toolkit 11.8+ or 12.4+
NVIDIA Drivers (latest version)
Git for repository cloning

Verify CUDA Installation

nvidia-smi
nvcc --version

Method 1: Install Using SGLang (Recommended)

SGLang is the recommended framework for MiMo-V2-Flash, offering optimized performance for MoE models.

Step 1: Install SGLang

# Create virtual environment
python -m venv mimo-env
source mimo-env/bin/activate  # On Windows: mimo-env\Scripts\activate

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install SGLang
pip install sglang

Step 2: Download the Model

# Login to Hugging Face
huggingface-cli login

# Download MiMo-V2-Flash
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./models/MiMo-V2-Flash

Step 3: Launch SGLang Server

python -m sglang.launch_server \
    --model-path ./models/MiMo-V2-Flash \
    --host 0.0.0.0 \
    --port 30000 \
    --trust-remote-code \
    --dtype float16 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.9

Step 4: Test the Installation

import requests
import json

url = "http://localhost:30000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
    "model": "MiMo-V2-Flash",
    "messages": [
        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

Method 2: Install Using Hugging Face Transformers

Step 1: Install Dependencies

pip install transformers==4.51.0
pip install accelerate
pip install bitsandbytes
pip install torch --index-url https://download.pytorch.org/whl/cu124

Step 2: Basic Usage Script

Create run_mimo.py:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model ID
model_id = "XiaomiMiMo/MiMo-V2-Flash"

# Load tokenizer and model
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    load_in_8bit=True,  # Enable 8-bit quantization
    max_memory={0: "15GB"}  # Limit GPU memory usage
)

# Generate text
prompt = "Explain the concept of machine learning"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

print("Generating response...")
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\n{response}")

Step 3: Run the Script

python run_mimo.py

Method 3: Using Ollama (Experimental)

Step 1: Install Ollama

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: Download from ollama.com

Step 2: Create Custom Modelfile

Create a Modelfile for MiMo-V2-Flash:

FROM ./models/MiMo-V2-Flash

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 50
PARAMETER num_ctx 262144

Step 3: Build and Run

# Create the model
ollama create mimo-v2-flash -f Modelfile

# Run the model
ollama run mimo-v2-flash

Method 4: Docker Deployment

Step 1: Create Dockerfile

Create Dockerfile:

FROM nvidia/cuda:12.4-devel-ubuntu20.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install sglang transformers accelerate

# Copy model and application
COPY models/MiMo-V2-Flash /app/models/MiMo-V2-Flash
COPY app.py /app/

# Expose port
EXPOSE 30000

# Launch server
CMD ["python3", "-m", "sglang.launch_server", "--model-path", "/app/models/MiMo-V2-Flash", "--host", "0.0.0.0", "--port", "30000"]

Step 2: Build and Run

# Build image
docker build -t mimo-v2-flash .

# Run container
docker run --gpus all -p 30000:30000 -v $(pwd)/models:/app/models mimo-v2-flash

Advanced Configuration

Enable Flash Attention

For better performance, install Flash Attention:

pip install flash-attn --no-build-isolation

Then add to your model configuration:

from sglang import set_default_backend, RuntimeBackend

set_default_backend(RuntimeBackend.CUDA)

Memory Optimization

If you encounter out-of-memory errors:

# Use quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,  # Use 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

Multi-GPU Setup

For systems with multiple GPUs:

# Distribute model across GPUs
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    max_memory={0: "10GB", 1: "10GB", 2: "10GB", 3: "10GB"}
)

Performance Optimization

1. Adjust GPU Memory Utilization

python -m sglang.launch_server \
    --gpu-memory-utilization 0.95  # Use 95% of GPU memory

2. Optimize Context Length

# Reduce context length if you need more speed
--max-model-len 32768  # 32K instead of 256K

3. Enable Tensor Parallelism

# Use multiple GPUs for inference
--tensor-parallel-size 4

Troubleshooting Common Issues

Issue 1: Out of Memory (OOM)

Solution:

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Or use smaller batch sizes
batch_size = 1  # Instead of larger values

Issue 2: CUDA Out of Memory

Solution:

Reduce --gpu-memory-utilization to 0.8
Enable quantization (8-bit or 4-bit)
Close other GPU-intensive applications

Issue 3: Model Loading Errors

Solution:

# Clear cache and re-download
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./models/MiMo-V2-Flash --resume-download

Issue 4: Slow Inference

Solutions:

Enable Flash Attention: pip install flash-attn
Use tensor parallelism for multi-GPU
Reduce context length
Increase GPU memory utilization

Issue 5: Import Errors

Solution:

# Reinstall dependencies
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

pip uninstall sglang
pip install sglang

Testing Your Installation

Comprehensive Test Script

Create test_mimo.py:

import requests
import json

def test_mimo():
    url = "http://localhost:30000/v1/chat/completions"
    headers = {"Content-Type": "application/json"}

    # Test 1: Basic text generation
    data1 = {
        "model": "MiMo-V2-Flash",
        "messages": [{"role": "user", "content": "Write a hello world program in Python"}],
        "max_tokens": 100
    }

    # Test 2: Code generation
    data2 = {
        "model": "MiMo-V2-Flash",
        "messages": [{"role": "user", "content": "Create a REST API with FastAPI"}],
        "max_tokens": 200
    }

    # Test 3: Math reasoning
    data3 = {
        "model": "MiMo-V2-Flash",
        "messages": [{"role": "user", "content": "Solve: What is the derivative of x^2 + 3x + 5?"}],
        "max_tokens": 100
    }

    for i, data in enumerate([data1, data2, data3], 1):
        response = requests.post(url, headers=headers, data=json.dumps(data))
        if response.status_code == 200:
            print(f"Test {i} passed!")
            print(response.json()["choices"][0]["message"]["content"])
        else:
            print(f"Test {i} failed: {response.status_code}")

if __name__ == "__main__":
    test_mimo()

Run the test:

python test_mimo.py

Best Practices

Monitor GPU Usage: Use nvidia-smi to monitor GPU memory and temperature
Adjust Batch Size: Start with batch size 1 and increase gradually
Use Virtual Environments: Isolate dependencies with venv or conda
Regular Updates: Keep drivers and CUDA toolkit updated
Backup Model: Keep a backup of the downloaded model files

Benchmarking Performance

Run a Simple Benchmark

import time
import torch

def benchmark_model(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    start_time = time.time()
    outputs = model.generate(**inputs, max_new_tokens=200)
    end_time = time.time()

    tokens_generated = len(outputs[0])
    tokens_per_second = tokens_generated / (end_time - start_time)

    return tokens_per_second

# Benchmark prompt
prompt = "Write a detailed explanation of quantum computing"
tps = benchmark_model(model, tokenizer, prompt)
print(f"Tokens per second: {tps:.2f}")

Conclusion

Running Xiaomi's MiMo-V2-Flash locally is a powerful way to leverage state-of-the-art AI capabilities while maintaining privacy and control. Whether you choose SGLang for maximum performance or Hugging Face Transformers for ease of use, this guide provides all the information you need to get started.

Key Takeaways:

SGLang is recommended for optimal performance
Ensure adequate GPU memory (15GB+ VRAM)
Use quantization if you have limited memory
Experiment with context length to balance performance and speed
Monitor GPU usage to prevent overheating

For scaling beyond local hardware or if you encounter hardware limitations, consider cloud GPU providers. Start with the recommended SGLang method and experiment with other frameworks based on your specific needs and hardware configuration.