Show HN: LLM Inference Performance Analytic Tool for Moe Models (DeepSeek/etc.)
github.comI built this to answer "what-if" questions about LLM deployment without spinning up expensive infrastructure.
The tool models inference physics - latency, bandwidth saturation, and PCIe bottlenecks for large MoE models like DeepSeek-V3 (671B), Mixtral 8x7B, Qwen2.5-MoE, and Grok-1.
Key features:
- Independent Prefill vs Decode parallelism config (TP/PP/SP/DP) - Hardware modeling: H100, B200, A100, NVLink topologies, IB vs RoCE - Optimizations: Paged KV Cache, DualPipe, FP8/INT4 quantization - Experimental: Memory Pooling (TPP, tiered storage) and Near-Memory Computing - offload cold experts and cold/warm KV-cache to system RAM, node-shared or global-shared memory pool
Live demo: https://llm-inference-performance-calculator-1066033662468.u...
Built with React, TypeScript, Tailwind, and Vite.
Disclaimer: I've calibrated the math models but they're not perfect. Feedback and PRs welcome.