Show HN: LLM Inference Performance Analytic Tool for Moe Models (DeepSeek/etc.)

github.com

1 points by kevin-2025 6 hours ago

I built this to answer "what-if" questions about LLM deployment without spinning up expensive infrastructure.

The tool models inference physics - latency, bandwidth saturation, and PCIe bottlenecks for large MoE models like DeepSeek-V3 (671B), Mixtral 8x7B, Qwen2.5-MoE, and Grok-1.

Key features:

- Independent Prefill vs Decode parallelism config (TP/PP/SP/DP) - Hardware modeling: H100, B200, A100, NVLink topologies, IB vs RoCE - Optimizations: Paged KV Cache, DualPipe, FP8/INT4 quantization - Experimental: Memory Pooling (TPP, tiered storage) and Near-Memory Computing - offload cold experts and cold/warm KV-cache to system RAM, node-shared or global-shared memory pool

Live demo: https://llm-inference-performance-calculator-1066033662468.u...

Built with React, TypeScript, Tailwind, and Vite.

Disclaimer: I've calibrated the math models but they're not perfect. Feedback and PRs welcome.