Model Recipes
StepFun

stepfun-ai/Step-3.7-Flash

Production-grade vision-language MoE (~198B total / 11B active parameters) combining a 196B sparse language backbone with a 1.8B perception encoder, hybrid SWA/Global attention, and 3-way Multi-Token Prediction

Sparse MoE VLM with hybrid attention and 3-layer MTP speculative decoding

moe198B / 11B262,144 ctxvLLM nightly+multimodal
Guide