stepfun-ai/Step-3.7-Flash
Production-grade vision-language MoE (~198B total / 11B active parameters) combining a 196B sparse language backbone with a 1.8B perception encoder, hybrid SWA/Global attention, and 3-way Multi-Token Prediction
Sparse MoE VLM with hybrid attention and 3-layer MTP speculative decoding