Fuck yeah! Moshi by @kyutai_labs just owned the stage! ๐Ÿ‡ช... | Fuck yeah! Moshi by @kyutai_labs just owned the stage! ๐Ÿ‡ช...
Fuck yeah! Moshi by
@kyutai_labs
just owned the stage! ๐Ÿ‡ช๐Ÿ‡บ/acc.

Architecture
1. 7B Multimodal LM (speech in, speech out)
2. 2 channel I/O - Streaming LM constantly generates text tokens as well as audio codecs (tunable)
3. Achieves 160ms latency (with a Real-Time Factor of 2)
4. The base text language model is a 7B (trained from scratch) - Helium 7B
5. Helium 7B is then jointly trained on w/ text and audio codecs
6. Speech codec is based on a Mimi (their inhouse audio compression model)
7. Mimi is a VQ-VAE capable of 300x compression factor - trained on both semantic and acoustic information
8. Text to Speech Engine supports 70 different emotions and styles like whispering, accents, personas, etc

Training/ RLHF
1. The model is fine-tuned on 100K transcripts generated by Helium itself.
2. These transcripts are highly detailed, heavily annotated with emotion and style, and conversational.
3. Text to Speech Engine is further fine-tuned on 20 hours of audio recorded by Alice and licensed.
4. The model can be fine-tuned with less than 30 minutes of audio.
5. Safety: Generated audio is watermarked (possibly w/ audioseal) & generated audios are indexed in a database
6. Trained on Scaleway cluster of 1000 H100 GPUs

Inference
1. The deployed demo model is capable of bs=2 at 24GB VRAM (hosted on Scaleway and Hugging Face)
2. Model is capable of 4-bit and 8-bit quantisation
3. Works across backends - CUDA, Metal, CPU
4. Inference code optimised with Rust
5. Further savings to be made with better KV Caching, prompt caching, etc.

Future plans
1. Short-term technical report and open model releases.
2. Open model releases would include the inference codebase, the 7B model, the audio codec and the full optimised stack.
3. Scale the model/ refine based on feedback except Moshi 1.1, 1.2, 2.0
4. License as permissive as they can be (yet to be decided)

Just 8 team members put all of this together! ๐Ÿ”ฅ

After using it IRL, it feels magical to have such a quick response. It opens so many avenues: research assistance, brainstorming/Steelman discussion points, language learning, and more importantly, it's on-device with the flexibility to use it however you want!

Hats off to Kyutai and the team for shipping a version that *just* works and is out in public ๐Ÿซก

Your turn, Open AI! ;)