Mistral Nemo is a transformer model, with the following architecture choices:
Layers: 40
Dim: 5,120
Head dim: 128
Hidden dim: 14,336
Activation Function: SwiGLU
Number of heads: 32
Number of kv-heads: 8 (GQA)
Vocabulary size: 2**17 ~= 128k
Rotary embeddings (theta = 1M)
Layers: 40
Dim: 5,120
Head dim: 128
Hidden dim: 14,336
Activation Function: SwiGLU
Number of heads: 32
Number of kv-heads: 8 (GQA)
Vocabulary size: 2**17 ~= 128k
Rotary embeddings (theta = 1M)