nanoGPT with Sigmoid Self-AttentionI couldn’t resist had... | nanoGPT with Sigmoid Self-AttentionI couldn’t resist had...
nanoGPT with Sigmoid Self-Attention
I couldn’t resist had to give it a try:)

Some observations on M2:
SSA was ~5-10% faster in training with similar final loss values, slightly less coherent text generation, marginally higher perplexity, and lower memory usage compared to softmax.

Code: https://github.com/Jaykef/ai-algorithms/blob/main/sigmoid_attn.ipynb ai-algorithms/sigmoid_attn.ipynb at main · Jaykef/ai-algorithms