The mamba paper Diaries
just one means of incorporating a variety mechanism into models is by letting their parameters that impact interactions alongside the sequence be input-dependent. functioning on byte-sized tokens, transformers scale badly as each token will have to "show up at" to every other token leading to O(n2) scaling rules, Because of this, Transformers pref