DOSEN PROFIL LENGKAP

Submission declined on 9 March 2026 by Nighfidelity (talk).

This draft appears to contain text generated by a large language model (such as ChatGPT). You cannot use LLMs to generate article content.

LLM-generated pages with certain obvious signs of being machine generated may be deleted without notice.

These tools are prone to specific issues that violate our policies:

hallucinations: they often invent false information and cite non-existent references.
unencyclopedic tone: they tend to be vague, promotional, or essay-like, rather than neutral and factual.
copyright issues: they may closely paraphrase existing text, leading to copyright violations.

Instead, only summarize in your own words a range of independent, reliable, published sources that discuss the subject.

See the advice page on large language models for more information.

If you would like to continue working on the submission, click on the "Edit" tab at the top of the window.
If you have not resolved the issues listed above, your draft will be declined again and potentially deleted.
If you need extra help, please ask us a question at the AfC Help Desk or get live help from experienced editors.
Please do not remove reviewer comments or this notice until the submission is accepted.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Declined by Nighfidelity 2 months ago. Last edited by Toddphen 46 days ago. Reviewer: Inform author.

Resubmit

Please note that if the issues are not fixed, the draft will be declined again.

Submission declined on 7 December 2025 by I2Overcome (talk).

This draft appears to contain text generated by a large language model (such as ChatGPT). You cannot use LLMs to generate article content.

LLM-generated pages with certain obvious signs of being machine generated may be deleted without notice.

These tools are prone to specific issues that violate our policies:

hallucinations: they often invent false information and cite non-existent references.
unencyclopedic tone: they tend to be vague, promotional, or essay-like, rather than neutral and factual.
copyright issues: they may closely paraphrase existing text, leading to copyright violations.

Instead, only summarize in your own words a range of independent, reliable, published sources that discuss the subject.

See the advice page on large language models for more information.

Declined by I2Overcome 5 months ago.

State Space Models (SSMs) are a class of neural network architectures for processing time series data that model sequences using principles from control theory. SSMs have emerged as efficient alternatives to Transformer and recurrent neural network (RNN) architectures, particularly for handling long-range dependencies in sequence modeling tasks.^[1] Unlike Transformers which have quadratic complexity with respect to sequence length, SSMs achieve linear or near-linear time complexity, making them particularly effective for processing very long sequences.^[2]^[3]

Overview

State Space Models in deep learning are based on continuous-time state space representations from classical control theory.^[2] At their core, SSMs map a one-dimensional input signal u(t) to an output signal y(t) through a hidden state x(t) using a system of differential equations.^[4] The basic SSM is defined by the equations:

x'(t) = Ax(t) + Bu(t)
y(t) = Cx(t) + Du(t)

where A is the state matrix, B is the control matrix, C is the output matrix, and D is a direct feedthrough term (often treated as a skip connection in deep learning applications).^[1]

SSMs offer several key advantages: they can naturally handle continuous data, automatically adapt to different sampling rates without retraining, and provide mathematically tractable analysis of their dynamics.^[2]^[5] Through discretization, SSMs can be viewed from three complementary perspectives: as continuous-time systems, as recurrent networks during inference, and as convolutional models during training.^[1]

History and Development

Origins in Neuroscience

The application of state space models to deep learning traces back to theoretical neuroscience research. In 2018, Aaron R. Voelker and Chris Eliasmith from the University of Waterloo proposed that the dynamic system in SSMs can effectively model "time cells" present in the hippocampus and cortex, leading to their work on applying SSMs to neural networks.^[6]^[2]

Legendre Memory Units (2019)

The Legendre Memory Unit (LMU), introduced by Voelker, Kajić, and Eliasmith in 2019, was among the first successful applications of SSMs in deep learning.^[7] LMUs are mathematically derived to orthogonalize continuous-time history by solving coupled ordinary differential equations, with their phase space mapping onto sliding windows of time via Legendre polynomials. LMUs demonstrated the ability to handle temporal dependencies spanning 100,000 time steps and achieved state-of-the-art performance on permuted sequential MNIST, exceeding 97% accuracy.^[7]

HiPPO Framework (2020)

The High-Order Polynomial Projection Operators (HiPPO) framework, introduced by Gu et al. in 2020 who were from Stanford University, provided a unified mathematical foundation for memory in sequence models.^[8] HiPPO optimally projects continuous signals onto polynomial bases, yielding linear dynamics for the projection coefficients. This framework produces several instantiations including HiPPO-LegS (scaled Legendre) and HiPPO-LegT (translated Legendre), which achieve timescale robustness and bounded gradients.^[8]^[9] The HiPPO framework achieved 98.3% accuracy on permuted MNIST, surpassing previous RNN approaches by a significant margin.^[8]

Parallelization (2021)

Chilkuri and Eliasmith proposed and demonstrated a method to efficiently train SSMs in parallel on GPUs.^[10] This overcomes concerns that the recurrence in SSMs would be difficult to train on GPUs, since other recurrent networks like LSTMs fell out of favour for this reason. Subsequently the first large language model (LLM) using SSMs was demonstrated to scale better than either LSTMs or Transformers using this method.^[11]

Structured State Space Models (S4, 2021)

The Structured State Space sequence model (S4), introduced by Gu, Goel, and Ré in 2021, marked a breakthrough in making SSMs practical for large-scale deep learning.^[12] S4 addressed the computational challenges of naive SSM implementations through a novel parameterization involving Structured initialization which uses the HiPPO matrix for the state matrix A. As well, the model introduced a Normal plus low-rank (NPLR) decomposition, which allows A to be diagonalized stably. Finally, the model reduces the SSM to a Cauchy kernel computation to improve computational efficiency.^[12]

S4 achieved interesting results across multiple domains:^[12]^[13]

91% accuracy on sequential CIFAR-10 (matching 2D ResNets with no data augmentation)
State-of-the-art on all tasks in the Long Range Arena benchmark
First model to solve the Path-X task (16,000 sequence length) with 88% accuracy
60× faster generation than Transformers on language modeling

The model demonstrated the ability to handle sequences exceeding 10,000 steps while maintaining linear scaling in sequence length.^[12]

Key Architectural Innovations

Mamba (2023)

Mamba, introduced by Gu and Dao in December 2023, represents a major advancement in SSM architectures through the introduction of selective state space models. Earlier language-focussed SSMs use time-invariant parameters^[11], meaning the matrices A, B, and C remain constant across the sequence. Mamba's main innovation is making these parameters functions of the input, allowing the model to selectively propagate or forget information based on content^[14]. In addition, compared to previous LLM work, Mamba provided a simplified architecture that replaces attention and MLP blocks with a unified SSM block^[15] Finally, Mamba includes specific hardware-aware algorithms including parallel scan, kernel fusion, and selective recomputation to achieve efficient training^[14].

Mamba achieved competitive or superior performance compared to Transformers while providing 5× higher throughput and linear scaling to million-length sequences.^[14] On language modeling, Mamba-3B matched Transformers twice its size in both pretraining and downstream evaluation.^[14]

Mamba-2 (2024)

In May 2024, Dao and Gu introduced Mamba-2 through their "Transformers are SSMs" paper, which established theoretical connections between SSMs and attention mechanisms via structured semiseparable matrices.^[16] The State Space Duality (SSD) framework enabled the design of Mamba-2, which is 2-8× faster than Mamba while maintaining competitive performance with Transformers on language modeling.^[16]

Mamba-2 achieves faster computation by leveraging matrix multiplication primitives and tensor cores on modern GPUs, allowing for larger state expansion (typically N=128-256 compared to N=16 in Mamba) while remaining computationally efficient.^[16] The model also enables better system-level optimizations including tensor parallelism and sequence parallelism.^[16]

Hybrid Architectures

Jamba (2024)

AI21 Labs introduced Jamba in March 2024, the first production-grade model combining Mamba SSM layers with Transformer attention and mixture-of-experts (MoE) components.^[17] Jamba features:

A hybrid architecture interleaving Transformer and Mamba layers at a 1:7 ratio^[17]
52B total parameters with only 12B active parameters during inference^[17]
Support for 256K token context windows
3× throughput improvement on long contexts compared to Mixtral 8x7B^[17]

The architecture demonstrated that hybrid approaches can effectively balance the strengths of both SSMs (efficiency, long context) and Transformers (performance, in-context learning).^[18] Jamba 1.5, released in August 2024, scaled to 398B total parameters with 94B active, representing the largest hybrid SSM-Transformer architecture to date.^[19]

Other Hybrid Models

Recent work has explored various hybrid architectures including: Vision Mamba (Vim), which has bidirectional Mamba blocks for visual data processing^[15]; MambaByte, which provides byte-level language modeling without tokenization;^[20] and MoE Mamba, which integrates mixture-of-experts with Mamba, requiring 2.2× fewer training steps than standard Mamba^[15]

Mathematical Framework

Continuous Representation

The continuous-time SSM is defined by linear ordinary differential equations:^[1]^[2]

x'(t) = Ax(t) + Bu(t)
y(t) = Cx(t) + Du(t)

where:

x(t) ∈ ℝᴺ is the state vector (N-dimensional latent state)
u(t) ∈ ℝ is the input signal
y(t) ∈ ℝ is the output signal
A ∈ ℝᴺˣᴺ is the state transition or dynamics matrix
B ∈ ℝᴺˣ¹ is the control matrix
C ∈ ℝ¹ˣᴺ is the output matrix
D ∈ ℝ is the feedthrough term

Discretization

To implement SSMs on digital computers, the continuous system must be discretized. The most common approach uses the zero-order hold (ZOH) method with step size Δ:^[1]^[21]

x̄ₖ = Āxₖ₋₁ + B̄uₖ
yₖ = Cx̄ₖ + Duₖ

where the discrete parameters are:

Ā = exp(ΔA)
B̄ = (exp(ΔA) - I)A⁻¹B

This discretization highlights two complementary views:^[1]

Recurrent view: Efficient O(N) linear-time inference by maintaining state
Convolutional view: Parallel O(N log N) training via FFT-based convolutions^[10]

The convolution kernel K̄ can be precomputed as:

K̄ₖ = CĀᵏB̄

This duality allows SSMs to combine the inference efficiency of RNNs with the training parallelism of CNNs and Transformers.^[1]

Computational Complexity and Efficiency

Comparison with Transformers

A fundamental advantage of SSMs is their computational complexity compared to Transformers:^[2]^[12]^[21]^[22]

Specifically, Transformers have the following complexity profile:

Training complexity: O(N²D) where N is sequence length, D is dimension
Inference complexity: O(N²D) due to attention over all previous tokens
Memory: O(N) for KV cache, growing linearly with sequence length

In contrast, State Space Models, have the following complexity profile:

Training complexity: O(N log N) via FFT for convolutional view
Inference complexity: O(N) linear time per token (only update hidden state)
Memory: O(1) constant for state, independent of sequence length

This means that SSMs scale better during training and achieve linear-time generation, while Transformers have quadratic complexity that becomes prohibitive for very long sequences.^[3]^[21] At sequence lengths beyond 8,000-16,000 tokens, SSMs typically become significantly faster than Transformers.^[22]

Applications

Natural Language Processing

SSMs have demonstrated strong performance on various NLP tasks, many of which are discussed above in more detail, and include:^[14]^[23]

Language modeling competitive with Transformers of similar or larger size
Long-document understanding with contexts up to 256K tokens
Strong performance on question answering, summarization, and text classification

The selective mechanism in Mamba has proven particularly effective for discrete modalities like language, addressing early limitations of S4 in this domain.^[14]

Computer Vision

Vision applications of SSMs have addressed many common areas of visual processing also tackled by Transformers, but with more of a focus on time series data.^[15]^[24] Briefly, these include image classification on ImageNet, sequential image tasks (e.g., sequential CIFAR-10), and video understanding and generation. Notably, Vision Mamba (Vim) achieved competitive results with Vision Transformers.^[25]

Audio and Speech

SSMs excel at audio tasks due to their continuous-time formulation.^[26] Audio applications include speech generation with models like SaShiMi,^[26] extremely efficient audio classification on automatic speech recognition benchmarks,^[27]

Time Series and Scientific Computing

The continuous nature of SSMs makes them well-suited for a wide variety of time series problems.^[28] Generally, these include genomic sequence modeling (million-length DNA sequences), ^[29] climate and weather prediction,^[30] and medical time series analysis.^[31]

Limitations and Open Challenges

Despite their advantages, SSMs face several challenges:^[14]^[32]

Associative recall: Pure SSM architectures may struggle with tasks requiring precise retrieval of specific information from long contexts, where attention mechanisms excel
Training speed at short sequences: Highly optimized Transformer implementations can be faster than SSMs at sequence lengths below 2,000-4,000 tokens
State capacity: Fixed-size hidden states may saturate on extremely long sequences, though this can be mitigated with larger state dimensions
Discrete modalities: While Mamba addressed this, earlier SSMs like S4 showed higher perplexity on language tasks compared to Transformers

These limitations have motivated hybrid architectures that combine SSMs with attention mechanisms to leverage the strengths of both approaches.^[17]^[18] Often, as in many cases above, these are referred to simply as SSM architectures because they are not pure Transformer architectures.

References

^ ^a ^b ^c ^d ^e ^f ^g Bourdois, L. (2024). "Introduction to State Space Models (SSM)". Hugging Face Blog. https://huggingface.co/blog/lbourdois/get-on-the-ssm-train
^ ^a ^b ^c ^d ^e ^f Voelker, A. R. (2019). "Dynamical Systems in Spiking Neuromorphic Hardware". Doctoral Thesis, University of Waterloo. https://compneuro.uwaterloo.ca/publications/voelker2019.html
^ ^a ^b Somvanshi, S., et al. (2025). "From S4 to Mamba: A Comprehensive Survey on Structured State Space Models". arXiv:2503.18970
^ Grootendorst, M. (2024). "A Visual Guide to Mamba and State Space Models". https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mamba-and-state
^ Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., & Ré, C. (2021). "Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers". Advances in Neural Information Processing Systems (NeurIPS), 34. arXiv:2110.13985
^ Voelker, A. R., & Eliasmith, C. (2018). "Improving Spiking Dynamical Networks: Accurate Delays, Higher-Order Synapses, and Time Cells". Neural Computation.
^ ^a ^b Voelker, A. R., Kajić, I., & Eliasmith, C. (2019). "Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks". Advances in Neural Information Processing Systems (NeurIPS), 32, 15544-15553.
^ ^a ^b ^c Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections". Advances in Neural Information Processing Systems (NeurIPS), 33. arXiv:2008.07669
^ Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections". Stanford Hazy Research Blog. https://hazyresearch.stanford.edu/blog/2020-12-05-hippo
^ ^a ^b Chilkuri, N. & Eliasmith, C. "Parallelizing legendre memory unit training." Proceedings of the 38th International Conference on Machine Learning, PMLR, 1898–1907. Jul 2021. URL: https://proceedings.mlr.press/v139/chilkuri21a.html
^ ^a ^b Chilkuri, N., Hunsberger, E., Voelker, A., Malik, G., & Eliasmith, C. "Language modeling using lmus: 10x better data efficiency or improved scaling compared to transformers." arXiv preprint, 2021. URL: https://arxiv.org/abs/2110.02402.
^ ^a ^b ^c ^d ^e Gu, A., Goel, K., & Ré, C. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces". International Conference on Learning Representations (ICLR). arXiv:2111.00396
^ Rush, S., & Karamcheti, S. (2022). "The Annotated S4". https://srush.github.io/annotated-s4/
^ ^a ^b ^c ^d ^e ^f ^g Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces". arXiv:2312.00752
^ ^a ^b ^c ^d "Mamba (deep learning architecture)". Wikipedia. https://en.wikipedia.org/wiki/Mamba_(deep_learning_architecture)
^ ^a ^b ^c ^d Dao, T., & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality". Proceedings of Machine Learning Research, 235, 10041-10071.
^ ^a ^b ^c ^d ^e Lieber, O., et al. (2024). "Introducing Jamba: AI21's Groundbreaking SSM-Transformer Model". AI21 Labs Blog. https://www.ai21.com/blog/announcing-jamba/
^ ^a ^b Lieber, O., et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model". arXiv:2403.19887
^ AI21 Labs (2024). "Jamba-1.5: Hybrid Transformer-Mamba Models at Scale". arXiv:2408.12570. https://www.ai21.com/research/jamba-1-5-hybrid-transformer-mamba-models-at-scale/
^ Wang, J., et al. (2024). "MambaByte: Token-free Selective State Space Model". arXiv:2401.13660
^ ^a ^b ^c Nichkawde, C. "Beyond Transformers: Structured State Space Sequence Models". https://cnichkawde.github.io/statespacesequencemodels.html
^ ^a ^b Alkin, B., et al. (2024). "Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling". arXiv:2404.16112
^ "State Space Models". Aman's AI Journal. https://aman.ai/primers/ai/state-space-models/
^ Zhu, X., Ruan, Q., Qian, S. & Zhang, M. (2025). "A hybrid model based on transformer and Mamba for enhanced sequence modeling". Scientific Reports. https://www.nature.com/articles/s41598-025-87574-8
^ Zhu, L., et al. (2024). "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model". arXiv:2401.09417
^ ^a ^b Goel, K., Gu, A., Donahue, C., & Ré, C. (2022). "It's Raw! Audio Generation with State-Space Models". International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v162/goel22a/goel22a.pdf
^ HuggingFace ASR Leaderboard. https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
^ Zhou, H., et al. (2020). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting". arXiv:2012.07436
^ Schiff et al. (2024). "Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling". https://arxiv.org/abs/2403.03234
^ Yang et al. (2025). "WSSM: Geographic-enhanced hierarchical state-space model for global station weather forecast." https://arxiv.org/abs/2501.11238
^ Brindle et al. (2025). "VISTA-SSM: Varying and Irregular Sampling Time-series Analysis via State Space Models." https://arxiv.org/abs/2410.21527
^ Tiezzi, M., Casoni, M., Betti, A., Guidi, T., Gori, M. & Melacci, S. (2025). "Back to recurrent processing at the crossroad of transformers and state-space models". Nature Machine Intelligence. https://www.nature.com/articles/s42256-025-01034-6

External Links

[Bourdois2024-1] ^ ^a ^b ^c ^d ^e ^f ^g Bourdois, L. (2024). "Introduction to State Space Models (SSM)". Hugging Face Blog. https://huggingface.co/blog/lbourdois/get-on-the-ssm-train

[Voelker2019-2] ^ ^a ^b ^c ^d ^e ^f Voelker, A. R. (2019). "Dynamical Systems in Spiking Neuromorphic Hardware". Doctoral Thesis, University of Waterloo. https://compneuro.uwaterloo.ca/publications/voelker2019.html

[Somvanshi2025-3] Somvanshi, S., et al. (2025). "From S4 to Mamba: A Comprehensive Survey on Structured State Space Models". arXiv:2503.18970

[Grootendorst2024-4] Grootendorst, M. (2024). "A Visual Guide to Mamba and State Space Models". https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mamba-and-state

[Gu2021LSSL-5] Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., & Ré, C. (2021). "Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers". Advances in Neural Information Processing Systems (NeurIPS), 34. arXiv:2110.13985

[VoelkerEliasmith2018-6] Voelker, A. R., & Eliasmith, C. (2018). "Improving Spiking Dynamical Networks: Accurate Delays, Higher-Order Synapses, and Time Cells". Neural Computation.

[Voelker2019LMU-7] Voelker, A. R., Kajić, I., & Eliasmith, C. (2019). "Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks". Advances in Neural Information Processing Systems (NeurIPS), 32, 15544-15553.

[Gu2020HiPPO-8] Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections". Advances in Neural Information Processing Systems (NeurIPS), 33. arXiv:2008.07669

[Gu2020HiPPOBlog-9] Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections". Stanford Hazy Research Blog. https://hazyresearch.stanford.edu/blog/2020-12-05-hippo

[Chilkuri2021Parallel-10] Chilkuri, N. & Eliasmith, C. "Parallelizing legendre memory unit training." Proceedings of the 38th International Conference on Machine Learning, PMLR, 1898–1907. Jul 2021. URL: https://proceedings.mlr.press/v139/chilkuri21a.html

[Chilkuri2021LLM-11] Chilkuri, N., Hunsberger, E., Voelker, A., Malik, G., & Eliasmith, C. "Language modeling using lmus: 10x better data efficiency or improved scaling compared to transformers." arXiv preprint, 2021. URL: https://arxiv.org/abs/2110.02402.

[Gu2021S4-12] Gu, A., Goel, K., & Ré, C. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces". International Conference on Learning Representations (ICLR). arXiv:2111.00396

[Rush2022-13] Rush, S., & Karamcheti, S. (2022). "The Annotated S4". https://srush.github.io/annotated-s4/

[Gu2023Mamba-14] ^ ^a ^b ^c ^d ^e ^f ^g Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces". arXiv:2312.00752

[WikiMamba-15] "Mamba (deep learning architecture)". Wikipedia. https://en.wikipedia.org/wiki/Mamba_(deep_learning_architecture)

[Dao2024Mamba2-16] Dao, T., & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality". Proceedings of Machine Learning Research, 235, 10041-10071.

[Lieber2024JambaBlog-17] Lieber, O., et al. (2024). "Introducing Jamba: AI21's Groundbreaking SSM-Transformer Model". AI21 Labs Blog. https://www.ai21.com/blog/announcing-jamba/

[Lieber2024JambaPaper-18] Lieber, O., et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model". arXiv:2403.19887

[AI21Jamba15-19] AI21 Labs (2024). "Jamba-1.5: Hybrid Transformer-Mamba Models at Scale". arXiv:2408.12570. https://www.ai21.com/research/jamba-1-5-hybrid-transformer-mamba-models-at-scale/

[Wang2024-20] Wang, J., et al. (2024). "MambaByte: Token-free Selective State Space Model". arXiv:2401.13660

[Nichkawde-21] Nichkawde, C. "Beyond Transformers: Structured State Space Sequence Models". https://cnichkawde.github.io/statespacesequencemodels.html

[Alkin2024-22] Alkin, B., et al. (2024). "Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling". arXiv:2404.16112

[AmanAI-23] "State Space Models". Aman's AI Journal. https://aman.ai/primers/ai/state-space-models/

[Zhu2025-24] Zhu, X., Ruan, Q., Qian, S. & Zhang, M. (2025). "A hybrid model based on transformer and Mamba for enhanced sequence modeling". Scientific Reports. https://www.nature.com/articles/s41598-025-87574-8

[Zhu2024Vision-25] Zhu, L., et al. (2024). "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model". arXiv:2401.09417

[Goel2022-26] Goel, K., Gu, A., Donahue, C., & Ré, C. (2022). "It's Raw! Audio Generation with State-Space Models". International Conference on Machine Learning (ICML). https://proceedings.mlr.press/v162/goel22a/goel22a.pdf

[HFASRLeaderboard-27] HuggingFace ASR Leaderboard. https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

[Zhou2020-28] Zhou, H., et al. (2020). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting". arXiv:2012.07436

[Schiff-29] Schiff et al. (2024). "Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling". https://arxiv.org/abs/2403.03234

[Yang-30] Yang et al. (2025). "WSSM: Geographic-enhanced hierarchical state-space model for global station weather forecast." https://arxiv.org/abs/2501.11238

[Brindle-31] Brindle et al. (2025). "VISTA-SSM: Varying and Irregular Sampling Time-series Analysis via State Space Models." https://arxiv.org/abs/2410.21527

[Tiezzi2025-32] Tiezzi, M., Casoni, M., Betti, A., Guidi, T., Gori, M. & Melacci, S. (2025). "Back to recurrent processing at the crossroad of transformers and state-space models". Nature Machine Intelligence. https://www.nature.com/articles/s42256-025-01034-6

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]