Today, I happened to discuss the future development trends of AI with some friends. Regarding the future of AI in the coming years, the core viewpoint is that people might be “overly optimistic” about AI development, and the current estimation of AI’s capabilities is overestimated. Without “killer” application scenarios, is it possible that the AI boom could subside in two to three years, eventually bursting the “huge bubble”?

I also believe that the actual performance of AI today is far below the public’s expectations. The hype from emerging media makes large models seem omnipotent, but they are limited in real-world applications and have a long way to go to achieve ideal results. As for the future development of the AI industry, I believe that evaluations from the industry are more comprehensive, combining management, business, technology, and past experiences. As a student with a purely theoretical and technical background, I can only share my views from a personal perspective, focusing on specific “points” rather than the broader commercial and management angles. From a purely theoretical and technical standpoint, I’ll discuss some immature insights.

Currently, academic research on large language models (LLMs) mainly focuses on reducing computational requirements, i.e., improving model efficiency, or increasing the knowledge density within the model. This is often combined with external knowledge bases to enhance the performance of LLMs, such as the popular vLLM and other open-source projects widely shared on platforms like WeChat. Indeed, with such rapid development, we might really hit a bottleneck in 2-3 years. By then, the internal knowledge density of models will be very high, and the forms of external knowledge bases will have widely recognized “best practice” solutions. At that point, we may no longer see the frequent significant improvements in LLMs as we did from 2022 to 2023. This typically indicates that breakthroughs in foundational theory or model architecture are needed.

Currently, most large models use the Transformer architecture, which raises the question: is Transformer the optimal solution? In fact, this question has been widely discussed over the past two years. The architectures that have attracted attention include Mamba and TTT. Due to limitations in computational power and data, I personally do not have the means to conduct in-depth experiments on these two architectures, but I guess the industry has attempted to do so. However, so far, there has not been a mature large model based on the Mamba architecture, which I suspect is because the results were not as expected. Even so, the Transformer may not be the optimal solution.

Proponents of Transformer argue that the multi-head attention mechanism or self-attention mechanism is a “simple yet effective” solution. Given that traditional methods struggle with capturing long-range dependencies, the idea is to abandon sequential modeling and instead calculate the correlation between each pair of tokens, thus solving the distance problem. Switching to spatial modeling also enables parallel computation, making it a win-win solution. However, some people believe that the Transformer’s solution is not elegant and can even be described as “ugly.” Human language inherently has a sequential nature, and reading is done in sequence. The Transformer’s approach of discarding the temporal nature of language is a “brute force” solution that does not align with natural principles. I am more inclined to the latter view: Transformer is an efficient temporary solution or a “workaround,” but I do not believe it is the final solution. As for why the “back-to-basics” Mamba has not been widely adopted, what shortcomings it still has may require more in-depth research.

Returning to the original question, when LLMs hit a bottleneck, I believe there will always be a new Transformer. Previously, Professor David Barber also mentioned that many “inelegant” methods still exist in LLMs, such as the widely criticized teacher forcing. This also indicates that many current solutions are compromises, and when the next game-changing foundational architecture emerges, the second wave of AI might rise again. We can review the widely used neural network architectures and their appearance time: RNN (1986), LSTM (1997), CNN (1998), ResNet (2012), GAN (2014), and Transformer (2016). The intervals between the emergence of new architectures are shortening, and each of the above architectures had revolutionary significance at the time and is still widely used today. It is now 2024, and it has been eight years since the Transformer was proposed. Who can guarantee that a more elegant game changer will not appear in the next two to three years?

These are just some thoughts on the way home. I’ll leave it here for now and plan to write a more detailed analysis when I have more time.