The battle for inference chips is heating up. Who can seize the new commanding heights of AI applications, GPU or ASIC?

 9:03am, 7 November 2025

As large language models (LLM) become more and more popular, the AI technology behind them is also becoming more and more mature. In the past, the focus of chips was on model training (Pre-Training), which required powerful computing power to "teach" AI. But now, the focus of AI is gradually turning to "inference" (Inference, that is, the stage of actual application of the model, answering questions or generating content). The substantial increase in demand for Inference has drawn much attention to the dynamics of CSPs that are actively launching Inference application ASICs. Nvidia, which has dominated model training in the past, has not ignored this change in trend and launched Nvidia Rubin CPX to challenge it.

Dismantling the Inference process

Inference can be divided into two stages: Prefill and Decode. Currently, most LLMs on the market are built based on the Transformer model, which is stacked by multiple layers of modules with the same structure. Each layer includes a self-attention mechanism (Self-Attention, SA) and a feedforward neural network (FFN).

In the Inference application, the Self-Attention module in each layer module will first consider all inputs to generate a matrix composed of three vectors: Query (Q) (queries based on input), Key (K) (all old tokens generated in the past), and Value (V) (old tokens extracted based on QK correlation). After scaling dot product attention (SDPA) processing, the final vector is passed through FFN. Generate the next layer of input.

Key-Value (KV) Cache is an important technology used. By storing the K and V vectors calculated each time in a specific memory space, there is no need to recalculate all the old tokens every time a token is generated, avoiding expensive repeated calculations and improving calculation efficiency.

Prefill is the process of calculating all K and V vectors in advance and storing them in the KV Cache. Decoding is the process of using each new token to match all K vectors in the KV Cache and generating the next token.

Inference brings changes in demand for computing power and storage

It can be seen that the requirements for the chip in the two stages of Prefill and Decode are quite different. The goal of the Prefill phase is to minimize the output time of the first token (Time-To-First-Token, TTFT), which requires the ability to perform highly parallel large matrix calculations. However, the requirements for large memory capacity and low latency are not as strict as the Decode phase. The goal of the Decode phase is to minimize the average generation time of each token (Time-Per-Output-Token, TPOT), which requires high bandwidth and low latency memory.

The AI chips currently on the market (mostly GPUs) usually adopt an "all-in-one" design, that is, the same chip is used to complete the two stages of prefill and decode.

This approach causes a waste of resources:

In the Prefill stage, the chip's computing power is fully activated, but the memory (bandwidth) utilization is insufficient. During the Decode stage, memory requirements explode, but there is a lot of idle computing power.

Therefore, the concept of Specialized Prefill and Decode hardware (SPAD) came into being, which is to design dedicated AI hardware for the requirements of the second stage of Prefill and Decode to improve the efficiency of Inference.

Prefill chips should use larger systolic arrays (Systolic Array), but the memory only needs to temporarily store KV Cache, so they can be paired with more cost-effective GDDR or LPDDR memories; Decode chips should reduce the computing power and be paired with high-capacity, high-bandwidth memories, such as HBM and HBF. Scale-Out high-speed interconnect is used between the two to pass KV Cache.

The GPU camp has actively launched dedicated chips for Prefill

In response to the growing demand for Inference, various manufacturers will gradually launch chips designed for the Inference Prefill stage in 2025. The prefill chips that have been announced so far mainly include Nvidia Rubin CPX, Huawei Ascend 950PR, Intel Crescent Island, and Qualcomm AI200. Except for Huawei Ascend 950PR, which is an ASIC, the currently launched Prefill chips are all GPUs. Major CSPs have not yet launched a Prefill-specific ASIC..

Google and Meta are working hard for the ASIC camp. Inference is the most active manufacturer.

As for ASICs, major CSPs have actively developed self-developed ASICs in the past, taking cost-effectiveness and high energy efficiency into consideration. Although the initial development cost of self-developed ASIC is high, the cost per unit can be reduced to about one-third of the cost of a single GPU after mass production. In addition, CSP understands its own models and needs and can design dedicated ASICs for specific applications, with better energy efficiency than general-purpose GPUs in specific areas.

CSPs currently actively launching Inference application ASICs mainly include Google's TPU v7 (Ironwood/Ghostfish), which has been mass-produced in the third quarter of this year, and Meta's MTIA 2, which has also been mass-produced in the third quarter of this year.

However, except for China, none of the major CSPs have yet launched ASICs dedicated to Prefill and Decode. In the context of the market gradually shifting from training to inference applications, the architecture of separation of prefill and decode hardware is becoming more and more important, which is consistent with CSP's R&D route of maintaining low cost and high energy efficiency ratio. Major CSPs may follow up on the development of prefill and decode dedicated ASICs in the future. It is expected that in the future, AI chip specifications will mainly differentiate in terms of memory and SerDes.