Colleagues at Lund University presented last month a working circuit that performs, in real time, zero-forcing decoding and precoding of 8 simultaneous terminals with 128 base station antennas, over a 20 MHz bandwidth at a power consumption of about 50 milliWatt.

Impressive, and important.

Granted, this number does not include the complexity of FFTs, sampling rate conversions, and several other (non-insignificant) tasks; however, it does include the bulk of the “Massive-MIMO”-specific digital processing. The design exploits a number of tricks and Massive-MIMO specific properties: diagonal dominance of the channel Gramian, in particular, in sufficiently favorable propagation.

When I started work on Massive MIMO in 2009, the common view held was that the technology would be infeasible because of computational complexity. Particularly, the sheer idea of performing zero-forcing processing in real time was met with, if not ridicule, extreme skepticism. We quickly realized, however, that a reasonable DSP implementation would require no more than some ten Watt. While that is a small number in itself, it turned out to be an overestimate by orders of magnitude!

I spoke with some of the lead inventors of the chip, to learn more about its design. First, the architectures for decoding and for precoding differ a bit. While there is no fundamental reason for why this has to be so, one motivation is the possible use of nonlinear detectors on uplink. (The need for such detectors, for most “typical” cellular Massive MIMO deployments, is not clear – but that is another story.)

Second, and more importantly, the scalability of the design is not clear. While the complexity of the matrix operations themselves scale fast with the dimension, the precision in the arithmetics may have to be increased as well – resulting in a much-faster-than-cubically overall complexity scaling. Since Massive MIMO operates at its best when multiplexing to many tens of terminals (or even thousands, in some applications), significant challenges remain for the future. That is good news for circuit engineers, algorithm designers, and communications theoreticians alike. The next ten years will be exciting.

I think that the circuit complexity scale is not cubical, instead I believe it is square instead because:

1) The latency of Gramma matrix calculator, more precisely the latency of Multiply Accumulator (MAC) is linearly proportional to the number of antenna of BS. The number of MAC is square proportional to the number of UE.

2) If we only utilize Matched Filter (MF) and Inverse of the Diagonal of the Gramma matrix for MU-MIMO detector. If the MF is implemented by a Matrix-Vector Multiplier Systolic Array, the latency of the MAC is proportional to the antenna of BS, and the size of Matrix-Vector Multiplier for is proportional to the number of user.

Agreed on “cubically”; the statement should read “resulting in a much-faster overall complexity scaling.” Thanks for spotting this.