Energy Efficient Group-Sort QRD Processor with On-line Update for MIMO Channel Pre-processing

Zhang, Chenxin; Prabhu, Hemanth; Liu, Yangxurui; Liu, Liang; Edfors, Ove; Öwall, Viktor

Published in:
IEEE Transactions on Circuits and Systems Part 1: Regular Papers

DOI:
10.1109/TCSI.2015.2402936

Published: 2015-01-01

Citation for published version (APA):

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

- Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
- You may not further distribute the material or use it for any profit-making activity or commercial gain
- You may freely distribute the URL identifying the publication in the public portal
Energy Efficient Group-Sort QRD Processor with On-line Update for MIMO Channel Pre-processing

Chenxin Zhang, Student Member, IEEE, Hemanth Prabhu, Student Member, IEEE, Yangxurui Liu, Liang Liu, Member, IEEE, Ove Edfors, Member, IEEE, and Viktor Öwall, Member, IEEE

Abstract—This paper presents a Sorted QR-Decomposition (SQRD) processor for 3GPP LTE-A system. It achieves energy-efficiency by co-optimizing techniques, such as heterogeneous processing, reconfigurable architecture, and dual-supply voltage operation. At algorithm level, a low-complexity hybrid decomposition scheme is adopted, which switches, depending on the energy distribution of spatial channels, between the traditional brute-force SQRD and a proposed group-sort QR-update strategy. A reconfigurable vector processor is accordingly developed to support the adaptive processing with high hardware efficiency. Furthermore, on-chip power management technique is also integrated to obtain real-time power-saving by adapting the voltage supply based on the instantaneous workload. As a proof-of-concept, we implemented the processor using a 65 nm CMOS technology and conducted post-layout simulation. The proposed SQRD processor occupies 0.71 mm² core area and has a throughput of up to 69 MQRD/s. Compared to the brute-force approach, an energy reduction of 10 ~ 61.8% is achieved.

Index Terms—QR decomposition, sorting, channel pre-processing, MIMO, reconfigurable processor

I. INTRODUCTION

Most of modern wireless systems have adopted Multiple-Input Multiple-Output (MIMO) technique to improve bandwidth and transmitting power efficiency [1], [2]. One important processing in MIMO systems is channel matrix pre-processing that enables efficient signal detection at the receiver side. A widely-adopted pre-processing approach is QR-Decomposition (QRD) [3], which is a key prerequisite in many advanced signal detectors such as sphere decoder [4] and K-Best detection [5], [6]. To further improve detection performance, sorted QRD algorithm [7] has been introduced to relieve detection error propagation by optimizing the spatial processing order.

Despite their importance, channel pre-processing units are often excluded from conventional signal detectors [8], as their computations are considered to be less frequent with the assumption that channel is block-stationary [9]. Nevertheless, outdated Channel State Information (CSI) will introduce severe interference and degrade detection performance, especially in high-mobility scenarios. In these cases, in-time CSI update and the corresponding matrix pre-processing are highly desirable in maintaining link quality. On the other hand, SQRD is a quite computationally intensive procedure, e.g., state-of-the-art SQRD processors consume even more energy than signal detectors [8], around 30 times in [10]. As a consequence, frequent computation of SQRD (e.g., at each CSI update and for each sub-carrier) will introduce unacceptable hardware and energy consumption overhead, making the efficient implementation of SQRD processors a critical challenge, especially for battery-powered mobile devices.

In this study, we tend to tackle this implementation challenge by conducting a co-optimization among different design layers, including decomposition algorithm, hardware architecture, and on-chip power management technique. More specifically, a hybrid decomposition scheme is developed to enable wide trade-off in complexity and performance. Depending on instantaneous channel realizations, an on-line SQRD computation scheme switches adaptively between brute-force SQRD and a very low-complexity QR-Update (QRU). This is performed by fully exploiting the property of LTE-A pilot pattern. The proposed QRU is able to produce exact $Q$ and $R$ matrices using only one Givens rotation, reducing the complexity significantly. To obtain the low-complexity benefit of QRU in the context of optimal detection ordering, we further propose an effective group-sort technique which approximates the precise sorting with a two-step (i.e., intra- and inter-antenna group) sorting.

From the hardware-design perspective, the implementation of the proposed hybrid SQRD algorithm is based on a flexible array architecture, providing run-time reconfigurability for adaptive switching between the aforementioned two SQRD schemes. The processor is vector enhanced to effectively leverage the data-level parallelism in MIMO-OFDM systems, e.g., the data stream of different sub-carriers and/or receiving antennas can be processed simultaneously to improve the processing throughput. Effectively harnessing the potential energy savings at run-time requires appropriate power management scheme. This work employs a dual-supply voltage technique, which is a result of coherently considering and optimizing power-saving capability, processor architecture, and workload variance. To demonstrate the effectiveness of the solution, we evaluated the proposed SQRD processor using a 65 nm CMOS technology. Post-layout simulation results demonstrate a 69 MQRD/s processing throughput. The power consumption is 195.3 mW for brute-force SQRD with 1.1 V supply and 74.5 mW for group-sort QRU with 0.9 V voltage.

The remainder of this paper is organized as follows: Section II briefly introduces the background, which includes the system model and previous works on QRD update/tracking schemes. Section III describes and evaluates the hybrid SQRD
algorithm. Section IV presents the design of vector processor. The implementation results and performance comparison are shown in Section V, and conclusions are drawn in Section VI.

II. BACKGROUND

A. System Model

Considering an equivalent baseband model of a spatial-multiplexing $N \times N$ MIMO system, the $N \times 1$ received complex signal vector $y$ is expressed as

$$y = Hx + n,$$ (1)

where $x$ is the $N \times 1$ transmit vector, $n$ is the i.i.d. complex Gaussian noise vector with zero mean and variance of $\sigma^2$, and $H$ denotes the $N \times N$ channel matrix containing uncorrelated complex-valued channel coefficients. Each component of $x$ is obtained by mapping a set of information bits, encoded by error-correcting codes (ECC), onto a Gray-labelled complex constellation like $M$-QAM. In this work, we use 3GPP LTE-A downlink as a testbed and resort to Frame-Error-Rate (FER) to evaluate algorithms.

In LTE-A, data blocks are carried out using resource blocks (RBs), each containing 12 consecutive sub-carriers and 7 OFDM-symbol slots allocated in a time-frequency grid. To help with tracking channel changes in both time and frequency domain, scattered orthogonal pilots are inserted. LTE-A pilot pattern for four antenna ports are shown in Fig. 1. One special property to be noticed here is that the pilot tones allocated in the middle slot of each RB are only available for antenna ports 0 and 1. This corresponds to an update of half of columns (e.g., 2 out of 4) in the channel matrix $H$, denoted as half-H renewal with respect to the full-H counterpart (e.g., all elements in $H$ are changed) that takes place at the beginning of each RB. Such channel updating property paves way to various schemes for exploiting pre-processing, see Sec. II-B and Sec. III.

B. Channel Pre-processing - QRD

For channel matrix $H$ of each time-frequency resource grid, one pre-processing approach involves decomposition of the channel matrix into

$$H = QR,$$ (2)

where $Q$ matrix contains the orthogonal unit vectors and $R$ is an upper-triangular matrix. QRD of a matrix can be implemented efficiently by using different algorithms, such as Givens rotation [10], Householder [11], Gram-Schmidt [12], by exploiting the intricate mechanisms of each of these algorithms. However, the computation involved is expensive and has to be performed for each sub-carrier. To tackle this, various approximative approaches are proposed to reduce the complexity at the price of some performance degradation, such as hold-Q [13], [14], interpolating-Q [15].

1) Approximative Hold-Q (Tracking-R) Scheme: The approximative method proposed in [13] utilizes the time-domain correlation property of $H$ by assuming that the orthogonal vectors of $Q$ remains unchanged during successive CSI updates, e.g., at the half-H renewal slot. The corresponding $R$ matrix absorbs all the variations of the channel and thus is not a pure upper-triangular matrix. The corresponding computation in the hold-$Q$ scheme can be simplified to

$$\tilde{H}_{\text{new}} = Q_{\text{new}}R_{\text{new}} \approx Q_{\text{old}}R_{\text{new}},$$ (3)

Here the subscript “old” refers to the QRD result at previous channel update, e.g., $Q_{\text{old}}$ is the corresponding orthogonal matrix, precisely computed at the full-H renewal slot. The subscript “new” refers to the current updated channel. Different strategies can be employed to perform the tracking-$R$ (or hold-$Q$) scheme, such as least mean square [14], Kalman filter and threshold schemes. As mentioned in [13], a threshold scheme based on the inaccuracy of the outdated $Q_{\text{old}}$ (energy of non-zero elements below the diagonal of $R_{\text{new}}$) is used as the criterion of evoking the approximation in (3). Furthermore, various design trade-offs can be performed to adjust the corresponding performance and computational complexity. However, these methods are bound to introduce errors and in-turn degrade performance, particularly in cases where $H$ changes fast.

2) Exact QR Update Scheme: To avoid the potential performance loss in aforementioned hold-$Q$ scheme while still keeping the overall complexity low, we investigate a step further and propose an post-processing procedure after the computation of (3), referred to as exact QR update hereafter. The scheme can be employed in certain scenarios, where only a few columns of the matrix gets changed during CSI update. For example, if only one ($n^{\text{th}}$) column of $H$ matrix changes

$$H_{\text{new}} = [h_1, \ldots, h_{n-1}, h_{\text{new}, n}, h_{n+1}, \ldots, h_N],$$ (4)

a low complexity method to compute QR decomposition is deduced by the fact that the multiplication of

$$Q_{\text{old}}H_{\text{new}} = [Q_{\text{old}}h_1, \ldots, Q_{\text{old}}h_{n-1}, Q_{\text{old}}h_{\text{new},n}, \ldots, Q_{\text{old}}h_N]$$

is given by

$$Q_{\text{old}}H_{\text{new}} = \begin{bmatrix}
Q_{\text{old}}h_1 & \cdots & \cdots & \cdots & \cdots \\
0 & \cdots & x & \cdots & x \\
0 & \cdots & x & 0 & \cdots \\
\cdots & \cdots & 0 & x & \cdots \\
\cdots & \cdots & \cdots & x & \cdots \\
\cdots & \cdots & \cdots & x & \cdots \\
0 & \cdots & x & 0 & \cdots \\
0 & \cdots & 0 & x & \cdots \\
0 & \cdots & \cdots & x & \cdots \\
0 & \cdots & \cdots & 0 & x
\end{bmatrix},$$ (5)
resulting in a spike in \(n^{th}\) column. The unwanted sub-diagonal elements can be cleared (zeroed) by performing a sequence of Givens rotation as

\[ G_{n+1} \cdots G_N (Q_{\text{old}}^H H_{\text{new}}) = R_{\text{new}}, \]

where \(G_n\) is the Givens matrix for rotation in the planes \(n\) and \(n+1\), as described in [16].

The key advantage of this method, apart from still being low complex, is that the accurate \(Q\) and \(R\) matrix can be obtained and hence performs better than hold-\(Q\) schemes. In the next section, we apply this scheme in the framework of LTE-A.

3) Application in LTE-A Framework: As mentioned earlier, in case only parts of the matrix columns alter over time, QRD of the new matrix can be performed in a more efficient way than a brute-force computation (referred to as Case-I). Inspired by this, we apply the aforementioned method in the framework of LTE-A downlink receiver and propose a low-complexity QRU scheme during half-H (Fig. 1) renewals. Specifically, the proposed scheme starts with the brute-force SQRD during full-H renewals. During half-H renewals, \(\tilde{H}_{\text{new}}\) is obtained by updating two columns of \(H_{\text{old}}\). Although orthogonal vectors in \(Q_{\text{old}}\) may no longer triangularize \(H_{\text{new}}\), it can still have vectors pointing in the correct directions. However, due to the outdated \(Q_{\text{old}}\), \(R_{\text{new}}\) is no longer an upper-triangular matrix, but may still reveal some upper-triangular properties depending on the positions of the two renewed columns. Specifically, in cases where column changes take place at the right-most of \(\tilde{H}_{\text{new}}\), only one element in the lower triangular part of \(R_{\text{new}}\) (i.e., \(\tilde{r}_{\text{new}}(4, 3)\), 4-th row 3-rd column of \(R_{\text{new}}\)) becomes non-zero. This implies that triangularization of \(R_{\text{new}}\) can be significantly simplified by nulling the single non-zero element instead of operating on all columns afresh as

\[ G R_{\text{new}} = G Q_{\text{old}}^H \tilde{H}_{\text{new}}, \]

where

\[ G = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & c & s^* \\ 0 & 0 & -s^* & e^* \end{bmatrix}. \]

In (8), \((\cdot)^*\) is the complex conjugation, \(c\) and \(s\) are defined as

\[ c = \tilde{r}_{\text{new}}^*(3, 3)/z, \]
\[ s = \tilde{r}_{\text{new}}^*(4, 3)/z, \]
\[ z = (|\tilde{r}_{\text{new}}(3, 3)|^2 + |\tilde{r}_{\text{new}}(4, 3)|^2)^{1/2}. \]

After triangularizing \(\tilde{R}_{\text{new}}\), exact \(Q_{\text{new}}\) and \(R_{\text{new}}\) in the proposed QRU scheme are obtained, expressed as

\[ Q_{\text{new}} = (G Q_{\text{old}}^H)^H, \]
\[ R_{\text{new}} = G \tilde{R}_{\text{new}} = G (Q_{\text{old}}^H \tilde{H}_{\text{new}}). \]

Note that the lower-right diagonal element of \(R_{\text{new}}\) in (11), i.e., \(r_{\text{new}}(4, 4)\), has been transformed from real to complex-valued domain during this exact QR update. This can be easily resolved by performing an additional real-domain transformation, e.g., using Givens rotation matrix [3], in case if real-valued diagonal elements are required. In the next section, we incorporate the optimal channel ordering in the QRU schemes and describe a group-sorting mechanism to further exercise this complexity reduction. Moreover, a hybrid scheme will be introduced to explore the advantages of different QRD schemes under varying channel conditions.

III. HYBRID SQRD UPDATE ALGORITHM

SQRD is capable of improving detection performance by optimizing processing order based on energy of spatial channels. SQRD starts with a column permutation of the original channel matrix \(H\).

\[ \tilde{H} = HP. \]

where \(\tilde{H}\) and \(P\) denote the sorted channel and the permutation matrix, respectively. The optimal sorting strives to achieve \(|r(i, i)| > |r(i-1, i-1)|\), for \(i = 2, \ldots, N\), which can be computed using an iterative sorting Gram-Schmidt algorithm [7]. After sorting, a QRD is performed on \(\tilde{H}\) to obtain the orthogonal matrix \(Q\) and upper-triangular matrix \(R\). In the following, we discuss how to apply the low-complexity QRU in the context of SQRD.

A. Sorted QRU Scheme

By combining the traditional brute-force approach (i.e., computing QRD from scratch during half-H renewals) and the QRU scheme, a hybrid decomposition algorithm is formed which dynamically switches between the two schemes to reduce the computational complexity, depending on run-time conditions of the channel reordering. Evidently, the complexity reduction depends on the applicability of the QRU scheme. Intuitively, we could fix the position of antenna ports 0 and 1 to the right-most part of \(\tilde{H}_{\text{new}}\) in order to obtain a maximum complexity gain, since it completely avoids brute-force computation during half-H renewals. However, the advantage of channel reordering (for improving detection performance) is lost and we refer to this as Case-II. On the other hand (Case-III), where channel columns are permuted based on the optimal detection order without considering the position of renewed channel columns, the applicability of the QR-update is dramatically reduced. For example, considering the \(4 \times 4\) MIMO LTE-A, only \(2!\times 2! = 1/6\) of sorting combinations meet the required update condition, thus limiting the complexity reduction. As a consequence, a smart sorting strategy is needed to explore the low-complexity potential of the QR-update, while still retaining the performance gain of the optimal channel reordering.

B. Group-sort Algorithm

To fulfill the aforementioned requirement, we propose a group-sort algorithm for channel reordering. Instead of operating on individual columns, sorting of \(H\) is applied on two virtual groups, wherein columns associated with antenna ports 0 and 1 are tied together. This way, combinations of “columns” is reduced from \(4!\) to \(2!\). Consequently, the probability of having both altered columns at the right-most part of \(\tilde{H}_{\text{new}}\) increases 3 times, i.e., from \(1/6\) to \(1/2\). To reduce errors due to the sub-optimal sorting sequences, a two-step sorting scheme
is adopted. First, the sorting between groups is based on the total energy of bundled columns as

\[ I = \arg \max_{i=\{0,1\},\{2,3\}} \sum_i \|h_i\|^2, \]  

(13)

where \( I \) contains inter-group indexes, e.g., \( I = \{0,1\} \) if antenna ports 0 and 1 has the strongest channels. Second, the two columns within each group, e.g., indexes within \( I \), are intra-group based on the energy of individual columns. To conclude, Table I summarizes four cases of the hybrid SQRD scheme and their applicability, wherein we denote the group-sort method as Case-IV. A significant increase to the applicability of the low-complexity QRU is obtained, comparing to the precise sorting scheme in Case-III.

### C. Algorithm Evaluation

Before evaluating the overall system performance, an evaluation of the optimal sorting order between full-H and half-H update was performed. In Fig. 2, the probability of the change in the sorting order is shown. The probability of change in the sorting order increases with the increase of Doppler frequency, especially for ETU models. However, it should be noted that the probability of change for the strongest antenna is much lower, which in-turn leads to a good initial (starting) coordinate, critical in detectors. As a consequence, we perform sorting only during full-H updates and maintain the sorting order for half-H update slots, since the probability of change is quite low. This further reduces hardware overhead, e.g., sorting complexity and the storage of permutation matrix.

For simulating the FER of the proposed algorithms, the 3GPP Extended Vehicular A (EVA) channel model with a maximum Doppler frequency of 70 Hz is used. Operating at a 2.6 GHz carrier frequency, this corresponds to a speed of 29 km/h. Channel models with higher mobility, such as Extended Typical Urban (ETU) model with 300 Hz Doppler frequency, are not considered in this study, as smaller MIMO configurations (e.g., \( 2 \times 2 \)) or lower modulation schemes (e.g., QPSK) are expected to be used to mitigate serious interference induced by fast channel variations. In each FER simulation, 5000 LTE-A subframes are transmitted and decoded using a fixed-complexity sphere decoder [17]. The ECC scheme adopted in simulations is a rate 1/2 parallel concatenated turbo code with the coding block length of 5376 and 6 decoding iterations. Furthermore, we assume the receiver has perfect channel knowledge. Performance of the proposed group-sort QR-update and aforementioned cases (Table I) are shown in Fig. 3. Note that Case-III has the same performance as the brute-force approach and is used as a reference for FER measurements. Compared to the one where no QRDs are performed during half-H renewals (upper curve in Fig. 3), it clearly shows the importance of performing CSI and QR updates even for channels with moderate Doppler shifts. Additionally, the adoption of channel reordering during QR decomposition improves performance to that of the fixed-order approach, e.g., 1.1 dB difference between Case-II and III at FER = \( 10^{-2} \). Furthermore, the group-sort approach has only small performance degradation of about 0.2 dB compared to Case-III, however, with a large complexity reduction as analyzed in the following.

### D. Complexity Analysis

Table II summarizes the complexity (\( C \)) of computations (2)–(7) for an \( N \times N \) MIMO system. To perform the brute-force decomposition (2), we consider Gram-Schmidt algorithm [16] that has a total complexity of \( C_1 \). Computations required for both (3) and (7) have a total complexity of \( C_2 + C_3 \), which is significantly lower than \( C_1 \), e.g., by about 42% for \( N = 4 \). Note that the product of \( Q_{old}^H \tilde{H}_{new} \) in (3) requires only half of the matrix computations during QR updates, since only two columns change in \( \tilde{H}_{new} \). The complexity of sorting in both precise- and group-sort approach is denoted as \( C_4 \). Based on this analysis and in reference to Case-I,

---

**Table I**

<table>
<thead>
<tr>
<th>Channel ordering</th>
<th>Brute-force</th>
<th>QRU</th>
</tr>
</thead>
<tbody>
<tr>
<td>Case-I</td>
<td>Ordering with precise sorting</td>
<td>100%</td>
</tr>
<tr>
<td>Case-II</td>
<td>Fixed order for antenna 0 and 1</td>
<td>0%</td>
</tr>
<tr>
<td>Case-III</td>
<td>Ordering with precise sorting</td>
<td>83.33%</td>
</tr>
<tr>
<td>Case-IV</td>
<td>Group-sort</td>
<td>50%</td>
</tr>
</tbody>
</table>
Table III shows the complexity reduction versus performance degradation of Case-II—IV for a 4×4 system. It shows that a 50% complexity reduction is obtained for Case-II. Moreover, combining the group-sort and the QR-update schemes results in more palatable trade-offs, i.e., 18% complexity reduction for only 0.2 dB performance degradation.

To further evaluate the hardware friendliness of the proposed algorithm, operations required in the four computations (Table II) are profiled. In Table IV, most operations are at vector level, representing a high degree of data level parallelism that can be exploited to achieve high processing throughput. In addition, most of them are shared among all computations, implying that extensive hardware reuse is possible.

IV. VLSI ARCHITECTURE OF THE VECTOR PROCESSOR

Based on the operation analysis, we present an Application-Specific Instruction-set Processor (ASIP) for performing the proposed hybrid SQRD algorithm. The hardware architecture is optimized jointly for flexibility, implementation cost, and processing speed. Regarding the flexibility requirements in contemporary system designs for coping with content switching (e.g., to adopt an appropriate algorithm based on the runtime update conditions in Table I), a run-time reconfigurable computing architecture is used. The adopted instruction set is tailored for baseband processing in MIMO-OFDM systems with special focus on the efficient computation of the operations listed in Table IV. The reconfigurable processor employs the vector processing concept with Single Instruction Multiple Data (SIMD) [18] architecture to parallelize the processing with high throughput.

A. Vector Processor

The implementation of the vector processor is based on the reconfigurable cell array framework presented in [19]. Fig. 4 shows a microarchitecture of the processor, consisting of 6 PEs and 2 MEs. Solid and dashed lines depict data and control bus, respectively.

Table II

<table>
<thead>
<tr>
<th>Complexity</th>
<th>Computation</th>
<th>Multiplication</th>
<th>Others*</th>
</tr>
</thead>
<tbody>
<tr>
<td>C1</td>
<td>QRD (2)</td>
<td>$N^3 + N^2$</td>
<td>$2N$</td>
</tr>
<tr>
<td>C2</td>
<td>$H_{old}^H H_{new}$ (3)</td>
<td>$\frac{1}{2}N^3$</td>
<td>0</td>
</tr>
<tr>
<td>C3</td>
<td>Triangularization (7)</td>
<td>$N^2 + 2N$</td>
<td>3</td>
</tr>
<tr>
<td>C4</td>
<td>Group Sorting (13)</td>
<td>$N^2 - N$</td>
<td>0</td>
</tr>
<tr>
<td>C5</td>
<td>Iterative Sorting (12)</td>
<td>$\frac{1}{2}N^3 - \frac{1}{2}N^2$</td>
<td>0</td>
</tr>
</tbody>
</table>

* Others include division, square-root, and CORDIC, where the latter one is often used for generating GR matrices.

Table III

<table>
<thead>
<tr>
<th>Complexity</th>
<th>Complexity and Performance Comparisons of Case-I—IV.</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Complexity</td>
</tr>
<tr>
<td>Case-I</td>
<td>$C_1 + C_5$</td>
</tr>
<tr>
<td>Case-II</td>
<td>$C_2 + C_3$</td>
</tr>
<tr>
<td>Case-III</td>
<td>$\frac{1}{2}C_1 + \frac{1}{2}(C_2 + C_3) + C_4$</td>
</tr>
<tr>
<td>Case-IV</td>
<td>$\frac{1}{2}C_1 + \frac{1}{2}(C_2 + C_3) + C_4$</td>
</tr>
</tbody>
</table>

Table IV

<table>
<thead>
<tr>
<th>Operation</th>
<th>A - B</th>
<th>A $\ominus$ B</th>
<th>A ± B</th>
<th>Scalar*</th>
</tr>
</thead>
<tbody>
<tr>
<td>QRD (2)</td>
<td>17</td>
<td>4</td>
<td>6</td>
<td>4</td>
</tr>
<tr>
<td>$H_{old}^H H_{new}$ (3)</td>
<td>8</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Triangularization (7)</td>
<td>10</td>
<td>0</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<td>Group Sorting (13)</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Iterative Sorting (12)</td>
<td>9</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

* Element-wise vector multiplication.

Scalar operations include division, square-root, and CORDIC.

IV. MICROARCHITECTURE OF THE VECTOR PROCESSOR

Based on the operation analysis, we present an Application-Specific Instruction-set Processor (ASIP) for performing the proposed hybrid SQRD algorithm. The hardware architecture is optimized jointly for flexibility, implementation cost, and processing speed. Regarding the flexibility requirements in contemporary system designs for coping with content switching (e.g., to adopt an appropriate algorithm based on the runtime update conditions in Table I), a run-time reconfigurable computing architecture is used. The adopted instruction set is tailored for baseband processing in MIMO-OFDM systems with special focus on the efficient computation of the operations listed in Table IV. The reconfigurable processor employs the vector processing concept with Single Instruction Multiple Data (SIMD) [18] architecture to parallelize the processing with high throughput.

A. Vector Processor

The implementation of the vector processor is based on the reconfigurable cell array framework presented in [19]. Fig. 4 shows a microarchitecture of the processor, consisting of 6 PEs and 2 MEs. Solid and dashed lines depict data and control bus, respectively.
four multiplication results in one clock cycle (see Fig. 5), achieving four concurrent vector operations in each clock cycle. To assist these vector computations, pre- and post-processing elements (PE2 and PE4 in Fig. 4) are employed, which perform operations such as matrix Hermitian (7) and result sorting (12)(13). By combining these three processing elements (PE2–4), several consecutive data manipulations can be accomplished in one single instruction without storing and loading intermediate results. This execution scheme is similar to that of VLIW processors, but has additional flexibility for loading configurations into individual processing elements without affecting others, hence resulting in reduced control overhead. The vector processor is parameterizable and in this work we have used 16 bits computations with 18 bits internal precision for accumulations. The register bank (ME2) contains 16 general purpose vector registers and each configuration memory can buffer up to 16 hardware configurations. Moreover, the instruction memory (ME1) has a capacity of 4 Kbits.

2) Multi-level Zero-delay Inner Loop Control: Multi-level loops are widely used in MIMO-OFDM systems to, for example, process multiple sub-carriers per OFDM symbol, multiple spatial layers per sub-carrier, and multiple iterations per spatial layer. To reduce control overhead during loop operations, a multi-level inner loop control scheme is adopted in PE1. Fig. 6(a) illustrates the deployed stack-based loop controller, which contains a configuration stack used to store the address of the first instruction in a loop (link address) and the corresponding loop count. Compared to other memory structures, the stack has a simple control mechanism and natively supports the execution order of multi-level loops. With the help of a finite-state machine (FSM), link addresses of loops are pushed into the stack in a “last-in-first-out” manner during the execution of a program. Fig. 6(b) shows a snapshot of the stack when the entire loop hierarchy of the enclosed code fragment is pushed into the stack. The adoption of multi-level inner loop control eases program writing, speeds up loop processing, and reduces program size and control overhead.

B. Algorithm Mapping and Scheduling

The presented hybrid SQRD scheme, including brute-force SQRD and group-sort QRU, are mapped manually onto the developed vector processor. The primary objective of the algorithm mapping and task scheduling is to achieve high utilization of the processor and to suffice the stringent timing constraint of LTE-A systems. Moreover, data-level parallelism in MIMO-OFDM systems is exploited to enable high processing throughput and energy efficiency.

In Fig. 7, we detail the mapped operations for both brute-force SQRD and the proposed group-sort QRU. Most of the operations in brute-force SQRD are vector operations and thus are mainly executed by PE3 (Fig. 4). The scalar division and square-root operations are out-sourced to PE5. As shown in Fig. 7(a), the GS-based SQRD performs matrix orthogonalization iteratively in a column-wise fashion. Each iteration starts with column-norm calculation of the matrix followed by a sorting operation, which is performed in PE4. Each iteration finishes with the energy-normalization of $Q$ matrix and the update of $R$ matrix. It should be pointed out that the workload of the straightforward operation mapping is unbalanced, e.g., calculating the column-norm of a $4 \times 4$ matrix occupies all four lanes in PE3 while only one lane is used when generating one column of the $Q$ matrix. To improve the hardware utilization, we leverage the data-level parallelism in OFDM systems and process the channel matrix of multiple sub-carriers simultaneously. More specifically, the processor works with one sub-carrier when current operation occupies all four lanes of PE3, otherwise multiple sub-carriers are processed in parallel. In summary, the execution time of brute-force SQRD for 12 sub-carriers is 87 cycles, resulting in an average execution time of 7.25 cycles per SQRD.
Table V shows the pseudo-code of the proposed QRU scheme for 12 sub-carriers within one LTE-A resource block. The corresponding operations can be separated into three stages, shown in Fig. 7(b). The first stage is the computation of $\tilde{R}_{\text{new}} = Q_{\text{old}}^H H_{\text{new}}$. Since only 2 columns of $H_{\text{new}}$ is updated during half-H renewals, column 3 and 4 of $R$ need to be computed (the left half of $R$ remains untouched). Updating each element requires a length-4 vector dot product. The Givens rotation matrix $G$ is obtained using CORDIC (PE6). The second and third stage consist of rotating $\tilde{R}_{\text{new}}$ and $Q_{\text{old}}^H$ with $G$. According to (8), $G$ is sparse, i.e., most elements being zeros, which enables parallel computation for multiple sub-carriers. As a result, QRU of 12 sub-carriers requires 42 cycles, averaging 3.5 cycles per QRU. Fig. 7(c) shows the resulting hardware utilization of brute-force SQRD and QRU mapping. It can be seen that the adopted algorithm mapping and task scheduling achieves high utilization of the processor, e.g., resulting in an average core-utilization of 84%.

C. Flexibility

By loading different programs and configurations to the processing and memory cells shown in Fig. 4, the platform has the potential to perform different algorithms, tasks, and radio standards. In the presented case study, the flexibility is demonstrated by dynamically adopting different QRD algorithms to efficiently perform MIMO channel pre-processing. Other examples include mapping of different antenna setups and run-time adaptation of system performance, e.g., adjusting the frequency of channel estimation and MIMO signal detection.

V. IMPLEMENTATION RESULTS AND DISCUSSION

To verify the aforementioned solution (both the hybrid adaptive SQRD algorithm and the reconfigurable vector processor), we implement the proposed SQRD processor using STMicroelectronics 65 nm CMOS technology and evaluate the corresponding hardware cost, processing throughput, and energy consumption with post-layout results.

A. Dual-Supply Voltage

As previously discussed, the proposed QRU scheme is capable of providing accurate $Q$ and $R$ matrix with much lower computational complexity, which is also demonstrated in Fig. 7. Given that the QRU requires only 48.2% of the execution cycles of the brute-force SQRD, one efficient method of harvesting power saving at run-time is to reduce the power supply of the processor at QRU mode. Here, the precondition is to maintain the processing throughput during SQRD mode switching in order to simplify the overall scheduling. In accordance with the available supply voltage in the provided standard cell library, we use ST general purpose standard $V_{\text{I}}$ (GPSVT) cell with 1.1 V voltage support for brute-force SQRD operations and 0.9 V for QRU operations. Cadence common power format (CPF) design flow is utilized to enable multiple-supply voltage implementation.

B. Post-layout Simulation Results

The designed SQRD processor has a total core area of 0.71 mm$^2$ equivalent to 362 K two-input NAND gates (GE).
According to post-layout static timing analysis, the maximum clock frequency for brute-force SQRD mode is 500 MHz, corresponding to a processing throughput of 69 MQRD/S. The power consumption is evaluated using *Synopsys PrimeTime* with switching activities annotated. In the brute-force SQRD mode, the processor operates at 500 MHz clock frequency with 1.1 V voltage supply and consumes 195.3 mW. In the QRU mode, the maximum clock frequency is 241 MHz with 0.9 V supply and the processor consumes 74.5 mW. The corresponding energy consumption for the two QRD computation modes is 2.83 nJ and 1.08 nJ, respectively. Table VI summarizes the implementation results.

To further understand the proposed architecture, the area and power breakdown of the processor is plotted, see Fig. 8. As can be seen, the vector block (PE1=4 and ME1=2) occupies 92% of the total area and consumes on average 85% of power. Among all, the homogeneous CMAC bank (PE3) consumes most of the area and power, and the master node (PE1) together with its instruction memory (ME1) take around 30%. It is worth emphasizing that the proposed low-complexity QRU offers 61.8% of energy saving in comparison to the brute-force SQRD. This is a result of the co-design among algorithm, architecture, and power-saving techniques.

Fig. 9 presents the resulting design trade-offs between energy and performance for Case-I–IV of the hybrid SQRD algorithm. Taking the brute-force QRD (Case-I) as a reference, the corresponding numbers on the horizontal axis measures the SNR degradation for reaching the target 10^-2 FER, while the percentage of energy reduction is shown on the vertical axis. Accordingly, algorithms having their coordinates towards the bottom-left corner is desired. In Fig. 9, it clearly shows that the proposed group-sort QR-update scheme (Case-IV) achieves a good compromise, i.e., trading 0.2 dB performance for around 31% energy reduction. In case of energy-constrained systems, the fixed-order scheme (Case-II) can be adopted to further reduce the energy consumption, i.e., by 62% in total, whereas the precise-sort scheme (Case-III) is used when high performance is demanded. All four update schemes in Fig. 9 are supported by the implemented vector processor, which can be dynamically reconfigured to adopt to run-time channel conditions and performance requirements.

### C. Result Discussion

Table VII shows the implementation results of some state-of-the-art MIMO channel pre-processors. The purpose here is not to conduct comparison saying one implementation is better than the other. Instead, we are interested in discussing techniques and strategies at different design layers that can be used to target various optimization objectives, such as system performance (algorithm selection), processing throughput, hardware efficiency, and flexibility. The discussion is supposed to serve as an overview and a design guideline on how to achieve balanced trade-off under different application scenarios. To ease the discussion we divide the table into two sections (columns), i.e., Application-Specific Integrated Circuit (ASIC) and ASIP, and cover 3 broad features (rows). Note that the 8×8 real-valued matrix supported by [11] and [3] in Table VII is an equivalent real domain transformation of the 4×4 complex-valued matrix. Therefore, it is fair to compare system performance with these works.

**Top Level Features:** ASIPs are reprogrammable within a certain application domain and hence are more flexible for algorithm adaptation and future evolution. In this study, we successfully leveraged this feature and mapped a hybrid SQRD scheme which switches adaptively between Group-sorting QRU and the brute force iterative Gram-Schmidt. Recently, the correlations in wireless channel have been utilized to reduce the computational complexity. For example, in [15], [20], an interpolation of QR in frequency domain is implemented, which can be merged with our proposed QRU scheme to utilize both the time and frequency correlations simultaneously. The reconfigurable processor has the capability to support future algorithm-level modifications.

**Architecture Features:** Advanced power management schemes like Dynamic Frequency Scaling (DFS) and Application-Driven Clock-Gating (AD-CG) [21] can be employed for low-power design [23]. In our design, the support
TABLE VII
IMPLEMENTATION RESULTS OF DIFFERENT QRD(SQRD) PROCESORS.

<table>
<thead>
<tr>
<th>Top level Features</th>
<th>ASIC</th>
<th>ASIP</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sorting, Update, Interpolation</td>
<td>[11], [3], [12], [20]</td>
<td>[21], [22], This Work</td>
</tr>
<tr>
<td>Matrix Dimension</td>
<td>8 × 8, 8 × 8, 4 × 4, 4 × 4</td>
<td>4 × 2, 4 × 4, 4 × 4</td>
</tr>
<tr>
<td>Architecture Features</td>
<td>Technology</td>
<td>Supply voltage [V]</td>
</tr>
<tr>
<td></td>
<td>65 nm, 180 nm, 90 nm, 90 nm</td>
<td>1.05</td>
</tr>
<tr>
<td>Reported Numbers</td>
<td>Execution cycles</td>
<td>4</td>
</tr>
<tr>
<td></td>
<td>Frequency [MHz]</td>
<td>225</td>
</tr>
<tr>
<td></td>
<td>Throughput [MQRD/s]</td>
<td>72</td>
</tr>
<tr>
<td></td>
<td>Power [mW]</td>
<td>252</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Technology Normalization</th>
<th>Gate Count [KGE]</th>
<th>Normalized Throughput*</th>
<th>Normalized Power*</th>
<th>Hardware Efficiency*</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>264</td>
<td>72</td>
<td>228.5</td>
<td>272.7</td>
</tr>
<tr>
<td></td>
<td>152</td>
<td>69.2</td>
<td>51.2</td>
<td>472.7</td>
</tr>
</tbody>
</table>

* Normalized Throughput [MQRD/s] = Throughput × Technology / 65
* Normalized Power = Power × Technology / (Supply Voltage / 1.1V)²
* Hardware Efficiency = Normalized Throughput/Gate Count
* Only relevant parts of the chip
* Data buffers excluded

for dual-supply voltage is implemented to reduce power during the low complexity QRU operation. One important aspect that needs to be pointed out here is the overhead in ASIP, e.g., the extra data storage for intermediate results and the corresponding memory bandwidth requirements (this is not included in this study). On the other side, ASIC implementations are highly streamlined and does not require any intermediate buffers. Optimized within the application domain, ASIPs can be deeply pipeline and parallelized to compensate for the overhead, e.g., running the same instruction for a large amount of data to reduce the instruction fetching overhead.

Internal wordlength is a critical aspect for efficient hardware implementation and system performance. For ASIC implementations, such as in [11] and [20], wordlength optimizations are performed to trade-off between hardware cost and performance. Their results show that QR pre-processing for LTE systems can be performed accurately using around 16 bits. However, performing wordlength optimization for ASIPs is not an easy task, since ASIPs are flexible platforms designed for mapping different algorithms and tasks. To suffice precision requirements for most of the baseband processing tasks, 16 bits computation with 16 bits internal precision for data accumulations is adopted in this work, similar to [22].

Hardware Efficiency: To get a rough idea of the efficiency of different design strategies, we also report the implementation efficiency of different works. The hardware efficiency of ASIPs is very high (e.g., [3] with NHE of 454) compared to ASICs, when considering the overheads involved (like memory and instruction reads). If only considering the processing core, the proposed ASIP achieves the same level of hardware efficiency as that of ASICs, thanks to the high hardware utilization (see Fig. 7).

VI. CONCLUSION

This paper exploits the energy efficient design methodology for algorithm development and VLSI implementation of a SQRD processor for 3GPP LTE-A systems. At the algorithm level, a hybrid QR-decomposition algorithm is proposed to reduce the computational complexity by combining low-complexity QR-update scheme and the traditional brute-force SQRD method. Algorithmic analyses show that a complexity reduction of up to 50% is achieved. To leverage the flexible decomposition algorithm, a reconﬁgurable vector processor is developed which is able to dynamically switch between different QR schemes based on energy distribution of spatial channels. Implementation with dual-supply voltage technique demonstrates up to ~ 62% energy reduction in comparison to the brute-force approach.

REFERENCES


Chenxin Zhang (S’09) received his M.S. degree in electrical engineering from Lund University, Sweden in 2009. He is currently working toward the Ph.D. degree in digital circuit design at the Department of Electrical and Information Technology at the same University. From Oct. 2012 to Feb. 2013, he was a visiting scholar at the Department of Electrical Engineering, University of California, Los Angeles. His research mainly focuses on developments of reconfigurable architectures for high computing performance and run-time flexible task mappings.

Hemanth Prabhu (S’11) received his B.E. and M.S. degrees in electronics and communication and System On Chip from VTU, India, and Lund University, Sweden, in 2006 and 2011 respectively. He has worked for 3 years as verification engineer in sasken communication, Bangalore, India. He is currently pursuing his Ph.D. degree in Lund University. His research interests are hardware implementation and signal processing for wireless communication systems such as LTE-A and Massive MIMO.

Liang Liu (S’10-M’12) is an Assistant Professor in the Department of Electrical and Information Technology at Lund University, Sweden. He received his B.S. and Ph.D. degree in the Department of Electronics Engineering (2005) and Microelectronics (2010) from Fudan University China. In 2010, he was with Rensselaer Polytechnic Institute, New York, USA as a visiting researcher. He joined Lund University as a Post-doc in 2010. His research interest includes wireless communication system and digital integrated circuits design.

Ove Edfors (M’91) is Professor of Radio Systems at the Department of Electrical and Information Technology, Lund University, Sweden. He graduated with a PhD in Signal Processing from Luleå University of Technology in 1996 and was appointed professor at Lund University in 2002. His research interests include statistical signal processing and low-complexity algorithms with applications in wireless communications. He has long experience with OFDM and MIMO systems and his current research focus is on how realistic propagation characteristics and hardware limitations influence system performance and baseband processing complexity.

Viktor Öwall (M’90) received the M.Sc. and Ph.D. degrees in electrical engineering from Lund University, Sweden, in 1988 and 1994, respectively. During 1995 to 1996, he joined the Electrical Engineering Department, the University of California at Los Angeles as a Postdoc. Since 1996, he has been with the Department of Electrical and Information Technology, Lund University, Sweden. He is currently full Professor at the same department and since 2009 the Head of Department. He is the Director of the VINNOVA Industrial Excellence Center in System Design on Silicon (SoS). His main research interest is in the field of digital hardware implementation, especially algorithms and architectures for wireless communication and biomedical applications.