Abstract—In H.264/AVC video encoding, rate-distortion optimization for mode selection plays a significant role to achieve outstanding performance in compression efficiency and video quality. However, this mode selection process also makes the encoding process extremely complex, especially in the computation of the rate-distortion cost function, which includes the computations of the sum of squared difference (SSD) between the original and reconstructed image blocks and context-based entropy coding of the block. In this paper, a transform-domain rate-distortion optimization accelerator based on fast SSD (FSSD) and VLC-based rate estimation algorithm is proposed. This algorithm could significantly simplify the hardware architecture for the rate-distortion cost computation with only ignorable performance degradation. An efficient hardware structure for implementing the proposed transform-domain rate-distortion optimization accelerator is also proposed. Simulation results demonstrated that the proposed algorithm reduces about 47% of total encoding time with negligible degradation of coding performance. The proposed method can be easily applied to many mobile video application areas such as a digital camera and a DMB (Digital Multimedia Broadcasting) phone.

Keywords—Context-adaptive variable length coding (CAVLC), H.264/AVC, rate-distortion optimization (RDO), sum of squared difference (SSD).

I. INTRODUCTION

H.264/AVC is the newest hybrid video coding standard developed by ITU-T Video Coding Expert Group (VCEG) and ISO/IEC MPEG Video Group named Joint Video Group (JVT) [1, 2]. H.264/AVC has proved its superiority in coding efficiency over its predecessors, e.g., it shows a more than 40% rate reduction over H.263 [3]. The rate-distortion optimization (RDO) is one of the essential parts of the H.264/AVC encoder to achieve the much better coding performance. However, the computational complexity of the RDO technique is extremely high. Hence, the cost function computation makes H.264/AVC impossible to be realized in real time applications without high computing hardware.

In order to accelerate H.264/AVC video coding, a number of researches have been made to explore fast algorithms in motion estimation [4-5] and mode decision [6-7]. To reduce complexity, a new cost function for intra 4x4 mode decisions was proposed in [9]. In this cost function, sum of absolute integer transform difference (SAITD) is used in distortion part and a rate prediction algorithm is used in rate part. A new transform domain rate estimation and distortion measure was proposed to reduce the rate-distortion cost function computation with ignorable coding performance degradation [10]. But the computation reduction is low. In order to avoid the entropy coding method during mode decision process, several methods [11, 12, 13] are introduced to predict the bit rate of 4x4 block. To reduce the complexity of distortion part of RDO technique, a new fast sum of squared difference (FSSD) computation algorithm with use of an iterative table-lookup quantization process was proposed in [14]. This FSSD algorithm was based on the theoretical equivalent of the SSDs in spatial and transform domain. This approach avoided the inverse quantization/transform and pixel reconstruction processes with nearby no rate-distortion performance degradation.

In this paper, a transform-domain rate-distortion optimization accelerator is developed based on fast sum of squared difference (FSSD) and rate estimation algorithm. A method to estimate the rate of context-adaptive variable length coding (CAVLC) is proposed. An efficient hardware structure for implementing the proposed transform-domain RDO accelerator is also introduced.

The remainder of this paper is organized as follows. Section 2 provides the review of rate distortion optimized mode decision technique. In section 3, transform domain fast sum of squared difference (FSSD) calculation is described. Section 4 introduces the fast rate estimation method. In section 5, algorithm of proposed rate distortion optimization accelerator is presented. Hardware architecture of proposed RDO accelerator is proposed in section 6. The simulation results of the proposed method are presented in section 7. Finally, section 8 concludes the paper.
II. RATE DISTORTION OPTIMIZED MODE DECISION OF H.264/AVC

To take the full advantages of all modes, the H.264/AVC encoder can determine the mode that meets the best RD tradeoff using RDO mode decision scheme. The best mode is the one with minimum rate-distortion (RD) cost and this cost is defined as

\[ J_{RD} = SSD(S,C) + \lambda \cdot R \] (1)

where \( \lambda \) is the Lagrange multiplier, \( R \) is the number of coded bits for each macroblock which can be written as follows:

\[ R = R_{header} + R_{motion} + R_{res} \] (2)

where \( R_{header}, R_{motion} \) and \( R_{res} \) means the number of bits need to encode the header information, motion vectors and quantized residual block, respectively. The SSD(S,C) in (1) is the sum of the squared difference (SSD) between the original blocks S and the reconstructed block C, and it can be expressed as

\[ SSD(S,C) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} (S_{ij} - C_{ij})^2 \] (3)

where \( S_{ij} \) and \( C_{ij} \) are the \((i,j)\)th elements of the current original block S and the reconstructed block C, respectively. In order to compute RD cost for each mode, same operation of forward and inverse transform/quantization and entropy coding is repetitively performed. All of these processing explains the high complexity of RD cost computation.

III. FAST SUM OF ABSOLUTE DIFFERENCE CALCULATION

The cause of difference between original block and reconstructed block is due to the quantization error. Hence the spatial domain SSD and transform domain SSD is theoretically equivalent [14]. Mathematically,

\[ SSD(S,C) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} (Q_{ij}f_{ij}^* - \hat{f}_{ij}^*)^2 \] (4)

where \( f_{ij}^* \) is the \((i,j)\)th element of integer cosine transformed block, \( \hat{f}_{ij}^* \) is the \((i,j)\)th element of inverse quantized transform block and \( q_{ij} \) is the \((i,j)\)th element of scaled transform matrix. The relationship between \( f_{ij}^* \) and \( \hat{f}_{ij}^* \) is defined as [14]

\[ \hat{f}_{ij} = Q^{-1}(Q_{ij}f_{ij}^*) = \Delta_{ij} \cdot round(f_{ij}^*/\Delta_{ij}) = \Delta_{ij} \cdot Z_{ij} \] (5)

where \( \Delta_{ij} = \Delta / q_{ij} \) represents the new quantization step size defined in [14] and \( \Delta \) is the quantization step size of original H.264 encoder. In order to avoid the multiplication and division operation in (5), we proposed an iterative table-lookup quantization process in [14].

From (4), it is shown that to calculate the SSD we need 16 multipliers and 16 square operators. Actually the \( q_{ij} \) is the constant value. The value of \( q_{ij} \) at different pixel position is given in Table 1. In the first region of Table 1 multiplication by \( q_{ij} \) can easily be implemented by shifting right by two bits. To avoid the multiplication operation in second region \( q_{ij} \) is approximated as \( \frac{1}{10} \approx \frac{1}{2^3} \). Hence multiplication in this region is replaced by shifting right by three bits. Similarly, for region III, \( q_{ij} \) is approximated as \( \frac{1}{4} \approx \frac{1}{2^3} + \frac{1}{2^5} \). From simulations, it is shown that these approximations do not significantly affect the SSD calculation as well as rate distortion performance. It should be noted that no computation is necessary for shifting operator. In order to avoid the square operation, we have used a square table. It should be noted again that \( f_{ij}^* \) is the coefficients before quantization and \( \hat{f}_{ij}^* \) is the coefficients after inverse quantization. So the sign of \( f_{ij}^* \) and \( \hat{f}_{ij}^* \) is always same. Additionally, the differences between \( f_{ij}^* \) and \( \hat{f}_{ij}^* \) must be smaller than quantization step size \( \Delta_{ij} \). Mathematically,

\[ f_{ij}^* - \hat{f}_{ij}^* < \Delta_{ij} \]

\[ f_{ij}^* - \hat{f}_{ij}^* < \Delta \]

\[ q_{ij}(f_{ij}^* - \hat{f}_{ij}^*) < \Delta \] (6)

**Table I**

<table>
<thead>
<tr>
<th>Region</th>
<th>Position (i, j)</th>
<th>( q_{ij} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>I</td>
<td>(0,0),(0,2),(2,0),(2,2)</td>
<td>( a^2 - \frac{1}{4} )</td>
</tr>
<tr>
<td>II</td>
<td>(1,1),(3,1),(3,3)</td>
<td>( b^2 - \frac{1}{10} )</td>
</tr>
<tr>
<td>III</td>
<td>Others</td>
<td>( ab = \frac{1}{4} \cdot \frac{2}{\sqrt{2}} )</td>
</tr>
</tbody>
</table>

For example, the value of \( \Delta \) is 10, 16, 26, 40 and 64 for QP is 24, 28, 32, 36, and 40, respectively. If we choose QP=24, the left hand side of (6) is always less that 10. Additionally, the values of left hand sides are always integer. If we store the squares of all integer values which are less than 10 (for QP=24), then 16 square operation can be saved. In this way, we need additional memory requirement. But the size of memory is not large. We have to store only 10 elements in case of QP=24. Thus, the computational intensive multiplication and square operations can be avoided.
IV. FAST RATE ESTIMATION

Context adaptive variable length coding (CAVLC) uses five syntax elements to encode the 4x4 block. To estimate the bits for quantized transform coefficients, we estimate the number of bits for each of five different types of symbols of CAVLC separately.

A. Coefficient Token

Both the total number of nonzero coefficients (Ncoeff) and the number of trailing +/- 1s (T_L) are coded as combined event. For all elements, except chroma DC, a choice between three tables and one fixed length codeword is made [15]. N_T is a parameter used for table selection. Assume N_u and N_L is the number of non-zero coefficients of upper and left block, respectively. If both upper and left blocks are available then N_T = (N_u + N_L)/2. If only upper block available N_T = N_u; if only left block available N_T = N_L; if neither is available N_T = 0. Since to encode coefficient token, three different VLC tables are used, it needs more memory to store all the possible symbols. This is not efficient for hardware implementation in terms of memory requirement. It is well known that neighboring blocks are highly spatially correlated. Hence there is high correlation between parameter N_T and number of non-zero coefficients of current block (Ncoeff). Therefore, we can replace N_T by Ncoeff for selection of VLC tables. The algorithm for table selection of proposed method given below:

if (Ncoeff > 7): 6 bit fixed length codeword
if (4 <= Ncoeff <= 7): Num-VLC2
if (2 <= Ncoeff < 4): Num-VLC1
if (0 <= Ncoeff < 2): Num-VLC0
else
    search rate from Table 3;

B. Sign of Trailing ones

One bit is used to signal sign information of trailing +/- 1s. 0 is used for positive and 1 is used for negative. So bit consumption to encode the sign of trailing ones (R_trail) is exactly equal to the number of trailing ones (N_T).

C. Encode the Level

CAVLC makes the choice of the VLC look-up table for the level adaptive in a way where the choice depends on the recently coded levels [16]. One out of 7 level VLC tables is used to encode the level information [15]. Level VLC0 has its own structure while the other tables, level VLCN, N=1 to 6, share common structure. From the observation of VLC tables, we have seen that bit consumption to encode a coefficient is increasing with increasing the absolute value of that coefficient. Additionally, the bit consumption also depends on the absolute value of previously coded level coefficient. From the observation, rate for level is estimated as follows

\[ R_{\text{level}(i)} = |L_i| \quad \text{for highest frequency component} \quad (7.a) \]
\[ R_{\text{level}(i)} = \alpha_i |L_i| + \beta_i |L_{i-1}| \quad \text{for other component} \quad (7.b) \]
where \( L_{i-1} \) is the recently encoded level of \( i \) th level. In order to set the weighting factors, we have done several experiments for different video sequences (Akiyo, Foreman, Stefan, and Mobile) with QCIF format at different QP values. We have observed RD performance of these video sequences at

<table>
<thead>
<tr>
<th>Sequence</th>
<th>QP</th>
<th>28</th>
<th>32</th>
<th>36</th>
<th>40</th>
</tr>
</thead>
<tbody>
<tr>
<td>Akiyo</td>
<td></td>
<td>0.13</td>
<td>0.09</td>
<td>0.05</td>
<td>0.02</td>
</tr>
<tr>
<td>Claire</td>
<td></td>
<td>0.06</td>
<td>0.05</td>
<td>0.05</td>
<td>0.02</td>
</tr>
<tr>
<td>Foreman</td>
<td></td>
<td>0.21</td>
<td>0.19</td>
<td>0.15</td>
<td>0.13</td>
</tr>
<tr>
<td>Container</td>
<td></td>
<td>0.26</td>
<td>0.21</td>
<td>0.17</td>
<td>0.18</td>
</tr>
<tr>
<td>Stefan</td>
<td></td>
<td>0.24</td>
<td>0.17</td>
<td>0.18</td>
<td>0.17</td>
</tr>
<tr>
<td>Miss_America</td>
<td></td>
<td>0.09</td>
<td>0.06</td>
<td>0.03</td>
<td>0.03</td>
</tr>
<tr>
<td>Silent</td>
<td></td>
<td>0.24</td>
<td>0.18</td>
<td>0.15</td>
<td>0.16</td>
</tr>
<tr>
<td>Salesman</td>
<td></td>
<td>0.20</td>
<td>0.23</td>
<td>0.14</td>
<td>0.16</td>
</tr>
</tbody>
</table>

In order to justify the above algorithm, we have done several experiments and calculated the probability of wrong table selection of different video sequences with different QP values. Table 2 shows the probability of wrong table selection. From Table 2, we have seen that this probability is very low for low motion sequences such as Akiyo, Claire, and Miss America. Although this probability for medium and high motion sequences is large, it does not greatly influence the rate-distortion performance. This is because number of bits for a particular coefficient token is almost similar in all VLC tables. Let us assume a symbol with number of non-zero coefficients is 4 and number of trailing ones is 3. The number of bits to encode this symbol are 9, 7 and 6 for VLC tables 0, 1 and 2, respectively. By combining the mentioned algorithm with VLC tables, we have developed a rate table to encode the coefficient token. Rate table is generated by collecting the length of codeword from the different tables. For example, when Ncoeff is either 0 or 1, data is collected from VLC table 0 and while the value of Ncoeff is either 2 or 3, data is collected from VLC table 1. From observation of VLC tables, we have seen that the rate is exactly equal to 6 while number of non-zero coefficient of current block (Ncoeff) is greater than 7. Otherwise, rate of 4x4 block (Rcoeff) is estimated based on a new rate table defined in Table 3. Therefore, algorithm for proposed rate estimation for coefficient token of 4x4 block is given as follows:

If (Number of nonzero coefficients > 7) Rcoeff=6;
else search rate from Table 3;
different combinations of weighting factor. Better RD performance was found at $\alpha_1 = 3$ and $\beta_1 = 1$.

D. Encode Total Zeros

The codeword Total_zeros is the number of zeros between the last non-zero coefficient of the zig-zag scan and its start. In case of mode decision process, only number of bits to encode the symbol is necessary. If only the length of codeword (instead of codeword) is retrieved from the total_zero table, then we can avoid the extra calculation for codeword generation of CAVLC during mode decision process.

E. Encode Runs

Since after transformation and quantization, blocks typically contain mostly zeros, CAVLC algorithm uses run-level coding to represent strings of zeros compactly. Run means the number of preceding zeros before each non-zero coefficient and Zeros_left called the number of zeros left before each non-zero coefficient. Based on Run and Zeros_left, a run VLC tables is used to encode this element [15]. From observation of run VLC table, it is shown that no of bits to encode this coefficient is increasing with the value of run. The estimated rate for run is proposed as

$$R_{run(i)} = \alpha_2 Run_i + \beta_2$$

(8)

Here $\alpha_2$ and $\beta_2$ are the weighting factors. From observation of run table, we have seen that value of $\alpha_2$ and $\beta_2$ are depends on the Run. Table 4 shows the value of $\alpha_2$ and $\beta_2$ with different value of run.

<table>
<thead>
<tr>
<th>Run</th>
<th>$\alpha_2$</th>
<th>$\beta_2$</th>
</tr>
</thead>
<tbody>
<tr>
<td>0,1, and 2</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>3,4,5, and 6</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<td>$&gt;6$</td>
<td>1</td>
<td>-3</td>
</tr>
</tbody>
</table>

Mathematically,

$$R_{res(est)} = R_{coeff} + R_{trailing} + \sum_{i=1}^{P} R_{level(i)} + R_{zero} + \sum_{j=1}^{Q} R_{run(j)}$$

(9)

where $P$ and $Q$ are the number of level and run, respectively.

Fig. 1 shows the comparison of our proposed method with actual rate for the Foreman sequence. Data is collected from the first 100 4x4 block. QP factor is set as 28. It is shown that the proposed method is very closely matched with actual rate.

V. RATE-DISTORTION OPTIMIZATION ACCELERATOR

The overall rate-distortion cost computation process using the proposed method combined with FSSD [14] can be summarized as:

Step 1: Compute the predicted block using inter or intra frame prediction: $P$

Step 2: Compute the residual (difference) block: $D = S - P$

Step 3: ICT transform the residual block: $F^* = ICT(D)$

Step 4: for $i=0$ to $N-1$ and $j=0$ to $N-1$ determine the quantized coefficients $Z_{ij}$ by the following iterative table-lookup quantization process:

Step I: Set $k=0$ and if $f_{ij}^* < 0$ then set sign$=-1$, otherwise sign$=1$.

Step II: If $|f_{ij}^*| \geq k\Delta_{ij}$, then $k=k+1$ and goto Step II;

Step III: Set $Z_{ij} = sign\times k$ and $\hat{f}_{ij}^* = sign\times k\Delta_{ij}$;

Step 5: Calculate SSD;

Step 6: Estimate the number of bits $R_e = R_{header} + R_{motion} + R_{res(est)}$

Step 7: Calculate the R-D cost : $J_{RD} = SSD + \lambda \cdot R_e$

The block diagram of proposed RDO accelerator is shown in Fig. 2. The major difference is that the conventional quantization process is replaced by the iterative table-lookup quantization and the SSD calculation is preformed in the ICT transform domain. Additionally, entropy coding technique is replaced by fast rate estimation algorithm. We can obtain the rate-distortion cost without computations of two scale transforms, quantization, inverse quantization, inverse ICT transform and entropy coding.
VI. Hardware Architecture of Proposed RDO Accelerator

Fig. 3 describes a possible hardware structure for the iterative table-lookup quantization implementation [14]. As each coefficient does not depend on the others, the quantization process can be performed in parallel. In other words, we can quantize all the coefficients of $f_{ij}$ simultaneously. At first, the quantization points and boundary points are generated by quantization table generator at the beginning of encoding process and then stored in the RAM, in which there is a pointer that points to the current boundary point. Then the current boundary point and quantization point are loaded in a register, which will compare with $f_{ij}$ after the sign of $f_{ij}$ is abstracted by a sign selector. If the comparator judges that $|f_{ij}|$ is larger, then ‘Loop’ signal takes effect which tells counter to add one and the pointer to shift forward and point to the next boundary point. When $|f_{ij}|$ is smaller than the current boundary point, ‘Out’ signal takes effect which resets the pointer to the initial position, asks the counter to output the accumulated value $z_{ij}$ and asks the register to output the current quantization value, $\hat{f}_{ij}$. From the structure in Fig. 3, we can find that the proposed table-lookup quantization process is very suitable for hardware implementation, since it does not require complicated operation modules and can execute in the parallel mode.

Fig. 4 shows the architecture of SSD calculator. It is shown that 16 multiplication operations are replaced by 24 shifting operators. It should be noted that no computation is necessary for shifting operator. In order to avoid the square operator, we have used a square table. In the square table, square of all integers within quantization step size $\Delta$ are stored. In this way, we need additional memory requirement. But the size of memory (ROM) is not large because of the limited value of $\Delta$. For example encoder has to store 64 values if the quantization parameter QP is 40. Architecture of shifter for region III is given in Fig. 5. Here multiplier is replaced by two shift right operators.
The hardware architecture for proposed fast rate estimation algorithm is shown in Fig. 6. The proposed architecture contains a number of counters and register to store the information for the block. Non-zero coefficients counter is used to store the number of non-zero coefficients (N_{coeff}). Trailing one counter is used to store the number of trailing +/-1s. Total zero counter is used to store the total number of zeros before the last non-zero coefficient. Run before register is used to store the number of zeros preceding each non-zero coefficient. The level detector checks if the coefficient is either level or not. The proposed architecture begins by reading the coefficients from the input buffer in reverse zig-zag order. In each cycle, it reads one coefficient from the input buffer analyzes the coefficient and updates the information stored in the related counter and register. Reverse zig-zag scanning enables us to determine the necessary information for encoding 4x4 block. It takes 16 cycles to read the coefficients and stored the corresponding information in the counter and register file for each 4x4 block. Based on the number of non-zero coefficients, comparator 1 select either the rate table for coefficient token (ROM1) should be used or not. Comparator 1 compares the output of non-zero coefficient counter with content of register 4. Register 4 always store a fixed value 7. Comparator1 sent a signal “1” when the number of non-zero coefficients is greater than the fixed value 7. Then the register 1 is enabled. If comparator 1 sent the signal “0”, rate table for coefficient token is activated. In this case, rate is found by searching the rate table based on the number of non-zero coefficients and number of trailing ones. Similarly number of bits to encode the total zero is calculated by rate table for total zeros (ROM2). The level detector checks if the coefficient is zero or not. If the level is non-zero, it will be stored into the level-first-in-first-out buffer (FIFO), and the corresponding run information will be also stored to the run-FIFO. Based on the content of FIFO, level and run rate estimators predict the rate of level and run, respectively.
VII. SIMULATION RESULTS

To evaluate the performance of the proposed method, JM 9.6 [8] reference software is used in simulation. Different types of video sequences are used as test materials. A group of experiments were carried out on the test sequences with different quantization parameters. All the simulations are conducted under Windows Vista operating system, with Pentium 4 2.2 G CPU and 1 G RAM. The conditions of the simulation are listed in Table 5. The comparison results are produced and tabulated based on the average difference of total encoding time (ΔT%), the average PSNR differences (ΔPSNR), and the average bit rate differences (ΔRate%). PSNR and bit rate differences are calculated according to the numerical averages between RD curves derived from the original and proposed algorithm, respectively. The detail procedure to calculate these differences can be found in [17].

In order to evaluate the complexity reduction, ΔT (%) is defined as follows

\[
\Delta T\% = \frac{T_{\text{original}} - T_{\text{proposed}}}{T_{\text{original}}} \times 100\%
\]  

where, \(T_{\text{original}}\) denotes the total encoding time of the JM 9.6 encoder and \(T_{\text{proposed}}\) is total encoding time of the encoder with proposed method.

A. Results with CAVLC entropy coding method

In this experiment, after selecting the best mode by proposed technique, CAVLC is used for final entropy coding of best mode. The rate-distortion performance comparison of the conventional RDO and proposed RDO in terms of PSNR, bit-rate, and encoding time are shown in Tables 6. When only rate estimation is used, the degradation of RD performance is very negligible. We have seen that while only rate estimation is used, the average PSNR degradation is 0.017 dB and the average bit rate increment is 0.615%. When both two algorithms are used, the rate-distortion performance differences are always less than 2.2 % and 0.07 dB in terms of bit rate and PSNR, respectively. Fig. 7 shows the comparison of rate-distortion curves of proposed accelerator with original RD optimization method. The performance of proposed
method is similar with actual RD curves. As shown in Table 6, the proposed accelerator can reduce the encoding time by around 47% (on average), in which about 14% reduction is achieved by the FSSD and about 33% reduction is achieved by rate estimation.

B. Results with CABAC entropy coding method

Although the proposed rate-distortion accelerator is developed based on CAVLC, it is also suitable during mode decision when Content Adaptive Binary Arithmetic Coding (CABAC) is used as entropy coding method. This is because the proposed algorithm is used only for RD optimized mode selection process. After selection of best mode, true entropy coding is used. As shown in Table 7, PSNR reduction and bit rate increment are about 0.08 dB and 2.5% on average, respectively. Fig. 8 shows the comparison of RD optimized curves. As shown in Fig. 8, the rate distortion performance of proposed method is much closed with actual RD optimized curve.

C. Comparison with other methods

In this experiment, the proposed rate estimation method is compared with other rate estimation methods. Only rate estimation method is enabled and CAVLC is used as final entropy coding method. Complexity reduction is calculated by (11).

Complexity reduction (%) = \( \frac{T_{\text{ref}} - T_{\text{proposed}}}{T_{\text{ref}}} \times 100\% \) (11)
where, $T_{ref}$ is the total encoding time of the other conventional method and $T_{proposed}$ is the total encoding time with proposed rate estimation method only.

Fig. 7. RD performance of proposed accelerator with CAVLC (X-axis: Bit rate in kbps, Y-axis: PSNR in dB)
Fig. 8. RD performance of proposed accelerator with CABAC (X-axis: Bit rate in kbps, Y-axis: PSNR in dB)
Comparison results are presented in Table 8. Positive values for the PSNR and bit rate differences indicate increments, and negative values indicate decrements. Negative values of complexity reductions indicate the increment of complexity of proposed method. We have seen that the RD performance of proposed method is always better than the other methods mentioned in Table 8. As compared to [11], the proposed method increases the video quality by 0.09 dB and reduces the bit rate of 1.05% on average with more similar complexity reduction. With compared to [13], the proposed algorithm achieves a better complexity saving which is about 5% and improves the RD performance slightly. The proposed method achieves more PSNR increment (about 0.10dB) and bit rate reduction (about 3.10%) of [9] with the expense of a little computational load.

VIII. CONCLUSION

In this paper, fast rate-distortion optimization method for mode decision of H.264/AVC is proposed. FSSD algorithm avoids conventional quantization, inverse quantization and inverse cosine transform. Additionally, entropy coding is replaced by fast rate estimation method. The proposed rate distortion accelerator can be used for both CAVLC and CABAC in H.264 encoding. Of course, the accuracy for CAVLC should be better but our simulation results show that the performance for CABAC with use of proposed accelerator in mode selection is similar to full RD cost encoding using CABAC. The proposed technique reduces total encoding time by about 47% on average with negligible degradation of coding performance. The proposed method is suitable for many video application areas such as a digital camera and a mobile phone.

ACKNOWLEDGMENT

This work was substantially supported by a CERG grant from Hong Kong Government, Hong Kong SAR, China. [Project No.9041251].

REFERENCES


Mohammed Golam Sarwer was born in Comilla, Bangladesh in 1978. He received B.Sc. and M.Phil degree both in electronic engineering from Khulna University of Engineering and Technology and City University of Hong Kong in 2001 and 2007, respectively. Mr. Sarwer was working as a full time faculty member in Khulna University of Engineering and Technology from 2001 to 2005. Currently, he is PhD candidate in the department of electrical and computer engineering, University of Windsor, Canada.

Lai-Man Po (M’92) received the B.Sc. and Ph.D. degrees, both in electronic Engineering, from the City University of Hong Kong in 1988 and 1991, respectively. Since November 1991, he has been with the Department of Electronic Engineering, City University of Hong Kong, and is currently Associate Professor. His research interests are in vector quantization, motion estimation and image/video compression. Dr. Po was awarded First Prize (1988) in the Paper Contest for Students and Non-corporate Members organized by the Institute of Electronics and Radio Engineers of Hong Kong, and a Postgraduate Fellowship of the Sir Edward Youde Memorial Council (1988 to 1991).

Q.M. Jonathan Wu (M’92) received his Ph.D. degree in electrical engineering from the University of Swansea, U.K., in 1990. From 1995, he worked at the National Research Council of Canada (NRC) for 10 years where he became a senior research officer and group leader. He is currently a Professor in the Department of Electrical and Computer Engineering at the University of Windsor, Canada. Dr. Wu holds the Tier I Canada Research Chair (CRC) in Automotive Sensors and Sensing Systems. He has published more than 100 peer-reviewed papers in areas of computer vision, image processing, intelligent systems, robotics, micro-sensors and actuators, and integrated micro-systems. His current research interests include 3D image analysis, active video object tracking and extraction, vision-guided robotics, sensor analysis and fusion, wireless sensor network, and integrated micro-systems. Dr. Wu is an Associate Editor for IEEE Transaction on Systems, Man, and Cybernetics (part A). Dr. Wu has served on the Technical Program Committees and International Advisory Committees for many prestigious conferences.