Abstract—This paper presents a VLSI design approach of a high-speed and real-time 2-D Discrete Wavelet Transform computing. The proposed architecture, based on new and fast convolution approach, reduces the hardware complexity in addition to reduce the critical path to the multiplier delay. Furthermore, an advanced two-dimensional (2-D) discrete wavelet transform (DWT) implementation, with an efficient memory area, is designed to produce one output in every clock cycle. As a result, a very high-speed is attained. The system is verified, using JPEG2000 coefficients filters, on Xilinx Virtex-II Field Programmable Gate Array (FPGA) device without accessing any external memory. The resulting computing rate is up to 270 M samples/s and the (9,7) 2-D wavelet filter uses only 18 kb of memory (16 kb of first-in-first-out memory) with 256×256 image size. In this way, the developed design requests reduced memory and provide very high-speed processing as well as high PSNR quality.

Keywords—Discrete Wavelet Transform (DWT), Fast Convolution, FPGA, VLSI.

I. INTRODUCTION

The Discrete Wavelet Transform (DWT) has become one of the most used techniques for image compression and is applied in a large category of applications [1]. DWT can provide significant compression ratios without great loss of visual quality than the previous techniques such as the Discrete Cosine Transform (DCT) and the Discrete Fourier Transform (DFT). The DWT present the main part of the JPEG2000 standard, which permits both lossy and lossless compression of digital images. It allows an encoded image to be reconstructed progressivelly.

The compression phase is mainly divided into three sequential steps: (1) Discrete Wavelet Transform, (2) Quantization, and (3) Entropy Encoding. After preprocessing, each component is independently analyzed by an appropriate discrete wavelet transform. The computational block diagram of the functionalities of the compression system is shown in Fig. 1.

Conventionally, programmable DSP chips are used to implement such algorithms for low-rate applications and the VLSI application specific integrated circuits (ASICs) for higher rates [2]. The FPGAs are programmable logic devices that provide sufficient quantities of logic resources that can be adapted to support a large parallel distributed architecture.

Lifting and convolution present the two computing approaches to achieve the discrete wavelet transform [3]. While conventional lifting-based architectures require fewer arithmetic operations compared to the convolution-based approach for DWT, they sometimes have long critical paths. If $Ta$ and $Tm$ are the delays of the adder and multiplier, respectively, then the critical path of the lifting-based architecture for the (9,7) filter is $(4\times Tm + 8 \times Ta)$, while that of the convolution implementation is $(Tm + 2 \times Ta)$ [4]. In addition to this and for the reason to preserve proper precision, intermediate variables widths are larger in lifting-based computing. As a result, the lifting multiplier and adder delays are longer than the convolution ones.

Many VLSI lifting-based DWT architectures have been developed and implemented to reduce the memory requirements and the critical path [5]-[14]. A JPEG2000 single-chip encoder with 81 M samples/s is presented in [8]. A high-speed and area-efficient system is presented in [11], 7.5 Kb is used for the (5,3) wavelet filter and 30 Kb for the (9,7) one, with 256×256 image size. Recently, the new lifting-based architectures represent the faster option, and the main critical path is about $Tlm$, which represents the lifting multiplier delay.

The present work introduces new reflection of VLSI design for very high-speed image computing using new convolution-based DWT.

Firstly, new and fast convolution-based architecture is developed to reduce the hardware requirement complexity in addition to reduce the critical path to the multiplier delay ($Tcm$). Thus, we reach a very high-speed computing compared to the lifting option.

Secondly, a novel and adapted two-dimensional (2-D) discrete wavelet transform system, with an efficient memory area (e.g. 16 Kb for the (9,7) wavelet filter with 256×256 image size), is designed to produce one output in every clock cycle.
II. DISCRETE WAVELET TRANSFORM

A. One-Dimensional Discrete Wavelet Transform

The basic DWT can be realized by convolution-based implementation using the FIR-filters to do the transform. The input discrete signal X(n) is filtered by a low-pass filter (h) and a high-pass filter (g) at each transform level. The two output streams are then sub-sampled by simply dropping the alternate output samples in each stream to produce the low-pass subband Y_L and high-pass subband Y_H. The associated equations can be written (1).

\[ y_L(n) = \sum_{i=0}^{N-1} h(2n-i) \cdot x(i) \cdot y_H(n) = \sum_{i=0}^{N-1} g(2n-i) \cdot x(i) \]  

(1)

Fig. 2 shows the signal analysis and reconstruction in one-dimensional (1-D) Discrete Wavelet Transform.

B. Two-Dimensional Discrete Wavelet Transform

For two-dimensional (Image) analysis and reconstruction the multi-resolution approach for Discrete Wavelet transform with reduced hardware complexity and memory. The main principle of the architecture can be applied to implement any symmetric filter. The (9, 7) wavelet filter presents the developed example. These (9, 7) filter has 9 low-pass filter coefficients \( h = \{ h-4, h-3, h-2, h-1, h_0, h_1, h_2 \), \( h_3, h_4 \} \) and 7 high-pass filter coefficients \( g = \{ g-2, g-1, g_0, g_1, g_2, g_3, g_4 \} \).

Based on the low-pass filter coefficients symmetry \( (h-i = h_i) \), the equation 2 can be given.

\[ \begin{align*}
    y_{L0} &= h_0(0 + x_0) + h_1(0 + x_1) + h_2(0 + x_2) + h_3(0 + x_3) + h_4(0 + x_4) \\
    y_{L1} &= h_0(0 + x_0) + h_1(x_1 + x_2) + h_2(x_0 + x_3) + h_3(0 + x_4) + h_4(0 + x_4) \\
    y_{L2} &= h_0(0 + x_0) + h_1(x_1 + x_2) + h_2(x_0 + x_3) + h_3(x_0 + x_4) + h_4(x_0 + x_4)
\end{align*} \]  

(2)

Similarly, the high-pass filter coefficients present symmetry as follows:

\[ \begin{align*}
    y_{H0} &= h_0(0 + x_0) + h_1(x_1 + x_2) + h_2(x_0 + x_3) + h_3(x_0 + x_4) + h_4(x_0 + 0) \\
    y_{H1} &= h_0(0 + x_0) + h_1(x_1 + x_2) + h_2(x_0 + x_3) + h_3(0 + x_4) + h_4(0 + x_4) \\
    y_{H2} &= h_0(0 + x_0) + h_1(x_1 + x_2) + h_2(x_0 + x_3) + h_3(0 + x_4) + h_4(0 + 0)
\end{align*} \]  

(3)

As a result (4) can be given.

\[ \begin{align*}
    y_{L0} &= g_0(0 + 0) + g_1(0 + x_0) + g_2(0 + x_1) + g_3(0 + x_2) + g_4(0 + x_3) \\
    y_{L1} &= g_0(0 + x_0) + g_1(x_1 + x_2) + g_2(0 + x_3) + g_3(0 + x_4) + g_4(0 + x_4) \\
    y_{L2} &= g_0(0 + x_0) + g_1(x_1 + x_2) + g_2(x_0 + x_3) + g_3(x_0 + x_4) + g_4(x_0 + x_4)
\end{align*} \]  

(4)

The architecture to compute the \( Y_{Li} \) and \( Y_{Hd} \) is shown in Fig. 4. Additional registers are added, between multipliers and adders, to speed up the computing. The critical path is reduced to the multiplier delay (Tcm). Furthermore, the outputs \( Y_{Li} \) and \( Y_{Hd} \) are obtained alternately at the trailing edges of the even and odd clock cycles. (e.g., \( Y_{LD}, Y_{LD}, Y_{LD} \), \( Y_{HR}, Y_{HR}, Y_{HR} \), \( Y_{HR} \) are obtained at clock cycles 9, 11, 13, . . . and \( Y_{HR}, Y_{HR}, Y_{HR} \) are obtained at clock cycles 8, 10, 12, . . . respectively).

The preliminary gate count estimates and number of components used in the proposed fast convolution-based architecture are given in Table I.
TABLE I
PRELIMINARY GATE COUNT ESTIMATES AND NUMBER OF COMPONENTS USED IN THE PROPOSED ARCHITECTURE

<table>
<thead>
<tr>
<th>Component</th>
<th>Gate count</th>
<th>Number of units</th>
</tr>
</thead>
<tbody>
<tr>
<td>Adder (8b+8b)</td>
<td>200</td>
<td>4</td>
</tr>
<tr>
<td>Adder (16b+16b)</td>
<td>400</td>
<td>4</td>
</tr>
<tr>
<td>Register (8 bit)</td>
<td>50</td>
<td>8</td>
</tr>
<tr>
<td>Register (9 bit)</td>
<td>56</td>
<td>5</td>
</tr>
<tr>
<td>Register (16 bit)</td>
<td>100</td>
<td>10</td>
</tr>
<tr>
<td>Multiplier (8b x 9b)</td>
<td>810</td>
<td>5</td>
</tr>
</tbody>
</table>

The total number of the used gate and the estimated work frequency for the (9,7) lifting architecture and the proposed one, using the field programmable gate array Xilinx Virtex-II, is provided in Table II.

TABLE II
THE TOTAL NUMBER OF THE USED GATE AND THE ESTIMATED WORK FREQUENCY

<table>
<thead>
<tr>
<th>Architecture</th>
<th>The total gate</th>
<th>Work frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>The lifting architecture [6]</td>
<td>13400</td>
<td>210 Mhz</td>
</tr>
<tr>
<td>The proposed architecture</td>
<td>8130</td>
<td>270 Mhz</td>
</tr>
<tr>
<td></td>
<td></td>
<td>275 Mhz*</td>
</tr>
</tbody>
</table>

*With 7 bits coding coefficients

IV. THE PROPOSED FAST CONVOLUTION-BASED 2-D DWT ARCHITECTURE

In this approach and to avoid the need of a transpose circuit between the two levels, the system starts the column-processing as soon as sufficient numbers of rows have been filtered. Fig. 5 presents the main 2-D DWT fast convolution-based diagram.

The proposed FPGA implementation of the 2-D Discrete Wavelet Transform is designed with two fast convolution-based blocks. The first one, which is similar to the 1-D block, realizes the row Discrete Wavelet Transform and uses D-Latch devices for the X(n) storage. The second achieves the Column Discrete Wavelet Transform using an FPGA block RAM storage of the computed rows.

Fig. 4 The proposed fast convolution-based 1-D (9,7) DWT block diagram

Fig. 5 2-D DWT fast convolution-based block diagram
$Y_L$ and $Y_H$ present the first level outputs. The $Y_{LL}$, $Y_{LH}$, $Y_{HL}$ and $Y_{HH}$ present the second level and the 2-D DWT outputs. $Y_{L0}$, $Y_{L1}$, $Y_{L2}$, ... are obtained at clock cycles 9, 11, 13, ... and $Y_{H0}$, $Y_{H1}$, $Y_{H2}$, ... are obtained at clock cycles 8, 10, 12, ... respectively.

The first row of $Y_{LH}$ and $Y_{HH}$ can be obtained after the beginning of the third row storage of the first level outputs. After the beginning of the fifth storage of the first level outputs, we can obtain the second row of $Y_{LH}$ and $Y_{HH}$ and the first row of $Y_{LL}$ and $Y_{HL}$ (Fig. 6).

Nine FPGA block RAM in Dual-Port Mode are required to accomplish the second level of the Parallel Distributed 2-D DWT Architecture with the (9,7) wavelet filters. The SelectRAM 1-D DWT fast convolution-based Modules is composed by blocks RAM, interface circuit and a computing circuit (Fig. 7). The computing circuit presents the main architecture of the D-Latch 1-D DWT fast convolution-based.

At each step of computing, only one block RAM is selected in write mode:

1st step:
- Block RAM 1 and block RAM 2: read mode
- Block RAM 3: write mode

2nd step:
- Block RAM 1 to block RAM 4: read mode
- Block RAM 5: write mode

Advanced step:
- Block RAM 1 to block RAM 8: read mode
- Block RAM 9: write mode

We have implemented the 2-D DWT Parallel Distributed Architecture described in the previous section using one of the XC2V1000 Xilinx Virtex-II FPGA devices. This device contains 1M system gates (5,120 slices), 40 Multiplier Blocks and can operate at a maximum clock speed of 450 MHz [19]. Each Xilinx Virtex-II FPGA block SelectRAM cell is an 18 Kbit fully synchronous memory. The two ports have independent inputs and outputs and are independently clocked. The data widths of the two ports can be configured independently, providing built-in bus-width conversion. As a result, one output is produced in every clock cycle and this configuration is able to accomplish 2048×2048 block computing.

Using the first-in-first-out (FIFO) memory, the total memory size (in words) needed $2_{-D_M}$, for the proposed one level 2-D computing, is

$$2_{-D_M} = (K-1) \times N$$  \hspace{1cm} (5)
We assume that: (1) The maximum number of filter taps among the highpass and low-pass filters is K; (2) the image size is N×N. Fig. 8 presents the main FIFO-based 2-D DWT fast convolution-based diagram.

Fig. 8 The FIFO-based 2-D DWT fast convolution-based block diagram

According to Eq. (5), only eight blocks in Dual-Port are needed to compute the (9,7) wavelet filter and four blocks in Dual-Port are for the (5,3) one. Fig. 9 shows the (9,7) FIFO Memory 1-D DWT fast convolution-based block diagram.

At each step of computing, all Memory blocks are selected in FIFO mode:

1st step:
Memory block 1 and memory block 2: FIFO mode

2nd step:
Memory block 1 to memory block 4: FIFO mode

Advanced step:
Memory block 1 to memory block 8: FIFO mode

We conclude that the fast convolution-based scheme and the appropriate 2-D DWT architecture reduce the memory area and the critical path to the convolution multiplier delay, which is the smallest one. The memory resources comparisons, with 256×256 image size, between our 2-D architecture and others are listed in Table III.

The PSNR, for the selected images, after forward and inverse Discrete Wavelet Transform with 7 bits coding coefficients, are given in Table IV.

V. CONCLUSION

In this paper, we have proposed a parallel architecture for very high-speed computing Discrete Wavelet Transform using SRAM and FIFO memory. To produce one output in every clock cycle in addition to reduce the critical path as well as an efficient memory area, new fast convolution-based architecture approach is performed. In this approach, the system stars the column-processing as soon as sufficient numbers of rows have been filtered. Two fast convolution-based blocks, for the two-dimensional (2-D) discrete wavelet transform (DWT), are used to accelerate the computing.

The Xilinx Virtex-II FPGA implementation, of developed architecture using the (9,7) filter, required eight FIFO memory in Dual-Port Mode. The (3,5) 2-D wavelet filter uses only 8 kb of memory with 256×256 image size and the (9,7) one uses only 16 kb.

REFERENCES


