Abstract—Motion estimation occupies the heaviest computation in HEVC (high efficiency video coding). Many fast algorithms such as TZS (test zone search) have been proposed to reduce the computation. Still the huge computation of the motion estimation is a critical issue in the implementation of HEVC video codec. In this paper, motion estimator architecture with optimized number of PEs (processing element) is presented by exploiting early termination. It also reduces hardware size by exploiting parallel processing. The presented motion estimator architecture has 8 PEs, and it can efficiently perform TZS with very high utilization of PEs.

Keywords—Motion estimation, test zone search, high efficiency video coding, processing element, optimization.

I. INTRODUCTION

With the advent of UHD (ultra-high definition) images, a new efficient video coding technique has been required for video compression and transmission. HEVC [1], [2] is newly-established international standard for video coding by JTC1 (Joint Technical Committee) of ISO/IEC (International Organization for Standardization/International Electrotechnical Commission) and VCET (Video Coding Experts Group) of ITU-T (International Telecommunication Union - Telecommunication Standardization Sector). It achieved significant improvement in compression ratio, but it suffers from huge computation due to complicated compression techniques.

Motion estimation [3]-[6] is the most computation-intensive processes in HEVC. There are many types of motion estimation, but BMA (block-matching algorithm) [3] is widely used in the hardware implementation. BMA divides the current frame into several blocks. Each block (this is called as current block) is matched with some blocks (they are called as search window blocks, and their positions are called as search positions) in some neighborhood region (this is called as search window). The most similar block (this is called as matched block) is found by iterative matched. SAD (sum of absolute difference) is used as a matching criterion, and lower SAD means more similar. Two components are encoded instead of whole reference block - (1) the block difference (this is called as residuals) between the current and matched blocks, and (2) their relative position displacements (this is called as motion vector). This is illustrated in Fig. 1 in detail.

FS (full search) [3] exhaustively matches all search positions in the search window. It shows the best compression ratio, but it suffers from huge computation when the image size is large. Many fast algorithms such as TSS (three-step search) [4], [5] and TZS [6] has been proposed to reduce the computation by skipping some search positions.

TZS is known as one of the most efficient motion estimation algorithm in HEVC. It first performs sparse scan in the search window to find the most possible region where the moving
object exists in. Then it performs medium scan with subsampling. After that, it performs precise scan to find the final matched block.

TZS significantly reduces the computation of FS, since it skips most search positions in the search window. Many researches have shown that its compression ratio is comparable with FS, while its computation is lower than 1/1000 with FS. However, TZS exploits several different stages with different block sizes, different search directions, and different resolution. In the hardware implementation of TZS motion estimator, these irregularities significantly increases the hardware complexity and they severely hinder the reuse of PEs (processing element)

This paper presents a motion estimator architecture for TZS search in HEVC coding. It fully supports early termination to reduce motion estimation computation. It has multiple PEs for parallel processing, and the number of PEs is optimized to achieve high utilization of hardware resources.

II. TEST ZONE SEARCH

As illustrated in Fig. 2, TZS consists of five major stages, i.e. motion vector prediction, grid search, raster search, star refinement, and raster refinement. Note that either star refinement or raster refinement is exploited in general. Most TZS motion estimators have four stages.

(1) Motion Vector Prediction
As illustrated in Fig. 3, SADs are calculated with four motion vectors, i.e. motion vectors of neighboring blocks \((MV_1, MV_2, \text{ and } MV_3)\) and their median \((MV_4)\). Motion vector with minimum SAD is determined as \(PMV\).

(2) Grid Search
Whole search window is searched with diamond (or square) pattern, as illustrated in Fig. 4. Default pattern is diamond pattern, but square pattern can be also used. It is a single-step search, i.e. diamond (or square) pattern is used only once. Displacement between minimum SAD position and \(PMV\) position is \(ui\text{BestDistance}\).

(3) Raster Search
Raster search is a subsampled full search with an interval of \(i\text{Raster}\) pixels, as illustrated in Fig. 5. When \(ui\text{BestDistance}\) is smaller than \(i\text{Raster}\), this step is skipped.

(4) Star Refinement
Star refinement is a multi-step search using diamond (or square) pattern. In each step, the size of the diamond (or square) pattern halves until its size is 1 (=4 search positions). When its size reaches 1, additional two missing points are searched.

(5) Raster Refinement
Raster refinement is similar with star refinement, but only the outer 8 search positions of diamond (or square) pattern are searched in each step, as illustrated in Fig. 6.

When multiple PEs are exploited in the TZS motion estimator, ET (early termination) should be considered where some computation is skipped during TZS. When ET occurs, some PEs are idle due to the skipped computation. However, the execution time can be reduced only when all PEs are idle.

There are two types of early termination in TZS. Inter-stage early termination occurs when TZS skips some stages in the flowchart of Fig. 2. Intra-stage early termination occurs during block matching, as illustrated in Fig. 7.

![TZS test zone search algorithm](image-url)
this paper, the number of PEs is optimized by considering intra-stage early termination.

Fig. 3 Motion vector prediction

III. PROPOSED MOTION ESTIMATOR ARCHITECTURE

Most motion estimators exploit multiple PEs for parallel processing. Usually, each PE processes each search position. Number of PEs is important in the motion estimator architecture. More PEs reduce execution time by parallel processing, but more PEs are idle due to the following reasons.

(A) Some PEs are idle due to intra-stage early termination, and the number of idle PEs increases along with the number of PEs.

(B) Some PEs are idle when the number of currently processing search positions is smaller than the number of PEs.

For (A), the number of idle PEs and the reduction of execution time are dynamically changed on the image sequences. Therefore, this effect cannot be optimized. However, it is obvious that early termination significantly reduces execution time. Therefore, the proposed motion estimator architecture has been designed to support early termination in hardware level.

For (B), the number of idle PEs and the reduction of execution time are deterministic for given motion estimation stages. Tables I and II show the utilization of the PEs and the execution cycles per current block for various number of PEs, respectively.

From Tables I and II, it is obvious that the optimum number of PEs is 8. When the number of PE is larger than 8, the utilization of PEs significantly decreases, but the execution time does not decrease. When the number of PE is smaller than 8, the utilization of PEs increases the execution time significantly increases.

Fig. 8 shows the architecture of the proposed motion estimator with 8 PEs. It is simple but efficient, and it can support early termination. The proposed motion estimator is currently under design, and it will be implemented in SoC (system-on-chip) next year.
TABLE I

<table>
<thead>
<tr>
<th>Stages</th>
<th>Utilization of PEs</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>4 PEs</td>
</tr>
<tr>
<td>Motion Vector Prediction</td>
<td>100.0%</td>
</tr>
<tr>
<td>Grid Search</td>
<td>90.6%</td>
</tr>
<tr>
<td>Raster Search</td>
<td>87.5%</td>
</tr>
<tr>
<td>Star Refinement</td>
<td>96.9%</td>
</tr>
<tr>
<td>Raster Refinement</td>
<td>96.9%</td>
</tr>
</tbody>
</table>

TABLE II

<table>
<thead>
<tr>
<th>Stages</th>
<th>Execution Cycles per Current Block</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>4 PEs</td>
</tr>
<tr>
<td>Motion Vector Prediction</td>
<td>1024 cycles</td>
</tr>
<tr>
<td>Grid Search</td>
<td>8192 cycles</td>
</tr>
<tr>
<td>Raster Search</td>
<td>9464 cycles</td>
</tr>
<tr>
<td>Star Refinement</td>
<td>8192 cycles</td>
</tr>
<tr>
<td>Raster Refinement</td>
<td>8192 cycles</td>
</tr>
</tbody>
</table>

Fig. 8 The proposed motion estimator architecture

ACKNOWLEDGMENT

This research was supported by System IC Commercialization R&BD Program (10049498) funded by the Ministry of Trade, Industry, and Energy, Korea.

REFERENCES


Seongsoo Lee received B.S, M.S, and Ph.D. degrees in E.E. from Seoul National University, Korea in 1991, 1993, and 1998, respectively. In 1998-2000, he was a research associate in Institute of Industrial Science, University of Tokyo, Japan. In 2000-2002, he was a research professor in Department of Electronic Engineering, Ewha Womans University, Korea. He joined School of Electronic Engineering at Soongsil University, Korea in 2002, where he is currently a professor. His research interests include high efficiency video coding, low-power SoC, multimedia SoC, power management SoC, battery management SoC, sensor SoC, and automotive SoC.