Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

نویسندگان

Francesc Massanes

Marie Cadennes

Jovan G. Brankov

چکیده

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating the Performance of Motion Estimation Block-Matching Algorithms on GPU Cards

In the field of video compression, motion estimation (ME) is a process that leads to high computational complexity. Implementation of ME block-matching (BM) algorithms on general purpose Central Processing Unit (CPU), has resulted in poor performance. In this paper we investigate the performance of two BM ME algorithms: Three Step Search (TSS) and Four Step Search (4SS) on Graphics Processing U...

متن کامل

Compute Unified Device Architecture ( CUDA ) Based Finite - Difference Time - Domain ( FDTD ) Implementation

Recent developments in the design of graphics processing units (GPUs) have made it possible to use these devices as alternatives to central processor units (CPUs) and perform high performance scientific computing on them. Though several implementations of finitedifference time-domain (FDTD) method have been reported, the unavailability of high level languages to program graphics cards had been ...

متن کامل

Parallel Implementation of Bias Field Correction Fuzzy C-Means Algorithm for Image Segmentation

Image segmentation in the medical field is one of the most important phases to diseases diagnosis. The bias field estimation algorithm is the most interesting techniques to correct the in-homogeneity intensity artifact on the image. However, the use of such technique requires a powerful processing and quite expensive for big size as medical images. Hence the idea of parallelism becomes increasi...

متن کامل

Exploiting current-generation graphics hardware for synthetic-scene generation

Increasing seeker frame rate and pixel count, as well as the demand for higher levels of scene fidelity, have driven scene generation software for hardware-in-the-loop (HWIL) and software-in-the-loop (SWIL) testing to higher levels of parallelization. Because modern PC graphics cards provide multiple computational cores (240 shader cores for a current NVIDIA Corporation GeForce and Quadro cards...

متن کامل

SiftCU: An Accelerated Cuda Based Implementation of SIFT

Scale Invariant Feature Transform (SIFT) is a popular image feature extraction algorithm. SIFT’s features are invariant to many image related variables including scale and change in viewpoint. Despite its broad capabilities, it is computationally expensive. This characteristic makes it hard for researchers to use SIFT in their works especially in real time application. This is a common problem ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Journal of electronic imaging

دوره 20 3 شماره

صفحات -

تاریخ انتشار 2011

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

نویسندگان

چکیده

منابع مشابه

Investigating the Performance of Motion Estimation Block-Matching Algorithms on GPU Cards

Compute Unified Device Architecture ( CUDA ) Based Finite - Difference Time - Domain ( FDTD ) Implementation

Parallel Implementation of Bias Field Correction Fuzzy C-Means Algorithm for Image Segmentation

Exploiting current-generation graphics hardware for synthetic-scene generation

SiftCU: An Accelerated Cuda Based Implementation of SIFT

عنوان ژورنال:

اشتراک گذاری