opengl - Blend two images using GPU -

i need blend thousands of pairs of images fast.

my code following: _apply function pointer function blend. 1 of many functions can pass, not one. function takes 2 values , outputs third , done on each channel each pixel. prefer solution general such function rather specific solution blending.

typedef byte (*transform)(byte src1,byte src2);  transform _apply;  (int i=0 ; i< _framesize ; i++)  {     source[i] = _apply(blend[i]); }   byte blend(byte src, byte blend) {     int resultpixel = (src + blend)/2;      return (byte)resultpixel; }

i doing on cpu performance terrible. understanding doing in gpu fast. program needs run in computers have either nvidia gpus or intel gpus whatever solution use needs vendor independent. if use gpu has opengl platform independent well.

i think using glsl pixel shader help, not familiar pixel shaders or how use them 2d objects (like images).

is reasonable solution? if so, how do in 2d? if there library great know.

edit: receiving image pairs different sources. 1 coming 3d graphics component in opengl (so in gpu originally). other 1 coming system memory, either socket (in compressed video stream) or memory mapped file. "sink" of resulting image screen. expected show images on screen, going gpu option or using sdl display them.

the blend function going executed one

byte patch(byte delta, byte lo) {     int resultpixel = (2 * (delta - 127)) + lo;      if (resultpixel > 255)        resultpixel = 255;      if (resultpixel < 0)        resultpixel = 0;      return (byte)resultpixel; }

edit 2: image coming gpu land, comes in fashion. fbo pbo system memory

glbindframebuffer(gl_framebuffer,fbo); glreadbuffer( gl_color_attachment0 ); glbindbuffer(gl_pixel_pack_buffer, pbo); glreadpixels(0,0,width,height,gl_bgr,gl_unsigned_byte,0);  glbindbuffer(gl_pixel_pack_buffer, pbo);  void* mappedregion = glmapbuffer(gl_pixel_pack_buffer, gl_read_only);

seems better work in gpu memory. other bitmap can come system memory. may video decoder in gpu memory well.

edit 3: 1 of images come d3d while other 1 comes opengl. seems thrust or opencl best option

from looks of blend function, entirely memory bounded operation. caches on cpu can hold small fraction of thousands of images have. meaning of time spent waiting ram fulfill load/store requests, , cpu idle lot.

you not speedup having copy images ram gpu, have gpu arithmetic units idle while wait gpu ram feed them data, wait gpu ram again write results, copy main ram. using gpu slow things down substantially.

but wrong , might not saturating memory bus already. you have try on system , profile it. here simple things can try optimize.

1. multi-thread

i focus on optimizing algorithm directly on cpu. simplest thing go multi-threaded, can simple enabling openmp in compiler , updating loop:

#include <omp.h> // add along enabling openmp support in compiler ... #pragma omp parallel // <--- compiler magic happens here (int i=0 ; i< _framesize ; i++)  {     source[i] = _apply(blend[i]); }

if memory bandwidth not saturated, speed blending many cores system has.

2. micro-optimizations

another thing can try implement blend using simd instructions cpus have nowadays. can't without knowing cpu targeting.

you can try unrolling loop mitigate of loop overhead.

one easy way achieve both of these leverage eigen matrix library wrapping data in data structures.

// initialize data , result buffer byte *source = ... byte *blend = ... byte *result = ...  // tell eigen data/buffer are, , treat dynamic vectory of bytes // cheap shallow copy map<matrix<byte, dynamic,1> > sourcemap(source, _framesize); map<matrix<byte, dynamic,1> > blendmap(blend, _framesize); map<matrix<byte, dynamic,1> > resultmap(result, _framesize);  // perform blend using manner of insane optimization voodoo under covers resultmap = (sourcemap + blendmap)/2;

3. use gpgpu

finally, provide direct answer question easy way leverage gpu without having know gpu programming. simplest thing try thrust library. have rewrite algorithms stl style algorithms, that's pretty easy in case.

// functor blending struct blend_functor {   template <typename tuple>   __host__ __device__   void operator()(tuple t)   {     // c[i] = (a[i] + b[i])/2;     thrust::get<2>(t) = (thrust::get<0>(t) + thrust::get<1>(t))/2;   } };  // initialize data , result buffer byte *source = ... byte *blend = ... byte *result = null;  // copy data vectors on gpu thrust::device_vector<byte> a(source, source + _framesize); thrust::device_vector<byte> b(blend, blend + _framesize); // allocate result vector on gpu thrust::device_vector<byte> c(_framesize);  // process data on gpu device thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(                                   a.begin(), b.begin(), c.begin())),                  thrust::make_zip_iterator(thrust::make_tuple(                                   a.end(), b.end(), c.end())),                  blend_functor());  // copy data main ram thrust::host_vector<byte> resultvec = c; result = resultvec.data();

a neat thing thrust once have written algorithms in generic way, can automagically use different back ends doing computation. cuda default end, can configure @ compile time use openmp or tbb (intel threading library).

Search This Blog

Brazzel