cpu architecture - What causes the retired instructions to increase? -

i have 496*o(n^3) loop. performing blocking optimization technique i'm operating 2 images @ time instead of 1. in raw terms, unrolling outer loop. (the non-unrolled version of code shown below: ) b.t.w i'm using intel xeon x5365 machine has 8 cores , has 3ghz clock, 1333mhz bus frequency, shared 8mb l2( 4 mb shared between every 2 core), l1-i 32kb,l1-d 32kb .

for(imageno =0; imageno<496;imageno++){ (unsigned int k=0; k<256; k++) { double z = o_l + (double)k * r_l; (unsigned int j=0; j<256; j++) {     double y = o_l + (double)j * r_l;      (unsigned int i=0; i<256; i++)     {         double x[1] = {o_l + (double)i * r_l} ;                      double w_n =  (a_n[2] * x[0] + a_n[5] * y + a_n[8] * z + a_n[11])  ;         double u_n =  ((a_n[0] * x[0] + a_n[3] * y + a_n[6] * z + a_n[9] ) / w_n);         double v_n =  ((a_n[1] * x[0] + a_n[4] * y + a_n[7] * z + a_n[10]) / w_n);                                for(int loop=0; loop<1;loop++)         {             px_x[loop] = (int) floor(u_n);             px_y[loop] = (int) floor(v_n);             alpha[loop] = u_n - px_x[loop] ;             beta[loop]  = v_n - px_y[loop] ;         }        if(px_y[0]>=0 && px_y[0]<(int)threadcopy[0].s_y)             {                 if (px_x[0]>=0 && px_x[0]<(int)threadcopy[0].s_x )                     ///////////////////(i,j) pixels ///////////////////////////////                     pixel_1[0] = threadcopy[0].i_n[px_y[0] * threadcopy[0].s_x + px_x[0]];                 else                     pixel_1[0] =0.0;                                                      if (px_x[0]+1>=0 && px_x[0]+1<(int)threadcopy[0].s_x)                     /////////////////// (i+1, j) pixels/////////////////////////                     pixel_1[2] = threadcopy[0].i_n[px_y[0] * threadcopy[0].s_x + (px_x[0]+1)];                 else                     pixel_1[2] = 0.0;                    }             else{                 pixel_1[0] =0.0;                                                     pixel_1[2] =0.0;                                                 }              if( px_y[0]+1>=0 && px_y[0]+1<(int)threadcopy[0].s_y)             {                  if (px_x[0]>=0 && px_x[0]<(int)threadcopy[0].s_x)                     pixel_1[1] = threadcopy[0].i_n[(px_y[0]+1) * threadcopy[0].s_x + px_x[0]];                 else                     pixel_1[1] = 0.0;                   if (px_x[0]+1>=0 && px_x[0]+1<(int)threadcopy[0].s_x)                     pixel_1[3] = threadcopy[0].i_n[(px_y[0]+1) * threadcopy[0].s_x + (px_x[0]+1)];                 else                      pixel_1[3] = 0.0;             }             else{                 pixel_1[1] = 0.0;                 pixel_1[3] = 0.0;             }                  pix_1 = (1.0 - alpha[0]) * (1.0 - beta[0]) * pixel_1[0] + (1.0 - alpha[0]) * beta[0]  * pixel_1[1]                 +  alpha[0]  * (1.0 - beta[0]) * pixel_1[2]   +  alpha[0]  *  beta[0]  * pixel_1[3];                                  f_l[k * l * l + j * l + i] += (float)(1.0 / (w_n * w_n) * pix_1); }  } }

i profiled results using intel vtune-2013 (using binary created gcc-4.1) , can see there 40% reduction in memory bandwidth usage expected because 2 images being processed every iteration.(f_l store operation causes 8 bytes of traffic every voxel). accounts 11.7% reduction in bus cycles! also, since block size increased in inner loop, resource stalls decrease 25.5%. these 2 accounts 18% reduction in response time. mystery question is, why instruction retired increased 7.9%? (which accounts increase in response time 6.51%) - possible reason of is: 1. since number of branch instructions increase inside block (and core architecture has 8 bit global history) retired branch instruction increased 2.5%( although, mis-prediction remained same! know, smells fishy right?!!). still missing answer rest 5.4%! please shed me light in direction? i'm out of options , no way think. lot!!

Search This Blog

Brazzel

cpu architecture - What causes the retired instructions to increase? -

Comments

Post a Comment