What jbanes suggested does yeild an improvement.
That's pretty much what I figured. In a modern, superscalar, out-of-order CPU, the more instructions you can untangle, the better the performance. It's quite possible that each line you access is executing in parallel with the other lines. For the absolute fastest code, try this version:
int npp = ((yi<<widthBits)+xi)<<2;
int pixP1 = np[npp];
int pixP2 = np[npp+1];
int pixP3 = np[npp+2];
int pixP4 = np[npp+3];
There are no guarantees, but that might eliminate some of the extra processing you were trying to get rid of.