I would look at the Redbook's discussion of the different matrices. Even though its talking about the old API, it explains the concepts really well.

So some explanations:

Any matrix can transform a vector. It helps to think of this transformation as changing coordinate spaces. Usually in the game, you have multiple coordinate spaces. There's the one where your geometry is defined for each model, which you might call the local space. Each object in the world has a translation, rotation, and scale, that represents its location in the world space.

Multiplying the vertices of the geometry in local space with the object's transform changes the coordinate space to be the world space by computing the equivalent "location".

When you multiply a matrix with another matrix, you're concatenating the transforms. This means that when you use the resulting matrix, it behaves just as if you transformed the vertex by both. The order they are applied goes from right to left. So the following transforms are equivalent:

1 2 3 4 5 6 7
| Matrix4f c = a x b; Vector4f t1 = c x v;
Vector4f t2 = b x v; Vector4f t3 = a x t2;
t1.equals(t3) == true; |

Now to the actual OpenGL concepts. I'll talk about the MODELVIEW matrix first. This is a combination of two conceptual tranforms. The model transform represents the transform from an object or light's local space to the world space (as I talked about above).

The view transform is the inverse of the camera's world transform. The inverse of a transform changes coordinate spaces in the opposite direction. So the inverse of a world transform moves from world space to local space. Since the camera's world transform can be thought of just like any other world transform, inversing it will change the coordinate space to show everything from the point of the camera, which is exactly what we want when rendering.

The actual view transform is actually a little different in implementation because OpenGL looks down the negative z-axis, which makes things difficult. Any formula online for a view matrix based on location, camera direction and an up vector should take this into account.

The projection matrix is the final matrix. The model and view matrices are affine transforms, which means that no matter how you apply them, parallel lines will remain parallel (basically things aren't distorted). The projection matrix is different. It is responsible for creating the perspective effects that we see when rendering 3D onto a monitor. There are multiple projection types but the most common is the perspective projection matrix.

In camera space (i.e. after transforming with the inverse view transform), the visible scene is contained in a frustum or truncated pyramid where the tip of the pyramid is at the camera's location and it is capped by the near and far planes. The projection matrix transforms every vertex within this frustum so that it lies within a unit cube, which makes it really easy to then map onto actual pixels.

Because of this, you will apply the projection matrix last after computing lighting and other effects. In the older versions of OpenGL, lighting was computed in camera space. This means that the normal vectors of your model have to be transformed, too. However, because normal vectors don't need to be translated, you can't just multiply the model matrix with them. I would look up tutorials on the "normal matrix" to read a good explanation of how to transform normal vectors.

Matrix multiplication is really fast, both on the GPU and CPU. I have found that the multiplication on the CPU was faster than the JNI overhead that came from using glPushMatrix and glPopMatrix. I don't think its necessary to implement your own matrix stack in your shaders. It is still probably helpful to separate the projection, view and model matrices though. If you have transform hierarchies, like in a scene graph, you can multiply all of their model matrices together in the correct order to create a single model transform for each object that will have the same end result.