I'm glad that your are interested in the tech.
Be sure not to miss the video
showing the complete application.
I guess, everything should be possible in Java3D, although not as fast as in OpenGL: Even if Java3D's Geometry uses OpenGL 2.0's VertexBufferObjects, it is not possible to declare some components static and other dynamic, at least AFAIK. .The positions and normals have to be updated every frame, but the texture-coordinates don't vary and therefore one should send them to graphics-memory only once. - anyway, I guess this is unlikely to be bottleneck.
I plan to release the full source-code in September, because I'm busy now preparing stuff for a conference in late August. Unfortunately, the realese will be without the models, since I have no rights on them and my sensei from Tokyo won't agree. The Maya exporter is up-2-date, so you could use it, the 3dsmax exporter is behind the last version, which makes it unusable at the moment. In Septmeber, I'll wirte a converter for COLLADA 1.4.1 so everyone can export own models from most DCC-Tools.
If you're in a hurry, I could give you the fast hacked, undocumented source code now, but I guess it'll be a great effort to go through that!
As you are particular interested in the lip-synchronization, I will explain a few things:
1. We use a third-party tool: Annosoft's LipsyncTool
(costs: ~500$). This outputs a time-schme for phonemes
and their intensity from an audio file (wav,mp3) and the spoken text (optional) to a simple text-format.
2. A self-written tool converts such a file into our keyframe-based animation format. The file used int the demo is email.anim.xml
3. Our characters have defined some facial features via Morph-Targets(3dsmax) / BlendShapes(Maya), e.g. LowerLipUp. Small offset vectors determine the offsets for each group of vertices, e.g. some around the lip. The animation parameter for one is simply an intensity, typically ranging from 0.0-1.0.
4. Since we wanted to be independent of the facial features defined for a character (some of our models use only about 20 ,others the full range defined by MPEG4's Facial Definition Parameters), we introduce a mapping between them and the phonems. More exactly a two way approach: from phonemes to visemes and from visemes to facial features. This is defined once per character in an expression file: Tsumabuki.expr.xml
. In order to get good values, we simply moved the sliders in Maya/3DSMAX until the mouth looked like a particular viseme (the visual correspondence to a phoneme).
Note: We plan to remove step 1-2 by graping the phoneme-information directly from a TTS-system in real-time.
Please ask if you have an further questions.