This is really far afield.
Different ways to do all this.
There are two halves. The first half is recording the motions and synchronizing them to the audio track. I don’t know if there is anything out there in the public domain, may have to write my own. This eventually should include a waldo and a graphical editor.
The second half is the playback. The motions could be recorded for the servos as a multiplexed audio track, much the same as Teddy Ruxpin, or the way R/C transmitters work. A microcontroller can demux the output. I beleive there is even Arduino functionality that could achieve this.
Another way is a custom file format that embeds serial control of servos in it.
Much to mull over. The mechanical design seems easy compared to this. ay be reinventing the wheel here.
Thinking of using PureData http://puredata.info/downloads/ running on a Raspberry Pi as the show control. If you have any notions, please comment.
Looking at an MP3 circuit from SparkFun https://www.sparkfun.com/products/9715
I had considered building a waldo to help animate the features. It could be simpler, and animations could be done in layers. A pan tilt gimbal, with rotate and pinch could be used to animate the eye and eyelids in one pass, and the face and handlebars in a second pass.