How does it work?
Speech consists of two types of sound: vowels and consonants. Vowel sounds are actually really simple to make. Each vowel is defined by three characteristic frequencies, called formants. Sound comes from the glottis (voice box) and passes through the vocal tract, where the shape of the throat and mouth amplify the formant frequencies by resonating them. Formant F1 tends to be higher for open sounds like "ah", "aw", and "uh", and lower for closed sounds like "ee" and "oo". Formant F2 is higher for front sounds like "ee" and "eh" and "aurgh", and lower for backed sounds like "oo" and "aw". Formant F3 stays high most of the time but is lowered for R and L sounds. Rounded sounds, like "oo" and "aw" and the French u, tend to lower all three formant frequencies. Consonants, on the other hand, contain more noise than vowels do, but even this noise is influenced by formants. Just as each vowel has a distinct set of frequencies, so does each consonant.
This speech synthesizer works by generating the formant frequencies directly, and resetting the waveform each glottal pulse. For example, if it is singing a note of middle C, a frequency of about 262 hertz or cycles per second, then the three formant oscillators are reset every 1/262nd of a second. This makes a sound that we recognize as a vowel. For the consonants, the noise generated is what's called chromatic noise, containing more of some frequencies and less of others, and is made by an oscillator that randomly fluctuates its speed instead of keeping up a pure tone.