The technology of eye tracker has become the main methods of analyzing the recognition issues in human-computer
interaction. Human eye image capture is the key problem of the eye tracking. Based on further research, a new
human-computer interaction method introduced to enrich the form of speech synthetic. We propose a method of Implicit
Prosody mining based on the human eye image capture technology to extract the parameters from the image of human
eyes when reading, control and drive prosody generation in speech synthesis, and establish prosodic model with high
simulation accuracy. Duration model is key issues for prosody generation. For the duration model, this paper put forward
a new idea for obtaining gaze duration of eyes when reading based on the eye image capture technology, and
synchronous controlling this duration and pronunciation duration in speech synthesis. The movement of human eyes
during reading is a comprehensive multi-factor interactive process, such as gaze, twitching and backsight. Therefore,
how to extract the appropriate information from the image of human eyes need to be considered and the gaze regularity
of eyes need to be obtained as references of modeling. Based on the analysis of current three kinds of eye movement
control model and the characteristics of the Implicit Prosody reading, relative independence between speech processing
system of text and eye movement control system was discussed. It was proved that under the same text familiarity
condition, gaze duration of eyes when reading and internal voice pronunciation duration are synchronous. The eye gaze
duration model based on the Chinese language level prosodic structure was presented to change previous methods of
machine learning and probability forecasting, obtain readers’ real internal reading rhythm and to synthesize voice with
personalized rhythm. This research will enrich human-computer interactive form, and will be practical significance and
application prospect in terms of disabled assisted speech interaction. Experiments show that Implicit Prosody mining
based on the human eye image capture technology makes the synthesized speech has more flexible expressions.
|