L2 Speech Synthesizer

generate non-native English utterances using publicly-available l2-arctic data.

Fig. 1. A schemetic diagram of computer-assisted pronunciation training (CAPT).

Mispronunciation detection and diagnosis (MDD) is one of the pivotal techniques in computer-assisted pronunciation training (CAPT) research, a robust MDD system can help English as a Second Language (ESL) learners efficiently improve their English speaking proficiency. Despite the potential benefits, the accuracy of mispronunciation diagnosis within MDD systems still remains unsatisfactory, primarily attributed to the data sparsity issue, where there is no amount of non-native mispronounced speech for training.

In response to this challenge, we have devised a solution by creating an L2 speech synthesizer capable of automatically generating mispronounced speech. Below, we provide some preliminary results of our system.

Generate L2 correct pronounced speech


Word errors evaluation (ASR transcripts generated by "whisper-large-v2")

(PNV_arctic_a0093) for a full minute he crouched and listened

PERCEIVED
for a full minute he is    crowd and   listen
OURS-CORRECT
for a full minute he  * crouched and listened

(BWC_arctic_a0046) the girl faced him her eyes shining with sudden fear

PERCEIVED
the girl faced him her eyes shinning with shoulder fear
OURS-CORRECT
the girl faced him her eyes  shining with   sudden fear
Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")

(PNV_arctic_a0093) f ao r | ah | f l | m ih n ah t | hh iy * | k r aw ch t | ae n d | l ih s ah n d

PERCEIVED
f ao * | ah | f uw | m ih n ah t | hh iy s | k r aw  * t | ae n d | l ih s ah * * 
OURS-CORRECT
f ao r | ah | f  l | m ih n ah t | hh iy * | k r aw ch t | ae n d | l ih s ah n d

(BWC_arctic_a0046) dh ah | g er l | f ey s t | hh ih m | hh er | ay z | sh ay n ih ng | w ih th | s ah d ah n | f ih r

PERCEIVED
 d ah | g er l | f ey s t | hh iy m | hh er | ay s | sh iy n ih  * | w ih th | sh aw d er * | f ih r
OURS-CORRECT
dh ah | g er l | f ey s t | hh ih m | hh er | ay s | sh ay n ih ng | w ih th |  s ah d ah n | f ih r

Generate L2 mispronounced speech


Word errors evaluation (ASR transcripts generated by "whisper-large-v2")

[ZEROSHOT_ex_1] there was a dog in the afternoon

OURS-CORRECT
there was a dog in the afternoon
OURS-MISPRON
there was a talk in the afternoon
Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")

[ZEROSHOT_ex_1] dh eh r | w ah z | ah | d ao g | ih n | dh iy | eh f t er n uw

OURS-CORRECT
dh eh r | w ah z | ah | d ao g | ih n | dh iy | eh f t er n uw
OURS-MISPRON
dh eh r | w ah z | ah | t ao k | ih n | dh ah | eh f t er n uw

Generate L2 speech with different pauses


Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")

[ZEROSHOT_ex_1] dh eh r | w ah z | ah | d ao g | ih n | dh iy | eh f t er n uw

OURS-SHORT-PAUSE
dh eh r | w ah z | ah | d ao g | sil | ih n | dh iy | eh f t er n uw
OURS-LONG-PAUSE
dh eh r | w ah z | ah | d ao g | sil sil | ih n | dh iy | eh f t er n uw
OURS-INTER-PAUSE
dh eh r | w ah z | sil | ah | d ao g | sil | ih n | dh iy | eh f t er n uw