L2 Speech Synthesizer
generate non-native English utterances using publicly-available l2-arctic data.

Mispronunciation detection and diagnosis (MDD) is one of the pivotal techniques in computer-assisted pronunciation training (CAPT) research, a robust MDD system can help English as a Second Language (ESL) learners efficiently improve their English speaking proficiency. Despite the potential benefits, the accuracy of mispronunciation diagnosis within MDD systems still remains unsatisfactory, primarily attributed to the data sparsity issue, where there is no amount of non-native mispronounced speech for training.
In response to this challenge, we have devised a solution by creating an L2 speech synthesizer capable of automatically generating mispronounced speech. Below, we provide some preliminary results of our system.
Generate L2 correct pronounced speech
Word errors evaluation (ASR transcripts generated by "whisper-large-v2")
(PNV_arctic_a0093) for a full minute he crouched and listened
PERCEIVED | for a full minute he is crowd and listen | |
OURS-CORRECT | for a full minute he * crouched and listened |
(BWC_arctic_a0046) the girl faced him her eyes shining with sudden fear
PERCEIVED | the girl faced him her eyes shinning with shoulder fear | |
OURS-CORRECT | the girl faced him her eyes shining with sudden fear |
Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")
(PNV_arctic_a0093) f ao r | ah | f l | m ih n ah t | hh iy * | k r aw ch t | ae n d | l ih s ah n d
PERCEIVED | f ao * | ah | f uw | m ih n ah t | hh iy s | k r aw * t | ae n d | l ih s ah * * | |
OURS-CORRECT | f ao r | ah | f l | m ih n ah t | hh iy * | k r aw ch t | ae n d | l ih s ah n d |
(BWC_arctic_a0046) dh ah | g er l | f ey s t | hh ih m | hh er | ay z | sh ay n ih ng | w ih th | s ah d ah n | f ih r
PERCEIVED | d ah | g er l | f ey s t | hh iy m | hh er | ay s | sh iy n ih * | w ih th | sh aw d er * | f ih r | |
OURS-CORRECT | dh ah | g er l | f ey s t | hh ih m | hh er | ay s | sh ay n ih ng | w ih th | s ah d ah n | f ih r |
Generate L2 mispronounced speech
Word errors evaluation (ASR transcripts generated by "whisper-large-v2")
[ZEROSHOT_ex_1] there was a dog in the afternoon
OURS-CORRECT | there was a dog in the afternoon | |
OURS-MISPRON | there was a talk in the afternoon |
Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")
[ZEROSHOT_ex_1] dh eh r | w ah z | ah | d ao g | ih n | dh iy | eh f t er n uw
OURS-CORRECT | dh eh r | w ah z | ah | d ao g | ih n | dh iy | eh f t er n uw | |
OURS-MISPRON | dh eh r | w ah z | ah | t ao k | ih n | dh ah | eh f t er n uw |
Generate L2 speech with different pauses
Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")
[ZEROSHOT_ex_1] dh eh r | w ah z | ah | d ao g | ih n | dh iy | eh f t er n uw
OURS-SHORT-PAUSE | dh eh r | w ah z | ah | d ao g | sil | ih n | dh iy | eh f t er n uw | |
OURS-LONG-PAUSE | dh eh r | w ah z | ah | d ao g | sil sil | ih n | dh iy | eh f t er n uw | |
OURS-INTER-PAUSE | dh eh r | w ah z | sil | ah | d ao g | sil | ih n | dh iy | eh f t er n uw |