L2 Speech Synthesizer

Fig. 1. A schemetic diagram of computer-assisted pronunciation training (CAPT).

Mispronunciation detection and diagnosis (MDD) is one of the pivotal techniques in computer-assisted pronunciation training (CAPT) research, a robust MDD system can help English as a Second Language (ESL) learners efficiently improve their English speaking proficiency. Despite the potential benefits, the accuracy of mispronunciation diagnosis within MDD systems still remains unsatisfactory, primarily attributed to the data sparsity issue, where there is no amount of non-native mispronounced speech for training.

In response to this challenge, we have devised a solution by creating an L2 speech synthesizer capable of automatically generating mispronounced speech. Below, we provide some preliminary results of our system.

Generate L2 correct pronounced speech

Word errors evaluation (ASR transcripts generated by "whisper-large-v2")

(PNV_arctic_a0093) for a full minute he crouched and listened

PERCEIVED		for a full minute he is crowd and listen
OURS-CORRECT		for a full minute he * crouched and listened

(BWC_arctic_a0046) the girl faced him her eyes shining with sudden fear

PERCEIVED		the girl faced him her eyes shinning with shoulder fear
OURS-CORRECT		the girl faced him her eyes shining with sudden fear

Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")

PERCEIVED		f ao * \| ah \| f uw \| m ih n ah t \| hh iy s \| k r aw * t \| ae n d \| l ih s ah * *
OURS-CORRECT		f ao r \| ah \| f l \| m ih n ah t \| hh iy * \| k r aw ch t \| ae n d \| l ih s ah n d

PERCEIVED		d ah \| g er l \| f ey s t \| hh iy m \| hh er \| ay s \| sh iy n ih * \| w ih th \| sh aw d er * \| f ih r
OURS-CORRECT		dh ah \| g er l \| f ey s t \| hh ih m \| hh er \| ay s \| sh ay n ih ng \| w ih th \| s ah d ah n \| f ih r

Generate L2 mispronounced speech

Word errors evaluation (ASR transcripts generated by "whisper-large-v2")

[ZEROSHOT_ex_1] there was a dog in the afternoon

OURS-CORRECT		there was a dog in the afternoon
OURS-MISPRON		there was a talk in the afternoon

Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")

OURS-CORRECT		dh eh r \| w ah z \| ah \| d ao g \| ih n \| dh iy \| eh f t er n uw
OURS-MISPRON		dh eh r \| w ah z \| ah \| t ao k \| ih n \| dh ah \| eh f t er n uw

Generate L2 speech with different pauses

Phone errors evaluation (ASR transcripts generated by "wav2vec2-large")

OURS-SHORT-PAUSE

dh eh r | w ah z | ah | d ao g | sil | ih n | dh iy | eh f t er n uw

OURS-LONG-PAUSE

dh eh r | w ah z | ah | d ao g | sil sil | ih n | dh iy | eh f t er n uw

OURS-INTER-PAUSE

dh eh r | w ah z | sil | ah | d ao g | sil | ih n | dh iy | eh f t er n uw