Fiddling with google speech to text:

johanf@RWAAI-johanf:~/dev/google-cloud-sdk/bin$ ./gcloud ml speech recognize /home/johanf/Downloads/The_Story_of_the_Nocturnal_Insect_mono_16000_20s_denoised.wav --language-code=ms-MY --include-word-time-offsets > /home/johanf/Downloads/The_Story_of_the_Nocturnal_Insect_mono_16000_20s_denoised.json
johanf@RWAAI-johanf:~/dev/google-cloud-sdk/bin$ cat /home/johanf/DoThe_Story_of_the_Nocturnal_Insect_mono_16000_ | jq '.results[] | .alternatives[] | .words[] | [.endTime,127,.word] | @tsv'
| sed 's/"//g' | sed 's/s\\t/ /' | sed 's/\\t/ /' > /home/johanf/data/ceqwong/json/The_Story_of_the_Nocturnal_Insect_mono_16000.lab

bin/mfa_train_and_align --no_dict ~/data/ceq_kruspe_narratives_tradnarr/data-Nocturnal_Insect ceqwong_out2/

~/data/ceq_kruspe_narratives_tradnarr/data-Nocturnal_Insect/98-part.lab ~/dev/montreal-forced-aligner/ceqwong_out2/data-Nocturnal_Insect/98-part.TextGrid

We tried this and that and fiddled with those and the outcome was as follows.

We interslice

The toolbox files provides a partitioning in utterance-like segments

We used the tool interslice from festvox to produce an initial chunking of the long speech file.

When we have the chunking we can build

The language has no written form, so therefore the transcriptions in the toolbox files are phonetic.

The toolbox files provides a phonetic representation of the spoken

One strategy is to use bootstrap resources and tools for a related language to produce resources and anlysed for the underresourced language. This strategy was used by [1], who used Malay resources to bootstrap ASR for Iban.

Compared to Ceq Wong, it turns out that Iban is relatively well-resourced as they find 7k news articles that they can use.

Many of the tools for speech aligning and chunking (eg, MFA, MAUS) predominantly use a graphemic representation and derive phonetic representations from those.

Since we only have phonetic representation an no orthography we try to backtrack from our phonetic representation and create an artificial orthography for our material.

This is somewhat similar to the 'Mismatched Crowdsourcing' approach taken by [2,3] for acquiring speech transcriptions in a foreign language using crowdsourced workers unfamiliar with that language.

We used a p2g (phoneme2grapheme) system in order to produce this. Since this was quite tentative, we used a pronunciation dictionary for English to train a p2g system. In the future, more multilingual approaches will be explored. (Unitran, Peters)

We also tried an even simpler technique with a few hand-written rules that just mapped the phonetic symbols to graphemes

Evaluation was quite tricky since the phonetic symbols aren't time aligned per segment

We used two different approaches: one was to concatenate two consecutive stretches of speech and then use the border between them as out target

The other is the approach taken by [4]: build a synthesiser from the utterances and resynthesise the utterances with that Then compare the synthesised versions with the original ones and use that as a measure. Keep the best utterances and rebuild from those, repeat the resynthesis and see if the results get better.

Conceptually, the idea is quite simple: find common substrings in the symbolic representation and note their positions. Substrings of different lengths could probably be used, the optimal lenght is probably not easy to define. Structure between relations of substrings of different lengths could probably also by utilised.

Then, likewise, find common subsegments in the acoustic representation and note their positions. Then, by matching positions a useful alignment should be possible to produce. Now there is of course a problem; identical subsegments do not exist in natural human speech. We therefore need a way of obtaining a symbolic representation of speech acoustics that normalises the naturally occuring variation in speech, but still keeps as much of the phonetic contrasts as possible.

This task is by no means trivial and we do not claim to have found an ideal representation. But we are exploring different possibilities.

MAUS chunker: Their main insight is that, after performing speech recognition on the signal, the alignment can be performed in the symbolic instead of the signal domain, which is generally less costly. While the resulting symbolic alignment is unlikely to be perfect, there may be stretches where the recognized string and the transcription match for a sufficient number of symbols, meaning that they can be considered aligned (so-called ’anchors’). Any non-aligned stretches can be recursively subjected to the same procedure, taking advantage of the fact that information about their content has become more specific since the last iteration.

There are in fact two tasks: 1) chunking the recording into units that correspond to the written segmentation, and 2) aligning each unit with its corresponding transcription


[2] Acquiring Speech Transcriptions Using Mismatched Crowdsourcing [3] Transcribing Continuous Speech Using Mismatched Crowdsourcing

[4] cmu-wilderness