Full speech recognition is only called for when a transcript is unavailable. But very often in linguistics and phonetics research we do have an orthographic transcription (or, equally, a script that was used to make a recording). In that case, we can use an adaptation of speech recognition, forced alignment, to time-align the words of the orthographic transcription to the audio. As a by-product, forced alignment also gives us the time-aligned segmental (typically phonemic) transcription of the audio.
Two notable systems for forced alignment are:
WebMAUS - from the Institute for Phonetics and Speech Processing, Munich
FAVE-align - An online interface to the Penn Phonetics Laboratory Forced Aligner P2FA
Both of these provide forced aligned labels in the form of Praat