I downloaded the TEDLIUM speech corpus last night, and this afternoon I poked around in it and said grr. It uses a NIST Sphere format which is basically a plain-text header followed by encoded samples... it's a WAV file, but with far fewer useful tools. Oh well, thinks I, I'll just transcode the files into Opus at a very low bandwidth and with some downsampling so that I'm not taking up all of my disk... and then opus-in-gstreamer decided to segfault every time I tried to send it the data. I'm sure it's just missing something, but no idea what at the moment.
In the end I just wound up transcoding to WAV files to make it easy to pull the files into gstreamer environments, which is where the final voice dictation will be happening, after all. Not sure that's worthwhile unless I do some down-sampling, but I was hoping to use Opus and/or Speex to pre-filter the audio (yes, I know, do it myself). It would be really interesting to use the internal model of Opus or Speex as the inputs to the neural network.
Not going to have time to play with any of that for a while, though, as I need to get work done.