One of the things that always bothered me being a Linux user is that some features that are very easy to do on a phone are challenging to figure out how to do on a computer. In particular, I could dictate a Slack message on my phone, but for the life of me could not figure out how to dictate a Slack message on my laptop. As a matter of fact, unless the dictation feature came within the app itself, dictation seemed to be a tool that was simply unavailable on Linux more broadly. Of course, I knew that wasn't true, there had to be a way. I finally figured it out!
My primary requirement was, when pressing a custom shortcut key, I wanted to be able to dictate in real time without waiting until the end of my monologue to see the results.
This means the audio needs to be streamed to a speech provider (of course we're going to use Deepgram here) but it also means that the Linux script needs to be able to paste the output from Deepgram into the currently open input field.
The two commands to grab text and paste are xclip and xdotool like:
echo "some text" | xclip -selection clipboard && xdotool key Control_L+vThe next challenge is getting the system to output the text in real time.
Linux could quite easily process streams of text but I had issues getting that to work without using a terminal emulator.
Since gnome-terminal doesn't have an obvious flag to hide the terminal on startup, I opted for terminator which makes that super easy.
- Clone this repo.
- In Settings->Keyboard->Custom Shortcuts (may be named differently in your Linux flavor):
- Add a custom shortcut (give it any name you like)
- For the
commandgive it:
terminator -H -e "bash -c /full/path/to/linux-stt/speechtotext.sh; exit"- Add a hotkey. I have mine mapped to
<super> + Mfor "microphone"
- Add a
.envfile with theDEEPGRAM_API_KEYdefined.
- No, it won't be able to read from your system environment variables.
- Update the script
speechtotext.shwith the full paths to your location - Update the
FORMAT,CHANNELS, andRATEvariables inmicrophone.pyto match your microphone input format.
- If you're having trouble with the speech recognition and figuring out what those fields should be, you can try writing the captured audio to a file. There is a
framesbuffer defined in the file that is ultimately unused. You can write out the frames to a file and then try a few differentffmpegconversions until you figure things out. (I'm sure there's a smarter way to do this, but this is good practice with ffmpeg :D ) Put the below code in thefinallyclause after closing the microphone:
print(f"Writing frames {len(frames)}")
with open('microphone.raw', 'wb') as f:
f.write(b''.join(frames))
# convert to wav with
# ffmpeg -ar 44100 -f s16le -ac 2 -i microphone.raw -y test.wav- Requires a control word to quit - currently this is "exit". This doesn't have to be the way. Check out linux-voice-type for an example of using the custom shortcut key twice - once to start and once to stop. I kind of like the "exit" control word, but it's not for everyone.