Monday, September 13, 2021

DIY Vocoder: Robots Feelin' It

The vocoder.  It's undeniably attention-grabbing.  And, for many, it is also completely off-putting.  It is absolutely the anchovies-on-pizza of the synthesizer world.  I like anchovies.  But would I like a vocoder?  I needed to find out...so, I built my own!


What is a Vocoder?  A vocoder is an effect that makes a synthesizer sound like your voice.  You sing or talk into a microphone and you play your synthesizer.  The vocoder transfers certain qualities of your voice onto your synthesizer.  Like a wah pedal, equalizer, or filter, a vocoder doesn't make its own sound; it manipulates and changes the sound from another device, such as a synthesizer.

Setup.  As shown below, a vocoder needs a microphone and a synthesizer.  The two are plugged into the vocoder.  The output of the vocoder goes out to your PA or DAW.  Pretty easy.  It's what's inside the vocoder that is the magic.

Starting Simple: 1-Band Vocoder.  Before we get too deep, we should look at a simpler example.  Since a vocoder processes many frequency bands in parallel, lets start by looking a simple 1-band vocoder.  As I've illustrated below, a 1-band vocoder has two elements: a block to sense the instantaneous loudness of the voice ("RMS", which is an envelope follower) and a gain block (a VCA) to change the loudness of the synth in response to the voice's loudness.  This is the core unit from which we will build up a vocoder.
 

What Does It Sound Like?
  This basic effect changes the overall loudness of the synthesizer.  When the voice is loud, it allows the synth to be loud.  When the voice is quiet, it forces the synth to be quiet.  It responds very quickly and very naturally.  While that is cool, it does not actually sound like the voice. Just changing the loudness is not enough to sound like your voice. 

Tracking Your Voice's Formants.  To add more voice-like qualities, a real vocoder looks at the loudness of many different frequency regions of your voice.  The different frequency regions are chosen to sense the different frequency regions associated with the different vowel sounds.  As you make the different vowel sounds, you change the shape of your mouth (and nasal passages) to enhance or attenuate the different harmonics contained in your voice.  This shaping is why an "A" sounds different from an "E" and from an "O"; the peaks in the frequency response (the "formants") are very different for the different vowel sounds.  A vocoder detects this frequency shaping and applies the same frequency shaping to the synthesizer's audio.  The result is a voice-like quality to the sound of the synth.


Multiband Vocoder.  Our 1-band vocoder changed the loudness of the overall signal -- all frequencies were affected equally.  To make the synth more voice like, we want to sense the frequency shaping (the formants) as it dynamically changes to make the different vowels.  Therefore, we need to break up the audio so that we can control the loudness of individual slices of the frequency spectrum.  To make this give us this finer control, we copy our 1-band vocoder many times in order to get a multiband vocoder.  The illustration below expands our 1-band vocoder into a 3-band vocoder.

Breaking Up the Audio into Frequency Bands.   The core of this system are still the green RMS blocks (envelope followers) and orange gain blocks (VCAs) as discussed before.  This system still controls loudness.  But, note that we precede each of these channels with a bandpass filter.  Therefore, each channel is controlling the loudness of just one frequency region.  In this simple 3-band example, the first filter might isolate the low frequencies so that that loudness of the low frequencies are made to be the same between the synth and the voice. The middle filter and last filter would do the same for the middle and high frequencies.  Now, we are impressing more of the voice's qualities onto the synth.  A real vocoder uses 8-16 of these channels, which gives it much better resolution, making the output even more voice-like.

Making My Own Vocoder.  In the old days, the bandpass filters, RMS blocks, and gain blocks would all be implemented by actual electronic circuits.  Today, however, it's much easier to write signal processing software to perform these functions.  That's what I did.   I used an open-source digital audio device from Blackaddr (the "Teensy Guitar Audio Pro", TGA) and wrote software to implement the vocoder.  The TGA uses a Teensy 3.6 as its processor, which can be programmed in the Arduino IDE, This is great because Arduino is what I use for many of my other synth hacks.   I've shared my Teensy/Arduino vocoder software on my GitHub here.

Filtering and Processing Speed.  Having up to 16 channels, and needing at least two filters per channel, a vocoder needs a lot of filters.  So, your choice of filter is important as it can consume a lot of the available processing power.  I used the multimode filter model that comes in the Teensy Audio Library  They implemented the filter as a time-domain IIR filter with fixed-point operations.  I ended up using two filters in series to sharpen the filter's response (though I'm not sure that was really necessary).  Even with the burden of the extra filtering, the Teensy 3.6 was fast enough to enable a 16-band vocoder.  With the double-filtering, that's 64 filters in total!  I was pleased.  


Filter Frequencies.  An important choice is to pick the frequency bands for your vocoder's filters.  I know that the frequencies should be tailored to the human voice, but I had no specific guidance on what frequencies to use.  After a bit of trial-and-error, I chose to center my first filter at 125 Hz and then I step the frequency upward by a factor of  1.319x for each subsequent filter.  This seemingly-bizarre value is partway between half-octave steps (1.414x) and third-octave steps (1.260x).  Using this step size, my filters end up being centered at: 125 Hz, 165 Hz, 218 Hz,...<etc>..., 4595 Hz, 6063 Hz and 8000 Hz.  To me, this felt like a good span for the human voice.

Make Your Own!  To be clear, when I wrote this vocoder, many of my choices were arbitrary.  Don't be afraid to make your own choices!  Choose a different type of bandpass filter.  Choose different filter frequencies.  Use FFTs instead of time-domain filters.  That's the beauty of hacking using open technology: you can try things out for yourself!  Go and have fun!

Thursday, September 2, 2021

DIY Beat Box to MIDI Converter

It was the best spouse's workplace party ever.  I got to talk with a guy about how he could make his modular synth respond to his voice.  How cool is that!  I was inspired.  So, while I don't have a modular synth, I do have an 808-style drum machine.  Here is a joyous toast to you, the best spouse's workplace party ever: my voice-driven drum machine!


DIY Voice-to-MIDI Converter.  The key to this hack is making a device that can listen to my voice and can distinguish between my different voice sounds.  When I make a low-pitched sound ("boom"), it should issue a MIDI command for a kick drum.  When I make a high-pitched sizzly sound ("tsst") it should command a hi-hat.  And when I make a mid-frequency sound ("kuh"), it should command a snare/clap sound.  I have a mic for my voice.  I have a MIDI-compatible drum machine for drum sounds.  What I needed was a voice-to-MIDI converter.  That is what I made.


Hardware.  For this voice-to-MIDI converter, I used an open-source audio processing device called a "Teensy Guitar Audio Pro" (TGA) from Blackaddr.  As the name implies, it is intended for guitar.  But, knowing that dynamic microphones share many qualities with guitar pick-ups, I knew that the device would work just fine with my old, cheap dynamic mic.  The TGA has tons of hardware features, but what's important for this hack is that it has a MIDI output.  What hackable guitar processor includes MIDI?  This is what makes the TGA the perfect tool for this hack.

Teensy Guitar Adapter Processes the Mic Signal to Generate MID Commands

Programmable.  The processor in the TGA is an open-source device that is programmable through the Arduino IDE.  This is perfect for me since many of my hacks are built around Arduino.  I have an old version of the TGA, which is built around a Teensy 3.6, which is 180 MHz processor with floating point support.  Newer versions of the TGA bump that up to a Teensy 4.1, which is 600 MHz!  Either way, the Teensy is great because there's an Audio-processing library and an active community to learn from.  The fine folks at Blackaddr also provide a bunch of example programs to help with TGA-specific features.

The Teensy Guitar Adapter is Powered by a Teensy 3.6

How Are the Sounds Different?  The main challenge for me was designing the audio processing.  How do I get the Teensy to automatically detect which sound I was making with my voice?  The first step is always to figure out what you, as the human, to understand what makes the sounds different.  A good way to do this is to make recordings of the three sounds ("boom", "tsst", and "kuh") and to compare the frequency spectrum of each.  


Frequency Spectra.  Using the TGA as an audio interface for my computer (thanks to Teensy's USB Audio features!), I plugged in my microphone and recorded myself beat boxing.  I then manually sliced up the recording and excerpted the first 50 msec of each sound sample.  I pulled the excerpts into Matlab and computed the average spectrum for all the kick sound ("boom"), hi-hat sounds ("tsst") and clap sounds ("kuh") that I made.  You can see the average spectra below.

Lows, Mids, and Highs.  After normalizing all the excerpts to the same loudness (I don't want to throw off my voice classifier just because I happened to be quite or loud), the plot of the spectra shows what is the same and what is different about my three voice sounds.  In their normalized form, the amount of low frequency sound is similar.  But, the mids and highs are quite different:
  • The "boom" sound for the kick has little mids and little highs. 
  • The "tsst" for hi-hats have little mids but lots of highs.  
  • The "kuh" for the clap has lots of mids and lots of highs.
Classify Based on Mids and Highs.  So, the mids and highs are the key.  The figure below plots measurement for each individual beatbox sound that I recorded.  I plotted each excerpt's mid-frequency energy on the horizontal axis versus its high-frequency energy on the vertical axis.  Notice how the three types of sounds nicely separate from each other!  This plot is the key to making the voice classifier. 

All The Pieces of the Voice Classifier.  Having figured out how to distinguish between the three voice sounds, we know what we need to do.  Once we detect that a voice sound is preset at all (such as by simply looking for the overall loundess of my voice), here are three steps of the classifier:

  1. Measure the mid and high frequency energy (via some sort of filters)
  2. Compare the measured mids and highs to the 2D plot above classify the sound
  3. Issue a MIDI command for the sound we want

Implementing the Classifier via Filters.  Based on the average spectra, it looks like I want a lowpass filter with a cutoff around 1200 Hz, a bandpass filter passing 1200 Hz to 3000 Hz, and a highpass filter with a cutoff at 3000 Hz.   Given the filter types available in the Teensy Audio library, I used the state-variable filter because it offers lowpass, bandpass, and highpass outputs.  To get a sharper frequency response, each "filter" is actually three of these filters in series.

Threshold Detection.  With the signals filtered into three streams (low, mid, high), I wrote a simple audio class to compute the RMS envelope of each signal.  After extracting he envelop, I compare the level to a threshold to detect when the sound is loud enough that we can assume that my voice is present.  That is a "detection".  There is a voice sound that needs to be classified into kick, hihat, or clap.

Classify.  Having run my three filters, I have three measured values for each detection: the low-frequency loudness, the mid-frequency loudness, and the high-frequency loudness.  I normalize for the overall loudness by dividing each value by the sum of the three values.  Then, for the mid and high frequency values, I compare them to my 2D plot shown earlier.  Where do they fall in this plot?  Depending upon where it falls, I issue a MIDI "note on" for the kick (note 36), hi-hat (note 42), or clap (note 39).

Tuning  Of course, the development of this hack did not go as smoothly as was implied here.  It took a lot of experimentation and tuning.  The final code is here on my GitHub. 


Having Fun. Once I got it working, it was fun to make deep 808 kicks with my voice.  It was fun to use my same voice sounds to trigger other drum sounds (like the cowbell!).  And, since the system simply outputs MIDI notes, I used it to drive my synthesizer.  If I had a sampler, it would have been fun to trigger silly sounds on the sampler using silly sounds from my voice.  If you try this yourself, let me know what fun you have with it!