Thursday, September 2, 2021

DIY Beat Box to MIDI Converter

It was the best spouse's workplace party ever.  I got to talk with a guy about how he could make his modular synth respond to his voice.  How cool is that!  I was inspired.  So, while I don't have a modular synth, I do have an 808-style drum machine.  Here is a joyous toast to you, the best spouse's workplace party ever: my voice-driven drum machine!


DIY Voice-to-MIDI Converter.  The key to this hack is making a device that can listen to my voice and can distinguish between my different voice sounds.  When I make a low-pitched sound ("boom"), it should issue a MIDI command for a kick drum.  When I make a high-pitched sizzly sound ("tsst") it should command a hi-hat.  And when I make a mid-frequency sound ("kuh"), it should command a snare/clap sound.  I have a mic for my voice.  I have a MIDI-compatible drum machine for drum sounds.  What I needed was a voice-to-MIDI converter.  That is what I made.


Hardware.  For this voice-to-MIDI converter, I used an open-source audio processing device called a "Teensy Guitar Audio Pro" (TGA) from Blackaddr.  As the name implies, it is intended for guitar.  But, knowing that dynamic microphones share many qualities with guitar pick-ups, I knew that the device would work just fine with my old, cheap dynamic mic.  The TGA has tons of hardware features, but what's important for this hack is that it has a MIDI output.  What hackable guitar processor includes MIDI?  This is what makes the TGA the perfect tool for this hack.

Teensy Guitar Adapter Processes the Mic Signal to Generate MID Commands

Programmable.  The processor in the TGA is an open-source device that is programmable through the Arduino IDE.  This is perfect for me since many of my hacks are built around Arduino.  I have an old version of the TGA, which is built around a Teensy 3.6, which is 180 MHz processor with floating point support.  Newer versions of the TGA bump that up to a Teensy 4.1, which is 600 MHz!  Either way, the Teensy is great because there's an Audio-processing library and an active community to learn from.  The fine folks at Blackaddr also provide a bunch of example programs to help with TGA-specific features.

The Teensy Guitar Adapter is Powered by a Teensy 3.6

How Are the Sounds Different?  The main challenge for me was designing the audio processing.  How do I get the Teensy to automatically detect which sound I was making with my voice?  The first step is always to figure out what you, as the human, to understand what makes the sounds different.  A good way to do this is to make recordings of the three sounds ("boom", "tsst", and "kuh") and to compare the frequency spectrum of each.  


Frequency Spectra.  Using the TGA as an audio interface for my computer (thanks to Teensy's USB Audio features!), I plugged in my microphone and recorded myself beat boxing.  I then manually sliced up the recording and excerpted the first 50 msec of each sound sample.  I pulled the excerpts into Matlab and computed the average spectrum for all the kick sound ("boom"), hi-hat sounds ("tsst") and clap sounds ("kuh") that I made.  You can see the average spectra below.

Lows, Mids, and Highs.  After normalizing all the excerpts to the same loudness (I don't want to throw off my voice classifier just because I happened to be quite or loud), the plot of the spectra shows what is the same and what is different about my three voice sounds.  In their normalized form, the amount of low frequency sound is similar.  But, the mids and highs are quite different:
  • The "boom" sound for the kick has little mids and little highs. 
  • The "tsst" for hi-hats have little mids but lots of highs.  
  • The "kuh" for the clap has lots of mids and lots of highs.
Classify Based on Mids and Highs.  So, the mids and highs are the key.  The figure below plots measurement for each individual beatbox sound that I recorded.  I plotted each excerpt's mid-frequency energy on the horizontal axis versus its high-frequency energy on the vertical axis.  Notice how the three types of sounds nicely separate from each other!  This plot is the key to making the voice classifier. 

All The Pieces of the Voice Classifier.  Having figured out how to distinguish between the three voice sounds, we know what we need to do.  Once we detect that a voice sound is preset at all (such as by simply looking for the overall loundess of my voice), here are three steps of the classifier:

  1. Measure the mid and high frequency energy (via some sort of filters)
  2. Compare the measured mids and highs to the 2D plot above classify the sound
  3. Issue a MIDI command for the sound we want

Implementing the Classifier via Filters.  Based on the average spectra, it looks like I want a lowpass filter with a cutoff around 1200 Hz, a bandpass filter passing 1200 Hz to 3000 Hz, and a highpass filter with a cutoff at 3000 Hz.   Given the filter types available in the Teensy Audio library, I used the state-variable filter because it offers lowpass, bandpass, and highpass outputs.  To get a sharper frequency response, each "filter" is actually three of these filters in series.

Threshold Detection.  With the signals filtered into three streams (low, mid, high), I wrote a simple audio class to compute the RMS envelope of each signal.  After extracting he envelop, I compare the level to a threshold to detect when the sound is loud enough that we can assume that my voice is present.  That is a "detection".  There is a voice sound that needs to be classified into kick, hihat, or clap.

Classify.  Having run my three filters, I have three measured values for each detection: the low-frequency loudness, the mid-frequency loudness, and the high-frequency loudness.  I normalize for the overall loudness by dividing each value by the sum of the three values.  Then, for the mid and high frequency values, I compare them to my 2D plot shown earlier.  Where do they fall in this plot?  Depending upon where it falls, I issue a MIDI "note on" for the kick (note 36), hi-hat (note 42), or clap (note 39).

Tuning  Of course, the development of this hack did not go as smoothly as was implied here.  It took a lot of experimentation and tuning.  The final code is here on my GitHub. 


Having Fun. Once I got it working, it was fun to make deep 808 kicks with my voice.  It was fun to use my same voice sounds to trigger other drum sounds (like the cowbell!).  And, since the system simply outputs MIDI notes, I used it to drive my synthesizer.  If I had a sampler, it would have been fun to trigger silly sounds on the sampler using silly sounds from my voice.  If you try this yourself, let me know what fun you have with it!

No comments:

Post a Comment