RabidAudio: The Color of Music

After being inspired by a neat TED talk about a colorblind guy who wears a camera device that plays different audio frequencies based on the color he is looking at, allowing him to "see" (sense) color, as well as a neat talk at Berry from Robert Schneider (no relation to Rob Schneider, it seems) about some of his math inspired music projects, I wanted to try and work backwards: turn sound into color.

Sound

All sound waves are complex waveforms, which are really just a sum of normal sinusoidal waves. When you play middle C on any instrument, the sound you make is not a pure 261.63 Hz sine wave. There are overtones and so forth that also sound based on the shape and design of the instrument, giving the note its timbre. The brilliant Joseph Fourier (the bane of all undergrad EE students taking Intro to Signal Processing) did a lot of work with complex waveforms, eventually leading to a whole branch of mathematics called Fourier Analysis. The major part of this is the Discrete Fourier Transform. Given a complex wave function f(x), solving the integral

results in a very interesting function which is able to pick out the component frequencies in the complex wave. Here is an example from MathWorks:

A small section of a wave which is the sum of a 120 Hz wave

with max amplitude 1, a 50 Hz wave with max amplitude .7,

and some random data added as noise.

This is the result of the FFT. Notice the peak at 120 Hz with

height of about 1 and the peak at 50 Hz with a height of about

.7. The other peaks are the result of the random data. Notice

how we are able to pick out the components even in spite of

the added noise.

There are handy sets of computer algorithms called Fast Fourier Transforms (FFT) which, as you might have guessed, are fast ways of approximating DFTs. The most popular example is FFTW, a C library which from what I understand, picks the best FFT algorithm based on the data given.

Here is a little example I did in MATLAB. I was able to determine the first note of the Hallelujah Chorus (Handel's Messiah), which is a sample file included with MATLAB:

load handel.mat
%The file is now stored in y
sound(y) %listen to it to make sure it works%the sound function defaults to a sample rate of 8192 per second.
% If we double it, we double the tempo and frequency (chipmonks)
sound(y,16384)
plot(y) %this will show a graph of the waveform like you see in a music editing program

%Lets get just the first note ('HA-'). look at the plot. The section we
%want is from about 1667 to about 5000, so let's define a new sound array a
a=y(1667:5000,1);
%listen to see that it is just the first note
sound(a)
plot(a)

%now we can find the FFT for a. The following is from the MATLAB guide for%the fft() function, using this data instead of theirs.
Fs=8192; %The sample frequency
T=1/Fs; %Period=1/Frequency
L=size(a,1); %the length of the signal
NFFT = 2^nextpow2(L); %the next power of two after L
Y = fft(y,NFFT)/L; %our FFT data
f = Fs/2*linspace(0,1,NFFT/2+1); %f is the set of frequencies for the x axis of the graph
p = 2*abs(Y(1:NFFT/2+1)); %p is the set of amplitudes for the y axis of the graph
plot(f,p); %isn't it beautiful!?

%now, let's find out what note they are singing.
%find(p==max(p)) will return a set of indexes in p of the maximum value of
%p (which is the one we want, a little after 500 Hz). In this case, there is
%only 1 occurence, so we can dump it streight into f, as the corresponding
%freqency for this value is then the value of f at this index.
f(find(p==max(p)))
%this returns 572 Hz, which is very close to 577.33 Hz: D5. This leads us
%to suspect the chorus opens with a D. If you look at the sheet music, it actually
%opens with an octave of D's: D5 and D6 (the song is in D Major). If you find the
%frequency for that other peak, you will find it is close to 1174.66 Hz, the frequency for D6.
The code above will generate the following graphs.

The first note of the sound clip

The FFT of the first note

Sound

Visible light is also a complex wave. Shining light through a prism breaks the light into it's component parts. Spectral analysis of light will reveal it's component frequencies. This is an Argon tube lamp (which emits a purple-blue color) and its spectra. Notice the vertical lines in the purple, blue, and teal sections of the spectra graph are bright. The brightness corresponds to the intensity or amount of that particular frequency.

Putting it together

We have complex sound waves split into component frequencies (notes) and their relative intensities (volume), and complex light waves split into component frequencies (colors) and their relative intensities (brightness). If we shift and translate our sound data from the human audio frequency range (roughly 12 Hz to 20 KHz) to human visual frequency range (390-750 nm wavelengths at the speed of light equate to 400-790 THz; 1 THz = 1 trillion Hz).

The triangle is the limits of the RGB system, and the "tongue" is the
full human visible spectrum in CIE coordinates.

There is still another problem. All of our displays (TVs, LCD panels, even CRTs) use RGB values to determine color. Linear combinations of red, green, and blue will result in all the colors of the color wheel. However, our displays to not factor intensity into these. Believe it or not, the screen you are looking at now is not capable of producing every visible color. This graph is from this amazingly useful site, which also includes source code for converting RGB <=> CIE color data.

CIE stands for the International Commission on Illumination (where you aware there was one!?), which defined the CIE X Y Z color space. My knowledge on this is pretty limited. From Wikipedia:

The human eye has photoreceptors (called cone cells) for medium- and high-brightness color vision, with sensitivity peaks in short (S, 420–440 nm), middle (M, 530–540 nm), and long (L, 560–580 nm) wavelengths... These tristimulus values of a color can be conceptualized as the amounts of three primary colors in a tri-chromatic additive color model.

In this model, Y means luminance, Z is quasi-equal to blue stimulation, or the S cone response, and X is a mix (a linear combination) of cone response curves chosen to be nonnegative. Thus, XYZ may be confused with LMS cone responses. But in the CIE XYZ color space, the tristimulus values are not the L, M, and S responses of the human eye, even if X and Z are roughly red and blue. Rather, they may be thought of as 'derived' parameters from the long-, medium-, and short-wavelength cones.

While I still don't completely understand the coordinate system, the conversion article is quite good at explaining how to normalize color values not covered by RGB coordinates to something RGB can handle (see the section called "Unrepresentable Colors").

The good news is that this project can be done with two existing C libraries. I will work on actually getting this to work, and hopefully have a graph of a song's color over time. Eventually, I would like to make a device with a microphone and a cheap LCD that displays the color of whatever it is hearing. That is a much harder problem. First, you have to sample to have a wave function to do a FFT on. That means it will have to record a brief snippet of sound and chug through a bunch of math before displaying a color. Google searches suggest people have been able to write watered-down FFT algorithms that will run on 8MHz atmegas, so it might be possible for Arduino, but the processing time might be too high for a reasonably small sample time. I would settle for 3 samples per second, and be happy with 30 (a common framerate). 100 samples per second would be ideal. If an Arduino couldn't handle this, I'm sure my RPi (which, being an ARM processor, is 32 bit and has support for floats, unlike the atmegas) can.

Wednesday, November 28, 2012

The Color of Music

Sound

Sound

Putting it together

No comments:

Post a Comment