COMPUTER SOUND PRINCIPLES AND VOCABULARY
For game audio, it all ends with a single byte stream. It may sound silly, but in reality, the entire focus of this book, and game audio in general, is on generating huge one-dimensional arrays of bytes. However, this simple core idea can sometimes be obfuscated by lots of technical vocabulary. To prevent total confusion, here's a look at some basic audio terms.
Sample Rate and Sample Quality
Pervasive throughout audio programming is the notion of the sample rate. To understand this, you need to learn how digital audio makes its way onto your hard drive in the first place (see Figure 1.1).

Figure 1.1: How your computer recotds audio.
Suppose you're making an action game and need a bone-chilling scream sound effect. You recruit your friend/sibling/co-worker and convince him to sit down in front of a microphone. You hit the record button in your audio software and cue him to scream.As soon as you hit the record button, your audio software instructs your sound hardware to begin capturing the sound. Basically, when your system captures sound, it relies on the performance of a special chip on your sound card called an ADC, or analog-to-digital converter. Every so often, the ADC looks at the electrical current from the mic to determine how loud it is out there. It then sends this number back to the sound card's device driver, and ultimately through Windows and back to your audio program, which stores it in memory or writes it to disk. The piece of data is called a sample. (Note that often times the entire recording, or collection of samples, is also referred to as a sample, so you have to rely on context to figure out what's being said.)The number of times per second samples are collected is called the sample rate. Sample rates are measured in Hertz, a unit for "per seconds." That is, 8,000Hz happens 8,000 times every second. Just as there are bytes and kilobytes, there are also Hertz and Kilohertz; 22KHz means that something is happening twenty-two thousand times per second. 8,000Hz is equal to 8KHz.In a way, sample rates are to audio programming what resolution is to graphics programming. Just as more pixels mean higher-quality video, a higher sample rate means higher-quality audio. Typically, you'll record sound effects at 11025Hz, 22,050Hz, or 44,100Hz. 44,100Hz (44.1KHz) is the sample rate of CD audio; if you're listening to a track on a CD, some computer somewhere measured the loudness in the air 44,100 times every second. That's a lot of data.
AUDIO CLIP | Play Audio Clips 1.1 and 1.2 to hear the difference between a 44.1Khz sample rate, and an 8000Hz sample rate. |
When the time comes to play the sound back, your computer simply reads the numbers and feeds them back out to the sound card (see Figure 1.2).

Figure 1.2: How your computer plays back audio.
There's another facet to the equation, though; there's also the notion of sample quality. Sample quality (also called bits per sample or sometimes just bits) refers to how accurately each sample is measured. The two common values are 8 bits and 16 bits. Obviously, 16 bits is better than 8. With 8 bits, you can only express 256 distinct levels of loudness, but with 16 bits, you get 65,536 unique levels, making you tremendously more sensitive. Taking this to imaginary extremes, if you had only 1 bit, you could only express "loud" and "quiet," and that wouldn't make for a very good sample at all. If sample rate is akin to video resolution, then bits per sample is akin to color depth—the more bits, the more precise you can be.
Why 44100 Hertz?
Why are CDs sampled at 44,100 Hertz, anyway? It seems like such an odd number. To answer this question, we have to dive back into audio history. Before CDs, digital audio was stored on video tape—a hack rivaling the best of them. The tapes were designed to store and play back 60 frames per second of video data. Each frame of video had 245 lines, and each line had three samples (for red, green, and blue). That gives us 245 3 60, or 44,100 samples.
AUDIO CLIP | Play Audio Clips 1.3 and 1.4 to hear the difference between 8- and 16-bit samples. Listen closely though; these differences are not nearly as noticeable as the sample rate differences. |
Stereo Sound
Up until now, this chapter has focused on one channel sound, also called monaural sound, and abbreviated mono.To capture stereo sound, you use two mics in two different places (one to the left of the sound source, and one to the right). Most of the time, you don't capture audio like this, though; instead, you use a mixing station or mixing software to explicitly control the sound. The mixing station allows you to pull in mono samples and control which speaker they come out.Regardless of how you do it, eventually you end up with a stereo sample, which has two channels. CD quality is synonymous with 16-bit stereo sound sampled at 44100Hz. For every second of CDquality audio, you have 44,100 samples at 16 bits (2 bytes) per sample—88,200 bytes per channel per second. Stereo sound means two channels, so you have 88,200 2, or 176,400 (about 176K), bytes for each second of audio. That's why something as huge as a CD, which can store 650MB of data, can only store 74 minutes of CD-quality audio.
The Story behind CDs
My father-in-law, a musician, once told me that the reason CDs hold 74 minutes of music is because the powers that be wanted to listen to Beethoven's Ninth Symphony in its entirety, without interruption. The engineers, always anxious to please, calculated the length of this symphony as 74 minutes, and came up with a physical specification for a disc that could store that much audio. I find it fascinating that a classical musician who died several hundred years ago had a very large hand in shaping one of today's most omnipresent pieces of audio technology.
Sound Formats, Compression, and Codecs
Of course, the world would be a pretty depressing place if the only way you could store CD-quality sound was by burning through 176KB per second. A typical 3-minute song would burn through 31,680KB (roughly 30MB)! This would make record labels happy, because they wouldn't have to deal with people swapping songs online. Even on a fast broadband connection, 30MB is a lot of bytes to transfer for three minutes of audio.Clever people have come up with ways around this, though. Several audio compression algorithms have come along. These algorithms represent the same data in more compact ways, allowing you to store the same information in less bytes. For example, say you have a mono WAV file with two seconds of silence in it. Ordinarily, if you were using CD quality, you'd write out 88,200 bytes with the same "zero" value. However, you could compress that by saying something like, "Okay, the next 88,200 bytes are zero," and not bother to write out each one. That's what a common compression algorithm called RLE (run-length encoding) does.RLE is a very simple algorithm (see Figure 1.3). A more advanced one goes by the name of MPEG Layer 3. In the days of 8.3 file names, sound files compressed using the MPEG Layer 3 algorithm were given the three character extension MP3, and the rest is history. MPEG itself is an acronym for the Moving Picture Experts Group. For more history on the MP3 format and the people who made it (Fraunhofer IIS-A), check out http://www.iis.fhg.de/amm/techinf/layer3/.

Figure 1.3: The algorithmbehind the simplest form of RLE compression
MP3 is a "lossy" compression algorithm, which means that some information is lost when you compress a WAV into MP3. MP3 works by making sure that the information that's lost is information you can probably live without. For example, a lot of high frequency things, such as cymbal crashes and vocalizations of the hiss of an "s" sound or a crisp "k" sound, are lost. Usually this loss is imperceptible to the average listener, but with a good pair of speakers and a keen ear, you can hear the difference. Try it sometime—go to a quiet place, and listen to your favorite song on CD, then listen to the same song in MP3. If your speakers are good and you're young or have taken good care of your ears, you'll be able to hear the difference.Other compression formats are out there. There's ADPCM, an acronym for Advanced Differential Pulse Code Modulation, Ogg Vorbis (an open source, patent-free audio compression algorithm that's quickly gaining popularity), and several lesser-known formats.These pieces of code that implement these algorithms are called codecs, an acronym for compressor/decompressor (don't you love all these audio acronyms?). Contained in a WAV file is the name of the codec it was compressed with; Windows provides several of the most common codecs, and automatically uses the right one based on the tag in the WAV file. If you try to play a WAV file that contains a tag for a codec not installed on your system, you won't be able to until you hunt down the codec it was made with. Happily, this is not a common occurance—99 percent of the WAVs out there are PCM or ADPCM.
Mixing Sound
Of course, there's more to good game audio than just playing WAV files. Often you'll want to play more than one sound effect at once, and that's where audio mixing comes in.The easiest way to play two sounds at the same time is simply to add their samples together (see Figure 1.4).

Figure 1.4: Simple additive sound mixing
So, if you have a scream sample and a growl sample, and you add them together, you'll hear both the scream and the growl at the same time. This doesn't seem like it should work, but it does.Of course, to get good results requires more than just adding the sound together. Your next thought might be, "Well, what about averaging them together?" The problem with that is it makes each individual sound get softer. If you're playing three sound effects at once, each sound effect will have one-third the volume it has when played alone.There are different solutions to this, some involving logarithmic lookup tables and other advanced math stuff. The good news is that unless you're an expert working on something very advanced, you never have to worry about mixing samples yourself; that's all done by the Windows OS. The closest you'll usually get is to just say, "Hey DirectX Audio (or hey Windows), mix this sample at this volume with that sample at that other volume," and then through some black magic it works.