Signal Enhancement for the Human Voice

Tamitha L. Skov and James L. Roeder

Scientists at The Aerospace Corporation routinely use signal analysis to study satellite data of the space environment. The techniques applied to study these sounds from space have also proven beneficial to law enforcement efforts to decipher voices.

Since 1966, Aerospace has analyzed audio-frequency radio signals in the 1 to 20 kilohertz (kHz) range from scientific payloads on Air Force and NASA spacecraft. The primary focus of such analyses has been to study electromagnetic signals generated by charged particles in space to better understand the radiation environment in Earth's magnetosphere and ionosphere, where charged particles abound.

Aerospace has recently been applying these techniques to analyze and enhance audio recordings for various law-enforcement agencies across the United States. In most cases, the task is to improve the intelligibility of recorded voices that have been shrouded in various types of noise. This work is done for the National Law Enforcement and Corrections Technology Center, which Aerospace operates under contract to the National Institute of Justice.

Since 1996, tapes from more than 500 cases have been processed for all local police departments and most of the state and federal law-enforcement agencies. Aerospace employees have testified in court as expert witnesses in such high-profile cases as those of Michael Jackson, Jon Benet Ramsey, and most recently Elio Carrion, the U.S. Air Force senior airman shot by a San Bernardino, California, sheriff deputy.

Specific characteristics of audio speech recordings must be taken into consideration when tailoring audio processing algorithms to support the needs of the audio forensics analyst. These include the character of speech signal and how the vocal tract generates it, the physiology of how the sound is perceived by the listener's ear and brain (known as "psychoacoustics"), and the environment in which the recording is made. The latter characteristics can include a great variety of ambient noise sources, including wind, other voices, nearby vehicles, music, and electromagnetic interference. All of these subtle (and not so subtle) aspects of the audio forensics analyst's duties are reviewed in the following sections.

Sounds from Space

The physics of collisionless plasma waves is one area in which Aerospace studies "sounds from space." An example of such a phenomenon occurs when lightning strikes and generates what are called "whistler" waves—very low frequency electromagnetic (radio) waves with maximum amplitudes usually at 3 to 5 kilohertz, similar to the frequency band of human speech. Although electromagnetic in nature, these waves can be converted into audio signals using a suitable receiver.

Produced mostly by intracloud and return-path lightning strikes, the initial impulse forming the wave travels away from the source region, up through the ionosphere, into the magnetosphere, and then returns, traveling along closed magnetic field lines. The wave undergoes dispersion of several thousand kilohertz because of the slower velocity of the lower frequencies through the plasma environments of the ionosphere and magnetosphere, which results in the wave being perceived as a descending tone, much like a whistle that lasts for several seconds.

The "dawn chorus" emission is another space physics electromagnetic wave phenomenon occurring most often near sunrise. The chorus waves exist in the audio-frequency spectrum up to about 10 kilohertz and resemble the sound of the birds' dawn chorus, hence the name. The chorus is thought to be caused by high-energy electrons that get caught in the Van Allen radiation belts and fall to Earth's surface in the form of audible radio waves.

Dawn choruses occur more frequently during space weather events called geomagnetic storms. A related phenomenon, which occurs during auroras, is known as an auroral chorus. It has been theorized that similar choral effects can be created by the auroras of Jupiter and Saturn and any other planet that possesses a magnetosphere-ionosphere system capable of creating auroras. In 2002, the Cassini spacecraft recorded an auroral chorus from Saturn.

Techniques used to analyze these kinds of electromagnetic waves include spectral analysis, noise reduction, filtering, and time stretching. Aside from electromagnetic waves, similar digital signal processing techniques may also be used to modify or analyze any type of signal that carries information. Examples of such signals include housekeeping telemetry signals from a launch vehicle, a Whistler-mode plasma wave emission such as those previously described, or an audio recording of people speaking. Once digitized, these very different signals have very similar attributes.

Characteristics of Voice

What is commonly referred to as the human "speech signal" begins as a sound generated from the vocal folds (mistakenly called "vocal cords") in the larynx. This "speech signal" is a continuous waveform with fundamental frequency of approximately 100 Hz for men and approximately 200 Hz for women. The exact frequency of the fundamental is determined by the vibrational characteristics of the folds and the dimensions of the larynx. Higher-order harmonics of diminishing amplitude are also present in the signal, and their frequencies and amplitudes are coupled to the throat, mouth, and face. Motion of the throat together with mouth and facial movements alter the output signal and are crucial to the formation of speech sounds.

The "vocal tract" consists of the laryngeal cavity, the pharynx, the oral cavity (formed by the tongue, palate, jaw, cheek, lips, etc.), and the nasal cavity. Resonances within the vocal tract naturally occur due to the physical shape and size of these cavities. By changing the size and shape of these cavities—for example, by moving the tongue, lips, and jaw—the vocal tract acts as a frequency-dependent amplifier that modulates the speech signal and enables us to produce the sounds we normally associate with voice. Since the speech signal is harmonic, the resonant gain of the vocal tract occurs at multiples of the fundamental pitch frequency. These amplitude peaks in the frequency spectrum of the voice are called the "formants" and are critical to intelligibility. Understanding which formants and which frequencies in the human voice are critical to intelligibility is where signal processing comes into play.

Spectrogram of a male voice

Top: Spectrogram of a male voice saying "Rice University" with red indicating highest energy portions. The time domain signal is shown beneath, with phonetics. There is significant energy in the consonant sounds above 2000 Hz, which is the cutoff for most telephone systems. Middle: A frequency-time spectrogram showing Whistler waves with time on the x axis and frequency on the y axis. The color represents the intensity or volume of the waves, with red indicating the largest intensity signals (in this case, loudest sounds). The Whistler waves are clearly seen as the curved red lines, which descend in frequency with time. Bottom: Spectrogram of the dawn chorus during a geomagnetic storm. The frequency range of these waves is slightly higher than that of the Whistlers, but still well within the audible range.

time domain signal
time domain signal

The majority of the voice power is carried by the vowel sounds, which are the highest amplitude and longest duration of all voice signals. Vowel sounds and their transitions are created by the formants. Usually the first two to four formants are all that are needed to understand the vowel sounds. Consonant sounds are impulsive and far lower in power than the vowels and occur over a smaller, yet higher, frequency range. Most consonants are created by sharp movements in the vocal tract, mouth, and face (usually the tongue, lips, and jaw). Some consonants, however, do not require the speech signal, but are the result of air movements only, for example the "f" as in "fair." These "unvoiced consonants," as they are called, prove very difficult to transmit over communication devices and are often only unambiguously determined using signal analysis techniques.

Psychoacoustics

Looking at the overall characteristics of the human voice and how it is perceived by the listener is the main concern of psychoacoustics. As the typical voice contains significant energy from about 100 Hz to over 5 kHz, psychoacoustics covers various vocal sounds over a large spectral range. As previously mentioned, vowels have more focused energy than consonants, especially at lower frequencies. Surprisingly, the intelligibility of speech is imparted primarily by the consonants, which are nearly 30 decibels lower in amplitude than the vowels. Unvoiced consonants have considerably less energy than any other speech sound and are usually found at frequencies higher than 5 kHz.

Since the spectral shape of the voice and formants are determinable using Fourier analysis (decomposing a musical instrument sound or any other periodic function into its constituent sine or cosine waves), it is possible to isolate the characteristics of vowels and consonants. The same techniques of signal analysis used on space physics wave phenomena commonly analyzed at Aerospace can easily be applied to speech analysis. By tailoring these tools to filter signals in the frequency range required by psychoacoustics, it is possible to filter out noise between the formants, identify, and then amplify the frequencies corresponding to the consonants. Because the resonant frequencies of the lowest two formants are relatively stable regardless of the pitch or inflection of speech, it is possible to isolate and enhance vowel sounds separately from the consonants. This is vital to increasing the intelligibility of voice recordings because consonants require radically different amplification and filtering treatment than do vowels. Capitalizing on the large gap between the second and third formants allows for effective noise reduction using specifically designed bandpass filters, thus further improving intelligibility via signal processing.

Real-World Applications

Although the theory of psychoacoustics is most often practiced in the laboratory, application of audio signal processing techniques in the real world means dealing with these issues under less-than-ideal circumstances. Factors that affect speech intelligibility and the ability of signal processing to enhance the intelligibility of voice recordings fall under the general categories of physiological stress, transmission systems, masking, and reverberation.

Voices differ from person to person. Frequency and spectral characteristics unique to a particular voice are most notably determined by the gender and age of a person. For example, a child or a woman will have a higher fundamental frequency and higher frequency formants than an adult male. But the exact frequency of formants can also change within a single voice from day to day. Health issues and stress levels can cause significant changes in the frequency spectrum of a voice, especially in the higher order formants. Language, accent, and dialect also cause changes in higher order formants and voiced consonants. Thus, signal processing tools must adapt to these changes similarly to voice recognition systems.

Transmission systems are another factor that affects how audio processing techniques can be applied to improve intelligibility of voice recordings. In law enforcement in particular, many audio forensics recordings are done over band-limited transmission systems such as telephones, lossy compression device recorders (which sacrifice some data permanently to make the overall file size smaller), poor microphone response systems, and the like, resulting in the partial and oftentimes complete removal of frequencies above 2000 Hz and below 400 Hz. Systems that severely limit the high-frequency energy of speech (often called the "intelligibility band") cause serious problems in speech recognition devices.

An example of this kind of limitation can be demonstrated using the common telephone. A simple exercise is to say the word "six" over the phone. If the word is said out of context, it will be impossible to tell whether the speaker is saying "six" or "fix" due to the lack of spectral information above 2000 Hz. In this case, the unvoiced consonant "s" has energy well over 3000 Hz, which gets entirely lost in transmission. This example illustrates how difficult reconstruction of the high-frequency spectral range can be over certain transmission systems, and the audio analyst must be careful so that high-frequency artifacts are not introduced during the reconstruction.

Another factor affecting voice intelligibility is a phenomenon called "masking." This effect is caused by the superimposition of noise in a small frequency range over an otherwise intelligible voice signal. Examples of such band-limited noise are electromagnetic alternating current interference, tape hiss, analog or digital clipping and other distortion, bias offsets, cell phone multiplexing interference, and loud ambient environments. How the amount of band-limited noise affects intelligibility is dependent upon the signal-to-noise ratio (s/n) and the frequency of the noise. Noise at a similar frequency as the voice fundamental masks the voice only at small s/n levels. However, noise at higher frequencies masks the subject voice at much higher s/n. For cases in which the noise frequency significantly overlaps the high-frequency voice band, successfully filtering out the noise between the formants and the consonant bands is critical to enhancing the intelligibility of the speech.

The final real-world issue affecting the intelligibility of voice recordings falls under the category of "reverberation." This phenomenon is the persistence of sound in a particular space after the original sound is removed. When sound is produced in an enclosed space, a large number of reflected echoes from the various surfaces within the space build up and then slowly decay as the sound is damped by objects (and air) within the room. The amount of reverberation that occurs in any particular situation is dependent upon the size of the room, the kind of surfaces encountered, and the number of sound sources present. Reverberation is a special case of masking in that it is band-limited noise, but the noise in this case stems from reflections of the voice itself. These echoes mimic or clone the source frequency spectrum, often with greater low-frequency energy, making them particularly difficult to remove without removing the very frequency information that constitutes the source signal. The masking effect of these "distractor" voices on intelligibility is dependent upon the number of voices and the s/n. Intelligibility is adversely affected by multiple distractor voices even at very large s/n ratios. Sufficiently long echoes from hard walls or enclosed spaces act as multiple cloned distractor voices, which are the largest challenge for audio forensics analysts to remove. To make things even more difficult, many law-enforcement audio recordings contain not only masking and reverberation, but additional distractor voices coming from radios and televisions.

Conclusion

Tailoring signal processing techniques usually reserved for space applications has proved valuable in the field of audio forensics. The National Law Enforcement and Corrections Technology Center and Aerospace have furthered techniques used in audio forensics casework and have designed a three-day course in signal processing techniques specifically to train law enforcement agency personnel in their use. Currently, the center has delivered the course to several hundred law enforcement personnel and continues to receive requests for additional classes and an expanded curriculum. Analogous to most technical analysis, the techniques taught within the course are heavily tailored to the unique needs of the audio forensics analyst, with the primary focus being to improve the intelligibility of voices recorded using a variety of methods in a variety of different environments.

Numerous considerations for tailoring signal processing techniques for audio applications must be taken into account, including the characteristics of the voice and how it is generated by the vocal tract, the psychoacoustics of how the sound is perceived by the listener, and finally the environment in which the recording is made. These real-world issues provide continual challenges, which the audio forensics analyst must be equipped to face. The National Law Enforcement and Corrections Technology Center and Aerospace have met those challenges head on and proved that space science applications and forensics can be used together to solve violent crimes, explain mysterious occurrences, and ultimately improve the ability of law enforcement agencies to establish justice.


To Winter 2008/2009 Table of Contents



Home   Contact Us   FAQ  |   (options)
Copyright and Terms of Use, © 1995-2010 The Aerospace Corporation. All rights reserved. Send any questions or comments regarding this service to .

This page was last modified on 02/17/09