Introduction
  Introduction
Initial Proposal
Project Description

Background      Information
  Psychoacoustic Model
Filter Banks

Project Research
  Research Findings
List of MATLAB Code
Simulations

Further Work
  Extensions to Research
Wavelets

References

About Us

Discrete Wavelet Transform for Audio Compression

BACKGROUND INFORMATION

Wavelets are a family of basis function for the space of square integrable functions. In Fourier analysis, the complex exponentials are used as a basis of such a space . The Fourier transform represents a signal in terms of sines and cosines of different frequencies. The Fourier representation breaks up a signal into these different components. Similarly, the wavelet transform represents a signal in terms of the wavelet basis. These are formed through scaling and shifting a single function called the wavelet function. The wavelet transform provides a combination of time and frequency localization, whereas the Fourier representation only provides frequency analysis. The combination and time and frequency localization is important for coding non-stationary signals such as audio signals. The DWT yields a compact representation of signals without an a priori knowledge of signal statistics.

CODER OVERVIEW

Most of the current audio coders can be classified in two groups: adaptive transform coders (ATC) and sub-band coders. Those coders use a fixed basis to represent the signal, while this particular technique uses a combination of both sinusoidal and wavelet basis. This technique involves 3 steps. First, the signal is broken up into harmonic tonal components with strong periodic structures. The Clarinet is one example of a signal with strong periodic structures. This portion is represented by sinusoidal basis functions. The coder uses an algorithm called Thomson's Harmonic Analysis, a criteria that provides a test to determine whether a harmonic is statistically significant.

a) Original Signal
b) Reconstructed tonal part
c) Residual signal

 

After removing the tones, the remaining signal, or the residual, consists of transient and noise-like features. The wavelet transform is performed on the residual using edge-prediction and noise modeling. For instance, the castanet has multiple transients that are accurately represented. The transients are preserved via proper wavelet coding while the noise-like features are replaced by synthetic noise.

 

CODER DETAILS

Wavelet transforms are useful for tracking transients. However, they do not provide compact representation for tonal signals. Therefore, wavelet-based coders sometimes perform poorly compared to existing coders using Fourier transforms. To avoid the problem of having poor encoding for tones, this coder uses combined harmonic and wavelet analysis. Sinusoids are well suited for representing the tones in the signal whereas the wavelet transform is more useful for transients. This coder selects the sinusoids based on the input signal and then takes a wavelet transform of the residual. In addition, this decomposition allows the use of more precise frequency domain masking models that work in conjunction with temporal domain masking models. Experiments have shown that this method produces a bit rate of 1 bit/sample while maintaining near-transparent audio quality.

First, the harmonic analysis/synthesis is performed. The output of this part consists of the tonal part of the input. The tonal part, t[n] is then subtracted form the signal s[n]. The residual consists of transients and noise.

The coder performs the harmonic analysis using Thomson's Harmonic Analysis, a technique that uses a set of orthogonal data windows that provides a weighted set of estimates of the frequency spectrum. Using point regression, we may then find an estimate of the harmonic mean at a given frequency. A "goodness of fit" of the regression analysis is performed. Reconstruction can be performed through sinusoidal addition.

The portion of the signal without the tones is the residual. This portion consists of transients and noise, and is represented using a wavelet transform. The wave-packet tree structure is used because its frequency bands correspond with the critical band structure of the human auditory system. At higher frequencies, the residual is broken down into edge information and background "noise" information. Most of the important edge information is contained in the lower frequency bands. The higher frequency bands contain a large number of extraneous edges. An edge mask is used to remove edges that are too close together to be distinguished by the human ear. Edges less than 3 msec. apart cannot be separated by the human ear. For postmasking, an edge that falls within 100 msec of the masking edge will not be detected and does not need to be preserved. After removing the transients from the higher frequency bands of the wavelet transform, the bands are fairly uncorrected. The histogram of the coefficients in each band appears to be Gaussian. This portion can be modeled by noise.

For impulsive signals such as castanets, there is an exponential decay in the signal. Therefore, the noise must be weighted by a decay factor. Signals with sharp attacks also show an exponential decay from the maximum level.

The experiments that have been performed suggest that about 80-90% of the coefficients of the wavelet transform can be replaced by a noise model, depending on the type of signal. Tonal signals such as the violin can have up to 98% of the coefficients replaced, and castanets can have about 85% of the coefficients replaced.

 

QUANTIZATION

The harmonic analysis for the tonal signals generates the frequencies, amplitudes, and phases of the harmonics present in the sinusoidal portion of the signal. The frequencies are quantized using 7 or 8 bits through just noticeable differences in frequency (JNDF). The amplitudes are quantized by exploiting the masking properties of the human auditory system. Harmonics that fall below the masking threshold are removed. Because the ear is not completely insensitive to phase, 6 bits are used to encode the phase from -pi to pi.

Wavelet coefficients in the lower frequency bands are encoded using 2 schemes, 8-bit log PCM and the 3 level Max-Lloyd quantizer. The 3 levels of the Max-Lloyd quantizer and the zeros are represented using 2 bits. Eight-bit log PCM is used to encode variances in the frequency bands and the differences between these variances from the average in each group is quantized using the 3 level Max-Lloyd quantizer.

At the higher frequencies, we need to encode the edge mask, the noise variances and means, the exponential decay factor for each noise segment, and the edge information. Eight bits can be allocated to the mean, variance, and decay constant. A noise-masking model is generated to ensure that the quantization noise falls below the masking threshold. For blocks of length 2048, the noise information can be coded at a rate of .05 bits/sample.

[Alex Chen]   [Nader Shehad]   [Aamir Virani]   [Erik Welsh]