Speex is based on CELP, which stands for Code Excited Linear Prediction. This section attempts to introduce the principles behind CELP, so if you are already familiar with CELP, you can safely skip to section 7. The CELP technique is based on three ideas:

The use of a linear prediction (LP) model to model the vocal tract
The use of (adaptive and fixed) codebook entries as input (excitation) of the LP model
The search performed in closed-loop in a “perceptually weighted domain”

This section describes the basic ideas behind CELP. Note that it’s still incomplete.


Linear Prediction (LPC)

Linear prediction is at the base of many speech coding techniques, including CELP. The idea behind it is to predict the signal Introduction to CELP Coding-冯金伟博客园 using a linear combination of its past samples:

egin{displaymath}
y[n]=sum_{i=1}^{N}a_{i}x[n-i]end{displaymath}

where Introduction to CELP Coding-冯金伟博客园 is the linear prediction of Introduction to CELP Coding-冯金伟博客园. The prediction error is thus given by: 

egin{displaymath}
e[n]=x[n]-y[n]=x[n]-sum_{i=1}^{N}a_{i}x[n-i]end{displaymath}

The goal of the LPC analysis is to find the best prediction coefficients Introduction to CELP Coding-冯金伟博客园 which minimize the quadratic error function: 

egin{displaymath}
E=sum_{n=0}^{L-1}left[e[n]ight]^{2}=sum_{n=0}^{L-1}left[x[n]-sum_{i=1}^{N}a_{i}x[n-i]ight]^{2}end{displaymath}

That can be done by making all derivatives Introduction to CELP Coding-冯金伟博客园 equal to zero: 

egin{displaymath}
frac{partial E}{partial a_{i}}=frac{partial}{partial a...
...um_{n=0}^{L-1}left[x[n]-sum_{i=1}^{N}a_{i}x[n-i]ight]^{2}=0end{displaymath}

The Introduction to CELP Coding-冯金伟博客园 filter coefficients are computed using the Levinson-Durbin algorithm, which starts from the auto-correlation Introduction to CELP Coding-冯金伟博客园 of the signal Introduction to CELP Coding-冯金伟博客园.

egin{displaymath}
R(m)=sum_{i=0}^{N-1}x[i]x[i-m]end{displaymath}

For an order Introduction to CELP Coding-冯金伟博客园 filter, we have: 

egin{displaymath}
mathbf{R}=left[egin{array}{cccc}
R(0) & R(1) & cdots & ...
...s & vdots\
R(N-1) & R(N-2) & cdots & R(0)end{array}ight]end{displaymath}

egin{displaymath}
mathbf{r}=left[egin{array}{c}
R(1)\
R(2)\
vdots\
R(N)end{array}ight]end{displaymath}

The filter coefficients Introduction to CELP Coding-冯金伟博客园 are found by solving the system Introduction to CELP Coding-冯金伟博客园. What the Levinson-Durbin algorithm does here is making the solution to the problem Introduction to CELP Coding-冯金伟博客园instead of Introduction to CELP Coding-冯金伟博客园 by exploiting the fact that matrix Introduction to CELP Coding-冯金伟博客园 is toeplitz hermitian. Also, it can be proven that all the roots of Introduction to CELP Coding-冯金伟博客园 are within the unit circle, which means that Introduction to CELP Coding-冯金伟博客园 is always stable. This is in theory; in practice because of finite precision, there are two commonly used techniques to make sure we have a stable filter. First, we multiply Introduction to CELP Coding-冯金伟博客园 by a number slightly above one (such as 1.0001), which is equivalent to adding noise to the signal. Also, we can apply a window to the auto-correlation, which is equivalent to filtering in the frequency domain, reducing sharp resonances.

The linear prediction model represents each speech sample as a linear combination of past samples, plus an error signal called the excitation (or residual). 

egin{displaymath}
x[n]=sum_{i=1}^{N}a_{i}x[n-i]+e[n]end{displaymath}

In the z-domain, this can be expressed as

egin{displaymath}
x(z)=frac{1}{A(z)}: e(z)end{displaymath}

where Introduction to CELP Coding-冯金伟博客园 is defined as

egin{displaymath}
A(z)=1-sum_{i=1}^{N}a_{i}z^{-i}end{displaymath}

We usually refer to Introduction to CELP Coding-冯金伟博客园 as the analysis filter and Introduction to CELP Coding-冯金伟博客园 as the synthesis filter. The whole process is called short-term prediction as it predicts the signal Introduction to CELP Coding-冯金伟博客园using a prediction using only the Introduction to CELP Coding-冯金伟博客园 past samples, where Introduction to CELP Coding-冯金伟博客园 is usually around 10.

Because LPC coefficients have very little robustness to quantization, they are converted to Line Spectral Pair (LSP) coefficients which have a much better behaviour with quantization, one of them being that it’s easy to keep the filter stable.


Pitch Prediction

During voiced segments, the speech signal is periodic, so it is possible to take advantage of that property by approximating the excitation signal Introduction to CELP Coding-冯金伟博客园 by a gain times the past of the excitation:

egin{displaymath}
e[n]simeq p[n]=eta e[n-T]end{displaymath}

where Introduction to CELP Coding-冯金伟博客园 is the pitch period, Introduction to CELP Coding-冯金伟博客园 is the pitch gain. We call that long-term prediction since the excitation is predicted from Introduction to CELP Coding-冯金伟博客园 with Introduction to CELP Coding-冯金伟博客园.

Innovation Codebook

The final excitation Introduction to CELP Coding-冯金伟博客园 will be the sum of the pitch prediction and an innovation signal Introduction to CELP Coding-冯金伟博客园 taken from a fixed codebook, hence the name Code Excited Linear Prediction. The final excitation is given by:

egin{displaymath}
e[n]=p[n]+c[n]=eta e[n-T]+c[n]end{displaymath}

The quantization of Introduction to CELP Coding-冯金伟博客园 is where most of the bits in a CELP codec are allocated. It represents the information that couldn’t be obtained either from linear prediction or pitch prediction. In the z-domain we can represent the final signal Introduction to CELP Coding-冯金伟博客园 as 

egin{displaymath}
X(z)=frac{C(z)}{A(z)left(1-eta z^{-T}ight)}end{displaymath}


Analysis-by-Synthesis and Error Weighting

Most (if not all) modern audio codecs attempt to “shape” the noise so that it appears mostly in the frequency regions where the ear cannot detect it. For example, the ear is more tolerant to noise in parts of the spectrum that are louder and vice versa. That’s why instead of minimizing the simple quadratic error 

egin{displaymath}
E=sum_{n}left(x[n]-overline{x}[n]ight)^{2}end{displaymath}

where Introduction to CELP Coding-冯金伟博客园 is the encoder signal, we minimize the error for the perceptually weighted signal 

egin{displaymath}
X_{w}(z)=W(z)X(z)end{displaymath}

where Introduction to CELP Coding-冯金伟博客园 is the weighting filter, usually of the form

egin{displaymath}
W(z)=frac{Aleft(frac{z}{gamma_{1}}ight)}{Aleft(frac{z}{gamma_{2}}ight)}
end{displaymath} (1)

with control parameters Introduction to CELP Coding-冯金伟博客园. If the noise is white in the perceptually weighted domain, then in the signal domain its spectral shape will be of the form 

egin{displaymath}
A_{noise}(z)=frac{1}{W(z)}=frac{Aleft(frac{z}{gamma_{2}}ight)}{Aleft(frac{z}{gamma_{1}}ight)}end{displaymath}

If a filter Introduction to CELP Coding-冯金伟博客园 has (complex) poles at Introduction to CELP Coding-冯金伟博客园 in the Introduction to CELP Coding-冯金伟博客园-plane, the filter Introduction to CELP Coding-冯金伟博客园 will have its poles at Introduction to CELP Coding-冯金伟博客园, making it a flatter version of Introduction to CELP Coding-冯金伟博客园.

Analysis-by-synthesis refers to the fact that when trying to find the best pitch parameters (Introduction to CELP Coding-冯金伟博客园Introduction to CELP Coding-冯金伟博客园) and innovation signal Introduction to CELP Coding-冯金伟博客园, we do not work by making the excitation Introduction to CELP Coding-冯金伟博客园 as close as the original one (which would be simpler), but apply the synthesis (and weighting) filter and try making Introduction to CELP Coding-冯金伟博客园 as close to the original as possible.

参考资料:

1 百科总结: https://zh.wikipedia.org/wiki/%E7%A0%81%E6%BF%80%E5%8A%B1%E7%BA%BF%E6%80%A7%E9%A2%84%E6%B5%8B
2 详细介绍: http://ntools.net/arc/Documents/speex/manual/node8.html