89 lines
4.5 KiB
Text
89 lines
4.5 KiB
Text
|
|
Speaker Adaptation with MLLR Transformation
|
|||
|
|
|
|||
|
|
Unsupervised speaker adaptation for Sphinx4
|
|||
|
|
|
|||
|
|
For building an improved acoustic model there are two methods. One of them
|
|||
|
|
needs to collect data from a speaker and train the acoustic model set. Thus
|
|||
|
|
using the speaker's characteristics the recognition will be more accurately.
|
|||
|
|
The disadvantage of this method is that it needs a large amount of data to be
|
|||
|
|
collected to have a sufficient model accuracy.
|
|||
|
|
|
|||
|
|
The other method, when the amount of data available is small from a new
|
|||
|
|
speaker, is to collect them and by using an adaptation technique to adapt the
|
|||
|
|
model set to better fit the speaker's characteristics.
|
|||
|
|
|
|||
|
|
The adaptation technique used is MLLR (maximum likelihood linear regression)
|
|||
|
|
transform that is applied depending on the available data by generating one or
|
|||
|
|
more transformations that reduce the mismatch between
|
|||
|
|
an initial model set and the adaptation data. There is only one transformation
|
|||
|
|
when the amount of available data is too small and is called global adaptation
|
|||
|
|
transform. The global transform is applied to every Gaussian component in the
|
|||
|
|
model set. Otherwise, when the amount of adaptation data is large, the number
|
|||
|
|
of transformations is increasing and each transformation is applied to a
|
|||
|
|
certain cluster of Gaussian components.
|
|||
|
|
|
|||
|
|
To be able to decode with an adapted model there are two important classes that
|
|||
|
|
should be imported:
|
|||
|
|
|
|||
|
|
import edu.cmu.sphinx.decoder.adaptation.Stats;
|
|||
|
|
import edu.cmu.sphinx.decoder.adaptation.Transform;
|
|||
|
|
|
|||
|
|
Stats Class estimates a MLLR transform for each cluster of data and the
|
|||
|
|
transform will be applied to the corresponding cluster. You can choose the
|
|||
|
|
number of clusters by giving the number as argument to
|
|||
|
|
createStats(nrOfClusters) in Stats method. The method will return an object
|
|||
|
|
that contains the loaded acoustic model and the number of clusters. This
|
|||
|
|
important to collect counts from each Result object because based on them we
|
|||
|
|
will perform the estimation of the MLLR transformation.
|
|||
|
|
|
|||
|
|
Before starting collect counts it is important to have all Gaussians clustered.
|
|||
|
|
So, createStats(nrOfClusters) will generate an ClusteredDensityFileData object
|
|||
|
|
to prepare the Gaussians. ClusteredDensityFileData class performs the clustering
|
|||
|
|
using the "k-means" clustering algorithm. The k-means clustering algorithm aims
|
|||
|
|
to partition the Gaussians into k clusters in which each Gaussian belongs
|
|||
|
|
to the cluster with the nearest mean. It is interesting to know that the problem
|
|||
|
|
of clustering is computationally difficult, so the heuristic used is the
|
|||
|
|
Euclidean criterion.
|
|||
|
|
|
|||
|
|
The next step is to collect counts from each Result object and store them
|
|||
|
|
separately for each cluster. Here, the matrices regLs and regRs used in
|
|||
|
|
computing the transformation are filled. Transform class performs the actual
|
|||
|
|
transformation for each cluster. Given the counts previously gathered and the
|
|||
|
|
number of clusters, the class will compute the two matrices A (the
|
|||
|
|
transformation matrix) B (the bias vector) that are tied across the Gaussians
|
|||
|
|
from the corresponding cluster. A Transform object will contain all the
|
|||
|
|
transformations computed for an utterance. To use the adapted acoustic model it
|
|||
|
|
is necessary to update the Sphinx3Loader which is responsible for
|
|||
|
|
loading the files from the model. When updating occurs, the acoustic model is
|
|||
|
|
already loaded, so setTransform(transform) method will replace the old means
|
|||
|
|
with the new ones.
|
|||
|
|
|
|||
|
|
Now, that we have the theoretical part, let’s see the practical part. Here is
|
|||
|
|
how you create and use a MLLR transformation:
|
|||
|
|
|
|||
|
|
Stats stats = recognizer.createStats(1);
|
|||
|
|
recognizer.startRecognition(stream);
|
|||
|
|
while ((result = recognizer.getResult()) != null) {
|
|||
|
|
stats.collect(result);
|
|||
|
|
}
|
|||
|
|
recognizer.stopRecognition();
|
|||
|
|
|
|||
|
|
// Transform represents the speech profile
|
|||
|
|
Transform transform = stats.createTransform();
|
|||
|
|
recognizer.setTransform(transform);
|
|||
|
|
|
|||
|
|
After setting the transformation to the StreamSpeechRecognizer object,
|
|||
|
|
the recognizer is ready to decode using the new means. The process
|
|||
|
|
of recognition is the same as you decode with the general acoustic model.
|
|||
|
|
When you create and set a transformation is like you create a
|
|||
|
|
new acoustic model with speaker's characteristics, thus the accuracy
|
|||
|
|
will be better.
|
|||
|
|
|
|||
|
|
For further decodings you can store the transformation of a speaker in a file
|
|||
|
|
by performing store(“FilePath”, 0) in Transform object.
|
|||
|
|
|
|||
|
|
If you have your own transformation known as mllr_matrix previously generated
|
|||
|
|
with Sphinx4 or with another program, you can load the file by performing
|
|||
|
|
load(“FilePath”) in Transform object and then to set it to an Recognizer object.
|
|||
|
|
|