Speaker Adaptation with MLLR Transformation Unsupervised speaker adaptation for Sphinx4 For building an improved acoustic model there are two methods. One of them needs to collect data from a speaker and train the acoustic model set. Thus using the speaker's characteristics the recognition will be more accurately. The disadvantage of this method is that it needs a large amount of data to be collected to have a sufficient model accuracy. The other method, when the amount of data available is small from a new speaker, is to collect them and by using an adaptation technique to adapt the model set to better fit the speaker's characteristics. The adaptation technique used is MLLR (maximum likelihood linear regression) transform that is applied depending on the available data by generating one or more transformations that reduce the mismatch between an initial model set and the adaptation data. There is only one transformation when the amount of available data is too small and is called global adaptation transform. The global transform is applied to every Gaussian component in the model set. Otherwise, when the amount of adaptation data is large, the number of transformations is increasing and each transformation is applied to a certain cluster of Gaussian components. To be able to decode with an adapted model there are two important classes that should be imported: import edu.cmu.sphinx.decoder.adaptation.Stats; import edu.cmu.sphinx.decoder.adaptation.Transform; Stats Class estimates a MLLR transform for each cluster of data and the transform will be applied to the corresponding cluster. You can choose the number of clusters by giving the number as argument to createStats(nrOfClusters) in Stats method. The method will return an object that contains the loaded acoustic model and the number of clusters. This important to collect counts from each Result object because based on them we will perform the estimation of the MLLR transformation. Before starting collect counts it is important to have all Gaussians clustered. So, createStats(nrOfClusters) will generate an ClusteredDensityFileData object to prepare the Gaussians. ClusteredDensityFileData class performs the clustering using the "k-means" clustering algorithm. The k-means clustering algorithm aims to partition the Gaussians into k clusters in which each Gaussian belongs to the cluster with the nearest mean. It is interesting to know that the problem of clustering is computationally difficult, so the heuristic used is the Euclidean criterion. The next step is to collect counts from each Result object and store them separately for each cluster. Here, the matrices regLs and regRs used in computing the transformation are filled. Transform class performs the actual transformation for each cluster. Given the counts previously gathered and the number of clusters, the class will compute the two matrices A (the transformation matrix) B (the bias vector) that are tied across the Gaussians from the corresponding cluster. A Transform object will contain all the transformations computed for an utterance. To use the adapted acoustic model it is necessary to update the Sphinx3Loader which is responsible for loading the files from the model. When updating occurs, the acoustic model is already loaded, so setTransform(transform) method will replace the old means with the new ones. Now, that we have the theoretical part, let’s see the practical part. Here is how you create and use a MLLR transformation: Stats stats = recognizer.createStats(1); recognizer.startRecognition(stream); while ((result = recognizer.getResult()) != null) { stats.collect(result); } recognizer.stopRecognition(); // Transform represents the speech profile Transform transform = stats.createTransform(); recognizer.setTransform(transform); After setting the transformation to the StreamSpeechRecognizer object, the recognizer is ready to decode using the new means. The process of recognition is the same as you decode with the general acoustic model. When you create and set a transformation is like you create a new acoustic model with speaker's characteristics, thus the accuracy will be better. For further decodings you can store the transformation of a speaker in a file by performing store(“FilePath”, 0) in Transform object. If you have your own transformation known as mllr_matrix previously generated with Sphinx4 or with another program, you can load the file by performing load(“FilePath”) in Transform object and then to set it to an Recognizer object.