Kaldi results / training a new model

HARK FORUM Kaldi results / training a new model

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
  • #2349

      I am using a modified version of the IJCAI-PRICAI network for my robot. However, I am getting poor results with Kaldi using the included sample configuration kaldi_conf and network chain_sample/tdnn_5b/.... Specifically, many of the spoken words are not recognized at all or are recognized incorrectly.

      I made the following changes to use the network on my robot:
      – Changed input to an ALSA 16-microphone array
      – Recorded new transfer function for the operating environment
      – Created a noise correlation matrix for the operating environment
      – Switched to GEVD-MUSIC for Localization

      I have confirmed that the raw .wav data from the microphones is clear, and the separated signals from GHDSS are also recognizable.

      Can you recommend any other things to try to improve recognition accuracy? Can I try and use a smaller model with only a few words?

      Thank you!


      Thank you for your inquiry.

      The model we are distributing is a learning of the separated sound of the TAMAGO-03 microphone array. When used with different microphone arrays, at least the difference in the volume of the separated sound input when generating the feature amount causes a decrease in recognition performance. Please try to compare the volume of the separated sound with the original evaluation set of IJCAI-PRICAI and adjust the volume until the performance improves.

      Also, if the microphone arrangement is significantly different from the TAMAGO-03 microphone array, the recognition performance may deteriorate because the tendency of distortion after separation is different. For best results, you need to create your own model by learn the separated sounds in your microphone array.

      Since it is a language model learned with a large vocabulary, many words can be recognized unless it is a word such as jargon or slang, so a language model with a small vocabulary should not be necessary. If you really need to create your own language model, please use Kaldi’s tools to create your language model. You need to run mkgraph.sh with arguments to the directory containing the “final.mdl” file and the directory of your language model.

      Best regards,
      HARK support team.


        Thank you very much for your help.

        I will check my microphone setup and verify that the sound levels are similar to TAMAGO-03. Right now I have an AudioStreamFromMic node connected to a MultiGain node, and I am adjusting the GAIN parameter in MultiGain. Is this the correct approach to adjust volume?

        Regarding your second suggestion, how can I create a model to learn separated sounds for my microphone array? I have noticed that my separated signals are distorted. Do I need to make changes to sub_separation or is there another process?

        I will continue to use the large vocabulary language model.

        Thank you again for your help.


        Yes, the correct way is to adjust the MultiGain connected after AudioStreamFromMic. To check the amplitude of the separated sound, check the output of SaveWave PCM. The HARK Cookbook URL below is an example of a SaveWave PCM connection. The IJCAI-PRICAI network file differs in that the separated sound after noise suppression is output, but the connection is such that the separated sound is output as in “Connection example 2”.

        Note: The IJCAI-PRICAI network file uses the separated sound file output by SaveWave PCM only for confirmation. When converting to features, the frequency domain is calculated as it is, so be aware that changing the gain parameter of Synthesize does not affect the features used in the recognition process.

        Use a toolkit called Kaldi to train the model. Kaldi contains a number of sample recipes for learning models with the corpus.
        We use a paid corpus, but there are also free corpus. Fortunately, there are many free English corpus.

        Note that we are using MSLS features. It is not a general MFCC features. Therefore, we will propose two methods, so please select the one you want.
        We have confirmed that using MSLS features is better than MFCC features, so we chose method 2, but the work cost is less if we choose 1.

        1. Simply replace the MSLS Extraction contained in the HARK network file with the MFCC Extraction and match the feature dimensions to those used by the recipe.
        2. Understand the features output by MSLS Extraction and make some patches to the recipes used for learning Kaldi.

        Work required to prepare training data:
        First, the impulse data is convoluted into the corpus data to be used, and the input data that simulates the microphone array coming from various sound source directions is prepared.

        For method 1:
        Next, input the data that simulates the input of the prepared microphone array to the network file, and use the separated sound file output by SaveWave PCM for learning.

        For method 2:
        Next, prepare a network file to obtain the features to be learned. Duplicate the network file you are currently using for speech recognition and make the following modifications: Use AudioStreamFromWAV instead of AudioStreamFromMic. At that time, don’t forget to set CONDITION for the EOF terminal. HARK also has a Save HTK Feature node for saving in HTK Feature format. The feature amount input to SpeechRecognition (SMN) Client is connected to SaveHTKFeature. At that time, set the parameter of SaveHTKFeature to USER.
        In the created network file, data that simulates the input of the microphone array prepared first is input, and the output feature amount file is used for training.
        Replace the features learned by Kaldi with those output by HARK. Also, match the number of dimensions in the config file related to the features of the learning recipe.
        Supplementary explanation: Kaldi has a method of learning while converting PCM data to features, but please do not use Kaldi’s feature conversion function and directly read the HTK Feature format output by HARK for learning. By doing so, the patch that outputs the MSLS feature to Kaldi’s feature conversion program becomes unnecessary.

        Best regards,
        HARK support team.


          Thank you very much for the detailed response! I will respond here if I have any more questions.
          Thanks again.

        Viewing 5 posts - 1 through 5 (of 5 total)
        • You must be logged in to reply to this topic.