Building a speech recognizer AI/ML model in Python (Part 6 of 6 — Recognizing spoken words)

4 min readFeb 13, 2024

We have learned all the techniques to analyze speech signal, in this final part, we will learn how to recognize spoken words. The speech recognition takes in audio signal and outputs recognized words. We will use Hidden Markov Models (HMMs) for the speech recognition task.

HMMs are great for analyzing sequential data and an audio signal is a time series signal, which is a manifestation of sequential data. The assumption is that the outputs are being generated by the system going through a series of hidden states. Our goal is to find out what these hidden states are so we can identify the words in our signal.

We will use the package called hmmlearn to build our speech recognition system.

In order to train our speech recognition system, we need a dataset of audio files for each word. We will use https://code.google.com/archive/p/hmm-speech-recognition/downloads.

Let’s create a new Python file and import the packages.

# Packages
import os
import argparse
import warnings
import numpy as np
from scipy.io import wavfile
from hmmlearn import hmm
from python_speech_features import mfcc

Define a function to parse the input arguments. We need to specify the input folder containing the audio files, that will be used to train our speech recognition system.

# Define function
def build_arg_parser():
  parser = argparse.ArgumentParser(description = 'Trains the speech recognizer')
  parser.add_argument('--input-folder', dest='input_folder', required=True, help = 'Input folder containing the audio file for training')
  return parser

Define class to train HMMs.

# Define class to train HMM
class ModelHMM(object):
  def __init__(self, num_components = 4, num_iter = 1000):
    self.n_components = num_components
    self.n_iter = num_iter
# Define the covariance and HMM type
    self.cov_type = 'diag'
    self.model_name = 'GaussianHMM'
# Initialize the variable in which we will store the models for each word
    self.models = []
# Define model using parameters
    self.model = hmm.GaussianHMM(n_components = self.n_components, covariance_type = self.cov_type, n_iter = self.n_iter)

Define method to train the model.

# Define method to train model
  def train(self, training_data):
    np.seterr( all = 'ignore')
    cur_model = self.model.fit(training_data)
    self.models.append(cur_model)

Define a method to compute the score for input data.

# Run HMM model for inference on input data
  def compute_score(self, input_data):
    return self.model.score(input_data)

Define a function to build model for each word in training dataset.

# Define function
def build_models(input_folder):
  speech_models = []
  # Parse the input directory
  for dirname in os.listdir(input_folder):
    subfolder = os.path.join(input_folder, dirname)
    if not os.path.isdir(subfolder):
      continue
    # Extract the label
    label = subfolder[subfolder.rfind('/') + 1:]
    # Initialize the variable
    X = np.array([])
    # Create list of files used for training
    training_files = [x for x in os.listdir(subfolder) if x.endswith('.wav')][:-1]
    # Iterate through the training files and build model
    for filename in training_files:
      # Extract the current filepath
      filepath = os.path.join(subfolder, filename)
      # Read audio file
      sampling_freq, signal = wavfile.read(filepath)
      # Extract the MFCC features
      with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        features_mfcc = mfcc(signal, sampling_freq)
      # Append to the variable X
      if len(X) == 0:
        X = features_mfcc
      else:
        X = np.append(X, features_mfcc, axis = 0)

Initialize, train and save the HMM model.

  # Create HMM model
  model = ModelHMM()
      
  # Train HMM model
  model.train(X)
      
  # Save model
  speech_models.append((model, label))
      
  # Reset variable
  model = none

return speech_models

Define function to test-on-test dataset.

# Define function
def run_tests(test_files):
  # Classify input data
  for test_file in test_files:
    sampling_freq, signal = wavfile.read(test_file)
    # Extract MFCC features
    with warnings.catch_warnings():
      warnings.simplefilter('ignore')
      features_mfcc = mfcc(signal, sampling_freq)
    # Define variables
    max_score = -float('inf')
    output_label = None
    # Iterate through each model to pick the best one
    # Run the current feature vector
    for item in speech_models:
      model, label = item
      # Evaluate score and compare against maximum score
      score = model.compute_score(features_mfcc)
      if score > max_score:
        max_score = score
        predicted_label = label
    # Print output
    start_index = test_file.find('/') + 1
    end_index = test_file.rfind('/')
    original_label = test_file[start_index:end_index]
    print('\nOrginal: ', original_label)
    print('Predicted: ', predicted_label)

Define the main function and get input folder from input parameter.

# Define main function
if __name__=='__main__':
  args = build_arg_parser().parse_args()
  input_folder = args.input_folder

Build HMM model for each word.

  # Build HMM model
  speech_models = build_models(input_folder)
  #Test files 
  test_files = []
  for root, dirs, files in os.walk(input_folder):
    for filename in (x for x in files if '15' in x):
      filepath = os.path.join(root, filename)
      test_files.append(filepath)
  run_tests(test_files)

We explored how to work with speech signals. In NLP, we utilize speech recognition to process audio and text queries, providing users with results based on their input. The advanced algorithm seamlessly integrates a series of complex codes to achieve functions that humans can perform effortlessly.

Building a speech recognizer AI/ML model in Python (Part 6 of 6 — Recognizing spoken words)

Written by Burhan Amjad

No responses yet