Building a speech recognizer AI/ML model in Python (Part 6 of 6 — Recognizing spoken words)

Burhan Amjad
4 min readFeb 13, 2024

--

Photo by Jacek Dylag on Unsplash

We have learned all the techniques to analyze speech signal, in this final part, we will learn how to recognize spoken words. The speech recognition takes in audio signal and outputs recognized words. We will use Hidden Markov Models (HMMs) for the speech recognition task.

HMMs are great for analyzing sequential data and an audio signal is a time series signal, which is a manifestation of sequential data. The assumption is that the outputs are being generated by the system going through a series of hidden states. Our goal is to find out what these hidden states are so we can identify the words in our signal.

We will use the package called hmmlearn to build our speech recognition system.

In order to train our speech recognition system, we need a dataset of audio files for each word. We will use https://code.google.com/archive/p/hmm-speech-recognition/downloads.

Let’s create a new Python file and import the packages.

# Packages
import os
import argparse
import warnings
import numpy as np
from scipy.io import wavfile
from hmmlearn import hmm
from python_speech_features import mfcc

Define a function to parse the input arguments. We need to specify the input folder containing the audio files, that will be used to train our speech recognition system.

# Define function
def build_arg_parser():
parser = argparse.ArgumentParser(description = 'Trains the speech recognizer')
parser.add_argument('--input-folder', dest='input_folder', required=True, help = 'Input folder containing the audio file for training')
return parser

Define class to train HMMs.

# Define class to train HMM
class ModelHMM(object):
def __init__(self, num_components = 4, num_iter = 1000):
self.n_components = num_components
self.n_iter = num_iter
# Define the covariance and HMM type
self.cov_type = 'diag'
self.model_name = 'GaussianHMM'
# Initialize the variable in which we will store the models for each word
self.models = []
# Define model using parameters
self.model = hmm.GaussianHMM(n_components = self.n_components, covariance_type = self.cov_type, n_iter = self.n_iter)

Define method to train the model.

# Define method to train model
def train(self, training_data):
np.seterr( all = 'ignore')
cur_model = self.model.fit(training_data)
self.models.append(cur_model)

Define a method to compute the score for input data.

# Run HMM model for inference on input data
def compute_score(self, input_data):
return self.model.score(input_data)

Define a function to build model for each word in training dataset.

# Define function
def build_models(input_folder):
speech_models = []
# Parse the input directory
for dirname in os.listdir(input_folder):
subfolder = os.path.join(input_folder, dirname)
if not os.path.isdir(subfolder):
continue
# Extract the label
label = subfolder[subfolder.rfind('/') + 1:]
# Initialize the variable
X = np.array([])
# Create list of files used for training
training_files = [x for x in os.listdir(subfolder) if x.endswith('.wav')][:-1]
# Iterate through the training files and build model
for filename in training_files:
# Extract the current filepath
filepath = os.path.join(subfolder, filename)
# Read audio file
sampling_freq, signal = wavfile.read(filepath)
# Extract the MFCC features
with warnings.catch_warnings():
warnings.simplefilter('ignore')
features_mfcc = mfcc(signal, sampling_freq)
# Append to the variable X
if len(X) == 0:
X = features_mfcc
else:
X = np.append(X, features_mfcc, axis = 0)

Initialize, train and save the HMM model.

  # Create HMM model
model = ModelHMM()

# Train HMM model
model.train(X)

# Save model
speech_models.append((model, label))

# Reset variable
model = none

return speech_models

Define function to test-on-test dataset.

# Define function
def run_tests(test_files):
# Classify input data
for test_file in test_files:
sampling_freq, signal = wavfile.read(test_file)
# Extract MFCC features
with warnings.catch_warnings():
warnings.simplefilter('ignore')
features_mfcc = mfcc(signal, sampling_freq)
# Define variables
max_score = -float('inf')
output_label = None
# Iterate through each model to pick the best one
# Run the current feature vector
for item in speech_models:
model, label = item
# Evaluate score and compare against maximum score
score = model.compute_score(features_mfcc)
if score > max_score:
max_score = score
predicted_label = label
# Print output
start_index = test_file.find('/') + 1
end_index = test_file.rfind('/')
original_label = test_file[start_index:end_index]
print('\nOrginal: ', original_label)
print('Predicted: ', predicted_label)

Define the main function and get input folder from input parameter.

# Define main function
if __name__=='__main__':
args = build_arg_parser().parse_args()
input_folder = args.input_folder

Build HMM model for each word.

  # Build HMM model
speech_models = build_models(input_folder)
#Test files
test_files = []
for root, dirs, files in os.walk(input_folder):
for filename in (x for x in files if '15' in x):
filepath = os.path.join(root, filename)
test_files.append(filepath)
run_tests(test_files)

We explored how to work with speech signals. In NLP, we utilize speech recognition to process audio and text queries, providing users with results based on their input. The advanced algorithm seamlessly integrates a series of complex codes to achieve functions that humans can perform effortlessly.

--

--

Burhan Amjad
Burhan Amjad

Written by Burhan Amjad

0 Followers

Computer scientist and researcher

No responses yet