Building a speech recognizer AI/ML model in Python (Part 5 of 6 — Extracting speech features)

3 min readFeb 7, 2024

Photo by Volodymyr Hryshchenko on Unsplash

So far, we have learned how to handle signals and linear frequency domain. The frequency domain features are what we use for analysis of speech recognition systems. The concepts we learned earlier are introduction, but real-world frequency domain features are complex.

Once we convert a signal into the frequency domain, we need to ensure that its useful in the form of a feature vector. MFCCs, or Mel Frequency Cepstral Coefficients is a tool we use to extract the features from an audio file.

MFCCs extracts power spectrum, it then uses filter banks and Discrete Cosine Transform (DCT) to extract the features.

DCT for beginners:
When you want to compress a series of data (say, a sound recording) you can look for patterns in the data, and just record the patterns instead of the original measurements.
DCT describes data as the sum of a series of waves (cosine functions) vibrating at different frequencies. This turns out to be a great way to describe sounds using just a small set of numbers. So it’s used in audio compression. It’s also used in video compression, since pictures also tend to have patterns.

We will be using a python package called python_speech_features to extract the MFCC features.

Create a new python file and import the packages.

# Import packages
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
from python_speech_features import mfcc, logfbank

In order to extract speech features, we first need to read the file that contains audio.

# Read input audio file
sampling_freq, signal = wavfile.read('audio.wav')

# Take 10,000 samples for analysis
signal = signal[:10000]

We then need to extract MFCC and print the MFCC parameters.

# Extract MFCC features
features_mfcc = mfcc(signal, sampling_freq)

# Print MFCC parameters
print('\nMFCC:\nNumber of windows=', features_mfcc.shape[0])
print('Length of each feature =', features_mfcc.shape[1])

Let’s plot the MFCC features.

# Plot
features_mfcc = features_mfcc.T
plt.matshow(features_mfcc)
plt.title('MFCC')

Extract the filter bank features.

# Extract filter bank features
features_fb = logfbank(signal, sampling_freq)
# Extract the parameter
print('\nFilter bank:\nNumber of windows =', features_fb.shape[0])
print('Length of each feature =', features_fb.shape[1])

Filter bank features

Plot the features.

# Plot
features_fb = features_fb.T
plt.matshow(features_fb)
plt.title('Filter bank')

As we convert sound into pictures, it helps us to analyze sound and derive insights from it.

In the next and final part (6) we will recognize the spoken words.

Further read: MFCC’s Made Easy. An easy explanation of an important… | by Tanveer Singh | Medium

Building a speech recognizer AI/ML model in Python (Part 5 of 6 — Extracting speech features)

Written by Burhan Amjad

No responses yet