Building a speech recognizer AI/ML model in Python (Part 1 of 6 — Visualizing Audio Signals)

3 min readJan 30, 2024

Speech recognition is a process of decoding the words that are spoken in a language. We capture audio signal through microphone, where sound waves are converted into electrical signals by the dynamic transducer built inside the microphone.

There are different aspects of speech that contributes to complexity, such as emotion, accent, intonation, affectation and language.

For humans its fairly easy to understand speech cues, but for machines, it is difficult to define a set of robust rules to analyze speech signals.

Visualizing audio signals

Let’s visualize an audio signal. We will read an audio file and work with it to help us understand how an audio signal is structured. Audio signals are analog and continuous waves, but when we capture through digital medium, it is converted into discrete values at a certain frequency.

Most commonly speech signals are sampled at 44,100 Hz, which means each second is broken down into 44,100 parts i.e. 1/44,100 seconds.

Let's visualize an audio signal:
Create a new Python file and import the following packages.

# Importing packages
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile

Read the input audio file using wavfile.read method. it will return two values i.e. sampling frequency and the audio signal.

Important notice, provide your own audio file. make sure the file format is .wav

# Read audio file
sampling_freq, signal = wavfile.read("audio.wav")

Print the shape of the audio signal, the datatype and the length of the duration of audio signal.

# Display params
print("\nSignal shape:", signal.shape)
print("Datatype", signal.dtype)
print('Signal duration:', round(signal.shape[0]/float(sampling_freq), 2), 'seconds')

The output should look like this,

Normalize the signal.
We normalize to raise the overall intensity (loudness) of the audio without distorting the audio itself.

# Normalize the signal
signal = signal / np.power(2, 15)

Extract the first 50 values from numpy array for plotting

# Extract the first 50 values
signal = signal[:50]

Construct the time axis in milliseconds

# Construct the time axis in milliseconds
time_axis = 1000 * np.arange(0, len(signal), 1)/float(sampling_freq)

Plot the audio signal.

# Plot the audio signal
plt.plot(time_axis, signal, color='black')
plt.xlabel('Time (ms)')
plt.ylabel('Amplitude')
plt.title('Input audio signal')
plt.show()

Finally, we have audio visualization.

In the next part (2) we will explore how to transform audio signal to the frequency domain.

All parts in this series have been referenced from AI with Python by Alberto Artasanchez & Prateek Joshi.

Building a speech recognizer AI/ML model in Python (Part 1 of 6 — Visualizing Audio Signals)

Visualizing audio signals

Written by Burhan Amjad

No responses yet