Building a speech recognizer AI/ML model in Python (Part 1 of 6 — Visualizing Audio Signals)

Burhan Amjad
3 min readJan 30, 2024

--

Photo by BandLab on Unsplash

Speech recognition is a process of decoding the words that are spoken in a language. We capture audio signal through microphone, where sound waves are converted into electrical signals by the dynamic transducer built inside the microphone.

There are different aspects of speech that contributes to complexity, such as emotion, accent, intonation, affectation and language.

For humans its fairly easy to understand speech cues, but for machines, it is difficult to define a set of robust rules to analyze speech signals.

Visualizing audio signals

Let’s visualize an audio signal. We will read an audio file and work with it to help us understand how an audio signal is structured. Audio signals are analog and continuous waves, but when we capture through digital medium, it is converted into discrete values at a certain frequency.

Most commonly speech signals are sampled at 44,100 Hz, which means each second is broken down into 44,100 parts i.e. 1/44,100 seconds.

Let's visualize an audio signal:
Create a new Python file and import the following packages.

# Importing packages
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile

Read the input audio file using wavfile.read method. it will return two values i.e. sampling frequency and the audio signal.

Important notice, provide your own audio file. make sure the file format is .wav

# Read audio file
sampling_freq, signal = wavfile.read("audio.wav")

Print the shape of the audio signal, the datatype and the length of the duration of audio signal.

# Display params
print("\nSignal shape:", signal.shape)
print("Datatype", signal.dtype)
print('Signal duration:', round(signal.shape[0]/float(sampling_freq), 2), 'seconds')

The output should look like this,

Output
Shapes, screenshot by the author

Normalize the signal.
We normalize to raise the overall intensity (loudness) of the audio without distorting the audio itself.

# Normalize the signal
signal = signal / np.power(2, 15)

Extract the first 50 values from numpy array for plotting

# Extract the first 50 values
signal = signal[:50]

Construct the time axis in milliseconds

# Construct the time axis in milliseconds
time_axis = 1000 * np.arange(0, len(signal), 1)/float(sampling_freq)

Plot the audio signal.

# Plot the audio signal
plt.plot(time_axis, signal, color='black')
plt.xlabel('Time (ms)')
plt.ylabel('Amplitude')
plt.title('Input audio signal')
plt.show()
Audio, screenshot by the author

Finally, we have audio visualization.

In the next part (2) we will explore how to transform audio signal to the frequency domain.

All parts in this series have been referenced from AI with Python by Alberto Artasanchez & Prateek Joshi.

--

--

Burhan Amjad
Burhan Amjad

Written by Burhan Amjad

0 Followers

Computer scientist and researcher

No responses yet