A Guide to Building an Automatic Speech Recognition System

us .org
dsi
© Copyright 2025. United States Data Science Institute. All Rights Reserved
A GUIDE TO BUILDING
AN AUTOMATIC
SPEECH RECOGNITION
SYSTEM

© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
From voice assistants to transcription services, we use this incredible technology called Automatic
Speech Recognition (ASR) in numerous ways to make our lives easier. ASR has completely transformed
the way we interact with technology, and therefore, they are now used in a wide range of applications.
This brief guide will help you understand how to build a basic Automatic Speech Recognition system
using PyTorch and the Hugging Face Transformers library.
Automatic Speech Recognition System – Overview
Automatic Speech Recognition System is sometimes referred to as speech-to-text. It is used for a variety of
purposes, including:
Hi!
Hi! Processing of human speech, where it takes the human
voice or audio as input
Converting it into written text, a process in which it transforms the
spoken words/audio into text format and makes it easier for the
computer to understand.
What does that mean?
Well, essentially, an ASR system can be considered as the translator between spoken language and
machine language. They use complex machine learning algorithms and models to analyze, understand,
and interpret human speech effectively.
They consider a lot of factors such as:
Variations in
pronunciation
Background
noise
Different accents
across the globe
The tone and context
of spoken words

Applications of Automatic Speech Recognition
An Automatic Speech Recognition System is used for a variety of applications such as:
Steps Involved in ASR Pipeline
Building an ASR system involves a few important stages. In this guide, we will learn how to:
Siri, Alexa, Google Assistant, & other voice assistants
Different transcription services
Smart devices using voice commands
Different accessibility tools
Customer service applications
To sum up, ASR helps machines ‘listen’ and ‘understand’ human speech so that we can easily interact
with technology in a more natural and intuitive way.
dsi.
Set up an
environment
for ASR systems
training
Load and
preprocess a
speech dataset
Fine-tune a
pretrained
ASR model
Evaluate the
model’s performance
using word error
rate, and
Deploy the
model for real-world
application.

dsi.
Since we are going to make a lightweight and efficient model, we will be using a small speech dataset
Step 1: Installing Dependencies
The first step in creating our ASR system is installing the required libraries that can facilitate loading datasets,
processing audio files, and fine-tuning the model.
pip install torch torchaudio transformers datasets soundfile jiwer
Why do we install these libraries? Because it serves the following purposes:
Transformers - it provides a pre-trained model for speech recognition called Wav2Vec2
Datasets - helps with loading and processing of speech datasets
Torchaudio - it will help to handle audio processing and manipulation tasks
Soundfile - it can read and write .wav files
Jiwer - required for computing WER and performance of the ASR system
Step 2: Loading Speech Dataset
For experimentation purposes, using large datasets like Common Voice won’t be a good choice. So,
we will be using SUPERB KS that contains short spoken commands (yes, no, etc.)
from datasets import load_dataset
dataset = load_dataset("superb", "ks", split="train[:1%]") # Load only 1% of the data for quick testing
print(dataset)
Using this command, we can load a tiny subset of the dataset.
Note: the dataset will consume storage space, so you must be aware of it when working with larger splits.

dsi.
Step 3: Preprocessing Audio Data
A correct format of the audio data ensures efficient training of an ASR model. The Wav2Vec2 model requires a
16 kHz sample rate and no padding or truncation.
To properly process the audio and extract relevant features, we define the following function.
import torchaudio
def preprocess_audio(batch):
speech_array, sampling_rate =
torchaudio.load(batch["audio"]["path"])
batch["speech"] = speech_array.squeeze().numpy()
batch["sampling_rate"] = sampling_rate
batch["target_text"] = batch["label"] # Use labels as text
output
return batch
dataset = dataset.map(preprocess_audio)
This will help us get audio files in the correct format so that we can process them further.
Step 4: Loading Wav2Vec2 Model
Since we are using the Wav2Vec2 pre-trained model from Hugging Face’s model hub, which has
already been trained on a large dataset, we only need to fine-tune for our specific requirements.
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2- base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base960h")
So, in this step, we define both the processor and the model. The processor will convert audio data into features
suitable for the model, and the model will have a pre-trained Wav2Vec2 consisting of 960 hours of speech.

dsi.
Step 5: Preparing Data for the Mode
In this step, we will tokenize and encode the audio for the model to understand it.
def preprocess_for_model(batch):
inputs = processor(batch["speech"], sampling_rate=16000,
return_tensors="pt", padding=True)
batch["input_values"] = inputs.input_values[0]
return batch
dataset = dataset.map(preprocess_for_model,
remove_columns=["speech", "sampling_rate", "audio"])
This makes our dataset compatible with the model.
Step 6: Defining Training Arguments
Training configurations like batch size, learning rate, optimization steps, etc., must be defined before we start
training the model. This is done using the following command:
from transformers import TrainingArguments
training_args = TrainingArguments
(output_dir="./wav2vec2",
per_device_train_batch_size=4,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_dir="./logs",
learning_rate=1e-4,
warmup_steps=500,
max_steps=4000,
save_total_limit=2,
gradient_accumulation_steps=2,
fp16=True,
push_to_hub=False,)

dsi.
Step 7: Model Training
We will then ﬁne-tune our Wav2Vec2 model using Hugging Face’s trainer as follows:
from transformers import Trainer
trainer = Trainer
(model=model,
args=training_args,
train_dataset=dataset,
tokenizer=processor,)
trainer.train()
Step 8: Evaluating Model
Now that we have trained our model, we must check how it performs. We do it by computing WER as follows:
import torch
from jiwer import wer
def transcribe(batch):
inputs = processor(batch["input_values"], return_tensors="pt",
padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
batch["predicted_text"] =
rocessor.batch_decode(predicted_ids)[0]
return batch
results = dataset.map(transcribe)
wer_score = wer(results["target_text"], results["predicted_text"])
print(f"Word Error Rate: {wer_score:.2f}")
By achieving a lower WER score, we can ensure better performance.

Step 9: Running Inference on New Audio
As the last step of the process, we can use our model to use it in a real-world context.
import torchaudio
import soundfile as sf
speech_array, sampling_rate = torchaudio.load("example.wav")
inputs = processor(speech_array.squeeze().numpy(),
sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
*Source of codes: KDNuggets
dsi.
CONCLUSION
That's it, folks! We have successfully developed and fine-tuned a pre-trained ASR model using
PyTorch and Hugging Face for real-world applications.
This is a simple project you can work on to enhance your data science portfolio. Also, consider
mastering new data science skills and working on a variety of such projects through the best data
®
science certifications from USDSI .

You May Also Like:
Storytelling with Data:
Transforming Raw Information
into Narrative Symphonies
Future Of Data Science:
10 Predictions You Should
Know
Data Science:
Unlocking Careers
for the Future
Data Science
vs.
Decision Science
Data Science Skills Vs. Tools:
What Matters the most for
Data Scientists
Master Data-Driven
Decision-Making in
2024
Discover More Discover More Discover More
Discover More Discover More Discover More
Factsheet:
Data Science Career
2025
Top 5
Must-know Data
Science Frameworks
Discover More Discover More
dsi.

LOCATIONS
info@ | www.
usdsi.org usdsi.org
Arizona
1345 E. Chandler BLVD.,
Suite 111-D Phoenix,
AZ 85048,
info.az@usdsi.org
Connecticut
Connecticut 680 E Main Street
#699 Stamford, CT 06901
,
info.ct@usdsi.org
Illinois
1 East Erie St, Suite 525
Chicago, IL 60611
info.il@usdsi.org
Singapore
No 7 Temasek Boulevard#12-07
Suntec Tower One, Singapore, 038987
Singapore, info.sg@usdsi.org
United Kingdom
29 Whitmore Road, Whitnash
Learmington Spa, Warwickshire,
United Kingdom CV312JQ
info.uk@usdsi.org
© Copyright 2025. United States Data Science Institute. All Rights Reserved.
REGISTER NOW
LEARN PYTORCH,
HUGGING FACE, AND
OTHER TOP LIBRARIES
WITH
D S
ATA CIENCE
CERTIFICATIONS
’S

A Guide to Building an Automatic Speech Recognition System

More Related Content

Similar to A Guide to Building an Automatic Speech Recognition System

More from USDSI

Recently uploaded

A Guide to Building an Automatic Speech Recognition System