us .org
dsi
© Copyright 2025. United States Data Science Institute. All Rights Reserved
A GUIDE TO BUILDING
AN AUTOMATIC
SPEECH RECOGNITION
SYSTEM
© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
From voice assistants to transcription services, we use this incredible technology called Automatic
Speech Recognition (ASR) in numerous ways to make our lives easier. ASR has completely transformed
the way we interact with technology, and therefore, they are now used in a wide range of applications.
This brief guide will help you understand how to build a basic Automatic Speech Recognition system
using PyTorch and the Hugging Face Transformers library.
Automatic Speech Recognition System – Overview
Automatic Speech Recognition System is sometimes referred to as speech-to-text. It is used for a variety of
purposes, including:
Hi!
Hi! Processing of human speech, where it takes the human
voice or audio as input
Converting it into written text, a process in which it transforms the
spoken words/audio into text format and makes it easier for the
computer to understand.
What does that mean?
Well, essentially, an ASR system can be considered as the translator between spoken language and
machine language. They use complex machine learning algorithms and models to analyze, understand,
and interpret human speech effectively.
They consider a lot of factors such as:
Variations in
pronunciation
Background
noise
Different accents
across the globe
The tone and context
of spoken words
Applications of Automatic Speech Recognition
An Automatic Speech Recognition System is used for a variety of applications such as:
Steps Involved in ASR Pipeline
Building an ASR system involves a few important stages. In this guide, we will learn how to:
Siri, Alexa, Google Assistant, & other voice assistants
Different transcription services
Smart devices using voice commands
Different accessibility tools
Customer service applications
To sum up, ASR helps machines ‘listen’ and ‘understand’ human speech so that we can easily interact
with technology in a more natural and intuitive way.
© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
Set up an
environment
for ASR systems
training
Load and
preprocess a
speech dataset
Fine-tune a
pretrained
ASR model
Evaluate the
model’s performance
using word error
rate, and
Deploy the
model for real-world
application.
© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
Since we are going to make a lightweight and efficient model, we will be using a small speech dataset
Step 1: Installing Dependencies
The first step in creating our ASR system is installing the required libraries that can facilitate loading datasets,
processing audio files, and fine-tuning the model.
pip install torch torchaudio transformers datasets soundfile jiwer
Why do we install these libraries? Because it serves the following purposes:
Transformers - it provides a pre-trained model for speech recognition called Wav2Vec2
Datasets - helps with loading and processing of speech datasets
Torchaudio - it will help to handle audio processing and manipulation tasks
Soundfile - it can read and write .wav files
Jiwer - required for computing WER and performance of the ASR system
Step 2: Loading Speech Dataset
For experimentation purposes, using large datasets like Common Voice won’t be a good choice. So,
we will be using SUPERB KS that contains short spoken commands (yes, no, etc.)
from datasets import load_dataset
dataset = load_dataset("superb", "ks", split="train[:1%]") # Load only 1% of the data for quick testing
print(dataset)
Using this command, we can load a tiny subset of the dataset.
Note: the dataset will consume storage space, so you must be aware of it when working with larger splits.
© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
Step 3: Preprocessing Audio Data
A correct format of the audio data ensures efficient training of an ASR model. The Wav2Vec2 model requires a
16 kHz sample rate and no padding or truncation.
To properly process the audio and extract relevant features, we define the following function.
import torchaudio
def preprocess_audio(batch):
speech_array, sampling_rate =
torchaudio.load(batch["audio"]["path"])
batch["speech"] = speech_array.squeeze().numpy()
batch["sampling_rate"] = sampling_rate
batch["target_text"] = batch["label"] # Use labels as text
output
return batch
dataset = dataset.map(preprocess_audio)
This will help us get audio files in the correct format so that we can process them further.
Step 4: Loading Wav2Vec2 Model
Since we are using the Wav2Vec2 pre-trained model from Hugging Face’s model hub, which has
already been trained on a large dataset, we only need to fine-tune for our specific requirements.
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2- base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base960h")
So, in this step, we define both the processor and the model. The processor will convert audio data into features
suitable for the model, and the model will have a pre-trained Wav2Vec2 consisting of 960 hours of speech.
© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
Step 5: Preparing Data for the Mode
In this step, we will tokenize and encode the audio for the model to understand it.
def preprocess_for_model(batch):
inputs = processor(batch["speech"], sampling_rate=16000,
return_tensors="pt", padding=True)
batch["input_values"] = inputs.input_values[0]
return batch
dataset = dataset.map(preprocess_for_model,
remove_columns=["speech", "sampling_rate", "audio"])
This makes our dataset compatible with the model.
Step 6: Defining Training Arguments
Training configurations like batch size, learning rate, optimization steps, etc., must be defined before we start
training the model. This is done using the following command:
from transformers import TrainingArguments
training_args = TrainingArguments
(output_dir="./wav2vec2",
per_device_train_batch_size=4,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_dir="./logs",
learning_rate=1e-4,
warmup_steps=500,
max_steps=4000,
save_total_limit=2,
gradient_accumulation_steps=2,
fp16=True,
push_to_hub=False,)
© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
Step 7: Model Training
We will then fine-tune our Wav2Vec2 model using Hugging Face’s trainer as follows:
from transformers import Trainer
trainer = Trainer
(model=model,
args=training_args,
train_dataset=dataset,
tokenizer=processor,)
trainer.train()
Step 8: Evaluating Model
Now that we have trained our model, we must check how it performs. We do it by computing WER as follows:
import torch
from jiwer import wer
def transcribe(batch):
inputs = processor(batch["input_values"], return_tensors="pt",
padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
batch["predicted_text"] =
rocessor.batch_decode(predicted_ids)[0]
return batch
results = dataset.map(transcribe)
wer_score = wer(results["target_text"], results["predicted_text"])
print(f"Word Error Rate: {wer_score:.2f}")
By achieving a lower WER score, we can ensure better performance.
Step 9: Running Inference on New Audio
As the last step of the process, we can use our model to use it in a real-world context.
import torchaudio
import soundfile as sf
speech_array, sampling_rate = torchaudio.load("example.wav")
inputs = processor(speech_array.squeeze().numpy(),
sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
*Source of codes: KDNuggets
© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
CONCLUSION
That's it, folks! We have successfully developed and fine-tuned a pre-trained ASR model using
PyTorch and Hugging Face for real-world applications.
This is a simple project you can work on to enhance your data science portfolio. Also, consider
mastering new data science skills and working on a variety of such projects through the best data
®
science certifications from USDSI .
You May Also Like:
Storytelling with Data:
Transforming Raw Information
into Narrative Symphonies
Future Of Data Science:
10 Predictions You Should
Know
Data Science:
Unlocking Careers
for the Future
Data Science
vs.
Decision Science
Data Science Skills Vs. Tools:
What Matters the most for
Data Scientists
Master Data-Driven
Decision-Making in
2024
Discover More Discover More Discover More
Discover More Discover More Discover More
Factsheet:
Data Science Career
2025
Top 5
Must-know Data
Science Frameworks
Discover More Discover More
© Copyright 2025. United States Data Science Institute. All Rights Reserved us org
dsi.
LOCATIONS
info@ | www.
usdsi.org usdsi.org
Arizona
1345 E. Chandler BLVD.,
Suite 111-D Phoenix,
AZ 85048,
info.az@usdsi.org
Connecticut
Connecticut 680 E Main Street
#699 Stamford, CT 06901
,
info.ct@usdsi.org
Illinois
1 East Erie St, Suite 525
Chicago, IL 60611
info.il@usdsi.org
Singapore
No 7 Temasek Boulevard#12-07
Suntec Tower One, Singapore, 038987
Singapore, info.sg@usdsi.org
United Kingdom
29 Whitmore Road, Whitnash
Learmington Spa, Warwickshire,
United Kingdom CV312JQ
info.uk@usdsi.org
© Copyright 2025. United States Data Science Institute. All Rights Reserved.
REGISTER NOW
LEARN PYTORCH,
HUGGING FACE, AND
OTHER TOP LIBRARIES
WITH
D S
ATA CIENCE
CERTIFICATIONS
’S

A Guide to Building an Automatic Speech Recognition System

  • 1.
    us .org dsi © Copyright2025. United States Data Science Institute. All Rights Reserved A GUIDE TO BUILDING AN AUTOMATIC SPEECH RECOGNITION SYSTEM
  • 2.
    © Copyright 2025.United States Data Science Institute. All Rights Reserved us org dsi. From voice assistants to transcription services, we use this incredible technology called Automatic Speech Recognition (ASR) in numerous ways to make our lives easier. ASR has completely transformed the way we interact with technology, and therefore, they are now used in a wide range of applications. This brief guide will help you understand how to build a basic Automatic Speech Recognition system using PyTorch and the Hugging Face Transformers library. Automatic Speech Recognition System – Overview Automatic Speech Recognition System is sometimes referred to as speech-to-text. It is used for a variety of purposes, including: Hi! Hi! Processing of human speech, where it takes the human voice or audio as input Converting it into written text, a process in which it transforms the spoken words/audio into text format and makes it easier for the computer to understand. What does that mean? Well, essentially, an ASR system can be considered as the translator between spoken language and machine language. They use complex machine learning algorithms and models to analyze, understand, and interpret human speech effectively. They consider a lot of factors such as: Variations in pronunciation Background noise Different accents across the globe The tone and context of spoken words
  • 3.
    Applications of AutomaticSpeech Recognition An Automatic Speech Recognition System is used for a variety of applications such as: Steps Involved in ASR Pipeline Building an ASR system involves a few important stages. In this guide, we will learn how to: Siri, Alexa, Google Assistant, & other voice assistants Different transcription services Smart devices using voice commands Different accessibility tools Customer service applications To sum up, ASR helps machines ‘listen’ and ‘understand’ human speech so that we can easily interact with technology in a more natural and intuitive way. © Copyright 2025. United States Data Science Institute. All Rights Reserved us org dsi. Set up an environment for ASR systems training Load and preprocess a speech dataset Fine-tune a pretrained ASR model Evaluate the model’s performance using word error rate, and Deploy the model for real-world application.
  • 4.
    © Copyright 2025.United States Data Science Institute. All Rights Reserved us org dsi. Since we are going to make a lightweight and efficient model, we will be using a small speech dataset Step 1: Installing Dependencies The first step in creating our ASR system is installing the required libraries that can facilitate loading datasets, processing audio files, and fine-tuning the model. pip install torch torchaudio transformers datasets soundfile jiwer Why do we install these libraries? Because it serves the following purposes: Transformers - it provides a pre-trained model for speech recognition called Wav2Vec2 Datasets - helps with loading and processing of speech datasets Torchaudio - it will help to handle audio processing and manipulation tasks Soundfile - it can read and write .wav files Jiwer - required for computing WER and performance of the ASR system Step 2: Loading Speech Dataset For experimentation purposes, using large datasets like Common Voice won’t be a good choice. So, we will be using SUPERB KS that contains short spoken commands (yes, no, etc.) from datasets import load_dataset dataset = load_dataset("superb", "ks", split="train[:1%]") # Load only 1% of the data for quick testing print(dataset) Using this command, we can load a tiny subset of the dataset. Note: the dataset will consume storage space, so you must be aware of it when working with larger splits.
  • 5.
    © Copyright 2025.United States Data Science Institute. All Rights Reserved us org dsi. Step 3: Preprocessing Audio Data A correct format of the audio data ensures efficient training of an ASR model. The Wav2Vec2 model requires a 16 kHz sample rate and no padding or truncation. To properly process the audio and extract relevant features, we define the following function. import torchaudio def preprocess_audio(batch): speech_array, sampling_rate = torchaudio.load(batch["audio"]["path"]) batch["speech"] = speech_array.squeeze().numpy() batch["sampling_rate"] = sampling_rate batch["target_text"] = batch["label"] # Use labels as text output return batch dataset = dataset.map(preprocess_audio) This will help us get audio files in the correct format so that we can process them further. Step 4: Loading Wav2Vec2 Model Since we are using the Wav2Vec2 pre-trained model from Hugging Face’s model hub, which has already been trained on a large dataset, we only need to fine-tune for our specific requirements. from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2- base-960h") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base960h") So, in this step, we define both the processor and the model. The processor will convert audio data into features suitable for the model, and the model will have a pre-trained Wav2Vec2 consisting of 960 hours of speech.
  • 6.
    © Copyright 2025.United States Data Science Institute. All Rights Reserved us org dsi. Step 5: Preparing Data for the Mode In this step, we will tokenize and encode the audio for the model to understand it. def preprocess_for_model(batch): inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True) batch["input_values"] = inputs.input_values[0] return batch dataset = dataset.map(preprocess_for_model, remove_columns=["speech", "sampling_rate", "audio"]) This makes our dataset compatible with the model. Step 6: Defining Training Arguments Training configurations like batch size, learning rate, optimization steps, etc., must be defined before we start training the model. This is done using the following command: from transformers import TrainingArguments training_args = TrainingArguments (output_dir="./wav2vec2", per_device_train_batch_size=4, evaluation_strategy="epoch", save_strategy="epoch", logging_dir="./logs", learning_rate=1e-4, warmup_steps=500, max_steps=4000, save_total_limit=2, gradient_accumulation_steps=2, fp16=True, push_to_hub=False,)
  • 7.
    © Copyright 2025.United States Data Science Institute. All Rights Reserved us org dsi. Step 7: Model Training We will then fine-tune our Wav2Vec2 model using Hugging Face’s trainer as follows: from transformers import Trainer trainer = Trainer (model=model, args=training_args, train_dataset=dataset, tokenizer=processor,) trainer.train() Step 8: Evaluating Model Now that we have trained our model, we must check how it performs. We do it by computing WER as follows: import torch from jiwer import wer def transcribe(batch): inputs = processor(batch["input_values"], return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values).logits predicted_ids = torch.argmax(logits, dim=-1) batch["predicted_text"] = rocessor.batch_decode(predicted_ids)[0] return batch results = dataset.map(transcribe) wer_score = wer(results["target_text"], results["predicted_text"]) print(f"Word Error Rate: {wer_score:.2f}") By achieving a lower WER score, we can ensure better performance.
  • 8.
    Step 9: RunningInference on New Audio As the last step of the process, we can use our model to use it in a real-world context. import torchaudio import soundfile as sf speech_array, sampling_rate = torchaudio.load("example.wav") inputs = processor(speech_array.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) *Source of codes: KDNuggets © Copyright 2025. United States Data Science Institute. All Rights Reserved us org dsi. CONCLUSION That's it, folks! We have successfully developed and fine-tuned a pre-trained ASR model using PyTorch and Hugging Face for real-world applications. This is a simple project you can work on to enhance your data science portfolio. Also, consider mastering new data science skills and working on a variety of such projects through the best data ® science certifications from USDSI .
  • 9.
    You May AlsoLike: Storytelling with Data: Transforming Raw Information into Narrative Symphonies Future Of Data Science: 10 Predictions You Should Know Data Science: Unlocking Careers for the Future Data Science vs. Decision Science Data Science Skills Vs. Tools: What Matters the most for Data Scientists Master Data-Driven Decision-Making in 2024 Discover More Discover More Discover More Discover More Discover More Discover More Factsheet: Data Science Career 2025 Top 5 Must-know Data Science Frameworks Discover More Discover More © Copyright 2025. United States Data Science Institute. All Rights Reserved us org dsi.
  • 10.
    LOCATIONS info@ | www. usdsi.orgusdsi.org Arizona 1345 E. Chandler BLVD., Suite 111-D Phoenix, AZ 85048, [email protected] Connecticut Connecticut 680 E Main Street #699 Stamford, CT 06901 , [email protected] Illinois 1 East Erie St, Suite 525 Chicago, IL 60611 [email protected] Singapore No 7 Temasek Boulevard#12-07 Suntec Tower One, Singapore, 038987 Singapore, [email protected] United Kingdom 29 Whitmore Road, Whitnash Learmington Spa, Warwickshire, United Kingdom CV312JQ [email protected] © Copyright 2025. United States Data Science Institute. All Rights Reserved. REGISTER NOW LEARN PYTORCH, HUGGING FACE, AND OTHER TOP LIBRARIES WITH D S ATA CIENCE CERTIFICATIONS ’S