audio_course

The Hugging Face Audio Course

https://huggingface.co/learn/audio-course/chapter0/introduction

Unit 1

Notes:

A common sampling rate used in training speech models is 16,000 Hz or 16 kHz.
If you plan to use custom audio data to fine-tune a pre-trained model, the sampling rate of your data should match the sampling rate of the data the model was pre-trained on.
Transformer models that solve audio tasks treat examples as sequences and rely on attention mechanisms to learn audio or multimodal representation. Since sequences are different for audio examples at different sampling rates, it will be challenging for models to generalize between sampling rates.
A waveform is the representation of sound over time. AKA time domain representation of sound.
A way to visualize audio data is to plot the frequency spectrum of an audio signal, also known as the frequency domain representation.
The spectrum is computed using the discrete Fourier transform or DFT. It describes the individual frequencies that make up the signal and how strong they are.
Power Spectrum measures energy rather than amplitude; this is simply a spectrum with the amplitude values squared.
A spectrogram plots the frequency content of an audio signal as it changes over time. It allows you to see time, frequency, and amplitude all on one graph. The algorithm that performs this computation is the STFT or Short Time Fourier Transform.
Creating a mel spectrogram is a lossy operation as it involves filtering the signal. Converting a mel spectrogram back into a waveform is more difficult than doing this for a regular spectrogram, as it requires estimating the frequencies that were thrown away.
A mel spectrogram is a popular choice in tasks such as speech recognition, speaker identification, and music genre classification.

Questions:

What are transformers?
What is the Nyquist limit?
What is the units of Amplitude?
How to convert from time domain to frequency domain?
What does this line mean: "The angle between the real and imaginary components provides the so-called phase spectrum, but this is often discarded in machine learning applications."
What is fourier series transformation and laplace series transformation?
What is Short Time Fourier Trandform?
What is a vocoder?
What is example implementation of HiFiGAN?

Terminology:

Waveform
Sampling Rate is the process of measuring the value of a continuous signal at fixed time steps. The sampled waveform is discrete, since it contains a finite number of signal values at uniform intervals. The sampling rate (also called sampling frequency) is the number of samples taken in one second and is measured in hertz (Hz). To give you a point of reference, CD-quality audio has a sampling rate of 44,100 Hz, meaning samples are taken 44,100 times per second. For comparison, high-resolution audio has a sampling rate of 192,000 Hz or 192 kHz. A common sampling rate used in training speech models is 16,000 Hz or 16 kHz.
Spectrogram

References:

Background in transformers

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
unit_1.ipynb		unit_1.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

audio_course

The Hugging Face Audio Course

Unit 1

Notes:

Questions:

Terminology:

About

Uh oh!

Releases

Packages

Languages

karthikskumar/audio_course

Folders and files

Latest commit

History

Repository files navigation

audio_course

The Hugging Face Audio Course

Unit 1

Notes:

Questions:

Terminology:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages