SPEECH INTERFACE CHALLAN
APPLICATION
A PROJECT REPORT
Submitted by
ABINAYA.K
GAYATHIRI.S
KIRUBAMANOHARI.R
NIVETHA.M
In partial fulfillment for the award of the degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
(Autonomous)
Pollachi, Coimbatore Dt. – 642 002
AUGUST 2020
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
BONAFIDE CERTIFICATE
Certified that this project report “SPEECH INTERFACE CHALLAN
APPLICATION” is the bonafide work of “ABINAYA.K (721716104002),
GAYATHIRI.S (721716104013), KIRUBAMANOHARI.R (721716104022),
NIVETHA.M (721716104035)” who carried out the project work under my
supervision.
SIGNATURE SIGNATURE
Dr. D. CHITRA M.E, Ph.D., Dr. V. SUJITHA M.E, Ph.D.,
Professor Assistant Professor
HEAD OF THE DEPARTMENT SUPERVISOR
Department of Computer Science Department of Computer
and Engineering Science and Engineering
P. A. College of Engineering and P. A. College of Engineering and
Technology Technology
Pollachi. Pollachi.
ACKNOWLEDGEMENT
First and foremost, we thank the GOD ALMIGHTY for blessing us with
the mental strength that was needed to carry out the project work. We thank our
Chairman Dr. P. APPUKUTTY, M.E, FIE, FIV., for his extensive support to
successfully carry out this project.
We take privilege in expressing our sincere and heartfelt thanks and
gratitude to our beloved Principal Dr. T. MANIGANDAN, M.E., Ph.D., for
providing us an opportunity to carry out this project work.
We express our heartfelt thanks to Dr. D. CHITRA, M.E., Ph.D.,
Professor and Head, Department of Computer Science and Engineering, for her
technical guidance and constructive suggestions provided throughout the project
work.
We take conceit in express our sincere and deepest thanks to our project
guide and Coordinator Dr. V. SUJITHA, M.E., Ph.D., Assistant Professor,
Department of Computer Science and Engineering, for his technical guidance,
constructive criticism and many valuable suggestions provided throughout the
project work.
We take this opportunity to thank and pay gratitude to our Project
coordinator Mrs. K. TAMILSELVI, M.E., Assistant Professor, Department of
Computer Science and Engineering and all teaching and non-teaching staff
members of our department, for their encouragement and valuable suggestion.
We take this opportunity to express our gratitude to our parents, friends,
family and other members whose belongings and love have always been with us
to carry out this project work successfully.
iv
ABSTRACT
Voice is the future. The world’s technology giants are making a
beeline for voice based applications. Communicating with technological
devices via voice has become so popular and natural in the
engineering world. The plebeians found difficulty in form filling. The
approach aims to overcome the challenges faced by the plebeians while
filling the forms.
The objective of this projects is to develop a modularized
framework for development of speech interface for form filling
application. The speech interface provides an integrated framework for
developing STS and TTS system across the languages and dictionaries
specific to various forms and domains. The accurate working of speech
interface is done with the help of dynamic down sampling, de-noising,
deep speech. The speech to text works in the speech interface using
speech recognition module in python. Specific scope of this project
include the speech interface enabled challan application formfilling.
TABLE OF CONTENTS
CHAPTER TITLE PAGE
ABSTRACT iv
LIST OF FIGURES viii
LIST OF ABBREVIATION ix
1 INTRODUCTION 1
1.1 SPEECH RECOGNITION 1
1.2 VARIOUS ALGORITHMS AND 2
METHODS FOR SPEECH
RECOGNITION
1.3 SOFTWARE TECHNOLOGIES FOR 3
SPEECH RECOGNITION
1.4 ADVANTAGES OF SPEECH 5
RECOGNITION
1.5 DISADVANTAGES OF SPEECH 5
RECOGNITION
2 SURVEY ON SPEECH INTERFACE 6
CHALLAN APPLICATION
3 SYSTEM SPECIFICATION 13
3.1 PROBLEMS IN SPEECH RECOGNITION 13
4 SYSTEM SPECIFICATION 14
4.1 HARDWARE SPECIFICATION 14
4.2 SOFTWARE SPECIFICATION 14
4.3 PYTHON 15
4.4 PHP 16
4.5 HTML 17
4.6 CSS 17
4.7 JAVA SCRIPT 17
4.8 MY SQL 17
4.9 XAMPP 18
4.10 NOTEPAD++ 18
5 SYNTHISIS OF SPEECH INTERFACE 20
5.1 OVERVIEW OF SPEECH INTERFACE 20
5.2 SPEECH INTERFACE SYNTHESIS 22
MODEL
5.3 WORKFLOW OF SPEECH INTERFACE 29
CHALLAN APPLICATION
6 IMPLEMENTATION AND RESULT 32
6.1 IMPLEMENTATION 32
6.2 SOURCE CODE 32
6.3 RESULT 40
7 CONCLUSION AND ENHANCEMENT 44
7.1 CONCLUSION 44
7.2 FUTURE ENHANCEMENT 44
REFERENCES 45
PUBLICATIONS 47
viii
LIST OF FIGURES
FIGURE NO. TITLE PAGE
NO.
1.1 SPEECH RECOGNITION SYSTEM 1
5.1 SPEECH INTERFACE CHALLAN 22
APPLICATION OVERVIEW
5.2 TEXT TO SPEECH SYNTHESIS 24
5.3 WORK FLOW OF SPEECH INTERFACE 30
CHALLAN APPLICATION
6.1 HOME SCREEN 39
6.2 CHALLAN FORM 40
6.3 ADMIN LOGIN 40
6.4 VOICE GENERATOR 41
6.5 RECOGNITION OF SPEECH 42
6.6 RECORDS OF THE USER 43
ix
LIST OF ABBREVIATIONS
API Application Program Interface
CER Character Error Rate
CAPFF Context Aware Paper Form Filling
CSS Cascading Style Sheet
CMU Carnegie Mellon University
DSP Digital Signal Processing
ECMP European Computer Manufacturer’s Association
HTML Hyper Text Markup Language
JS Java Script
MFCC Mel-Frequency Cepstral Coefficient
NLP Natural Language Processing
PHP Hypertext Preprocessor
RDBMS Relational Database Management system
SQL Structured Query Language
STT Speech To Text
SVM Support Vector Machine
TTS Text To Speech
XAMPP Cross-Platform, Apache, MySQL, PHP and Perl
ix
1
CHAPTER 1
INTRODUCTION
1.1 SPEECH RECOGNITION
Speech recognition is an interdisciplinary subfield of computational
linguistics that develops methodologies and technologies that enables the
recognition and translation of spoken language into text by computer.
Speech recognition is the ability of a machine or program to identify words
and phrases in spoken language and convert them to a machine-readable format.
Rudimentary speech recognition software has a limited vocabulary of words and
phrases, and it may only identifies when the speech is very clearly. More
sophisticated software has the ability to accept natural speech.
Figure1.1 Speech Recognition system
2
1.2VARIOUS ALGORITHM AND METHODS FOR SPEECH
RECOGNITION
a) HIDDEN MARKOV MODEL(HMM)
Hidden Markov Model is stochastic approach. It is easy, computationally
practical and can be trained routinely, so they are very trendy model in speech
recognition. This model is characterized by a finite state Markov model and set
of output distributions. It is based on huge vocabulary speech recognition
systems and trained automatically on large speech data for hours.
The advantage of HMM is that it decreases the complexity and time for
training the huge vocabulary in recognition system.
The limitation of HMM is that it is often complex to examine the errors of
an HMM scheme in an effort to enhance its performance.
b) NEURAL NETWORK (NN)
Neural Network have also been used for speech recognition system. They
are being used in solving complex identification tasks. System using NN
approach based system provides better accuracy as compare to HMM that is used
for limited training data and vocabulary.
The advantage of this approach is that they can control low quality,
noisy data and are speaker independent.
The disadvantage of NN approach is that optimal configuration
selection is not easy to select.
c) DYNAMIC TIME WARPING (DTW)
Dynamic Time Warping is an algorithm, that is used for calculating
similarity between two series that may differ in time or speed. It has been useful
3
to video, audio, graphics and any data that can be twisted into a linear
representation and analyzed with DTW. The optimization process is performed
by using dynamic programming, hence it is named as Dynamic Time Warping.
The advantage of DTW is that it works well for small number of
template and independent of language.
The disadvantage is that only limited number of templates are
allowed.
d) VECTOR QUANTIZATION(VQ)
VQ is a function of mapping the voice samples from a large vector space
to a finite number of regions in that space. Each region is called as a Cluster and
that cluster can be denoted by its center called as Centroid/Code words. The
Codebook is a collection of code words. For each speaker a codebook is
generated in VQ method. The Codebook then serves as prerecorded words for
the speaker and it is used when speaker is tested in the system. In voice
recognition system, to attain high speaker recognition rate it uses VQ as a
parameter and this parameter is used for accuracy rate, processing time, number
of speakers and size of training database.
The advantage of this technique is that it saves lots of time
throughout the testing phase and decreases the storage and computation
effort in resolving the resemblance of spectral analysis vectors.
The disadvantage is that both the storage space and the time needed
to perform the quantization grows faster than exponentially with the
number of dimensions.
4
1.3 SOFTWARE TECHNOLOGIES FOR SPEECH RECOGNITION
a) DRAGON NATURALLY SPEAKING
Best as an overall dictation and voice recognition software.It is also called
Dragon for PC. It can be used for personal as well as for official
purposes.Dragon Home can be used by anyone i.e. from students to daily multi-
taskers. Dragon Professional Individual is useful for professional individuals and
small businesses.
b) GOOGLE NOW
Best suited for Android Mobile Devices.Google Now is the feature of
Google Search of the Google App. This feature is available for Android and iOS
devices. Though it is available for iOS devices, it works best on Android
devices.
c) GOOGLE CLOUD SPEECH API
Best in recognizing 120 languages.Google Cloud Speech API can be used
for short form and long form video. It can be used for the processing of real-time
streaming and pre-recorded audio. It automatically transcribes the correct nouns,
dates, and phone numbers.
d) GOOGLE DOCS VOICE TYPING
Best suited for Dictation on Google Docs.Google Docs Voice Typing is
integrated with Google Suite and hence it is the perfect tool for the dictation and
voice recognition to be paired with Google suite. It is indeed a very cost-
effective solution.
5
e) SIRI
Best suited for iOS mobile devices.Siri is the virtual assistant for Apple
devices. 21 languages are supported by Siri. It will be pre-installed on Apple
devices. It can respond in its own voice.
f) AMAZON LEX
Best suited for creating a Chat-bot.Amazon Lex is used in the applications
to build a conversational interface. The developed bot can be used in the Chat
platform, IOT devices, and mobile clients.
g) MICROSOFT BING SPEECH API
Best for accuracy and ease of use.Microsoft speech recognition API is
used to transcribe the speech into text. This transcribed text can be displayed by
the application or the application can respond or act as per the command. It can
also perform text to speech conversion in many different languages.
h) CORTANA
Best suited for Windows users.Cortana is a virtual assistant that comes
with Windows 10 systems and Windows phone. It is also available for Android
and iOS devices.
1.4 ADVANTAGES OF SPEECH RECOGNITION
One of the most notable advantages of speech recognition
technology includes the dictation ability it provides.
Speech recognition can allow documents to be created faster
because the software generally produces words as fast as they are spoken,
that is generally much faster than a person can type.
6
Dictation solutions are not only used by individuals but also by
organizations that require heavy transcription tasks such as healthcare and
legal.
With the help of the speech recognition technology callers can input
information such as name, account number, the reason of their call etc.
without interacting with a live agent. Instead of having callers remain idly
on hold while agents are busy.
1.5 DISADVANTAGES 0F SPEECH RECOGNITION
For many speech recognition devices, even after a long training
period, many people find that they still speak in an unnatural way and
over-enunciated words.
While voice recognition technology recognizes most words in the
English language, it still struggles to recognize names and slang words.
These devices often take a bit to register what’s being said, that can
be frustrating and interrupt the thought flow.
6
CHAPTER 2
SURVEY ON SPEECH RECOGNITION TECHNIQUE
The existing speech recognition techniques are focused on bring-forth
the more accurate conversion of speech to text with the help of various
programming languages and technologies. The following are some of the
existing system that explains the speech recognition in various techniques.
AkitadaOmagari et al. (2019), Proposed the developments in deep
learning, the security of neural networks against vulnerabilities has become
one of the most urgent research topics in deep learning. There are many types
of security countermeasures. Adversarial examples and their defense
methods, in particular, have been well-studied in recent years. An adversarial
example is designed to make neural networks misclassify or produce
inaccurate output. Audio adversarial examples are a type of adversarial
example where the main target of attack is a speech-to-text transcription
neural network.The study, propose a new defense method against audio
adversarial examples for the speech-to-text transcription neural networks. It is
difficult to determine whether an input waveform data representing the sound
of voice is an audio adversarial example. The main framework of the
proposed defense method is based on a sandbox approach. The proposed
defense method, used actual audio adversarial examples that were created on
Deep Speech, which is a speech-to-text transcription neural network.
Confirmed that our defense method can identify audio adversarial examples
to protect speech-to-text systems.
7
Yuan Jiang et al. (2019), Proposed a paper presents methods of making
using of text supervision to improve the performance of sequence-to-sequence
(seq2seq) voice conversion. Compared with conventional frame-to-frame voice
conversion approaches, the seq2seq acoustic modeling method proposed in our
previous work achieved higher naturalness and similarity. The paper, further
improve its performance by utilizing the text transcriptions of parallel training
data. First, a multi-task learning structure is designed which adds auxiliary
classifiers to the middle layers of the seq2seq model and predicts linguistic labels
as a secondary task. Second, a data-augmentation method is proposed which
utilizes text alignment to produce extra parallel sequences for model training.
Experiments are conducted to evaluate our proposed method with training sets at
different sizes .Experimental results show that the multi-task learning with
linguistic labels is effective at reducing the errors of seq2seq voice conversion.
The data-augmentation method can further improve the performance of seq2seq
voice conversion when only 50 or 100 trainingutterances are available.
Snezhana et al. (2019),Proposed voice commands recognition tasks used
limited sets of words in comparison of universal speech recognition systems
dedicated to work with the whole set of words of one or more than one natural
languages. These universal speech recognition systems are usually based on
cloud technologies, artificial intelligence and probably on neural networks with
deep learning. The main drawback of using these universal speech recognition
systems in tasks like voice commands recognition is the need of unnecessary
search the limited set of words in the databases, containing very large set of
words of a chosen natural language. The proposition as the goal of the article is
to combine the advantages of universal speech recognition systems using cloud,
technologies, artificial intelligence and neural networks with deep learning in
8
voice commands recognition tasks, but creating and using the reduced database
as an appropriate subset of the large speech recognition database existing as
cloud databases.
Ananya et al. (2018), Proposed anNatural language processing technique.
It is a widely used technique by which systems can understand the instructions
for manipulating text or speech. In the present paper, a Text-to-speech
synthesizer is developed that converts text into spoken word, by analyzing and
processing it using Natural Language Processing (NLP) and then using Digital
Signal Processing (DSP)technology to convert the processed text into
synthesized speech representation of the text. Here developed a useful text-to-
speech synthesizer in the form of a simple application that converts inputted text
into synthesized speech and reads out to the user which can then be saved as an
mp3 file.
Burhanuddin et al. (2018),Proposed an Offline Voice to Text transcription
system for healthcare organization which can be used by counsellors and NGO’s
to record the conversation during surveys and convert it into text and save it. The
system include san open source application. It uses the CMU Sphinx toolkit for
speech recognition. The system supports multi –language recognition. The CMU
Sphinx toolkit utilizes acoustic model,phonetic dictionary and language model.
The user will record his /her voice through the mobile application then
recognition and transcription is done through CMU Sphinx toolkit. The
transcription file will be saved as a text file in the device memory which user can
upload and retrieve data from the database server through the application.
Jiangtao Wang et al. (2017) introduced a context-aware system, named
CAPFF, for helping people in filling paper forms, mainly in two contexts:1)
9
people have no idea of what should be filled in certain form fields; 2)people are
not aware of the mistakes they are likely to commit in entering information,
which may violate data entry constraints. In the offline phase, CAPFF provides a
tool to build the knowledge about a given form, and such knowledge includes
instructions, field-level examples, and constraints among form fields. In the
online phase, when people set out to fill a paper form, the video camera of the
system determines the position of the pen and then provides assistance, based on
the user’s form filling context. evaluated CAPFF’s performance through 450
paper form filling activities, and the results show that the proposed CAPFF is
effective in terms of both accuracy and response time.
Yogita et al. (2016),Proposed a multilingual speech-to-text conversion
system. Conversion is based on information in speech signal. Speech is the
natural and most important form of communication for human being. Speech-
To-Text (STT) system takes a human speech utterance as an input and requires a
string of words as output. The objective of the system is to extract, characterize
and recognize the information about speech. The proposed system is
implemented using Mel-Frequency Cepstral Coefficient (MFCC) feature
extraction technique and Minimum Distance Classifier, Support Vector Machine
(SVM) methods for speech classification. Speech utterances are pre-recorded
and stored in a database. Database mainly divided into two parts testing and
training. Samples from training database are passed through training phase and
features are extracted. Combining features for each sample forms feature vector
which is stored as reference. Sample to be tested from testing part is given to
system and its features are extracted. Similarity beten these feature sand
reference feature vectors is computed and words having maximum similarity are
1
given as output. The system is developed in MATLAB (R2010a) environment.
Lideng et al. (2013),Proposed an overview of the invited and contributed
papers presented at the special session at ICASSP, entitled “New Types of Deep
Neural Network Learning for Speech Recognition and Related Applications,” as
organized by the authors. also describe the historical context in which acoustic
models based on deep neural networks have been developed. The Technical
over view of the papers presented in our special session is organized into five
ways of improving deep learning methods : (1)better optimization; (2) better
types of neural activation function and better network architectures; (3) better
ways to determine the myriad hyper parameters of deep neural networks; (4)
more appropriate ways to preprocess speech for deep neural networks; and (5)
ways of leveraging multiple languages or dialects that are more easily achieved
with deep neural networks than with Gaussian mixture models.
Brad myers et al. (2011),Proposed a multi nodal approach to interactive
recovery from speech recognition errors for the design of speech user interfaces.
propose a framework to compare various error recovery methods, arguing that a
rational user will prefer interaction methods which provide an optimal trade off
accuracy, speed and naturalness. describe a prototypical implementation of
multimodal interactive error recovery and present results from a preliminary
evaluation in form filling and speech to speech translation tasks.
Lei xie et al. (2010), demonstrated the recent progress on speech and
auditory technologies for potential ubiquitous, immersive and personalized
applications. The first demo shows an intelligent spoken question ansring
system, which enables users to interact with a talking avatar via natural speech
1
dialogues. The prototype system demonstrates our latest development on
automatic speech recognition, key-word spotting, personalized text-to-speech
synthesis and visual speech synthesis. The second demo exhibits a virtual
concert with immersive audio effects. Through our virtual auditory technology,
aring simple earphones, listeners are able to experience immersive concert audio
effects from an ordinary music file. believe the technologies shown in the two
demos can be easily deployed in many significant applications. Keywords-
spoken dialogue system, question ansring,speech recognition, keyword spotting,
speech synthesis, visual speech synthesis, talking face, human-computer
interaction, virtual auditory,head related transfer functions.
Shih jung ping et al. (2007)Proposed anApplication systems that utilize
recognition technologies, such as speech recognition, provide human-machine
interface that could aid people more easily in operating system device or help
those who are physically unable to interact with computers through traditional
input devices such as mouse or keyboard. As have seen, speech recognition
technology is widely used between the device interfaces and human. The
common approached method to increase speech recognition function into
devices is through low-level programmed wrappers. Forgetting that, first must
obtain the source code of system and have programming knowledge in order to
perform it. The speech commands are pre-defined for particular application
systems, so user could not set or modify the commands to what they want
especially. Even the designer want to add or delete some speech commands, it
also is not easy and is time consuming. The research, provide a general
interfacing frame work. Under the, user could set or modify speech commands
conveniently and easily without the needs of the detailed system code, system
design and programming knowledge. After speech commands are set by end
1
user, interface would store these commands to database. When end user speak a
command through interface, the interface would analyze and recognize it, and
then interact directly with application systems through calling the API functions,
have done before, to control mouse moving and keyboard pressing. The
proposed system could be applied to GUI base commercial software proposed
interfacing framework under windows environment, user can interact with most
of the application systems by controlling mouse moving, mouse jumping, mouse
clicking, keyboard pressing and compound keyboard pressing through speech
commands just like what have done normally. Finally, apply some examples to
demonstrate the applicability and feasibility of the proposed interfacing
framework.
LIMITATIONS
From the survey, identifying two important challenges that have not been
well addressed in the current form filling applications. The first challenge is the
form filling application does not have speech interface with voice based
instructions for the user to fill the form conveniently. The second challenge is
perturbations in the audio for speech to text conversion gives deviation of output
text from the original text.
1
CHAPTER 3
SPEECH RECOGNITION
3.1 PROBLEMS IN SPEECH RECOGNITION
Smart speakers involving speech recognition application and system such
as siri, Google home and alexa have gained with popularity in recent year. The
increased use of smart speakers has opened up new application areas and they
are put to practical use in many real world fields. The speech recognition is
vulnerable and open to the public unauthorized access to personal information
,illegal computer access and unauthorized falsification may occur through smart
speakers . In this system, audio adversarial examples a type of adversaries where
the main target of attack is a speech to text transcription. The main framework of
the proposed defined method is based on the sand box approach. evaluate the
proposed defense method actual audio adversarial example used which is a
speech to text transcription. In sandbox approach program and data that are
received from outside are executed and used in a specific area to prevent
invalid operation in the internal system. The defines method has two main step
finding perturbation, comparison step. Perturbations in the input waveform to
trick the speech to text transcription seems to be slight noise. The down
sampling significantly identifies the perturbations. The output of the first steps
is given as input to speech recognizer and the results are compared with original
input by using CER. If the value CER is larger than a user given threshold then
the input waveform is detected as an adversarial example.
1
1
CHAPTER 4
SYSTEM SPECIFICATION
4.1 HARDWARE SPECIFICATION
Hardware requirements for developing the
application System: Intel 2.4 GHz.
Hard Disk : 256 GB.
Monitor : 15 VGA Colour.
Mouse : Logitech.
Ram : 2GB.
Keyboard : 110 keys enhanced
Speed : 2.40 GHz
Hardware requirements for running the
application Device :Phone/Tablet/PC
RAM :2GB RAM
Screen : 249*320mdpi/15”VGAColour monitor
Keyboard :110 keys enhanced
Speed :1 GHz or Higher
4.2 SOFTWARE SPECIFICATION
Software requirements for developing the
application Application server :XAMPP
IDE : Notepad++, Python
3.6.2 Operating system : Windows7 /8/8.1/10
1
Coding Language :
PHP,HTML,CSS,JS,PYTHON Data Base :MYSQL
Software requirements for running the application
Operating system : Windows7 /8/8.1/10
Network : Wi-fi or Cellular
Network
4.3 PYTHON
Python is a high-level, interpreted and general purpose dynamic
programming language, focuses on code readability. The syntax in Python
helps the programmers to do coding in fewer steps as compared to Java or C+
+. The Python is widely used in bigger organizations because supports
multiple programming paradigms.
Speech recognition is the process of converting spoken words to text.
Python supports many speech recognition engines and APIs, including
Google Speech Engine, Google Cloud Speech API,Microsoft Bing Voice
Recognition and IBM Speech to Text.
Libraries in python for speech to text conversion:
Pandas- In computer programming, pandas is a software
library written for the Python programming language for data manipulation
and analysis. In particular, it offers data structures and operations for
manipulating numerical tables and time series. It is free software released
under the three-clause BSD license. The name is derived from the term "panel
data", an econometrics term for data sets that include observations over
multiple time periods for the same individuals.
Sklearn - Scikit-learn (formerly scikits.learn and also known
as sklearn) is a free software machine learning library for
1
the Python programming language. It features various classification,
regression and clustering algorithms including support vector machines,
random forests, gradient boosting, k-means and DBSCAN and is designed to
interoperate with the python numerical and scientific libraries NumPy and
SciPy.
Matplotlib-Matplotlib is a plotting library for the Python programming
language and its numerical mathematics extension NumPy. It provides
an object-oriented API for embedding plots into applications using general-
purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. There is also
a procedural "pylab" interface based on a state machine (like OpenGL),
designed to closely resemble that of MATLAB, though its use is
discouraged. SciPy makes use of Matplotlib.
Keras - Keras is an open-source neural-network library written
in Python.Keras contains numerous implementations of commonly used
neural-network building blocks such as layers, objectives, activation
functions, optimizers, and a host of tools to make working with image and
text data easier to simplify the coding necessary for writing deep neural
network code.
TensorFlow –TensorFlow is a free and open- source software library
for dataflow and differentiable programming across range of tasks. It is a
symbolic math library, and is used for machine learning applications such as
neural networks. It is used for both research and production at Google.
PyAudio - PyAudio provides Python bindings for PortAudio, the cross-
platform audio I/O library. It can easily use Python to play and record audio
on a variety of platforms.The PyAudio recognize the frequency, filter the low
frequency audio samples.
Speech Recognition - To convert speech to text the recognizer class
from the speech recognitionmodule. To recognize speech from an audio file,
1
create an object of the AudioFile class of the speech recognition module. The
path of the audio file that has to be translatedinto text is passed to the
constructor of the AudioFile.
4.4 PHP
PHP is a widely-used, open source scripting language used for scripts
that are executed on the server and it is freeware. It is a server side scripting
language used to develop attractive and dynamic web pages. PHP is a
recursive acronym for "PHP: Hypertext Preprocessor".PHP is a server side
scripting language that is embedded in HTML. It is used to manage dynamic
content, databases, session tracking, even build entire e-commerce sites.It is
integrated with a number of popular databases, including MySQL,
PostgreSQL, Oracle, Sybase, Informix, and Microsoft SQL Server.
In order to develop and run PHP Web pages three vital components
need to be installed on the computer system.
Web Server − PHP will work with virtually all Web Server software,
including Microsoft's Internet Information Server (IIS) but then most often
used is freely available Apache Server.
Database − PHP will work with virtually all database software,
including Oracle and Sybase but most commonly used is freely available
MySQL database.
PHP Parser − In order to process PHP script instructions a parser must
be installed to generate HTML output that can be sent to the Web Browser.
4.5 HTML
HTML (Hypertext Markup Language) is the code that is used to
structure a web page and its content.HTML is not a programming language; it
1
is a markuplanguage that defines the structure of the content. HTML consists
of a series of elements, is used to enclose, or wrap, different parts of the
content to make it appear a certain way, or act a certain way.Attributes
contain extra information about the element that will not appear in the actual
content. The class attribute allows to give the element an identifier that can be
used later to target the element with style information and other things.
4.6 CSS
Cascading Style Sheets, fondly referred to as CSS, is a simple design
language intended to simplify the process of making web pages presentable.
CSS handles the look and feel part of a web page. CSS can control the color
of the text, the style of fonts, the spacing between paragraphs, how columns
are sized and laid out, what background images or colors are used, layout
designs and variations in display for different devices and screen sizes as
well as a variety of other effects.
4.7 JAVA SCRIPT
JavaScript , often abbreviated as JS, is a programming language that
conforms to the ECMAScript specification. JavaScript is high-level,
often just-in-time compiled, and multi-paradigm. It has curly-bracket
syntax, dynamic typing, prototype-based object-orientation, and first class
functions.
4.8 MY SQL
MySQL is the most popular Open Source Relational SQL Database
Management System. MySQL is one of the best RDBMS being used for
developing various web-based software applications. MySQL is developed,
marketed and supported by MySQL AB, it is a Swedish company. MySQL is
released under an open-source license. MySQL is a very powerful program in
1
its own right. It handles a large subset of the functionality of the most
expensive and powerful database packages. MySQL uses a standard form of
the well-known SQL data language. MySQL works on many operating
systems and with many languages including PHP, PERL, C, C++, JAVA, etc.
MySQL works very quickly and works well even with large data sets.
MySQL is very friendly to PHP, the most appreciated language for web
development.
4.9 XAMPP
XAMPP stands for Cross-Platform (X), Apache (A), MariaDB (M),
PHP (P) and Perl (P). Since XAMPP is simple, lightweight Apache
distribution it is extremely easy for developers to create a local web server for
testing and deployment purposes.
4.10 NOTEPAD++
Notepad++ is a free, open-source text editor available for Windows that
can be used in any type of text editing , coding and web design related source
code editing activities. The default Notepad++ installation comes with a
couple of plugins auch as the plugin manager, converter, MIME tools and
NPPExport. TestFX is a highly versatile Notepad++ plugin that simplifies a
lot of text editing tasks while coding or editing the text documents. It allows
to change letter cases, convert or replace characters,tidy HTML code. Editing
with CSS style files or the project working on requires with colors, the
ColorPicker plugin will simplify the task of finding colors and their codes.
ColorPicker has its own color picking interface, but it also uses the default
color section windows of MS paint.
2
CHAPTER 5
SYNTHESIS OF SPEECH INTERFACE
5.1 OVERVIEW OF SPEECH INTERFACE
Speech interface is one of the artificial computation of providing a
audio instruction, based on the users form filling context and converting the
users speech into text in the particular text field of the form using the module
speech recognition in python . This project is aimed at developing and
deploying speech interface for form-filling application, replaces the
traditional keyboard for entering the responses. The proposed system interacts
with the user, in order to fill the responses. The Speech Interface system
prompts the user with queries generated using a text-to-speech synthesis
system, recognizes the user responses using a speech recognition system, and
fills the form. The audio files are stored in the mp3 format. The audio
instruction is used to help the user to know the content to be filled in the text
area near to the context.
Speech Interface has a modularized framework for filling different
fields in the forms, like account number, accountant name, amount etc., it
can be easily deployed across the domains. The user’s speech is converted to
text using the speech recognition module in python. Automatic Speech
Recognition (ASR) system has mainly three tasks −
Speech recognition that allows the machine to catch the words,
phrases and sentences the user speaks.
Then the natural language processing allow the machine to
understand the word spoken by the user, and
2
Speech to Text allows the machine to convert the processed audio
inputs into texts.
The user’s speech signals are captured with the help of a microphone and
then it has to be understood by the system. Capture the microphone input
anddown-sample the audio input, the PyAudio package is imported. The
PyAudio is a open source, cross platform package allows to play audio and
record audio. It provides a python bindings for port audio. The PyAudio
recognize the frequency, filter the low frequency audio samples.
Recognizing speech requires audio input and speech recognition makes
retrieving text from the audio input. The speech recognition library acts as a
wrapper for several popular speech API’s. The speech recognition happens with
the recognizer class. The primary purpose of a recognizer instance is to
recognize speech. The recognize_*() method will throw a Speech Recognition
request error exception when the audio input can’t be recognized by the
recognizer class. In this way the user’s input is filled in the appropriate text field
with more accurate conversion of speech to text. The completion of all the text
fields in the form, user can submit the details. The details will be stored in the
database.
The TTS system works by the Google text to speech API. Google Text to
Speech is one of the best TTS API among all, because it will generate audio as
approximately similar to human voice.The STT systems require "training" also
called "enrollment" where an individual speaker reads text or
isolated vocabulary into the system. The system analyzes the person's specific
voice and uses it to fine-tune the recognition of that person's speech, resulting in
increased accuracy. Systems that do not use training are called "speaker
independent" systems. Systems that use training are called "speaker dependent".
2
A STT system consists of one interface as front end and the other as back
end. The first interface divides the single audio input into several fragments and
then undergoes the process of dynamic down sampling for each fragmented
audio input. The second interface divides the output of the first interface into
small fragments. It removes the low frequency sound. Therefore the
perturbations are eliminated and the conversion of speech to text will be
accurate.
5.2 SPEECH INTERFACE SYNTHESIS MODEL
The speech interface provides the audio instruction to the user first and it
gets a speech as an input from the user. The conversion of speech to text takes
place by several steps.
Figure 5.1 Speech Interface challan application overview
User: The plebeians or the person those have difficulty in filling the form
needs a human assistant or the speech interface.
Challan application: The speech interface is designed for the application
of challan form. The challan application has three pages namely home page,
challan form and admin page.
2
Home page: A home page is generally the main web page a visitor
navigating to a website.The home page is designed with the html, php, css. The
page displays the welcome greeting to work with the speech interface.
Challan form: The challan form is a front end shown to the user in the
clearly understandable format. The form used here is amount depositing challan
form. The challan form consists of the fields like,
Account holder name: The person name who accepts legal responsibility
for handling the account.
Account number: It is a unique string of numbers, letters and characters
that defines the owner of a service and permits access to it.
Date: Indicates the date in the person depositing or remitting the amount.
Branch: It is the Physical location of a banking corporation.
Phone number: It is the sequence of digits helps to communicate with the
people.
Amount: The cash to be deposited is entered in Indian currency rupees.
Remitters name: A person name who sends the payment.
Admin login: The admin login has a set of credentials used to
authenticate a user. It consists of user name and password. The Admin will
register for username and password. A username is a name that uniquely
identifies someone on a computer system. The username is almost paired with a
password. While logging in, the username may appear on the screen and the
password is kept secret. By keeping the password private, the details of the users
are secured. The Admin has the access to login for checking the user details who
have used the speech interface challan application.
2
Voice generator: The man-machine communication in the speech
interface challan application is initiated by providing the audio instruction to the
user. The voice generator helps in generating the appropriate audio instruction
to the particular field.Google Text to Speech is one of the best TTS API out
there, because it will generate audio as approximately similar to human voice
while other APIs generate audio like a metallic voice or robotic voice.
Figure 5.2 Text to Speech Synthesis
A TTS system or "voice generated-engine" is composed of one interface
as a front-end and a back-end. The first interface converts raw text containing
2
alpha-numeric symbols like numbers and abbreviations in terms of speech into
the equivalent of out words. Here analysis of text includes various features such
as recognition of text unit, normalization of text unit pattern, pre-processing, etc.
The front interfaces always engage with phonetic conversion to each unit, and
divides and marks to form a speech tree or pattern tree using the speech unit
configures the tune and rhythm through phrases, clauses, and sentences. This
process of transcriptions is known as text-to-phoneme (TTP) or grapheme-to-
phoneme (GTP) conversion. These two conversions are together known as
symbolic linguistic representation is the desired output of TTS. Whereas in the
back side the symbolic linguistic representation converts into sound. Sometimes
this back end computes pitch analysis, contour analysis, rhythm analysis etc., for
output speech.
The Figure 5.2 explains the system of text to speech conversion.
This system accepts text units as input and produce DSP units as output.
The Text to Speech system consists of
Text analysis: Text Analysis is about parsing texts in order to extract
machine-readable facts from them. The purpose of Text Analysis is to create
structured data out of free text content. The process can be thought of as slicing
and dicing heaps of unstructured, heterogeneous documents into easy-to-manage
and interpret data pieces. The text analysis consists of preprocessor,
morphological analysis, contextual analysis, syntactic prosodic parser.
Preprocessor:Function that takes text and returns text. Its goal is to
modify text for example correcting pronounciation, and/or to prepare text for
proper tokenization for example enuring spacing after certain characters.
Morphological analysis:Morphological analysis is the elimination of
contradictory statements from a large space of possibilities by systematic
search.
2
Morphology is the study of morphemes and their arrangements in forming
words. Morphemes are the minimal meaningful units may constitute words or
parts of words.
Contextual analysis:Contextual analysis is a method of studying text and
its cultural, social, or political context. It is often used by historians, art critics, or
sociologists. Context analysis in NLP involves breaking down sentences into n-
grams and noun phrases to extract the themes and facets within a collection of
unstructured text documents.
Syntactic prosodic parser:Prosody is concerned with those elements of
speech that are not individual phonetic segments (vowels and consonants) but
are properties of syllables and larger units of speech, including linguistic
functions such as intonation, tone, stress, and rhythm. Such elements are known
as suprasegmentals.
Letter sound unit: The discrete units that occur in a sequence of sounds,
can be broken down into phonemes, syllables or words in spoken language
through a process . The letter sound unit has the data of audio sounds match
with the phonemes.
Prosody Generator: Prosody generator that generates prosody
information for implementing highly natural speech synthesis without
unnecessarily collecting large quantities of learning data. The prosody generator
is generated using any of the one method. The first method involving generating
the prosody information using a statistical technique, the second method
involving generating the prosody information using rules based on heuristics.
2
Speech to Text (STT): The user response was taken as input to the STT. The
audio input was converted to text by undergoing two process with the help of
python libraries accurate conversion.speech recognition system works on the
structuring of audio signals. The steps that followed to work with audio signals are as follows
–
Recording: To read the audio signal from a file, then record it using a
microphone, at first.
Sampling: The recording are with microphone, the signals are stored in a
digitized form. But to work upon it, the machine needs them in the discrete
numeric form. Hence, signals should perform sampling at a certain frequency
and convert the signal into the discrete numerical form. Choosing the high
frequency for sampling implies that when humans listen to the signal, they feel it
as a continuous audio signal.Characterizing an audio signal involves converting
the time domain signal into frequency domain, and understanding its frequency
components. The use of mathematical tool like Fourier Transform to perform
this transformation. To generate the audio signal with some predefined
parameters. This step will save the audio signal in an output file. This is the most
important step in building a speech recognizer because after converting the
speech signal into the frequency domain, it must convert into the usable form of
feature vector. The use of different feature extraction techniques like MFCC,
PLP, PLP-RASTA etc. for this purpose. Google Speech API in Python to make
it happen. To install the following packages for this −
Pyaudio − It can be installed by using pip install Pyaudio command.
SpeechRecognition− This package can be installed by using pip install
SpeechRecognition.
In this speech interface application the perturbation elimination can be done
by two process. The process are namely dynamic downsampling and denoising.
2
Dynamic downsampling: Dynamic downsampling is the process of
making the digital audio signal smaller by lowering its sampling rate or sample
size (bits per sample).
1. The audio input waveform is considered as ‘x’, window frame size is
considered as ‘n’.
2. The input ‘x’ is divided into smaller fragments based on window size.
Sx = len(x)/n.
3. The fragment audio sample are subjected to downsampling. The
downsampled audio fragments are denoted as dSxi .
4. The output of the dynamic downsampling is obtained by the operation
XORin the downsampled audio fragments.
Y=dSx1⊕dSx2⊕dSx3⊕……⊕dSxn
Denoising: Denoising is any signal processing method reconstructs a
signal from a noisy one. Its goal is to remove noise and preserve useful
information. In this process
1. Divide the downsampled audio fragments into two parts.
2. The first audio part is taken as c(r 1) and the second audio part is taken
as c(r2) , compare c(r1) with c(r2), if the length of c(r1) is longer then c(r2)then
the audio of c(r1) is again divided into two parts.
3. The division step will be repeated until the downsampled audio is
divided into many small fragments.
4. In all the tiny fragments , the low frequency sounds are eliminated .
Thus the noise in the audio gets eliminated, the tiny fragments are fused
together .
The downsampled and denoised audio is converted into text. These processes are
done by the pyAudio and SpeechRecognition packages in python.
2
Database : A database is an organised collection of data . Python , PHP,
MySQL , is used to design the database in the speech interface challan
application. The required tables with input fields are created for managing the
records efficiently.
5.3 WORKFLOW OF SPEECH INTERFACE CHALLAN APPLICATION
The speech interface is a modularized framework for challan application.
It is designed in the way of user comfortable form. The paper form filling will be
difficult for the plebeians, while the speech interface guides the plebeians in easy
way of form filling. The speech Interface provides the voice based instructions
for the user before filling each text field. The user can fill the form by providing
voice as a input to the speech interface, the speech recognition system in the
speech interface will convert the speech to text. The text will be filled in the
appropriate text box field. Once filling all the necessary fields the user can
submit the form.
The first page of this website is home screen. The home screen
displays the welcome message for the speech interface challan application. The
top of the screen has two buttons navigate to the challan form page and user
details page.
The user clicks the challan button in home screen and navigate to the
challan form page. The challan form – amount depositing form displays on the
user screen with the fields of accountant name, account number, Date , place ,
phone number, amount in words, remitters name.
User clicks the play button to hear the voice based instructions. The voice
based instructions provides the details of what to be filled in the field.
3
User access the Speech Interface Challan Application Website
Challan Form displays on the user screen
Welcome message by Speech Interface and the User specifies the task
User clicks play button, Speech Interface Voice Generator generates the instruction
Speech input is provided to the speech interface by clicking the speak button
No
If 7 fields are filled ?
Yes
The Acknowledgement displayed on the screen
The users data gets stored in the database, Admin can view data
Figure 5.3 workflow of speech interface challan application
3
The Figure 5.3 represents the work flow process of speech interface challan
application.
Once, the user understands the detail to be filled in the field , the user
clicks the speak button and provides the speech as input. The speech recognition
system in speech interface starts converting the speech into text. The text will be
displayed in the appropriate fields.
In case of any wrong details filled in the form the user can click the
refresh button at the bottom of the screen, so all the text fields gets empty.
On completion of all the fields in the form with correct details user clicks
the deposit button at the bottom of the page.
The successful submission the acknowledgement dialog box appears as
the “Amount is deposited Sucessfully”.
The user details will be saved in the database that can be viewed by the
admin by navigating to the admin page in the home screen. The Admin knows
the user id and password to open the admin page. So, the user details will be
more secure. On entering into the admin page the tabular column is displayed
with the user details is stored in the database.
32
CHAPTER 6
IMPLEMENTATION AND RESULT
6.1 IMPLEMENTATION
Our designed web application is called the speech interface challan
application, with the speech to text functionality. The system was developed
using PHP, Python 3.6.2, HTML, CSS, Java script.
The application is divided into two main modules. The first module
which includes the basic GUI components which handles the basic operations
of the application such as generating audio instruction, receiving user speech
(audio) as input. The second module, the main conversion engine which
integrated into the main module is for the acceptance of audio input, hence
the conversion.
Speech interface (STT) converts speech to text by receiving the audio input
from the user’s speech. The python library packages start the conversion of
speech to text. The recognition of speech takes 20 to 30 seconds and finally
the text will be displayed in the appropriate text field. STT shows an
exceptional error when the speech cannot be recognized by the STT engine.
6.2 SOURCE CODE
Home screen
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
3
<title> Home</title>
<link rel="stylesheet" href="css/bootstrap.min.css" type="text/css">
<link type="text/css" rel="stylesheet" href="css/style.css">
<link rel="stylesheet" href="font-awesome/css/font-awesome.min.css"
type="text/css"><link rel="stylesheet" href="css/animate.min.css"
type="text/css"<link rel="stylesheet" href="css/creative.css"
type="text/css"></head><body >
<nav id="mainNav" class="navbarnavbar-default navbar-fixed-top"
style="background:black">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-
toggle="collapse" data-target="#bs-example-navbar-collapse-1">\
<span class="icon-bar"></span>
</button>
</div>
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1>
<ul class="navnavbar-navnavbar-right">
<li><a class="page-scroll" href="index.php">Home</a>
</li <li><aclass="page-scroll"
href="challan_home.php">CHALLAN</a>
</li><li><a class="page-scroll" href="admin.php">Admin Login</a></li>
<</ul></div></div></nav>
<header style="background-color:#82E0AA;">
<div class="header-content"><div class="header-content-inner">
<h1style="font-size:50px;font-family:Century
Gothic;color:black;">SPEECH INTERFACE</h1><br>
3
<h1 style="font-size:50px;font-family:Century Gothic;color:black;">
Challan
<?php
include('connect.php');
?>
<!DOCTYPE html>
<html lang="en">
<head>
<link rel="shortcut icon" href="favicon.ico" type="image/icon">
<link rel="icon" href="favicon.ico" type="image/icon"
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Challan Filling</title>
<meta name="description" content="">
<meta name="author" content="templatemo">
<linkrel="stylesheet"
href="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/css/bootstrap.min.
css">
<script
src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></s
cript>
<script
src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/js/bootstrap.min.js"
></script>
<link href="js/validation.js" rel="stylesheet">
<style>
audio{
3
filter: sepia(20%) saturate(50%) grayscale(1) contrast(99%)
invert(12%);
width: 100px;
height: 20px;
}
</style>
</head>
<body style="background-color:#82E0AA;">
<?php
$accname = "";
$accnum = "";
$accdate = "";
$branchspeech = "";
$mob = "";
$amt = "";
$remname = "";
if(isset($_POST['accname']))
{$command = escapeshellcmd('sp_tx.py');
$accname =
shell_exec($command);
if(isset($_POST['accnum']))
{$command = escapeshellcmd('sp_tx.py');
$accnum = shell_exec($command);
}
if(isset($_POST['accdate']))
{$command = escapeshellcmd('sp_tx.py');
$accdate = shell_exec($command);
}
3
if(isset($_POST['branchspeech']))
{ $command = escapeshellcmd('sp_tx.py');
$branchspeech = shell_exec($command);
}
if(isset($_POST['mob']))
{ $command = escapeshellcmd('sp_tx.py');
$mob = shell_exec($command);
}
if(isset($_POST['amt']))
{ $command = escapeshellcmd('sp_tx.py');
$amt = shell_exec($command);
}
if(isset($_POST['remname']))
{ $command = escapeshellcmd('sp_tx.py');
$remname = shell_exec($command);
}?>
<?php
if(isset($_POST['submit']))
{ $AccountName = $_POST['AccountName'];
$AccountNumber = $_POST['AccountNumber'];
$date = $_POST['date'];
$branch = $_POST['branch'];
$Num = $_POST['Num'];
$Amount = $_POST['Amount'];
$Remitters = $_POST['Remitters'];
3
if($AccountName != "" && $AccountNumber != "" &&
$date != "" && $branch != "" && $Num != "" && $Amount != "" &&
$Remitters != "")
{
$sql ="INSERT INTO challan_details
(AccountHolderName,accountnumber,date,branch,mobile,amount,deposit
orname) VALUES('$AccountName',
'$AccountNumber','$date','$branch','$Num','$Amount','$Remitters')"
$query = mysqli_query($connect,$sql);
echo "<center style='font-size:30px;'>Amount Deposited Successfully...
</center>
else
{echo "<center style='font-size:30px;'>Please Fill the details</center> }
?>
Speech to Text
importspeech_recognition as sr
r=sr.Recognizer()
withsr.Microphone() as source:
audio = r.listen(source)
a=r.recognize_google(audio)
print(a)
User details
<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8"><meta http-equiv="X-UA-Compatible"
content="IE=edge">
3
<meta name="viewport" content="width=device-width, initial-scale=1">
<h2 class="margin-bottom-10">Challan Details</h2> <div
class="box-body table-responsive no-padding">
<table class="table table-hover" border="1 <tr<th style="text-
align:center;">S.No</th>
<th style="text-align:center;">Acc Holder Name</th><th style="text-
align:center;">AccNumber</th>
<thstyle="text-
align:center;">Date</th><thstyle="textalign:center;">Branch Name</th>
<th style="text-align:center;">Mobile </th>
<th style="text-align:center;">Amunt (Words)</th>
<th style="text-align:center;">Depositor Name</th>
<th style="text-align:center;">Delete</th></tr>
<tr></div></div></div></div></div></body></html>
<?php$sql = "SELECTFROM challan_details";
$result = mysqli_query($connect,$sql);
$count = 1;
If(mysqli_num_rows($result)>0)
{while($row = mysqli_fetch_array($result))
{ echo "<tr><td> ".$count.?<td> ".$row['AccountHolderName']."
</td>
<td> ".$row['accountnumber']." </td>
<td> ".$row['date']." </td>
<td> ".$row['branch']." </td
<td> ".$row['mobile']." </td>
<td> ".$row['amount']." </td>
<td> ".$row['depositorname']."</td>
3
<td><button type='button' class='btnbtn-primary' name='delete' ><a
href='admin_home.php?id=".$row['id']."'
style='color:white;'>Delete</></button></td> $count++;}
} echo "</table>";?>
Xampp
<?php
session_start();
$connect = mysqli_connect("localhost", "root", "", "challan");
?>
6.3 RESULT
4
The implementation of speech interface with challan application provides
solutions to overcome the challenges faced by the plebeians while filling the
form.
Figure 6.1Home screen
The figure 6.1shows the home screen in the speech interface challan
application.This the first page of the web application. This screen appears in the
full screen mode when the application is launched. As there isa home, challan,
admin button present in the top right corner of application window, each having
different functions. Click on the challan to move to the page challan form.
4
Figure 6.2 challan form
The figure 6.2 shows the challan form in the speech interface challan
application.
Figure 6.3 Admin login
The figure 6.3 shows the Admin login in the speech interface challan
application.
4
Figure 6.4 Voice generator
The figure 6.4shows the voice generating code in the speech interface challan
application. The user clicks the play button to hear the audio instruction. The
volume button is used to adjust the volume of the audio played. After listening to
the audio instruction, the user understands what to be filled in the field and the
user provides the audio/speech as input.
4
Figure 6.5 Recognition of speech
The figure 6.5 shows the conversion of speech to text in the speech interface
challan application. The audio input is provided to the STT by clicking the speak
button once. The python program recognizes the speech in 20 to 30 seconds and
displays the text in the appropriate text field. In this way all the text box can be
filled. Figure 6.5 displays the challan form is filled with user details in all the
text boxes.
4
Figure 6.6 Records of the user
The figure 6.6 shows the details of the user who uses the speech interface challan
application. Once the amount is deposited successfully the acknowledgement is
shown to the user and the details get automatically stored in the database.
4
CHAPTER 7
CONCLUSION AND FUTURE ENHANCEMENT
7.1 CONCLUSION
The voice based applications are at the forefront of any human machine
interaction environment in this case the speech interface form filling will be
more useful for plebeians. Here the exploration of speech recognition accuracy
came to known by reviewing the existing literature survey. Thus the speech
recognition in speech interface is done by the python packages. Accuracy of the
software is excellent in the conversion to text of its ability to work in the real life
environment. The speech interface is implemented in the challan application
form filling. The speech interface challan application helps the plebeians to fill
their form by themselves in bank sectors, without sharing their personal
information to third persons. Further discussed about the expansion of speech
interface in the field of other application.
7.2 FUTURE ENHANCEMENT
The following section discusses some limitations identified in the speech
recognition. The major limitation in the speech interface is designed for one
particular language. The future work planned to carry out the addressed
limitations by expanding speech interface in multi languages or indigenous
languages. The proposed STS and TTS systems have great potential in
developing the voice enabled applications in wide variety of service sectors for
the aadhar card, PAN card, Gas connection, education, health care, tourism and
other public and private sectors.
4
REFERENCES
1. Keii chi Tamura et al “Novel Defense Method against Audio
Adversarial
Example for Speech-to-Text Transcription Neural Networks”. In
proceeding of IEEE 11th International Workshop on Computational
Intelligence and Applications November 9-10, 2019, Hiroshima, Japan.
2. Yuan Jiang et al “Improving Sequence-to-Sequence voice conversation
by adding text-supervision”. At National Engineering Laboratory for
Speech and Language Information Processing,University of Science
and Technology of China in 2019.
3. Snezhanapleshkova et al “Reduced Database for Voice Commands
Recognition Using Cloud Technologies, Artificial Intelligence and
Deep Learning”.In proceedings of XVI-th International Conference on
Electrical Machines, Drives and Power Systems ELMA 2019, 6-8 June
2019, Varna, Bulgaria.
4. Atma Prakash Singh et al “A Survey: Speech Recognition Approaches
And Techniques” In proceeding of IEEE Uttar Pradesh Section
International Conference On Electrical, Electronics and Computer
Engineering 2018.
5. Ananya Paul et al “Development of GUI for Text-to-Speech
Recognition
using Natural Language Processing”, IEEE 2018.
6. Farhan Khan et al. “Voice to Text transcription using CMU Sphinx” .
Journal of IEEE transactionon human machine system, VOL. 47, NO.
6, DECEMBER 2017.
7. Jiangtao Wang et al “CAPFF: A Context Aware Assistant For Paper
Form Filling”. Journal of IEEE Transactions on Human-Machine
Systemsvol. 47, no. 6, december 2017.
4
8. Dr. JayashriVajpai et al “Industrial Applications of Automatic Speech
Recognition Systems”. Journal of IEEE Engineering Research and
Applications, Vol. 6, Issue 3, (Part - 1) March 2016, pp.88-95.
9. Yogita H. Ghadage “Speech to Text Conversion for Multilingual
Languages”. In proceeding of International Conference on
Communication and Signal Processing, April 6-8, 2016, India.
10.VishnudasRaveendran et al “An Approach To File Manager Design For
Speech Interfaces”, in 2016, India.
11.YoungJae Song et al “Classifying Speech Related vs. Idle State
Towards Onset Detection in Brain-Computer Interfaces”, in 2014.
12.Lie deng et al “New types of deep neural network learning for Speech
recognition and related applications:An overview”, IEEE 2013.
13.Lei Xie et al “Speech and Auditory Interfaces for Ubiquitous,
Immersive and Personalized Applications”. In proceeding of Symposia
and Workshops on Ubiquitous, Autonomic and Trusted Computing
2010.
14.Sunil Issar “A Speech Interface for Forms on WWW”. At Carnegie
Mellon University in 2009.
15.Shih-Jung Peng et al “A Generic Interface Methodology for Bridging
Application Systems and Speech Recognizers” IEEE 2007.
16.Tomoki Toda et al “Voice conversation based on maximum-likehood
estimation of spectral parameter trajectory” 2007.
17.Meinard Miller, “Dynamic time warping”, IEEE 2007
18.Bernhard Suhm “Interactive recovery from Speech Recognition errors
in speech user interfaces” IEEE 2006.
19.SadaokiFurui “Automatic speech recognition and its application to
information Extraction” 2005.
20.Nobuo Hataoka et al “Robust Speech Dialog Interface for Car
Telematics Service”, IEEE 2004.
4
47
PUBLICATIONS
1. K.Abinaya, S.Gayathiri,, R.Kirubamanohari, M.Nivetha “Development
of Speech Interface for Challan Application using Speech Recognition”.
In proceeding of International Journal of Advanced Research Trends in
Engineering and Technology, A Science and Technology Journal, Volume
7, Issue 6, June 2020.