dip_pdf
dip_pdf
By
Chellammal K (41130094)
Dhansila H (41130115)
Deepika S (41130112)
Tharani P (41130118)
Bhavadhareni (41130065)
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
CATEGORY-1 UNIVERSITY BY UGC
Accredited with Grade “A++” by NAAC| 12B status by UGC | Approved by AICTE
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI - 600 119
AUGUST - 2024
i
DEPARTMENT OF ELECTRONICS AND COMMUNICAITON ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of Chellammal K
(41130094), Dhansila H (41130115), Deepika S (41130112), Tharani P (41130118)who
carried out the project entitled “TEXT RECOGNITION IN IMAGES AND
CONVERTING RECOGNIZED TEXT TO SPEECH” under my supervision from June
2024 to August 2024.
Faculty Incharge
ii
DECLARATION
DATE:
1.
2.
3.
4.
iii
ACKNOWLEDGEMENT
We convey our thanks to Dr. N. M. NANDHITHA, M.E., Ph.D., Professor & Dean,
School of Electrical and Electronics and Dr. T. RAVI, M.E., Ph.D., Professor &
Head, Department of Electronics and Communication Engineering for providing
us necessary support during the progressive reviews.
We would like to express our sincere and deep sense of gratitude to our Dr. P.
Chitra, M.E., Ph.D., Professor, Department of Electronics and Communication
Engineering for her valuable guidance, suggestions and constant encouragement
paved way for the successful completion of our project work.
We wish to express our thanks to all Teaching and Non-teaching staff members of
the Department of Electronics and Communication Engineering who were helpful in
many ways for the completion of the project.
iv
ABSTRACT
This project focuses on developing a system for text recognition in images and
converting the recognized text to speech using MATLAB Image Processing Toolbox.
The primary objective is to create an application that can accurately extract text from
various types of images and then synthesize the extracted text into audible speech,
enhancing accessibility for visually impaired individuals and providing a versatile tool
for text-to-speech applications.
The system leverages Optical Character Recognition (OCR) techniques to
identify and extract text from images. Pre- processing steps, including image
binarization, noise reduction, and edge detection,are implemented to enhance the
accuracy of the OCR process. The extracted text is then processed and fed into a
Text-to-Speech (TTS) engine, which converts the textual data into spoken words.
The system is designed to handle various fonts, sizes, and orientations of text,
making it robust and adaptable to different use cases.
The developed system is evaluated on a diverse dataset comprising different
types of images, ensuring its robustness and generalizability. The results
demonstrate the system's ability to effectively recognize text from images and
produce high-quality speech output, showcasing its potential applications in various
domains such as accessibility, automated document processing, and multimedia
content analysis.
The implementation in MATLAB offers a flexible and powerful environment for
combining image processing and TTS functionalities. Additionally, the project
incorporates real-time processing capabilities, enabling dynamic text recognition
and immediate speech output. It also features a user-friendly interface, allowing
users to easily upload images and control the text-to-speech conversion process.
This project has potential applications in assistive technologies, automated reading
systems, and interactive educational tools.
v
TABLE OF CONTENTS
1. CHAPTER 1
INTRODUCTION 1
2. CHAPTER 2
3. CHAPTER 3
4. CHAPTER 4
PROPOSSED SYSTEM 6
5. CHAPTER 5
6. CHAPTER 6
6.1 CONCLUSION 14
7. CHAPTER 7
REFERENCES 16-18
vi
List Of Figures
3.4 Flowchart 11
VII
CHAPTER-1
INTRODUCTION
The advancement of digital technology has revolutionized the way we interact with
information, and one key area of innovation is the extraction of text from images.
This process, known as Optical Character Recognition (OCR), utilizes image
processing techniques to identify and convert textual information from images into
machine-readable text. MATLAB, a powerful computational environment and
programming language, offers a versatile platform for implementing OCR through
its robust image processing toolbox. By leveraging MATLAB's capabilities, one can
efficiently preprocess images, enhance text clarity, and accurately recognize
characters, facilitating a wide range of applications from document digitization to
automated data entry.
Once the text is successfully extracted using OCR in MATLAB, the next step
involves converting the recognized text into speech using Text-to-Speech (TTS)
technology. TTS systems generate natural-sounding speech from textual input,
providing a vital tool for creating accessible content for visually impaired individuals
or for developing interactive voice applications. The integration of OCR and TTS
within MATLAB allows for a seamless workflow, where image data can be
processed to extract text and then immediately transformed into spoken words. This
combined approach not only enhances data accessibility and usability but also
opens new opportunities for interactive multimedia applications, educational tools,
and assistive technologies
1
CHAPTER-2
LITERATURE SURVEY
Zen, H., et al. (2016). This paper explores the use of deep neural networks for
statistical parametric speech synthesis, focusing on advancements in generating
natural-sounding speech from text. These techniques are crucial for improving the
naturalness and clarity of TTS systems.
Chung, J. S., et al. (2016). This paper explores the application of Convolutional
Recurrent Neural Networks (CRNNs) for text recognition in natural images. The
authors propose a model that combines convolutional and recurrent layers to
effectively handle text recognition in varied and complex scenes, enhancing OCR
accuracy in real-world conditions.
2
Liu, X., et al. (2018). Liu et al. present a benchmark for deep text recognition
methods, providing a comprehensive evaluation of various OCR techniques on
standardized datasets. The paper highlights advancements in deep learning
approaches for text recognition and offers valuable insights into the performance of
different models.
Li, Y., et al. (2018). Li and colleagues introduce a transformer-based model for end-
to-end text recognition. The paper discusses how transformers, which have
achieved significant success in natural language processing, can be applied to text
recognition tasks to improve performance and efficiency in OCR systems.
Zhang, S., et al. (2019). Zhang et al. address the challenges of recognizing text in
multiple languages and scripts within natural images. The paper proposes a unified
model that handles diverse linguistic and script variations, improving OCR
capabilities across different languages and writing systems.
Cheng, Y., et al. (2019). Cheng et al. provide an extensive review of text detection
and recognition techniques in scene images. The paper covers both traditional and
deep learning methods, addressing challenges and advancements in OCR
technologies.
Yin, X., et al. (2020). This survey paper reviews advancements in text-to-speech
synthesis using image-based inputs, discussing methodologies, challenges, and
research directions. It provides insights into integrating OCR with TTS technologies.
Graves, A., & Schmidhuber, J. (2021). Graves and Schmidhuber's work on offline
handwriting recognition using multidimensional RNNs improves the accuracy of
OCR systems for handwritten text, contributing to advancements in text recognition
methodologies.
3
Wang, X., et al. (2022). This survey provides an overview of methods for text
recognition in natural scenes, including recent advances in deep learning and neural
network-based approaches. It discusses applications and future directions, offering
insights into OCR and TTS integration.
Huang, J., et al. (2023). Huang et al. introduce an end-to-end system for
recognizing text and synthesizing speech directly from images. The paper presents
a unified approach combining OCR and TTS technologies, highlighting
improvements in system efficiency and output quality.
Liu, Y., et al. (2024). Liu and colleagues explore recent advances in integrating
OCR and TTS technologies for multimodal applications. The paper reviews state-
of-the-art techniques and their applications in various domains, offering a
comprehensive look at current research trends.
Shi, B., et al. (2024). Shi et al. propose an end-to-end neural network model
designed for image-based sequence recognition, which is particularly effective for
scene text recognition. The paper presents a novel architecture that integrates
convolutional and recurrent layers, enhancing the model's ability to handle complex
text sequences in images.
Jiang, L., et al. (2024).Jiang and colleagues investigate the use of Generative
Adversarial Networks (GANs) for enhancing text-to-speech synthesis. The paper
discusses how GANs can be employed to generate high-quality, natural-sounding
speech from text, contributing to advancements in TTS technology.
Zhou, W., et al. (2024).Zhou and colleagues present a unified framework that
integrates both text recognition and text-to-speech synthesis directly from visual
data. The paper introduces an innovative model that simultaneously performs OCR
and generates speech, aiming to streamline the process and improve efficiency.
The study discusses the model's architecture, training methodology, and
performance across various datasets, offering significant advancements in merging
OCR and TTS technologies into a cohesive system.
4
CHAPTER – 3
The scope of this project includes several critical aspects, starting with image
processing and OCR. This involves using MATLAB's tools to preprocess images,
such as reducing noise, enhancing contrast, and converting images to binary format
to improve text recognition accuracy. Following text extraction, the project
implements TTS systems within MATLAB to synthesize speech, ensuring that the
output is both clear and natural-sounding. This includes supporting multiple
languages and voices to cater to a diverse range of users.
This project's innovation lies not only in integrating OCR and TTS technologies but
also in utilizing MATLAB as a development platform, which offers a comprehensive
suite of tools and functionalities for image and signal processing. MATLAB's rich
library of algorithms and user-friendly interface make it an ideal choice for rapid
prototyping and experimentation. Moreover, the flexibility of MATLAB's TTS tools
enables fine-tuning of voice output for better naturalness and intelligibility. By
exploiting these advanced features, the project aims to push the boundaries of what
is possible in automated text recognition and speech synthesis, setting a new
standard for accessibility technologies. This integration not only improves the user
experience by providing more accurate and natural outputs but also lays the
groundwork for future innovations in the field of assistive technology, potentially
benefiting a broader audience beyond those with visual impairments.
5
CHAPTER - 4
PROPOSED SYSTEM
6
4.1 Optical Character Recognition
OCR is a reader which recognizes the text character of the computer which may be
in printed or written form. the OCR templates of each character are used to
recognize the character i.e. scanning process is carried out. After this, the character
image is translated in ASCII code which is further used in data processing. OCR
follows some basic architectural framework module as we can see it in figure 2 given
below. It thresholds with the image recognition where each character from the image
is being sent for recognition. There are some pre-processes involved to make the
image noise free which involves process like binarization, skew correction and
normalization [8]. Where we can see the image undergoing some enhancements
such as filtering out noise and contrast correction. After this, the real framework of
OCR starts with the segmentation process, as the name says everything here the
characters is segmented so that they can be texted separately. At the next level of
the framework the segmentation is followed by the text lines, words and characters
in the image. This level is assisted by some connected component analysis
Information and projection analysis to again assist text segmentation. Further the
process is followed by feature extraction which is concerned with the representation
of the object. All the real time working on the OCR lies in the Feature extraction
which can be titled as the ‘Heart’ of OCR.
7
4.2 Text Recognition
The initial process is to capture the image using a webcam/camera followed by the
text preprocessing. When the text area is been taken in the circumstance the internal
framework of the processes starts. The text is segmented then those recognized
texts are rebuilt in the original text. This information is used to reconstruct the
numbers and words in the original text. For recognition of captured characters, the
flow of the proposed work goes as follows binarization, optical scanning, feature
extraction, segmentation and character recognition.
4.2.2. Binarization
4.2.3. Segmentation
8
4.2.4. Feature Extraction
In the feature extraction, the main work is to extract the essential characteristics of
the character generated by joining the set pixels. Probably guessed and accepted
that the inevitable problems occur is in the pattern recognition. Mostly the
description of the character is carried out by its actual raster image. The extraction
feature is carried out in five different ways:
2. Techniques of Feature-based.
3. Points Distribution.
5. Analysis of Structural.
4.2.5. Recognition
Ones the character is segmented its stored in a variable as to compare it with the
stored template form. Precisely preliminary data will be stored in the form of
templates where all characters recognized font and size are available. The data
contains further information: the value of ASCII character, the name of the character,
character of the image in Jpg and character width, etc. for every identified character
all information will be captured then there is the comparison between the recognized
character information with the predefined character which is stored in the system.
As used same font and size for the identified character extract one unique match.
There is variation in the size of the character and then it is scaled with the known
standard thus completing the recognition process.
We know that the artificial production of human sound is nothing but the Speech.
Some mechanism is used to generate speech this mechanism is known as a speech
synthesizer. In that, we can give the sound of human and robot also but generally,
we use the robot. The text to speech system is divided into two sections: namely a
back-end and front-end. The front-end has two vital tasks. Now the process starts
with the conversion of the raw text like abbreviations, symbols, numbers and it a
little bit similar to written out words.
9
FIG:4.3 TTS system
1. Text Analysis
2. Phonetic Conversion
3. Prosody Generation
4. Speech Synthesis
TTS systems are widely used in various applications, including accessibility tools
for visually impaired individuals, virtual assistants, automated customer service, and
language learning tools. They enhance user experience by providing auditory
access to written content, facilitating communication, and improving interaction with
digital devices. By integrating sophisticated text analysis, phonetic conversion,
prosody generation, and advanced synthesis methods, TTS systems offer a
powerful means of converting text into natural, intelligible, and contextually
appropriate speech.
10
4.4 Flowchart
1. Step: The image is captured using a webcam and it is stored in the form of (.jpg)
file format. Next step image is read/display using the imread command. It will read
the image from stored file.
FIG:4.4 Flowchart
11
2. Step: The preprocessing conversion of the original RGB image into a grayscale
is done by rgb2gray command. In this, as discussed above the pixels are made set
and unset. The commands Rgb2gray it convert RGB image or colormap to
grayscale image. Rgb2gary conversion RGB image to grayscale by removing the
hue and the saturation information which is Re-training the luminance. The
Command fid = open (filename) open the file using filename contains the name of
the file to be opened. And characters are stored in an empty matrix.
3. Step: The image filtering. Filtering the unwanted noise is removed which
smoothing the image for further processing.
4-10. Step: The process performs segmentation of each and every line, every letter.
A correlation process in which the file is loaded so as to match the letter into stored
templates.
11. Step: The next step is the conversion of text to speech. The first analysis, text
and it will convert into speech using MATLAB.
12
CHAPTER – 5
First text is recognizing using OCR and second is text to audio conversion using
MATLAB. It is a real-time system first capture the image through the webcam and it
converts the text file then the text file is coveted into the speech. The advantage of
this system does not need the internet connectivity that is a basic necessity.
13
Fig 4.3 Grayscale image
14
Fig 4.5 Recognized Text Part of Image
15
CHAPTER – 6
6.1 CONCLUSION
The integration of OCR and TTS technologies in this project represents a robust
solution that bridges the gap between visual and auditory information, effectively
transforming written content into spoken words. This integration not only highlights
the technical advancements achieved but also opens up various practical
applications, including automated content processing and interactive systems.
Looking ahead, there are several promising directions for future work. Expanding
the system’s multilingual capabilities will be essential for reaching a broader
audience, accommodating different languages and accents. Additionally, improving
the system’s handling of complex text scenarios, such as multi-column layouts and
handwritten documents, will enhance its versatility. Personalizing the system
16
through adaptive learning algorithms can further tailor the recognition and synthesis
processes to individual user needs, while integrating with Augmented Reality (AR)
and Virtual Reality (VR) environments could offer immersive experiences.
Optimizing the system for mobile and embedded devices will ensure its
performance across diverse platforms, and user feedback studies will provide
valuable insights for refining and enhancing the system’s overall usability. These
future endeavors aim to build upon the project’s success, driving further innovation
and expanding the technology’s impact in various domains.
Future work for this project presents several exciting opportunities to further
enhance and refine the system for recognizing text in images and converting it to
speech. First, expanding the system’s multilingual capabilities will be crucial. By
incorporating support for a wider range of languages and accents, the technology
can become more inclusive, catering to a global audience. This expansion would
involve adapting both the Optical Character Recognition (OCR) and Text-to-Speech
(TTS) components to handle linguistic diversity and regional nuances effectively.
Another important area for development is improving the system's performance with
complex text scenarios. Enhancing the OCR model to better recognize and process
text in multi-column layouts, handwritten documents, and low-resolution images
could significantly broaden the system's applicability.
Optimization for mobile and embedded devices is also essential. Ensuring the
system is resource-efficient and performs well on various platforms, including
smartphones and low-power devices, would enhance its accessibility and usability
in different settings. Finally, conducting user feedback studies will be valuable.
Gathering insights from real-world usage will help refine the system, addressing
practical challenges and improving overall user experience. These future work
areas aim to build on the project’s success, advancing the technology and
broadening its impact across various applications.
17
CHAPTER – 7
REFERENCES
3. Graves A, Mohamed AR, Hinton GE. Speech to Text with Deep Neural
Pattern Recognition.
18
10. Yamashita R, Nishida K, Matsumoto T. High-Quality Text-to-Speech
Conference.
Systems.
Language Processing.
Learning.
19
19. Miao Y, Metze F, Jansen A. End-to-End Speech Recognition Using
Processing.
22. Kim H, Kim D, Kim Y. Efficient Neural Network Models for Large-Scale
20
Matlab code:
%% TEXT TO SPEECH %%
%==================%
clc
clear all;
i=imread('pro.jpg'); %Here you have to put which photo you want to read(MAIN
INPUT)
figure
imshow(i)
gray=rgb2gray(i);
figure
imshow(gray);
th=graythresh(i);
figure
imshow(bw); %See this image and make sure that image has been processed
correctly,if it not happens correctly then you will get garbage output
21
ocrResults=ocr(bw) %Using Optical Character Recognition for recognizing the text
recognizedText = ocrResults.Text;
figure;
imshow(i);
ocrResults.WordBoundingBoxes, ...
ocrResults.WordConfidences);
imshow(Iocr);
for n=1:numel(ocrResults.Words) %iterate speech part for all text in the photo
NET.addAssembly('System.Speech')
22
mysp.Volume=100; %Volume of voice(Range : 1-100)
record(a,5);
b = double(b);
figure
plot(b);
end
23