SlideShare a Scribd company logo
Università degli studi di Bari “Aldo Moro”
                         Dipartimento di Informatica




      A Run Length Smoothing-Based Algorithm
     for non-Manhattan Document Segmentation
                           S. Ferilli, F. Leuzzi, F. Rotella, F. Esposito
                               Via Orabona, 4 - 70126 Bari – Italy
                                   {ferilli, esposito}@di.uniba.it
L.A.C.A.M.                    {fabio.leuzzi, fulvio.rotella}@uniba.it
http://lacam.di.uniba.it
Introduction
● Automatic document processing a hot topic
  ― Layout analysis a fundamental step

    ● Identification of frames (relevant components in the document)

    ● Performance can determine quality and feasibility of the whole process

● Two different…

    ● Kinds of sources: Digitized (scanned) vs. Natively digital documents

    ● Categories of layouts: Manhattan vs. Non-Manhattan

    ● Types of algorithms: Top-down vs. Bottom-up




● Run Length Smoothing Algorithm
    ● Manhattan Layout

● Other works exploit or try to improve the RLSA by setting its parameters

● Many works on Manhattan layout

  ― Top-down strategies

● Less works on non-Manhattan layout

  ― Bottom-up strategies




●   The Manhattan assumption holds for many typeset documents, simplifies
    document processing…BUT cannot be assumed in general
RLSO
                   Application to scanned images
RLSO (Run Length Smoothing with OR)
1) horizontal smoothing with threshold th, row by row

2) vertical smoothing with threshold tv, column by column
●   logical OR of the images obtained in steps 1 and 2
                                         th = 5
                                         tv = 4
                                        (AND)
RLSO




                         ?
Application to scanned images
RLSO
              Application to born-digital documents
●   Set horizontal/vertical distance thresholds th/tv
●   build a frame for each basic block
●   H ={(dh, b’, b’’) | b’ and b’’ are horizontally adjacent basic blocks
                          and dh is the horizontal distance between them}
●for all (dh,1, b’h,1, b’’h,1) ∈ H s.t. dh,1 ≤ th merge the frames to which b’h,1, b’’h,1
belong

●   V = {(dv, b’, b’’) | b’ and b’’ are vertically adjacent basic blocks
                           and dv is the vertical distance between them}
●   for all (dv,1, b’h,1, b’’h,1) ∈ V s.t. dv,1 ≤ tv merge the frames to which b’h,1, b’’h,1 belong


      Reference block
      Adjacent blocks
    Non-adjacent blocks
    Horizontal distance
     Vertical distance
RLSO
Application to born-digital documents
RLSO
●   Run Length Smoothing algorithms based on thresholds
    ―   Hard to properly set manually (Not typical human activity)
    ―   Heuristic approaches (Ad hoc)
    ―   Tampers the idea of automatic processing
    ―   Fixed thresholds not suitable to documents with several different
        spacings




                   Automatic assessment of RLSO thresholds
RLSO
                   Automatic threshold assessment
●   Study of Run Lengths behavior                                     Figure 1.
                                                                      a fragment of
    ―   Histogram very irregular                                      scientific paper
            ● Peaks = most frequent spacings

            ● Peak clusters = equally spaced

              components
          ― Hard to exploit by automatic

            techniques

    ―   Cumulative histograms more regular
          ― Bar b = runs larger or equal than

            b                                   H’(i) = ∑ j≥ i H(j)
        ● Monotonically decreasing

          ― Flat zones = lengths for which no

            runs are present
        ● Scaled down to 10%

          ― Reduces variability
RLSO
                    Automatic threshold assessment
●   Select threshold on flat zones
    ― Derivative a good indicator

      ● Slope = 0

      ● Discrete approximation on bar

        b:
    ― Tolerance possible                               Figure 1-a.

      ● Slope = – 30

    ― Skip starting and trailing flat

      zones
      ● Starting zone = missing small
                                                b
        run lengths
      ● Trailing zone = merge whole

        content                                         Figure 1-b.


●   Iteration of technique on
    previously smoothed image
    ― Finds progressively more
                                        (Figure 1-a/1-b) successive application of RLSO with
      spaced components                 automatic threshold assessment on Figure 1.
Sample Evaluation
Conclusions
●   RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the
    document image and fill them with black pixels whenever they are shorter than a
    given threshold
     –   Both Manhattan and Non-Manhattan Layout
     –   Version for natively digital documents
●   Automatic thresholding effective on documents having
     –   single character size
     –   different spacings

●   Good baseline towards more complex documents
     –   different character sizes
     –   graphics
●   Current and future Work
     –   Stop criterion for iteration
     –   Clustering based on positioning and spacing

More Related Content

PDF
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
PDF
Free-scale Magnification for Single-Pixel-Width Alphabetic Typeface Characters
PDF
Datech2014-Session1-Document Representation Refinement for Precise Region Des...
PDF
SHARP - A parallel algorithm for shape recognition
PDF
Path Finding Solutions For Grid Based Graph
PDF
Improvement in Traditional Set Partitioning in Hierarchical Trees (SPIHT) Alg...
PPTX
SCENE TEXT RECOGNITION IN MOBILE APPLICATION BY CHARACTER DESCRIPTOR AND STRU...
PPT
License Plate Recognition
A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
Free-scale Magnification for Single-Pixel-Width Alphabetic Typeface Characters
Datech2014-Session1-Document Representation Refinement for Precise Region Des...
SHARP - A parallel algorithm for shape recognition
Path Finding Solutions For Grid Based Graph
Improvement in Traditional Set Partitioning in Hierarchical Trees (SPIHT) Alg...
SCENE TEXT RECOGNITION IN MOBILE APPLICATION BY CHARACTER DESCRIPTOR AND STRU...
License Plate Recognition

What's hot (20)

PPTX
Lbp based edge-texture features for object recoginition
PPTX
Text detection and recognition from natural scenes
PDF
IRJET- Devnagari Text Detection
PPTX
Text extraction from images
PPTX
Image to text Converter
PDF
Self-Directing Text Detection and Removal from Images with Smoothing
PDF
E041122335
PDF
F045053236
PDF
Improved algorithm for road region segmentation based on sequential monte car...
PPTX
Detecting text from natural images with Stroke Width Transform
PDF
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
PDF
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
PDF
Another Simple but Faster Method for 2D Line Clipping
PPTX
Locally densest subgraph discovery
PDF
Automatic digital terrain modelling
PDF
Another simple but faster method for 2 d line clipping
PDF
Topology-Preserving Ordering of the RGB Space with an Evolutionary Algorithm
PPTX
Static Spatial Graph Features
PPTX
A Graph Summarization: A Survey | Summarizing and understanding large graphs
PDF
Text Detection Strategies
Lbp based edge-texture features for object recoginition
Text detection and recognition from natural scenes
IRJET- Devnagari Text Detection
Text extraction from images
Image to text Converter
Self-Directing Text Detection and Removal from Images with Smoothing
E041122335
F045053236
Improved algorithm for road region segmentation based on sequential monte car...
Detecting text from natural images with Stroke Width Transform
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
CLASSIFICATION AND COMPARISON OF LICENSE PLATES LOCALIZATION ALGORITHMS
Another Simple but Faster Method for 2D Line Clipping
Locally densest subgraph discovery
Automatic digital terrain modelling
Another simple but faster method for 2 d line clipping
Topology-Preserving Ordering of the RGB Space with an Evolutionary Algorithm
Static Spatial Graph Features
A Graph Summarization: A Survey | Summarizing and understanding large graphs
Text Detection Strategies
Ad

Viewers also liked (8)

PPTX
Take your sbdc online
PDF
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
PDF
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
ODP
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
PDF
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
PDF
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
PDF
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
PDF
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Take your sbdc online
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
Recognising the Social Attitude in Natural Interaction with Pedagogical Agents
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...
Improving Robustness and Flexibility of Concept Taxonomy Learning from Text
Ad

Similar to A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation (20)

PDF
Ip unit 5
PDF
The Explanation the Pipeline design strategy.pdf
PDF
Chromatic Sparse Learning
PPTX
DEEP LEARNING TECHNIQUES POWER POINT PRESENTATION
PDF
Image Smoothing for Structure Extraction
PPTX
feature matching and model fitting .pptx
PPTX
Path planning all algos
PPTX
RedBlackTrees1Lecture notes and everything related
PPT
Interactive Stereoscopic Rendering for Non-Planar Projections (GRAPP 2009)
PPT
Double Patterning (4/2 update)
PPTX
Line Detection in Computer Vision - Recent Developments and Applications
PPT
Robotics - introduction to Robotics
PPTX
Pulse Estimation
PDF
Summary of My Research
PPTX
Presentation at SMI 2023
PPT
Miniproject final group 14
PPTX
project_PPT_final
PPTX
Classic video datasets and algorithms.pptx
PPT
Topic 6 Graphic Transformation and Viewing.ppt
PDF
An introduction to isogeometric analysis
Ip unit 5
The Explanation the Pipeline design strategy.pdf
Chromatic Sparse Learning
DEEP LEARNING TECHNIQUES POWER POINT PRESENTATION
Image Smoothing for Structure Extraction
feature matching and model fitting .pptx
Path planning all algos
RedBlackTrees1Lecture notes and everything related
Interactive Stereoscopic Rendering for Non-Planar Projections (GRAPP 2009)
Double Patterning (4/2 update)
Line Detection in Computer Vision - Recent Developments and Applications
Robotics - introduction to Robotics
Pulse Estimation
Summary of My Research
Presentation at SMI 2023
Miniproject final group 14
project_PPT_final
Classic video datasets and algorithms.pptx
Topic 6 Graphic Transformation and Viewing.ppt
An introduction to isogeometric analysis

Recently uploaded (20)

PPTX
A Presentation on Touch Screen Technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
A Presentation on Artificial Intelligence
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A Presentation on Touch Screen Technology
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
OMC Textile Division Presentation 2021.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A Presentation on Artificial Intelligence
1 - Historical Antecedents, Social Consideration.pdf
Chapter 5: Probability Theory and Statistics
Unlocking AI with Model Context Protocol (MCP)
A comparative study of natural language inference in Swahili using monolingua...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
cloud_computing_Infrastucture_as_cloud_p
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
WOOl fibre morphology and structure.pdf for textiles
Agricultural_Statistics_at_a_Glance_2022_0.pdf

A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

  • 1. Università degli studi di Bari “Aldo Moro” Dipartimento di Informatica A Run Length Smoothing-Based Algorithm for non-Manhattan Document Segmentation S. Ferilli, F. Leuzzi, F. Rotella, F. Esposito Via Orabona, 4 - 70126 Bari – Italy {ferilli, esposito}@di.uniba.it L.A.C.A.M. {fabio.leuzzi, fulvio.rotella}@uniba.it http://lacam.di.uniba.it
  • 2. Introduction ● Automatic document processing a hot topic ― Layout analysis a fundamental step ● Identification of frames (relevant components in the document) ● Performance can determine quality and feasibility of the whole process ● Two different… ● Kinds of sources: Digitized (scanned) vs. Natively digital documents ● Categories of layouts: Manhattan vs. Non-Manhattan ● Types of algorithms: Top-down vs. Bottom-up ● Run Length Smoothing Algorithm ● Manhattan Layout ● Other works exploit or try to improve the RLSA by setting its parameters ● Many works on Manhattan layout ― Top-down strategies ● Less works on non-Manhattan layout ― Bottom-up strategies ● The Manhattan assumption holds for many typeset documents, simplifies document processing…BUT cannot be assumed in general
  • 3. RLSO Application to scanned images RLSO (Run Length Smoothing with OR) 1) horizontal smoothing with threshold th, row by row 2) vertical smoothing with threshold tv, column by column ● logical OR of the images obtained in steps 1 and 2 th = 5 tv = 4 (AND)
  • 4. RLSO ? Application to scanned images
  • 5. RLSO Application to born-digital documents ● Set horizontal/vertical distance thresholds th/tv ● build a frame for each basic block ● H ={(dh, b’, b’’) | b’ and b’’ are horizontally adjacent basic blocks and dh is the horizontal distance between them} ●for all (dh,1, b’h,1, b’’h,1) ∈ H s.t. dh,1 ≤ th merge the frames to which b’h,1, b’’h,1 belong ● V = {(dv, b’, b’’) | b’ and b’’ are vertically adjacent basic blocks and dv is the vertical distance between them} ● for all (dv,1, b’h,1, b’’h,1) ∈ V s.t. dv,1 ≤ tv merge the frames to which b’h,1, b’’h,1 belong Reference block Adjacent blocks Non-adjacent blocks Horizontal distance Vertical distance
  • 7. RLSO ● Run Length Smoothing algorithms based on thresholds ― Hard to properly set manually (Not typical human activity) ― Heuristic approaches (Ad hoc) ― Tampers the idea of automatic processing ― Fixed thresholds not suitable to documents with several different spacings Automatic assessment of RLSO thresholds
  • 8. RLSO Automatic threshold assessment ● Study of Run Lengths behavior Figure 1. a fragment of ― Histogram very irregular scientific paper ● Peaks = most frequent spacings ● Peak clusters = equally spaced components ― Hard to exploit by automatic techniques ― Cumulative histograms more regular ― Bar b = runs larger or equal than b H’(i) = ∑ j≥ i H(j) ● Monotonically decreasing ― Flat zones = lengths for which no runs are present ● Scaled down to 10% ― Reduces variability
  • 9. RLSO Automatic threshold assessment ● Select threshold on flat zones ― Derivative a good indicator ● Slope = 0 ● Discrete approximation on bar b: ― Tolerance possible Figure 1-a. ● Slope = – 30 ― Skip starting and trailing flat zones ● Starting zone = missing small b run lengths ● Trailing zone = merge whole content Figure 1-b. ● Iteration of technique on previously smoothed image ― Finds progressively more (Figure 1-a/1-b) successive application of RLSO with spaced components automatic threshold assessment on Figure 1.
  • 11. Conclusions ● RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the document image and fill them with black pixels whenever they are shorter than a given threshold – Both Manhattan and Non-Manhattan Layout – Version for natively digital documents ● Automatic thresholding effective on documents having – single character size – different spacings ● Good baseline towards more complex documents – different character sizes – graphics ● Current and future Work – Stop criterion for iteration – Clustering based on positioning and spacing