Study of Report Layout Research Algorithms

Relative Research of Document Structure Examination Algorithms for Printed Document Images

  • Divya Kamat, Divya Sharma, Parag Chitale, Prateek Dasgupta

ABSTRACT

In the following survey paper, different algorithms that could be used for report layout research have been researched and their results have been likened. For the removal of image face mask, Bloomberg's algorithm and CRLA have been defined. For the purpose of text segmentation, we've analyzed the Recursive XY Cut algorithm, RLSA and RLSO algorithms.

  1. Introduction

Physical layout research of printed file images is the first step of the OCR alteration. For the OCR to work well, we have to provide an input wherein no images are present in the file i. e. the image includes only words. If this isn't done properly, the OCR will give back garbage values. To avoid this, we have talked about two algorithms, Bloomberg's Algorithm and CRLA that could be used for the removal images from the report images.

The next step is the written text segmentation wherein we find the written text blocks inside the report. The coordinates of the content material blocks are then handed down as input to the OCR. To execute this segmentation, we've reviewed the recursive XY slice algorithm, the RLSA and RLSO algorithms.

  1. Removal of Image from Document

The first rung on the ladder in the record layout evaluation is to remove the images present in the original file. We will be discussing the Bloomberg's algorithm along with its variants and the CRLA algorithm for image removal.

  1. Bloomberg's Algorithm
The Bloomberg's algorithm is mostly used to get the image mask of halftone images. The execution of the algorithm uses basic morphological operations. The algorithm has the following steps
  1. In the first step, the binarization of the suggestions image is conducted.
  2. Next, 4x1 threshold lowering is performed twice using threshold T=1.
  3. 4x1 threshold lowering is performed using T=4.
  4. 4x1 threshold lowering is conducted using T=3.
  5. Opening the image with a structural component of size 5x5.
  6. Next, 1x4 growth of the image is conducted twice.
  7. Next the union of overlapping components of the seed image extracted from step 6 with the image from step two 2 is conducted.
  8. Dilation with structural element 3x3 accompanied by 1x4 enlargement which is performed twice.
  9. The halftone face mask extracted from step 8 is then subtracted from the binarized input image.

The main concern with Bloomberg's algorithm is that it's unable to separate between words and sketches (i. e. collection drawings) in a printed record image.

  1. Enhanced CRLA Algorithm

CRLA means Constraint Run Length Algorithm. Within this algorithm we apply horizontal and vertical smoothening to the report image to obtain a clear parting between words and images in the doc.

Enhanced CRLA is used to smooth out only the text part in the image and steer clear of smoothening of non-textual part of the doc image.

Algorithm
  1. Label the connected components in the file image.
  2. Classify the components with respect to their heights as follows:
  1. Height less than or equal to 1 cm, label it as 1
  • Height between 1 and 3 cm, label it as 3
  • Height higher than 3 cm, label it as 3
  • Apply horizontal smoothening to the components with label 1 only.
  • Apply vertical smoothening to the components with label 1 only.
  • Logically AND both images obtained previously.
  • Apply horizontal smoothening to the outcome image of AND operation.
  • Calculate Mean Black Run Length
  • Calculate the Black color Run Length (BRL) row-wise for the region under consideration.
  • Maintain a Black-White Change Matter (TC) for the spot.
  • Calculate Mean BRL as MBRL= (BRL/TC).
  • Calculate Mean Changeover Count
  • Maintain a Black-White Move Count up (TC) for the region.
  • Calculate W, the width of the region.
  • Calculate Mean TC as MTC=(TC/W)
  • Extract the components from the image with label 1 having prices of MBRL and MTC in the suitable range for the normal report image.
  • Apply horizontal smoothening to the components with label 2 only.
  • Apply vertical smoothening to the components with label 2 only.
  • Logically AND the two images obtained previously.
  • Apply horizontal smoothening to the output image of AND procedure.
  • Calculate MBRL and MTC.
  • Extract the components from the image with label 2 and 3 having principles MBRL and MTC in the appropriate range for the normal doc image.
  • At step 9 we extract the text area of the document image with step 15 we draw out the non-text part of the file image.

    The main benefit of the CRLA algorithm is the fact that clear parting of content material and non-text area of the document image. It also works for sketches as well as halftones effectively. It includes considerably less difficulty as selective smoothening is done.

    However, after the removal of the non-textual area of the document image, some stray pixels continue to be the image. The linked components in the halftone image whose level is significantly less than 1cm are assumed as word elements in the algorithm. This ends in occurrence of unwanted components in the final image.

    1. Text Segmentation
    The next thing in the report layout evaluation is the segmentation of text message into words blocks that may be provided as input to the OCR. The next algorithms have been researched for this
    1. Recursive XY Chop algorithm
    The recursive XY slash algorithm is used for obtaining text blocks from an image that does not contain any images from the initial printed report. The XY cut algorithm works in the following way
    1. The bounding boxes of the image are computed.
    2. Next we determine the horizontal and vertical projections of the image.
    3. After calculating the projections, we then perform X reductions on all the valleys in the horizontal projections which have a value higher than the threshold th.
    4. Next we perform Y cuts in between these X reductions at all the valleys in the vertical projections that have a value greater than the threshold tv.
    5. We do it again the steps 3 and 4 until there are no more X or Y slashes possible in an area.

    One of the problems with XY lower algorithm is that there surely is no method to find a threshold that is wonderful for all the documents. Instead, a new threshold needs to be determined for each and every document and this cannot be done without manual treatment.

    Another major issue with the recursive XY algorithm is enough time complexness. The recursive XY slash algorithm requires a big time for you to complete execution. Despite these drawbacks, this algorithm efficiently separates the written text blocks provided that a manual threshold is provided.

    1. RLSA
    The Run-Length Smoothing Algorithm (RLSA) works on black & white scanned images of documents. It detects runs of white pixels and changes them into dark pixels whenever they are significantly less than confirmed threshold. The RLSA works in four steps
    1. In the first rung on the ladder, we perform horizontal smoothing. For this, we check the image row-wise and then replace lengths of white pixels by dark-colored pixels if they are less than a threshold th.
    1. In the next step, we perform vertical smoothing. For this, we scan the image column-wise and then replace lengths of white pixels by dark pixels if they're significantly less than a threshold tv.
    1. Next, we perform rational ANDing of the images obtained from the first and second steps.
    1. Then we perform horizontal smoothing on the image extracted from step 3 3 with a threshold ta.
    1. RLSO
    A simplified version of the RLSA, RLSO (Run-Length Smoothing with OR) works as follows
    1. In the first rung on the ladder, we perform horizontal smoothing. For this, we scan the image row-wise and then replace measures of white pixels by dark pixels if they are less than a threshold th.
    2. In the second step, we perform vertical smoothing. For this, we check the image column-wise and then replace measures of white pixels by dark-colored pixels if they're less than a threshold tv.
    3. Next we perform a logical OR operation on the images extracted from the first and second step.

    The RLSA algorithm earnings rectangular frames of documents with Manhattan Designs. Alternatively, RLSO algorithm also is effective with non-Manhattan designs. The challenge with both RLSA and RLSO would be that the threshold for smoothing needs to be determined manually. Also the threshold required for each document image differs and it is extremely difficult to be decided manually.

    1. Conclusion

    We have compared the above mentioned given algorithms for the file layout evaluation. During our research we discovered that, while Bloomberg's algorithm faces problems for images which contain sketches, CRLA encounters problems for images that contain extremely small non-textual elements.

    We also noticed that the recursive XY Chop algorithm and RLSA both do not work on paper documents having non-Manhattan layouts. Alternatively, the RLSO algorithm offers comparatively better results for Manhattan as well as non-Manhattan layouts. However, all three algorithms mentioned above face the normal problem of manual threshold dedication which is file specific.

    1. References
    1. Syed Saqib Bukhari, Faisal Shafait and Thomas M. Bruel, "Improved Doc Image Segmentation Algorithm using Multiresolution Morphology"
    1. Jaekyu Ha and Robert M. Haralick, Ihsin T. Philips, "Recursive XY Trim using Bounding Boxes of Connected Components", Third International Conference on Document Examination and Popularity, ICDAR, 1995
    1. Stefano Ferilli, Teresa M. A. Basile, Floriana Esposito, "A histogram-based Technique for Automatic Threshold Analysis in a Run Length Smoothing-based Algorithm", ACM, 2010.
    1. Hung-Ming Sunlight, "Enhanced Constrained Run-Length Algorithm for Complex Layout Document Processing", International Journal of Applied Knowledge and Executive, 2006

    Also We Can Offer!

    Other services that we offer

    If you don’t see the necessary subject, paper type, or topic in our list of available services and examples, don’t worry! We have a number of other academic disciplines to suit the needs of anyone who visits this website looking for help.

    How to ...

    We made your life easier with putting together a big number of articles and guidelines on how to plan and write different types of assignments (Essay, Research Paper, Dissertation etc)