Abstract
The development of Optical Character Recognition (OCR) techniques has enabled the automatic extraction of text from scanned documents, particularly in fields such as the digitization of historical or administrative records. However, OCR-generated errors, often caused by image quality or font variability, pose significant challenges for the automatic classification of the resulting text. This study focuses on applying Machine Learning methods to classify OCR-extracted text, with a specific case study: the categorization of digitized examination papers. The primary objective is to develop a robust pipeline that not only optimizes recognition but also classifies texts into their respective categories. In this research, we apply a series of preprocessing steps to mitigate the impact of OCR errors, such as grayscale conversion and noise reduction. We then compare the performance of various classification algorithms, including Logistic Regression, Support Vector Machines (SVM), and neural network-based approaches, specifically Transformer models like BERT. The results demonstrate that Transformer-based models, combined with a post-OCR correction stage, offer significantly better performance by reducing the effect of recognition errors on classification quality. This study contributes to the application of machine learning techniques in the classification of OCR-extracted data, specifically within the domain of examination papers, and proposes a selection of techniques to improve model accuracy in this context.
Keywords: Text Classification, Optical Character Recognition, Machine Learning, Data Preprocessing, Natural Language Processing.