Avatar of Sarbhanu Baidya

Python Tool: OCR for Indic Languages

GPLv3 License

Introduction:

An approach to use OpenCV and Google’s Tesseract to do OCR in Python.

Other Libraries: PyTesseract (Python-tesseract is a python wrapper for Google’s Tesseract-OCR), NumPy, Pillow.

  • Tesseract Documentation
    • Original Repository: Tesseract at UB Mannheim link
    • The Mannheim University Library (UB Mannheim) uses Tesseract to perform OCR (optical character recognition) of historical German newspapers (Allgemeine Preußische Staatszeitung, Deutscher Reichsanzeiger). The latest results with OCR from more than 360,000 scans are available online.
  • OpenCV Documentation: link

Source Code:

View on GitHub

Deployment:

  • I used PyCharm for most of this. Also can be deployed using JupyterLab & Notebooks.
  • Use appropriate functions for viewing data, as the consoles in Notebook won’t support OpenCV’s ‘imshow()’ function.

Acknowledgements: