To extract text from an image using Optical Character Recognition (OCR) in Python, you can utilize the Tesseract OCR library, which is one of the most popular and effective OCR tools available. Here's a step-by-step guide on how to do this:
Install Required Libraries: First, you need to install the required libraries. Open your terminal or command prompt and run the following command:
pip install pytesseract pillow
This will install the pytesseract
library (which is a Python wrapper for Tesseract) and the Pillow
library (used for image handling).
Install Tesseract: Tesseract itself is not a Python library, but a standalone OCR engine. You need to install it separately. You can download the installer from the official Tesseract GitHub repository: https://github.com/tesseract-ocr/tesseract
Import Libraries: In your Python script, import the required libraries:
python
from PIL import
Image import pytesseract
Load and Process the Image:
Load the image using the Image
class from the Pillow
library:
python
image = Image.open('your_image_path.jpg')
Perform OCR:
Use pytesseract.image_to_string
function to perform OCR on the loaded image:
python
text = pytesseract.image_to_string(image)
Print Extracted Text: Finally, you can print the extracted text:
python
print(text)
Here's the complete code snippet:
python
from PIL import Image
import pytesseract
# Load the image
image = Image.open('your_image_path.jpg')
# Perform OCR
text = pytesseract.image_to_string(image)
# Print extracted text
print(text)
Replace 'your_image_path.jpg'
with the actual path to your image file.
Remember to adjust the parameters of the OCR function as needed. For example, you can specify the language of the
text using the lang
parameter if the text in the image is not in English:
python
text = pytesseract.image_to_string(image, lang='eng')
Make sure the Tesseract executable is in your system's PATH or provide the path explicitly
using the pytesseract.pytesseract.tesseract_cmd
parameter before performing OCR:
python
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
Replace the path with the actual path to your Tesseract executable.
Keep in mind that the accuracy of the OCR process heavily depends on the quality and clarity of the image. Pre-processing the image (e.g., resizing, enhancing contrast) might help improve OCR results.