Forums

How to upload new .traineddata file in pytesseract to perform ocr

Hello, I have hosted a web service to perform a simple OCR but because it is not giving the output 100% efficient, I'm trying to upload a similar font .traineddata file to make it work.

Normally We can add the .traineddata file in C:\Program Files (x86)\Tesseract-OCR\tessdata

But what about in pythonanywhere website ?

I'm not a Tesseract expert, but I think you can just upload it to any directory inside your home directory, then use the TESSDATA_PREFIX environment variable to tell the system where it is.

I don't understand how to use TESSDATA_PREFIX environment variable in pythonanywhere. Can you please elaborate it ?

You set an environment variable in Bash by using this command:

export TESSDATA_PREFIX=/home/sodmzs/something

...where the /home/sodmzs/something is the value you're setting it to. Any other commands you run in there (for example python) will have access to that environment variable. So if you upload the data to a directory inside your home directory, then set the environment variable, Tesseract should look there for its data.

On windows this command worked for me:

tesseract.exe input.jpg stdout --tessdata-dir d:\tessdata config_file

so you can try to add --tessdata-dir with full path on Tesseract command line...

Thank you so much giles for your quick reply! I have set the Environment variable by export TESSDATA_PREFIX=/home/sodmzs/something

  tessdata_dir_config = '--tessdata-dir"/home/sodmzs/something"'
    text = pytesseract.image_to_string(Image.open('/home/sodmzs/something/temp/image.jpg',lang='ocr',config=tessdata_dir_config));
    return jsonify(response = text)

Here lang='ocr' is the ocr.traineddata file I have uploaded here /home/sodmzs/something

But it is not working..!! Please can you figure out that am I doing something wrong ?

How is it not working? What happens? Do you actually have a directory called "something" in your home directory? Is that where your image is?

I have set TESSDATA_PREFIX=/home/sodmzs/something but If I look to /var/log to figure what the problem is, it's showing me

raise TesseractError(status_code, get_errors(error_string)) pytesseract.pytesseract.TesseractError: (1, 'Tesseract Open Source OCR Engine v3.04.01 with Leptonica Error opening data file /usr/share/tesseract-ocr/tessdata/ocr.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language \'ocr\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

please help me out with this

See http://help.pythonanywhere.com/pages/environment-variables-for-web-apps/ for how to set environment variables for web apps.

Thank you glenn for your reply. I did exactly according with the link and sucessfully set TESSDATA_PREFIX but unfortunately I stucked with the new error.

File "/home/sodmzs/.virtualenvs/something/lib/python3.7/site-packages/pytesseract/pytesseract.py", line 194, in run_tesseract raise TesseractError(status_code, get_errors(error_string)) pytesseract.pytesseract.TesseractError: (1, "Tesseract Open Source OCR Engine v3.04.01 with Leptonica read_params_file: Can't open txt Failed loading language 'ocrb' Tesseract couldn't load any languages! Could not initialize tesseract.")

please help me out with this

All I can tell from that is that either your TESSDATA_PREFIX is not set to the correct directory or perhaps the directory that you're setting it to is not laid out the way that tesseract is expecting it to be.

None of us here are tesseract experts, so perhaps you can check the tesseract docs to see if there is a way to get more detail about the error or you could ask on their forums where someone more knowledgeable about tesseract can help you.