hpr3998 :: Using open source OCR to digitize my mom's book

How I used open source tools such as gphoto2 and the OCR software tesseract to digitize pages

Hosted by Deltaray on Wednesday, 2023-11-29 is flagged as Clean and is released under a CC-BY-SA license.
Tags: ocr, opensource, grep, scripts, programming. Comments: 3.

Listen in ogg, opus, or mp3 format. Play now:

Duration: 00:30:47
Download the transcription and subtitles.

Part of the series: general.

To improve the speed of my workflow, I wrote a bash script that uses the open source programs gphoto2, tesseract, grep and ImageMagick to digitize my mom's 338 page book. Here is the link to the script: https://github.com/deltaray/ocr-script

Comments

Subscribe to the comments RSS feed.

Comment #1 posted on 2023-11-29 20:39:52 by brian-in-ohio

good show

Enjoyed every minute of this show. Its someting I've wanted to try, now I think I will. Nice little rant at the end, hit the nail on the head. Keep the shows coming

Comment #2 posted on 2023-12-03 13:19:03 by Deltaray

Thanks

Thanks, I appreciate that feedback and good luck with your endeavors.

Comment #3 posted on 2024-11-29 04:52:38 by SolusSpider - Peter Paterson

Experience with Tesseract OCR software

Greetings Deltaray, so pleased to meet you.
My own experience with Tesseract OCR software is via my volunteer work with MissionAssist.

MissionAssist is a UK based charity.
I volunteer for them as a Digitisation Keyboarder, receiving PDF scans of Bibles and other books, from people groups all over the world, and typing the chapter text into a structured text file.
https://missionassist.org.uk/services/digitisation/bible-digitisation-project/

Tesseract is a wonderful tool that helps me with a lot of the process, obtaining a text file and then working directly on it.

Since I run KDE, I use Spectacle to highlight the area of the PDF I want to convert into a PNG file for tesseract to read.
A lot of the scans we receive are not exactly straight, often in columns, have ink marks, and bleed through from the other side. So, not always a straight forward OCR process.
I save these files with chapter and verse references in the title.
Once I have a set of PNG files from my allocated chapter, I simply run tesseract per file to create the text file.
I then use cat to collect the text files into one file to work on.

Your show was really more about using bash and especially the grep command to process your project.
I learned a lot from that alone! Thanks for the education.

Checked your HPR profile and was not surprised you are the guy behind @climagic
I did follow you on Twitter, but left at the buyout.
So glad to know you are on Mastodon, and I followed that account today.

I do plan recording my own show about my use of tesseract as I volunteer with MissionAssist, but given my current workload and other reasons I am looking at sometime in the new year of 2025.

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Your Name/Handle:
Title:
Comment:
Anti Spam Question:	What does the letter P in HPR stand for?
Are you a spammer?	Yes No
Who is the host of this show?
What does HPR mean to you?