To improve the speed of my workflow, I wrote a bash script that uses the
open source programs gphoto2, tesseract,
grep and ImageMagick to digitize my mom's 338 page
book. Here is the link to the script:
https://github.com/deltaray/ocr-script
Comment #1 posted on 2023-11-29 20:39:52 by brian-in-ohio
good show
Enjoyed every minute of this show. Its someting I've wanted to try, now I think I will. Nice little rant at the end, hit the nail on the head. Keep the shows coming
Comment #2 posted on 2023-12-03 13:19:03 by Deltaray
Thanks
Thanks, I appreciate that feedback and good luck with your endeavors.
Comment #3 posted on 2024-11-29 04:52:38 by SolusSpider - Peter Paterson
Experience with Tesseract OCR software
Greetings Deltaray, so pleased to meet you.
My own experience with Tesseract OCR software is via my volunteer work with MissionAssist.
MissionAssist is a UK based charity.
I volunteer for them as a Digitisation Keyboarder, receiving PDF scans of Bibles and other books, from people groups all over the world, and typing the chapter text into a structured text file.
https://missionassist.org.uk/services/digitisation/bible-digitisation-project/
Tesseract is a wonderful tool that helps me with a lot of the process, obtaining a text file and then working directly on it.
Since I run KDE, I use Spectacle to highlight the area of the PDF I want to convert into a PNG file for tesseract to read.
A lot of the scans we receive are not exactly straight, often in columns, have ink marks, and bleed through from the other side. So, not always a straight forward OCR process.
I save these files with chapter and verse references in the title.
Once I have a set of PNG files from my allocated chapter, I simply run tesseract per file to create the text file.
I then use cat to collect the text files into one file to work on.
Your show was really more about using bash and especially the grep command to process your project.
I learned a lot from that alone! Thanks for the education.
Checked your HPR profile and was not surprised you are the guy behind @climagic
I did follow you on Twitter, but left at the buyout.
So glad to know you are on Mastodon, and I followed that account today.
I do plan recording my own show about my use of tesseract as I volunteer with MissionAssist, but given my current workload and other reasons I am looking at sometime in the new year of 2025.
Leave Comment
Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.
Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).