Site Map - skip to main content

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes every weekday Monday through Friday.
This page was generated by The HPR Robot at


hpr2637 :: Convert it to Text

This episode will make you want to TXT all the things.

<< First, < Previous, , Latest >>

Thumbnail of Mr. Young
Hosted by Mr. Young on Tuesday, 2018-09-11 is flagged as Clean and is released under a CC-BY-SA license.
Unix, text, portability, ranger. 5.
The show is available on the Internet Archive at: https://archive.org/details/hpr2637

Listen in ogg, spx, or mp3 format. Play now:

Duration: 00:16:21

general.

Why use plain text?

  • Portability
  • Use with Unix tools
  • Use with Ranger

Ranger for the win

  • Ranger is a free console file manager that gives you greater flexibility and a good overview of your files without having to leave your *nix console. It visualizes the directory tree in two dimensions: the directory hierarchy on one, lists of files on the other, with a preview to the right so you know where you’ll be going.
  • The scope functionality is where converting to text pays off. Located at $HOME/.config/ranger/scope.sh, scope is the feature that allows for file preview from inside the console. Text files are highlighted based on their file extension, for non-text files, different converters can be used to coerce the file into a text representation. Some items are available out of the box, but the configuration is written in such a way that any text can be presented in the preview screen.
  • The basic format of the scope switch statement is as follows:
case "$extension" in
    odt|odp)
        try odt2txt "$path" && { dump | trim | fmt -s -w $width; exit 0; };;

Tools in the toolset

  • atool
  • caca-utils
  • poppler-utils
  • catdoc
  • catppt
  • odt2txt
  • ods2tsv
  • docx2txt
  • xlsx2csv
  • mediainfo
  • lynx/w3m/elinks
  • highlight

Bonus tools

  • q
  • jq
  • xmlstarlet

Comments

Subscribe to the comments RSS feed.

Comment #1 posted on 2018-09-13 05:00:37 by Ken Fallon

WOW

Those that I know I use literally every day. Can't wait to try the rest out.

Please do a deep dive series on each. No pressure.

Comment #2 posted on 2018-09-14 11:23:51 by Beeza

Value of text conversion

I'm a big fan of plain text and CSV files, as they are probably the formats that will last conceptually forever - unlike the Office formats we use today (including ODS/ODT etc). You may lose the layout information but the "meat" is always preserved.

The PDF to Text converters only work with documents which have been generated from a WP application. Scans of a printed document generally only produce an embedded JPG image.

A few years ago I created a system that employed many of the commands you mention in your episode to convert a document into pure ASCII text, then create a non-repeating list of all the words it contains, along with an instance count (using SQL). By applying this to the contents of a document library the database was used to populate a "search by keyword" system for that library.

Populating the database from several hundred Word and PDF documents took only a couple of minutes. The subsequent keyword searches were very fast and produced a list of relevant documents ranked by the number of instances of the keyword. It was very easy to combine keywords using SQL "AND" and "OR" qualifiers.

Comment #3 posted on 2018-09-15 11:49:36 by Jonas

Ranger, etc.

I'm a die hard vimmer and have never heard of Ranger. I'm looking forward to using it more. I asked a couple of my online Linuxey buddies and they used it years ago when they had less substantial machines. I still love the command line stuff even with my best machines. Everything is super quick in the terminal.
Thanks for the mention and your great shows.
I need to explore jq for sure. I work with a database that saves a couple columns in JSON. It would be nice to query the exports in a more friendly way.

Comment #4 posted on 2018-09-15 15:34:53 by Dave Morriss

Great show

I installed Ranger after listening to your show 1756 (https://hackerpublicradio.org/eps.php?id=1756) but never used it and completely forgot about it. I was surprised to find it on my system and have been playing about with it a lot since listening to this show.

I'm a long-time text and command-line user but I tend to use Midnight Commander for the times I want to do a lot of file searching and manipulation, though I have to admit I use Dolphin sometimes in two-pane mode when I'm doing things like copying files off an SD card. I shall add Ranger to the mix too I think.

I agree with Ken: we need shows about all of the tools in your list!

Anyway, this was a very welcome episode. Thanks.

Comment #5 posted on 2018-09-20 03:07:44 by clacke

Q

Never heard of Q before. Very cool! I will very likely find use for this.

Not a very googlable name, but I found it here: https://harelba.github.io/q/

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Provide feedback
Your Name/Handle:
Title:
Comment:
Anti Spam Question: What does the letter P in HPR stand for?
Are you a spammer?
Who is the host of this show?
What does HPR mean to you?