hpr2091 :: Everyday Unix/Linux Tools for data processing
In this episode, I give some examples of common and uncommon tools for processing data files
Hosted by Mr. Young on Monday, 2016-08-08 is flagged as Clean and is released under a CC-BY-SA license.
linux, unix, data, command-line.
4.
The show is available on the Internet Archive at: https://archive.org/details/hpr2091
Listen in ogg,
spx,
or mp3 format. Play now:
Duration: 00:30:15
general.
Here are some of the tools I use to process and clean data from all manner of customers:
The detox utility renames files to make them easier to work with. It removes spaces and other such annoyances. It’ll also translate or cleanup Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.
See other episodes for great sed information. I like to remove DOS end of line and end of file characters:
sed -i 's/
//g' *.txt
or
sed -i 's/\r//g' *.txt
Command-line tools
- ack
- awk
- detox
- grep
- pandoc
- pdftotext -layout
- sed
- unix2dos and dos2unix
- wget
- curl
R libraries
- RCurl
- XML
- rvest
- tm
- xlsx
Python libraries
- beautifulsoup
- csv
- nltk YouTube Series
- rdflib
- re
Vim tricks
- buffer searches (
:vim /pattern/ ##
) - Ack plugin
- bufdo (
:bufdo %s/pattern/replace/ge | update
)