Site Map - skip to main content

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes every weekday Monday through Friday.
This page was generated by The HPR Robot at


hpr4341 :: Transferring Large Data Sets

How to transfer large data sets using tar and bluray discs while preserving meta data.

<< First, < Previous, , Latest >>

Hosted by hairylarry on Monday, 2025-03-24 is flagged as Clean and is released under a CC-BY-SA license.
libreindiearchive, archival, data, bash, tar. (Be the first).

Listen in ogg, opus, or mp3 format. Play now:

Duration: 00:10:30
Download the transcription and subtitles.

Programming 101.

A series focusing on concepts and the basics of programming

Transferring Large Data Sets

Very large data sets present their own problems. Not everyone has directories with hundreds of gigabytes of project files, but I do, and I assume I'm not the only one.

For instance, I have a directory with over 700 radio shows, many of these directories also have a podcast, and they also have pictures and text files.

Doing a properties check on the directory I see 450 gigabytes of data.

When I started envisioning Libre Indie Archive I wanted to move the directories into archival storage using optical drives. My first attempt at this didn't work because I lost metadata when I wrote the optical drives since optical drives are read only.

After further work and study I learned that tar files can preserve meta data if they are created and uncompressed as root. In fact, if you are running tar as root preserving file ownership and permissions is the default.

So this means that optical drives are an option if you write tar archives onto the optical drives.

I have better success rates with 25 GB Blue Ray Discs than with the 50 GB discs. So, if your directory breaks up into projects that fit on 25 GB discs, that's great.

My data did not do this easily but tar does have an option to write a data set to multiple tar files each with a maximum size, labelling them -0 -1, etc.

When using this multi volume feature you cannot use compression. So you will get tar files, not tar.gz files.

It's better to break the file sets up in more reasonable sizes so I decided to divide the shows up alphabetically by title, so all the shows starting with the letter a would be one data set and then down the alphabet, one letter at a time.

Most of the letters would result in a single tar file labeled -0 that would fit on the 25 GB disc. Many letters, however, took two or even three tar files that would have to be written on different disks and then concatenated on the primary system before they are extracted to the correct location in primaryfiles.

There is a companion program to tar, called tarcat, that I used to combine 2 or 3 tar files split by length into a single tar file that could be extracted.

I ran engrampa as root to extract the files.

So, I used a tar command on the working system where my Something Blue radio shows are stored. Then I used K3b to burn these files onto a 25 GB Blu Ray Disc carefully labeling the discs and writing a text file that I used to keep up with which files I had already copied to Disc.

Then on the Libre Indie Archive primary system I copied from the Blu Ray to the boot drive the file or files for that data set. Then I would use tarcat to combine the files if there was more than one file for that data set. And finally I would extract the files to primaryfiles by running engrampa as root.

Now I'm going to go into details on each of these steps.

First make sure that the Libre Indie Archive program, prep.sh, is in your home directory on your workstation. Then from the data directory to be archived, in my case the something_blue directory run prep.sh like this.

~/prep.sh

This will create a file named IA_Origin.txt that lists the date, the computer and directory being archived, and the users and userids on that system. All very helpful information to have if at some time in the future you need to do a restore.

Next create a tar data set for each letter of the alphabet. (You may want to divide your data set in a different way.)

Open a terminal in the same directory as the data directory, my something_blue directory, so that ls displays something_blue (your data directory). I keep the Something Blue shows and podcasts in subdirectories in the something_blue directory.

Here's the tar command.

Example a:

sudo tar -cv --tape-length=20000000 --file=somethingblue-a-{0..50}.tar /home/larry/delta/something_blue/a*

This is for the letter a so the --file parameter includes the letter a. The numbers 0..50 in the squirelly brackets are the sequence numbers for the files. I only had one file for the letter a, somethingblue-a-0.tar.

The last parameter is the source for the tar files, in this case

/home/larry/delta/something_blue/a*

All of the files and directories in the something_blue directory that start with the letter a.

You may want to change the --tape-length parameter. As listed it stores up to 19.1 GB. The maximum capacity of a 25 GB Blu-ray is 23.3GB for data storage.

Example b:

For the letter b, I ended up with three tar files.

somethingblue-b-0.tar
somethingblue-b-1.tar
somethingblue-b-2.tar

I will use these files in the example below using tarcat to combine the files.

I use K3b to burn Blu-Ray data discs. Besides installing K3b you have to install some other programs and then there is a particular setup that needs to be done including selecting cdrecord and no multisession. Here's an excellent article that will go step by step through the installation and setup.

How to burn Blu-ray discs on Ubuntu and derivatives using K3b?

https://en.ubunlog.com/how-to-burn-blu-ray-discs-on-ubuntu-and-derivatives-using-k3b/

I also always check Verify data and I use the Linux/Unix file system, not windows which will rename your files if the filenames are too long.

I installed a Blu-Ray reader into the primary system and I used thunar to copy the files from the Blu-Ray Disc to the boot drive.

In the primaryfiles directory I make a subdirectory, something_blue, to hold the archived shows.

If there is only one file, like in example a above, you can skip the concatenation step.

If there is more than one file, like Example b above, you use tarcat to concatenate these files into one tar file.

You have to do this. If you try to extract from just one of the numbered files when there is more than one you will get an error. So if I try to extract from somethingblue-b-0.tar and I get an error it doesn't mean that there's anything wrong with that file. It just has to be concatenated with the other b files before it can be extracted.

There is a companion program to tar called tarcat that should be used to concatenate the tar files.

Here's the command I used for example b, above.

tarcat somethingblue-b-0.tar somethingblue-b-1.tar somethingblue-b-2.tar > sb-b.tar

This will concatenate the three smaller tar files into one bigger tar file named sb-b.tar

In order to preserve the meta data you have to extract the files as root. In order to make it easier to select the files to be extracted and where to store them I use the GUI archive manager, engrampa. To run engrampa as root open a terminal with CTRL-ALT t and use this command

sudo -H engrampa

Click Open and select the tar file to extract. Then follow the path until you are in the something_blue directory and you are seeing the folders and files you want to extract. Type Ctrl a to select them all. (instead of the something_blue directory you will go to your_data directory)

Then click Extract at the top of the window. Open the directory where you want the files to go. In my case, primaryfiles/something_blue

Then click Extract again in the lower right.

After the files are extracted go to your data directory in primaryfiles and check that the directories and files are where you expect them to be.

You can also open a terminal in that directory and type

ls -l

to review the meta data.

When dealing with data chunks sized 20 GB or more each one of these steps takes time. The reason I like using an optical disk backup to transfer the files from the working system to Libre Indie Archive is because it gives me an easy to store backup that is not on a spinning drive and that cannot be overwritten. Still optical disk storage is not perfect either. It's just another belt to go with your suspenders.

Another way to transfer directories into the primaryfiles directory is with ssh over the network. This is not as safe as using optical disks and it also does not provide the extra snapshot backup. It also takes a long time but it is not as labor intensive.

After I spend some more time thinking about this and testing I will do a podcast about transferring large data sets with ssh.

Although I am transferring large data sets to move them into archival storage using Libre Indie Archive there are many other situations where you might want to move a large data set while preserving the meta data. So what I have written about tar files, optical discs, and running thunar and engrampa as root is generally applicable.

As always comments are appreciated. You can comment on Hacker Public Radio or on Mastodon. Visit my blog at home.gamerplus.org where I will post the show notes and embed the Mastodon thread for comments about thie podcast.

Thanks




Comments

Subscribe to the comments RSS feed.

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Provide feedback
Your Name/Handle:
Title:
Comment:
Anti Spam Question: What does the letter P in HPR stand for?