Copy & md5 Shell Script

Dustin Cross · Mar 11, 2011

Anyone who tries out the new percent script, please let me know how it works for you. It seems to be working fine for me.

Dusty

Dustin Cross · Mar 11, 2011

I did some testing today on the new version that tells you percent complete. Everything seems to be working fine. I uploaded a new version with a couple speed improvements.

I did some speed testing today comparing copy an 8GB Red CF card. Here are my results.

3:00 min - Finder 1 copy
11:30 min - Finder 4 copies
7:00 min - Finder 1 copy, then 3 copies from first

5:20 min - ShotPutPro 1 copy (byte checks)
37:00 min - ShotPutPro 4 copies (byte checks)

3:45 min - R3D Data Manager 1 copy (md5 checksums)
7:45 min - R3D Data Manager 4 copies (md5 checksums)

3:45 min - CopyScript 1 copy (md5 checksums)
8:00 min - CopyScript 4 copies (md5 checksums)

All copies were from the same CF card to the same destinations. Computer was shutdown between each copy. This wasn't too scientific, just a quick test to give me an idea how this script compares.

I was pretty happy with these results for my little CopyScript. Not much slower than a straight finder copy with no checksums.

For R3Ds RDM is still the best option, but my script works on almost any footage.

ShotPutPro surprised me. Having never compared it to something else I thought it would be much faster. When making four copies, I thought it was gonna kill the CF card and reader. I need to run that test again to make sure something odd didn't happen, but I didn't want to wait another 40 minutes.

Dusty

Jonathan Carbonaro · Mar 12, 2011

Dusty,

Thanks for a good script! This is great because you're not limited to R3D's!
I used it to backup Audio as well, and it ran great.
The only thing I noticed was that when it wrote the Checksum, the percent complete stayed at 0% until it got to 99% and then it updated and finished.
Other than that, it worked perfectly!
I tested on my 15" macbookPro 2.53 dual core using OSX 10.5.8
Going to test on the Tower tomorrow..

Jonathan

Dustin Cross · Mar 12, 2011

Jonathan,

Were you copying a single file? That would give you 0% and then 100% with the way I have it set up.

The way I am calculating percent complete is to add the size of each file completed and compare that to the size of all files being copied. So if there was only one file, it would read 0% until it finished that file and then it would be 100%.

Or if your files were so small that the checksum happened faster than the checksum caught up. I have a slight delay built in so it doesn't error out because the previous step hasn't finished.

If that is not your situation, I will have to figure out why.

Dusty

Jonathan Carbonaro · Mar 12, 2011

Nope, that was it!
The file was a 5gig file, but yes, it was only one. I had actually just realized that as I did further testing with multiple files within a folder.
Thanks, that makes sense

Jonathan

Richard Goodwin · Mar 12, 2011

You can do a binary compare of directories with the diff command (as a secondary check to doing md5 checksum)

There is also ditto, which is OSX specific but preserves metadata and whatnot.

Dustin Cross · Mar 12, 2011

Richard,

Never heard of ditto. I'll have to look into it.

Not sure what binary compare of directories would give over md5 of each file. How are you thinking binary compare would improve? I could do a diff -r and compare each destination directory to the source directory, excluding the md5 files the script creates.

Would this add anything that makes it worth the extra time?

I just did a quick test and this was very fast. Much faster than a full read or making md5s, so that makes me think it is not comparing files as completely as md5. I'll do more research into what diff is actually doing on directories.

Thanks,
Dusty

Dustin Cross · Mar 17, 2011

Has anyone tried the CopyScript_percent script? What did you think? Anything you would like to see changed?

I am thinking about adding stuff in the log file about what where files were copied from and all destinations, what time each step starts and finish, and a name for the person doing the work. Basically a bunch of stuff that can be used as a paper trail. What do you guys think? Anything you would like to see in there? Or do you think you wouldn't use something like that?

Dusty

MasonJames · Mar 31, 2011

+1 for the log file creation. Haven't been able to test this script just yet, but looks very exciting. Thanks for sharing!

Dustin Cross · Mar 31, 2011

Mason,

Thanks for the feedback. What would you like to see in the log file?

Let me know how things go when you test.

Dusty

Dustin Cross · Apr 19, 2011

Just updated the script on the first post. Everything seems to be working great. Let me know if there is anything else you want to see in the new log file. /var/tmp/copyscript.log

Dusty

jamie parry · Apr 20, 2011

hey Dusty
great work on the script.
if you want a more accurate % monitor and a progress bar then check out gooey gadgets http://sibr.com/blog/?p=104
and use du -ck to get a total of the bytes you want to transfer across
then every 2 seconds do du -ck on your destination (to see how many bytes are in at the moment)
and then make a percentage out of $copy_size/$expected_total_size*100
you will have to do bash math with bc and printf to get nice round numbers, but those values go into gooey gadgets to make your progress bar. cool huh?
i've done all this in my software (alexicc) and it is the first time i've got a progress bar to actually work from a bash script.
gooey gadgets will need to be on the machine you run it on but I make that happen by putting the binaries i need inside the app folder then use the path to those binaries in the shell script app.
it all sounds more complicated than it is !!!
cheers
jamie

Dustin Cross · Apr 20, 2011

Jamie,

Thanks. I will have to look into that. Sounds like it does the same thing I am doing du on source and destination and some bc math.

The problem with the du approach is different file systems have different sizes for some things. That is why I had to make the script stop counting at 90%. Had a couple times with SxS cards and CF cards where the destination copy never got over 95% of the source and my percent complete loop never stopped.

I am still interested in your script for transcoding Alexa footage with a LUT.

The next thing I want to work on is something that syncs sound and video based on timecode and a script that pulls scene and take metadata out of sound and transcodes dailies with scene and take in the filename.

Dusty

Tim Sutherland · Apr 20, 2011

Dustin,

Any chance you can change the order of operations slightly so that it makes the first copy, then checksums the first copy, then makes all other copies and does other checksums? That way when the source footage has been copied and checksummed it can be ejected before any other copies are made.

I just used your script on a 3 week, 3 camera job, and mags start stacking up pretty quickly so it would be nice if you could eject the cards/drives as soon as possible to start the next copy.

Also, I've had about 8 instances of the original script open at once, so I don't know what you changed to make that possible, but it seemed to be working fine for me before.

Also, it would be nice if when you chose a source folder, it would check for checksum text files from previous copy operations that would be in the same parent folder of the source folder, so that you could use that for comparisons instead of checksumming the source again, which would save time for additional copies, for example when post makes copies for visual f/x.

All in all, it already works great, and I'm thankful that you've spent so much time making it even better.

Tim

Dustin Cross · Apr 20, 2011

Tim,

Thanks for the feedback. Did you use the version that tells you percent complete or the old version?

Right now the script does the source checksum and second copy in parallel. I normally have my source media and destination media on different buses, so they both run at max speed. The checksum is usually faster than the second copies, so as soon as it says the source checksums are done you could eject the source drive while it is still doing the extra copies. I could have the script eject the source media as soon as it finishes the source checksum. Would that work for you?

The change to allow multiple instances was because of the percent complete temp files and thenew log it is writing. The percent complete temp files were all the same name and multiple instances would conflict with each other. Also I wanted things to stay organized in the log file so one drives log is all together and you don't have a copy line from source A mixed with a checksum line from source B. Check out the new log and let me know what you think. /var/tmp/copyscript.log

I will look into making the script a little smarter about checksums.

p.s. - that was you I ran into at the 3cP booth with Scott Mason right?

Dusty

jamie parry · Apr 21, 2011

good call on the du. I'd forgotten about copying to ntfs and other systems! that's me being mr Mac only in my house!
my workaround(s) for that are
ls -l reports the same values on different filesystems as it doesn't bother with block sizes but i think that only works on FILES not folders
so to get the size of all the movs in a folder i do this

find /Users/jamie/Desktop/movz -iname \*.mov -exec ls -l {} \;|awk '{print $5}'| awk '{ sum+=$1} END {print sum}'

find is the command then the folder you're interested in then the name of the things (i've chosen any .mov file here) then the command to EXECcute on each thing found.
all that mess at the end takes the 5th column of ls -l output (which is the file size in bytes on my os) and then adds all the columns to make a total byte size
hopefully irrespective of Block size which varies with filesystem like you said.

the other workaround is to use gnu du (google coreutils for os x) which has some options for ignoring blocksize as well
I think its gdu -k --apparent-size or something like that
these 2 hacky workarounds get over the weirdo problem of block size overhead. Why is nothing simple any more eh?

the alexicc thing is available as an app thingy from lightillusion.com
it's priced as it is cos steve shaw's icc profile service is NOT free and he wanted to share proceeds with other developers of Spaceman
it does seem to work and Job ter Burg and wouter have done over 1500 ish clips with it in their movie.
it has some issues (ie don't use it with prores422 log C !!!) and is not a pretty looking app but it gets people home quicker and that's why i made my bit of it!
all the best
jamie

Dustin Cross · Apr 21, 2011

Jamie,

I am 100% Mac too, but teh media I have to deal with is primarily FAT32 and SxS cards are UDF. They both do things a little differently than HFS+.

I should change the du command to look at files only like I do with the md5 command. Something really simple like:

find -s * -type f

That way find ignores anything that is not a file (-type f). the only problem with that is I have to find all files, du to get their size, then add all the sizes to get the total. It was so much easier to change 99% to 90% and be done.

Dusty

Tim Sutherland · Apr 21, 2011

Dustin, checksum and second copy in parallel is great, no need to auto eject. I was using the very first version before the percentage, so that's why I had multiple instances.

And yes, that was me at NAB. Next time you find yourself in LA we should hang out.

I'll look at the logs tomorrow.

I think this now might be the fastest way to copy data to multiple destinations when the destinations are different speeds.
Tim

Richard Goodwin · Apr 21, 2011

Dustin Cross said:
Richard,

I just did a quick test and this was very fast. Much faster than a full read or making md5s, so that makes me think it is not comparing files as completely as md5. I'll do more research into what diff is actually doing on directories.

Thanks,
Dusty

Have not been on this list in a bit.

md5 does not compare files but uses an algorithm to calculate a unique "fingerprint" for the file. (see: http://en.wikipedia.org/wiki/Md5sum). The diff program just compares the files byte for byte. That is why diff is much faster (just byte comparison vs the computer actually calculation doing work). If you are using GNU diff that is highly tested and should be ok.

md5 is usually used when you are transporting data across a network and so the user at the destination end will not be able to compare your data directly. In this case since you have direct access to both copies you don't necessarily need to take the time to do checksums; they provide a nice double verification.

(also have you looked at rsync?)

Dustin Cross · Apr 21, 2011

Richard,

I understand md5 and what diff does for text files, but these are binary files and I can't find anything that really says what diff does with binary files. My concern with diff was it was so fast it could not have read all the data. diff on a file seemed to be faster than the drive could read the file, so it is only reading parts of the file and that is a problem. We have to know that every bit is the same for what we do, so we must have checksums, therefore the diff was kinda pointless. I have access to the source media on set, but the guys I send the drives to do not have that access. Those checksums are how we verify the files they are working with six months from now are the exact same as what came off the camera.

I did test with rsync, cpio, and every other way to copy files I could find on the internet. cp ended up being the fastest. There were other benefits to other ways to copy the files, but speed was what I wanted.

Dusty

Welcome to our community

Be a part of something great, join today!

Copy & md5 Shell Script

Well-known member

Well-known member

Banned

Well-known member

Banned

Well-known member

Well-known member

Well-known member

New member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member