Wednesday, September 8, 2010

Diff between two files (unordered and with different number of lines)

This program will find the differences between two files, and should be applied in the following situations:
  • The files have different number of lines
  • The files are unordered.
  • `diff` fails to get the differences
  • Large files (the reason to use shell `sort` function)
  • Single record per line
For example: Imagine you have two files with a list of users, and these listings were taken from a database at different time instants. These listings will be different because users have been added and removed from the database. This script will help you to find the differences when `diff` can't resolve it.

#!/usr/bin/perl -w

use strict;

my $file1= $ARGV[0];
my $file2= $ARGV[1];

unless($file1 and $file2){
print "Usage: $0 <file1> <file2>\n\n";
exit;
}
unless( -f $file1){
print "File 1 does not exist: [$file1]\n\n";
exit;
}
unless( -f $file2){
print "File 2 does not exist: [$file2]\n\n";
exit;
}


my $tmp_file1 = '/tmp/f1.tmp';
my $tmp_file2 = '/tmp/f2.tmp';

`sort $file1 > $tmp_file1`;
`sort $file2 > $tmp_file2`;

open(F1, $tmp_file1) or die "$!";
open(F2, $tmp_file2) or die "$!";

my $read_f1 = 1;
my $read_f2 = 1;

my $s1;
my $s2;
while(1){

if (eof(F1)){print ">>$_" while <F2>;}
if(eof(F2)){print "<<$_" while <F1>;}

if($read_f1){$s1 = <F1>;}
if($read_f2){$s2 = <F2>;}

last unless $s1 and $s2;

$read_f1 = 1;
$read_f2 = 1;

next if ( lc($s1) eq lc($s2) );

if(lc($s1) gt lc($s2)){
print ">$s2";
$read_f1 = 0;
}else{
print "<$s1";
$read_f2 = 0;
}
}

unlink $tmp_file1 or die "$!";
unlink $tmp_file2 or die "$!";