2000-12-09..2011-02-23 Hisashi MORITA
See the ChangeLog for detail.
Compares two text files by word, by character, or by line
DocDiff compares two text files and shows the difference. It can compare files word by word, character by character, or line by line. It has several output formats such as HTML, tty, Manued, or user-defined markup.
It supports several encodings and end-of-line characters, including ASCII (and other single byte encodings such as ISO-8859-*), UTF-8, EUC-JP, Shift_JIS, CR, LF, and CRLF.
Note that you need appropriate permission for proper installation (you may have to have a root/administrator privilege).
)# cp -r docdiff /usr/lib/ruby/1.9.1
)# cp docdiff.rb /usr/bin/
)# mv /usr/bin/docdiff.rb /usr/bin/docdiff
)# ln -s /usr/bin/docdiff.rb /usr/bin/chardiff.rb
# ln -s /usr/bin/docdiff.rb /usr/bin/worddiff.rb
)# chmod +x /usr/bin/docdiff.rb
)# cp docdiff.conf.example /etc/docdiff.conf
# $EDITOR /etc/docdiff.conf
)% cp docdiff.conf.example ~/etc/docdiff.conf
% $EDITOR ~/etc/docdiff.conf
docdiff [options] oldfile newfile
e.g. % docdiff old.txt new.txt > diff.html
See the help message for detail (docdiff --help).
% cat sample/01.en.ascii.lf
Hello, my name is Watanabe.
I am just another Ruby porter.
% cat sample/02.en.ascii.lf
Hello, my name is matz.
It's me who has created Ruby. I am a Ruby hacker.
% docdiff sample/01.en.ascii.lf sample/02.en.ascii.lf
Hello, my name isWatanabe.matz.
It's me who has created Ruby. I amjust anothera Rubyporter.hacker.
%
You can place configuration files at:
Notation is as follows (also refer to the file docdiff.conf.example included in the distribution archive):
# comment
key1 = value
key2 = value
...
Every value is treated as string, unless it seems like a number. In such case, value is treated as a number (usually an integer).
Sometimes DocDiff fails to auto-recognize encoding and/or end-of-line character. You may get an error like this.
charstring.rb:47:in `extend': wrong argument type nil (expected Module) (TypeError)
In such a case, try explicitly specifying encoding and end-of-line character (e.g. docdiff --utf8 --crlf).
When comparing space-separated texts (such as English or program source code), the word next to the end of line is sometimes unnecessarily deleted and inserted. This is due to the limitation of DocDiff's word splitter. It splits strings into words like the following.
text 1:
foo bar
("foo bar" => ["foo ", "bar"])
text 2:
foo
bar
("foo\nbar" => ["foo", "\n", "bar"])
comparison result:
foofoo
bar
("<del>foo </del><ins>foo</ins><ins>\n</ins>bar")
Foo is (unnecessarily) deleted and inserted at the same time.
I would like to fix this sometime, but it's not easy. If you split single space as single element (i.e. ["foo", " ", "bar"]), the word order of the comparison result will be less natural. Suggestions are welcome.
If you want to use DocDiff as an external diff program from VCSs, the following may work.
% svn diff --diff-cmd=docdiff --extensions "--ascii --lf --tty --digest"
% GIT_EXTERNAL_DIFF=~/bin/gitdocdiff.sh git diff
~/bin/gitdocdiff.sh:
#!/bin/sh docdiff --ascii --lf --tty --digest $2 $5
With zsh, you can use DocDiff or other utility to compare arbitrary sources. In the following example, we compare specific revision of foo.html in a repository with one on a website.
CVS:
% docdiff =(cvs -Q update -p -r 1.3 foo.html) =(curl --silent http://www.example.org/foo.html)
Subversion:
% docdiff =(svn cat -r3 http://svn.example.org/repos/foo.html) =(curl --silent http://www.example.org/foo.html)
You can compare files other than plain text, such as HTML and Microsoft Word documents, if you use appropriate converter.
Comparing the content of two HTML documents (without tags) :
% docdiff =(w3m -dump -cols 10000 foo.html) =(w3m -dump -cols 10000 http://www.example.org/foo.html)
Comparing the content of two Microsoft Word documents :
% docdiff =(wvWare foo.doc | w3m -T text/html -dump -cols 10000) =(wvWare bar.doc | w3m -T text/html -dump -cols 10000)
If you want to compare Latin-* (ISO-8859-*) texts, try using ASCII as their encoding. When ASCII is specified, DocDiff assumes single-byte characters.
Comparing Latin-1 texts:
% docdiff --encoding=ASCII latin-1-old.txt latin-1-new.txt
This software is distributed under so-called modified BSD style license (http://www.opensource.org/licenses/bsd-license.php (without advertisement clause)). By contributing to this software, you agree that your contribution may be incorporated under the same license.
Copyright and condition of use of main portion of the source:
Copyright (C) Hisashi MORITA. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff library (docdiff/diff.rb and docdiff/diff/*) was originally a part of Ruby/CVS by Akira TANAKA. Ruby/CVS is licensed under modified BSD style license. See the following for detail.
There are several other software that can compare text word by word and/or character by character.