can now generate all pairs $i,j$ for which $x_i^\pi$ is present in both their sketches. From these we can compute, for each pair $i,j$ with non-zero sketch overlap, a count of the number of $x_i^\pi$ values they have in common. By applying a preset threshold, we know which pairs $i,j$ have heavily overlapping sketches. For instance, if the threshold were 80%, we would need the count to be at least 160 for any $i,j$. As we identify such pairs, we run the union-find to group documents into near-duplicate ``syntactic clusters''. This is essentially a variant of the single-link clustering algorithm introduced in Section 17.2 (page [*]).
Output of BibTeX now includes tags in a field called "keywords", which is more common than "tags". When importing BibTeX, both fields are merged.
If you post a single BibTeX snippet, which you already have, you can see the duplicate on the edit bibtex page. Furthermore the postBookmark button has been updated: selected text from the page is now automatically included in the description/comment field. On the settings page you can now update your email address, as well as your homepage and real name.
How do you find similar pictures in a large collection in different formats, resolutions and different rotations?
How do you find all the duplicates in a huge collection of music files in different formats?
How do you find duplicate text files or binary files on your computer?
Do you get a program to handle each case individually or would you rather have one program that does it all?
Here is my solution
DupeFinder is a simple application for locating, moving, renaming and deleting duplicate files in a directory structure. It's perfect both for users who haven't kept their hard drives very well organized and need to do some cleaning to free space and for users who like to keep lots of backup copies of important data "just in case" something bad should happen.
Duplicate Files Finder is a cross-platform application for finding and removing duplicate files by deleting, creating hardlinks or creating symbolic links. A special algorithm minimizes the amount of data read from disk, so the program is very fast.
dupeGuru is a tool to find duplicate files on your computer. It can scan either filenames or contents. The filename scan features a fuzzy matching algorithm that can find duplicate filenames even when they are not exactly the same. dupeGuru runs on Windows, Mac OS X and Linux.
dupeGuru is efficient. Find your duplicate files in minutes, thanks to its quick fuzzy matching algorithm. dupeGuru not only finds filenames that are the same, but it also finds similar filenames.
dupeGuru is customizable. You can tweak its matching engine to find exactly the kind of duplicates you want to find. The Preference page of the help file lists all the scanning engine settings you can change.
dupeGuru is safe. Its engine has been especially designed with safety in mind. Its reference directory system as well as its grouping system prevent you from deleting files you didn't mean to delete.
Do whatever you want with your duplicates. Not only can you delete duplicates files dupeGuru finds, but you can also move or copy them elsewhere. There are also multiple ways to filter and sort your results to easily weed out false duplicates (for low threshold scans).
Supported languages: English, French.
Requirements
Mac OS X: 10.5 and up (Leopard, Snow Leopard or Lion). PowerPC or Intel. (Last version to support Tiger: v2.8.2)
Windows: 2k/XP/Vista/Win7.
Linux: Ubuntu 10.04
M. Bilenko, and R. Mooney. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), page 39--48. (2003)