BibSonomy :: user :: stroeh :: shingle
actions for all displayed bookmarks:
export:
sort:
others:

bookmarks (4)

actions for all displayed publications:
export:

sort:
others:

publications (1)

• can now generate all pairs $i,j$ for which $x_i^\pi$ is present in both their sketches. From these we can compute, for each pair $i,j$ with non-zero sketch overlap, a count of the number of $x_i^\pi$ values they have in common. By applying a preset threshold, we know which pairs $i,j$ have heavily overlapping sketches. For instance, if the threshold were 80%, we would need the count to be at least 160 for any $i,j$. As we identify such pairs, we run the union-find to group documents into near-duplicate syntactic clusters''. This is essentially a variant of the single-link clustering algorithm introduced in Section 17.2 (page [*]). · http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html
can now generate all pairs $i,j$ for which $x_i^\pi$ is present in both their sketches. From these we can compute, for each pair $i,j$ with non-zero sketch overlap, a count of the number of $x_i^\pi$ values they have in common. By applying a preset threshold, we know which pairs $i,j$ have heavily overlapping sketches. For instance, if the threshold were 80%, we would need the count to be at least 160 for any $i,j$. As we identify such pairs, we run the union-find to group documents into near-duplicate syntactic clusters''. This is essentially a variant of the single-link clustering algorithm introduced in Section 17.2 (page [*]). · http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html
2 years and 2 months ago
by stroeh
1
(0)
BibSonomy is offered by the KDE group of the University of Kassel, the DMIR group of the University of Würzburg, and the L3S Research Center, Germany. Privacy & Terms of Use - Contact