Massively scalable near duplicate detection in streams of documents using MDSH.
Paul Logasa BogenChristopher T. SymonsAmber McKenzieRobert M. PattonRobert E. GillenPublished in: IEEE BigData (2013)
Keyphrases
- document collections
- information retrieval
- legal documents
- document classification
- xml documents
- massively parallel
- information retrieval systems
- web documents
- relevant documents
- document retrieval
- document clustering
- keywords
- document representation
- vector space model
- data streams
- text documents
- retrieval systems
- retrieved documents
- highly scalable
- web scale
- document analysis
- transactional data
- vector space
- text retrieval
- web data
- sliding window
- website
- clustering algorithm
- document content
- real time