Publication: Web document text and images extraction using DOM analysis and natural language processing.