Document structure meets page layout: loopy random fields for web news content extraction.
Alex SpenglerPatrick GallinariPublished in: ACM Symposium on Document Engineering (2010)
Keyphrases
- content extraction
- web news
- structured documents
- html documents
- conditional random fields
- markov random field
- graphical models
- maximum entropy
- xml documents
- multimedia information retrieval
- text content
- digital archives
- parameter estimation
- probabilistic model
- information retrieval
- machine learning
- query language
- multimedia
- news articles
- semi structured
- named entities
- data mining
- information retrieval systems