Integrated multi-strategic Web document pre-processing for sentence and word boundary detection.
Junhyeok ShimDongseok KimJeongwon ChaGary Geunbae LeeJungyun SeoPublished in: Inf. Process. Manag. (2002)
Keyphrases
- boundary detection
- web documents
- preprocessing
- n gram
- keywords
- sentence level
- information extraction
- noun phrases
- part of speech
- prefetching
- image segmentation
- syntactic information
- natural language
- web pages
- object detection and recognition
- syntactic analysis
- sentence similarity
- word level
- text corpus
- syntactic categories
- berkeley segmentation dataset
- co occurrence
- web content
- word frequency
- vector space model
- training corpus
- closed contours
- ambiguous words
- textual information
- web data
- information retrieval
- occlusion boundaries
- sentiment analysis
- feature extraction
- stop words
- word sense disambiguation
- three dimensional
- data analysis
- text mining
- web logs
- text summarization
- structured data