Abstract
Finding useful information from the Web becomes increasingly difficult as the volume of Web data rapidly grows. To facilitate effective Web browsing, Web designers usually display the same type of information with a consistent layout (referred to as a Web pattern). Discovering Web patterns can benefit many applications, such as extracting structured data. This paper presents a generic framework for discovering Web patterns and recognizing their instances (i.e., structured data) based on graph grammars. In our framework, a Web pattern is visually yet formally specified as a graph grammar, which is automatically induced through a grammar induction engine. The grammar induction engine is featured by converting the problem of (2-dimensional) graph grammar induction to (1-dimensional) string induction. Based on the induced pattern, matching instances are recognized from Web pages through a graph parsing process. We have evaluated the framework on twenty-one e-commerce Web sites. The evaluation results are promising with a high F1-score.
| Original language | English |
|---|---|
| Pages (from-to) | 528-545 |
| Number of pages | 18 |
| Journal | Information Sciences |
| Volume | 328 |
| DOIs | |
| Publication status | Published - 20 Jan 2016 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2015 Elsevier Inc. All rights reserved.
Keywords
- Graph grammar induction
- Spatial graph grammar
- Web patterns