Pattern-based extraction of addresses from web page content

Saeid Asadi*, Guowei Yang, Xiaofang Zhou, Yuan Shi, Boxuan Zhai, Wendy Wen Rong Jiang

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

Abstract

Extraction of addresses and location names from Web pages is a challenging task for search engines. Traditional information extraction and natural processing models remain unsuccessful in the context of the Web because of the uncontrolled heterogenous nature of the Web resources as well as the effects of HTML and other markup tags. We describe a new pattern-based approach for extraction of addresses from Web pages. Both HTML and vision-based segmentations are used to increase the quality of address extraction. The proposed system uses several address patterns and a small table of geographic knowledge to hit addresses and then itemize them into smaller components. The experiments show that this model can extract and itemize different addresses effectively without large gazetteers or human supervision.

Original languageEnglish
Title of host publicationProgress in WWW Research and Development - 10th Asia-Pacific Web Conference, APWeb 2008, Proceedings
Pages407-418
Number of pages12
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event10th Asia Pacific Conference on Web Technology, APWeb 2008 - Shenyang, China
Duration: 26 Apr 200828 Apr 2008

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4976 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference10th Asia Pacific Conference on Web Technology, APWeb 2008
Country/TerritoryChina
CityShenyang
Period26/04/0828/04/08

Keywords

  • Address Extraction
  • Address Itemization
  • Web page Analysis

Fingerprint

Dive into the research topics of 'Pattern-based extraction of addresses from web page content'. Together they form a unique fingerprint.

Cite this