Efficient Regular Expression Matching Based on Positional Inverted Index

Tao Qiu, Xiaochun Yang*, Bin Wang, Wei Wang

*Corresponding author for this work

Research output: Contribution to journalJournal Articlepeer-review

5 Citations (Scopus)

Abstract

We study the efficient regular expression (regex) matching problem. Existing algorithms are the scanning-based algorithms which typically use an equivalent automaton compiled from the regex query to verify a document. Although some works propose various strategies to quickly jump to candidate locations in a document where a query result may appear, they still need to utilize the scanning-based method to verify these candidate locations. These methods become inefficient when there are still many candidate locations needed to be verified. In this article, we propose a novel approach to efficiently compute all matching positions for a regex query purely based on a positional $q$q-gram inverted index. We propose a gram-driven NFA to represent the language of a regex and show all regex matching locations can be obtained by finding positions on $q$q-grams of GNFA that satisfy certain positional constraints. Then we propose several GNFA-based query plans to answer the query using the positional inverted index. In order to improve the query efficiency, we design the algorithm to build a tree-based query plan by carefully choosing a checking order for positional constraints. Experimental results on real-world datasets show that our method outperforms state-of-the-art methods by up to an order of magnitude in query efficiency.

Original languageEnglish
Pages (from-to)1133-1148
Number of pages16
JournalIEEE Transactions on Knowledge and Data Engineering
Volume34
Issue number3
DOIs
Publication statusPublished - 1 Mar 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2020 IEEE.

Keywords

  • Positional inverted index
  • Query plan
  • Regular expression matching

Fingerprint

Dive into the research topics of 'Efficient Regular Expression Matching Based on Positional Inverted Index'. Together they form a unique fingerprint.

Cite this