Package 'regextable' reference manual

Title:	Pattern-Based Text Extraction and Standardization with Lookup Tables
Description:	Extracts information from text using lookup tables of regular expressions. Each text entry is compared against patterns in a lookup table, returning the extracted substrings alongside optional metadata. Users can choose to extract all matching patterns (generating multiple rows per text entry) or limit extraction to the first match. This approach enables comprehensive, standardized pattern coverage when processing large or complex text datasets.
Authors:	Shirlyn Dong [aut, cre], Devin Judge-Lord [aut]
Maintainer:	Shirlyn Dong <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.2
Built:	2026-05-20 09:59:02 UTC
Source:	https://github.com/judgelord/regextable

Clean Text

Description

Cleans a character vector by converting to lowercase, removing or replacing specific punctuation, normalizing commas, and squishing excess whitespace.

Usage

clean_text(text)
clean_text(text)

Arguments

text

Character vector to clean.

Value

A cleaned character vector.

Examples

clean_text(c("Hello  World!?", "This--is\tR.\nTesting: 1, 2, , 3;"))
clean_text(c("Hello  World!?", "This--is\tR.\nTesting: 1, 2, , 3;"))

cr2007_03_01 dataset

Description

Sample text dataset used for demonstration of regextable.

Format

A tibble with 5 columns:

date: Date of the record (YYYY-MM-DD)
speaker: Speaker name in the text
header: Header or title of the speech
url: Original URL of the source text
url_txt: Full text content from the source

Source

Generated for the regextable package.

Extract Regex Pattern Matches from Text Data

Description

Matches text against a table of regular expressions and returns extracted matches with optional metadata.

Usage

extract(
  data,
  regex_table,
  col_name = "text",
  pattern_col = "pattern",
  typo_table = NULL,
  typo_from_col = "typo",
  typo_to_col = "correction",
  date_col = NULL,
  date_start = NULL,
  date_end = NULL,
  data_return_cols = NULL,
  regex_return_cols = NULL,
  remove_acronyms = FALSE,
  do_clean_text = TRUE,
  unique_match = FALSE,
  use_ner = FALSE,
  ner_timing = "after",
  ner_entity_types = c("ORG"),
  verbose = TRUE,
  cl = NULL
)
extract(
  data,
  regex_table,
  col_name = "text",
  pattern_col = "pattern",
  typo_table = NULL,
  typo_from_col = "typo",
  typo_to_col = "correction",
  date_col = NULL,
  date_start = NULL,
  date_end = NULL,
  data_return_cols = NULL,
  regex_return_cols = NULL,
  remove_acronyms = FALSE,
  do_clean_text = TRUE,
  unique_match = FALSE,
  use_ner = FALSE,
  ner_timing = "after",
  ner_entity_types = c("ORG"),
  verbose = TRUE,
  cl = NULL
)

Arguments

data

A data frame or character vector containing the text to search. If a character vector is provided, it is internally converted to a data frame and col_name is ignored.

regex_table

A data frame containing regular expression patterns and optional metadata columns.

col_name

Character string specifying the column in data that contains text to search. Default is "text".

pattern_col

Character string specifying the column in regex_table that contains regex patterns. Default is "pattern".

typo_table

Optional data frame with text replacements for corrections. Replacements are applied sequentially to the text using regex (with word boundaries) before pattern matching.

typo_from_col

Optional column in typo_table with text to replace. Default is "typo".

typo_to_col

Optional column in typo_table with replacement text. Default is "correction".

date_col

Optional column in data for date filtering. If provided, rows are filtered by date_start and date_end before pattern matching.

date_start

Optional start date (Date object or string like "YYYY-MM-DD") for filtering data when date_col is specified.

date_end

Optional end date (Date object or string like "YYYY-MM-DD") for filtering data when date_col is specified.

data_return_cols

Optional vector of column names to include from data. Default is NULL (only row_id is returned).

regex_return_cols

Optional vector of column names to include from regex_table. Default is NULL (no metadata columns added).

remove_acronyms

Logical; if TRUE, removes patterns consisting only of uppercase letters (2 or more characters) from regex_table.

do_clean_text

Logical; if TRUE, applies basic text cleaning to the input before matching.

unique_match

Logical; if TRUE, stops searching after the first match to find at most one match per row (evaluated in the order patterns appear in regex_table). If FALSE, returns all matches for all patterns.

use_ner

Logical; if TRUE, uses the 'spacyr' package to validate that matches are actual Named Entities (e.g., organizations). Note: spacyr must be initialized (e.g., via spacyr::spacy_initialize()) before calling this function.

ner_timing

Character string; either "after" or "before". If "after" (default), regex matches are found first, then validated with NER. If "before", NER extracts entities first, and regex searches only within those entities.

ner_entity_types

Character vector; the types of Named Entities to keep if use_ner is TRUE. Default is "ORG".

verbose

Logical; if TRUE, displays progress messages.

cl

A cluster object created by parallel::makeCluster(), or an integer to indicate number of child processes (integer values are ignored on Windows). Passed to pbapply::pblapply().

Details

Pattern matching is performed using R's regular expression engine and is case-insensitive by default. For each input row, the function checks patterns in regex_table and returns matches based on the unique_match parameter.

Value

A tibble with the following columns:

row_id: Integer identifier corresponding to rows in the input data.
Additional columns from data if data_return_cols is specified.
Additional columns from regex_table if regex_return_cols is specified.
pattern: The matched regular expression pattern(s).
match: The extracted text from the data (original casing preserved).

Examples

# Create sample data
data <- data.frame(
  id = 1:3,
  text = c("I love apples", "Bananas are great", "Oranges and apples"),
  stringsAsFactors = FALSE
)

# Create regex patterns
patterns <- data.frame(
  pattern = c("apples", "bananas", "oranges"),
  category = c("fruit", "fruit", "fruit")
)

# Extract all matches
extract(data, patterns)

# Extract one match per row
extract(data, patterns, unique_match = TRUE)
# Create sample data
data <- data.frame(
  id = 1:3,
  text = c("I love apples", "Bananas are great", "Oranges and apples"),
  stringsAsFactors = FALSE
)

# Create regex patterns
patterns <- data.frame(
  pattern = c("apples", "bananas", "oranges"),
  category = c("fruit", "fruit", "fruit")
)

# Extract all matches
extract(data, patterns)

# Extract one match per row
extract(data, patterns, unique_match = TRUE)

members dataset

Description

Lookup table of member names and metadata for regex matching.

Format

A tibble with 9 columns:

congress: Congress number (numeric)
chamber: Chamber (House/President/Senate)
bioname: Full bio name of the member
pattern: Regex pattern to match this member's name
icpsr: Numeric ICPSR identifier
state_abbrev: Two-letter state abbreviation
district_code: District number (0 for President)
first_name: First name of the member
last_name: Last name of the member

Source

Generated for the regextable package.

Package 'regextable'

Help Index

Clean Text

Description

Usage

Arguments

Value

Examples

cr2007_03_01 dataset

Description

Format

Source

Extract Regex Pattern Matches from Text Data

Description

Usage

Arguments

Details

Value

Examples

members dataset

Description

Format

Source