| Title: | Pattern-Based Text Extraction and Standardization with Lookup Tables |
|---|---|
| Description: | Extracts information from text using lookup tables of regular expressions. Each text entry is compared against patterns in a lookup table, returning the extracted substrings alongside optional metadata. Users can choose to extract all matching patterns (generating multiple rows per text entry) or limit extraction to the first match. This approach enables comprehensive, standardized pattern coverage when processing large or complex text datasets. |
| Authors: | Shirlyn Dong [aut, cre], Devin Judge-Lord [aut] |
| Maintainer: | Shirlyn Dong <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.2 |
| Built: | 2026-05-20 09:59:02 UTC |
| Source: | https://github.com/judgelord/regextable |
Cleans a character vector by converting to lowercase, removing or replacing specific punctuation, normalizing commas, and squishing excess whitespace.
clean_text(text)clean_text(text)
text |
Character vector to clean. |
A cleaned character vector.
clean_text(c("Hello World!?", "This--is\tR.\nTesting: 1, 2, , 3;"))clean_text(c("Hello World!?", "This--is\tR.\nTesting: 1, 2, , 3;"))
Sample text dataset used for demonstration of regextable.
A tibble with 5 columns:
Date of the record (YYYY-MM-DD)
Speaker name in the text
Header or title of the speech
Original URL of the source text
Full text content from the source
Generated for the regextable package.
Matches text against a table of regular expressions and returns extracted matches with optional metadata.
extract( data, regex_table, col_name = "text", pattern_col = "pattern", typo_table = NULL, typo_from_col = "typo", typo_to_col = "correction", date_col = NULL, date_start = NULL, date_end = NULL, data_return_cols = NULL, regex_return_cols = NULL, remove_acronyms = FALSE, do_clean_text = TRUE, unique_match = FALSE, use_ner = FALSE, ner_timing = "after", ner_entity_types = c("ORG"), verbose = TRUE, cl = NULL )extract( data, regex_table, col_name = "text", pattern_col = "pattern", typo_table = NULL, typo_from_col = "typo", typo_to_col = "correction", date_col = NULL, date_start = NULL, date_end = NULL, data_return_cols = NULL, regex_return_cols = NULL, remove_acronyms = FALSE, do_clean_text = TRUE, unique_match = FALSE, use_ner = FALSE, ner_timing = "after", ner_entity_types = c("ORG"), verbose = TRUE, cl = NULL )
data |
A data frame or character vector containing the text to search. If a character vector is provided, it is internally converted to a data frame and |
regex_table |
A data frame containing regular expression patterns and optional metadata columns. |
col_name |
Character string specifying the column in |
pattern_col |
Character string specifying the column in |
typo_table |
Optional data frame with text replacements for corrections. Replacements are applied sequentially to the text using regex (with word boundaries) before pattern matching. |
typo_from_col |
Optional column in |
typo_to_col |
Optional column in |
date_col |
Optional column in |
date_start |
Optional start date (Date object or string like "YYYY-MM-DD") for filtering |
date_end |
Optional end date (Date object or string like "YYYY-MM-DD") for filtering |
data_return_cols |
Optional vector of column names to include from |
regex_return_cols |
Optional vector of column names to include from |
remove_acronyms |
Logical; if TRUE, removes patterns consisting only of uppercase letters (2 or more characters) from |
do_clean_text |
Logical; if TRUE, applies basic text cleaning to the input before matching. |
unique_match |
Logical; if TRUE, stops searching after the first match to
find at most one match per row (evaluated in the order patterns appear in |
use_ner |
Logical; if TRUE, uses the 'spacyr' package to validate that
matches are actual Named Entities (e.g., organizations). Note: |
ner_timing |
Character string; either "after" or "before". If "after" (default), regex matches are found first, then validated with NER. If "before", NER extracts entities first, and regex searches only within those entities. |
ner_entity_types |
Character vector; the types of Named Entities to keep if |
verbose |
Logical; if TRUE, displays progress messages. |
cl |
A cluster object created by |
Pattern matching is performed using R's regular expression engine and is
case-insensitive by default. For each input row, the function checks patterns
in regex_table and returns matches based on the unique_match parameter.
A tibble with the following columns:
row_id: Integer identifier corresponding to rows in the input data.
Additional columns from data if data_return_cols is specified.
Additional columns from regex_table if regex_return_cols is specified.
pattern: The matched regular expression pattern(s).
match: The extracted text from the data (original casing preserved).
# Create sample data data <- data.frame( id = 1:3, text = c("I love apples", "Bananas are great", "Oranges and apples"), stringsAsFactors = FALSE ) # Create regex patterns patterns <- data.frame( pattern = c("apples", "bananas", "oranges"), category = c("fruit", "fruit", "fruit") ) # Extract all matches extract(data, patterns) # Extract one match per row extract(data, patterns, unique_match = TRUE)# Create sample data data <- data.frame( id = 1:3, text = c("I love apples", "Bananas are great", "Oranges and apples"), stringsAsFactors = FALSE ) # Create regex patterns patterns <- data.frame( pattern = c("apples", "bananas", "oranges"), category = c("fruit", "fruit", "fruit") ) # Extract all matches extract(data, patterns) # Extract one match per row extract(data, patterns, unique_match = TRUE)
Lookup table of member names and metadata for regex matching.
A tibble with 9 columns:
Congress number (numeric)
Chamber (House/President/Senate)
Full bio name of the member
Regex pattern to match this member's name
Numeric ICPSR identifier
Two-letter state abbreviation
District number (0 for President)
First name of the member
Last name of the member
Generated for the regextable package.