BigQuery is taking the analytics industry by storm. One of the most important use cases is to extract and transform parts of a string. Below are some of the powerful constructs provided by BigQuery.
BigQuery provides the following constructs for regex operations.
Construct | Arguments | Use |
REGEXP_CONTAINS | REGEXP_CONTAINS(value, regexp) | Checks if the string contains the regex |
REGEXP_EXTRACT | REGEXP_EXTRACT(value, regexp[, position[, occurrence]]) | Extract the substring that matches the regex. |
REGEXP_EXTRACT_ALL | REGEXP_EXTRACT_ALL(value, regexp) | Return all substrings that match the regex |
REGEXP_INSTR | REGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]]) | Return 1-based index for the first occurrence of the regex |
REGEXP_REPLACE | REGEXP_REPLACE(value, regexp, replacement) | Replace all substrings in value |
Source: Google Cloud Platform
Table Setup
Let’s create a simple table with some string data to play around with.
-- Create Store
CREATE OR REPLACE TABLE `gcp-tour-276710.thefellowcoder.store`
(
date DATE,
details STRING,
address STRING
)
-- Add values
INSERT INTO `gcp-tour-276710.thefellowcoder.store`
VALUES
('2021-07-14', '<h1>Nike Shoes</h1><p>Base: $450 Addon: $20</p><p>Support email: help@nike.com</p>','Los Angeles, California 90001 '),
('2021-06-13', '<h1>Rebook Shoes</h1><p>Base: $150 Addon: $10</p><p>Support email: help@rebook.com</p>','Sacramento, California 94203 '),
('2021-07-11', '<h1>Adidas Shoes</h1><p>Base: $350 Addon: $30</p><p>Support email: help@adidas.com</p>','Beverly Hills, California 90209 '),
('2021-07-11', '<h1>Carnival All Rounder Shoes</h1><p>Base: $250 Addon: $0</p>','Beverly Hills, California 90209 ')
Sample Queries
Let's see each of the constructs in action
- REGEXP_CONTAINS example
-- All stores that have a support email
select * from `gcp-tour-276710.thefellowcoder.store`
where REGEXP_CONTAINS(details, r'\S+@\S+\.\S+')
- REGEXP_EXTRACT example
-- Extract Product Names
select REGEXP_EXTRACT(details, r'<h1>([^<]+)<\/h1>')
from `gcp-tour-276710.thefellowcoder.store`
- REGEXP_EXTRACT_ALL example
-- Extract all prices
select REGEXP_EXTRACT_ALL(details, r'\$(.\d+)')
from `gcp-tour-276710.thefellowcoder.store`
- REGEXP_INSTR example
-- Get index of email
select details, REGEXP_INSTR(details, r'\S+@\S+\.\S+')
from `gcp-tour-276710.thefellowcoder.store`
- REGEXP_REPLACE example
-- Replace all email-ids
select REGEXP_REPLACE(details, r'\S+@\S+\.\S+', 'admin@thefellowcoder.com') from `gcp-tour-276710.thefellowcoder.store`
Common Use-cases:
Extract part of the string with regex
This is one of the most common use cases wherein we must extract a substring that matches a regex.
-- Extract product names from all details
select REGEXP_EXTRACT(details, r'<h1>([^<]+)<\/h1>')
from `gcp-tour-276710.thefellowcoder.store`
Replace regex with string
Let’s say you want to sanitize the string and remove all special characters from the string.
-- remove special characters
select details,REGEXP_REPLACE(details, r'[^0-9a-zA-Z]+',' ')
from `gcp-tour-276710.thefellowcoder.store`
Matches any of the regex patterns
check for multiple regex patterns for a string
-- check for pincodes starting with 90/94
select address,REGEXP_EXTRACT(address, r'9[0|4]\d+') as pincode
from `gcp-tour-276710.thefellowcoder.store`
Useful regex examples for data cleaning.
String | Regex | Output |
<h1>Hello World!</h1> | <h1>([^<]+)<\/h1> | Hello World! |
<p>Support email: thefellowcoder@gmail.com</p> | \w+@\w+\.\w+ | thefellowcoder@gmail.com |
Rebook ShOes | (?i)shoes | ShOes |
This,contains$special\characters | [^0-9a-zA-Z]+ | This contains special characters |
String needs to be trimmed. | ^(\s+).*?(\s+)$ | String needs to be trimmed. |