How to remove or replace non English characters from Google BigQuery

When you table has non ascii characters, you may stumble on the below error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in 
position 38: ordinal not in range(128)

Solution:

Use REGEXP REPLACE to remove non-ascii chars

REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '')

Example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '12 - Table - Стол - test' AS field1 UNION ALL
  SELECT '23 - Table - الطاولة' UNION ALL
  SELECT '34 - Table - שולחן' 
)
SELECT 
  REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '') AS ascii_only,
  field1
FROM `project.dataset.table` 

Result

Row ascii_only          field1   
1   12 - Table - - test 12 - Table - Стол - test     
2   23 - Table -        23 - Table - الطاولة     
3   34 - Table -        34 - Table - שולחן   

DataFreak

posted on 01 Mar 21

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds