How to remove or replace non English characters from Google BigQuery

When you table has non ascii characters, you may stumble on the below error:

\n\n
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in \nposition 38: ordinal not in range(128)
\n\n

Solution:

Use REGEXP REPLACE to remove non-ascii chars

\n\n
REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '')\n
\n\n

Example

\n\n
#standardSQL\nWITH `project.dataset.table` AS (\n  SELECT '12 - Table - Стол - test' AS field1 UNION ALL\n  SELECT '23 - Table - الطاولة' UNION ALL\n  SELECT '34 - Table - שולחן' \n)\nSELECT \n  REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '') AS ascii_only,\n  field1\nFROM `project.dataset.table` \n
\n\n

Result

\n\n
Row ascii_only          field1   \n1   12 - Table - - test 12 - Table - Стол - test     \n2   23 - Table -        23 - Table - الطاولة     \n3   34 - Table -        34 - Table - שולחן   \n

DataFreak

posted on

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds