How to remove or replace non English characters from Google BigQuery
When you table has non ascii characters, you may stumble on the below error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)
Solution:
Use REGEXP REPLACE to remove non-ascii chars
REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '')
Example
#standardSQL
WITH `project.dataset.table` AS (
SELECT '12 - Table - Стол - test' AS field1 UNION ALL
SELECT '23 - Table - الطاولة' UNION ALL
SELECT '34 - Table - שולחן'
)
SELECT
REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '') AS ascii_only,
field1
FROM `project.dataset.table`
Result
Row ascii_only field1
1 12 - Table - - test 12 - Table - Стол - test
2 23 - Table - 23 - Table - الطاولة
3 34 - Table - 34 - Table - שולחן
DataFreak
posted on 01 Mar 21Enjoy great content like this and a lot more !
Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds
Post Comment