How to remove or replace non English characters from Google BigQuery
When you table has non ascii characters, you may stumble on the below error:
\n\nUnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in \nposition 38: ordinal not in range(128)\n\n
Solution:
Use REGEXP REPLACE to remove non-ascii chars
\n\nREGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '')\n
\n\nExample
\n\n#standardSQL\nWITH `project.dataset.table` AS (\n SELECT '12 - Table - Стол - test' AS field1 UNION ALL\n SELECT '23 - Table - الطاولة' UNION ALL\n SELECT '34 - Table - שולחן' \n)\nSELECT \n REGEXP_REPLACE(field1, r'([^\p{ASCII}]+)', '') AS ascii_only,\n field1\nFROM `project.dataset.table` \n
\n\nResult
\n\nRow ascii_only field1 \n1 12 - Table - - test 12 - Table - Стол - test \n2 23 - Table - 23 - Table - الطاولة \n3 34 - Table - 34 - Table - שולחן \n
DataFreak
posted onEnjoy great content like this and a lot more !
Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds
Post Comment