How to do Efficient sampling of a fixed number of rows in Google BigQuery
I have a large data set of size N, and want to get a (uniformly) random sample of size n. There are two possible solutions:
\n\nSELECT foo FROM mytable WHERE RAND() < n/N\n
\n\nThis is fast, but doesn't give me exactly n rows (only approximately).
\n\nSELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n\n
\n\nThis requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).
\n\nThe best and easiest way to get a random sample from big query:
\n\nIf you try to execute the following query several times without using cached results, you will got different results.
\n\nSELECT *\nFROM `bigquery-samples.wikipedia_benchmark.Wiki1B`\nLIMIT 5\n
\n\nTherefore, depends on how randomly you want to have the samples, this maybe a better solution
Asran
posted onEnjoy great content like this and a lot more !
Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds
Post Comment