How to do Efficient sampling of a fixed number of rows in Google BigQuery

I have a large data set of size N, and want to get a (uniformly) random sample of size n. There are two possible solutions:

\n\n

SELECT foo FROM mytable WHERE RAND() < n/N\n

\n\n

This is fast, but doesn't give me exactly n rows (only approximately).

\n\n

SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n\n

\n\n

This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).

\n\n

The best and easiest way to get a random sample from big query:

\n\n

If you try to execute the following query several times without using cached results, you will got different results.

\n\n

SELECT *\nFROM `bigquery-samples.wikipedia_benchmark.Wiki1B`\nLIMIT 5\n

\n\n

Therefore, depends on how randomly you want to have the samples, this maybe a better solution

posted on

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds