How to use clustering without a partition key in a Bigquery table?

Currently, you can partition only on a date or a timestamp column in bigquery. But, what if your table doesnt have a date or timestamp column ? Well, its going to do a full table scan. Consider the below example,

SELECT ID,Key FROM mydataset.mytable where ID = 100077113;
(3.3s elapsed, 72.1 GB processed)
\n \n

How can we use the clustering capabilities in BigQuery without a partition key on the table?

\n Solution:

Our example table has 12m rows and 76 GB of data. This table has no timestamp column.

\n \n

This is how to cluster said table - while creating a fake date column for fake partitioning:

\n \n
CREATE TABLE `fh-bigquery.public_dump.github_java_clustered` \n   (id STRING, size INT64, content STRING, binary BOOL\n    , copies INT64, sample_repo_name STRING, sample_path STRING\n    , fake_date DATE) \n PARTITION BY fake_date \n CLUSTER BY id AS (\n   SELECT *, DATE('1980-01-01') fake_date \n   FROM `fh-bigquery.github_extracts.contents_java`\n )
\n \n

Did it work?

\n \n
# original table\n \n SELECT *\n FROM `fh-bigquery.github_extracts.contents_java`\n WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'\n (3.3s elapsed, 72.1 GB processed)\n \n # clustered table\n SELECT *\n FROM `fh-bigquery.public_dump.github_java_clustered2`\n WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'\n (2.4s elapsed, 232 MB processed)
\n \n

What we learned here:

\n \n
    \n
  • Clustering can work with unique ids, even for tables without a date to partition by.
  • \n
  • Prefer using a fake date instead of a null date
  • \n
  • Clustering made my query 99.6% cheaper when looking for rows by id

DataFreak

posted on

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds