Google Bigquery Compression and Columnar Storage Overview

BigQuery is a Columnar analytics database. Behind the scenes, All the data is compressed by default in BigQuery. Although the data is compressed, Google will charge the customers for the uncompressed data size (for both storage and data scanned)

How BigQuery stores the data internally

With Columnar format, any column level analytical operations are going to be much faster. To understand the organization of data in Bigquery, we will need to get ourselves familiar with the below two concepts:

Repetition level: the level of the repetitions in case of a repeat column type
Definition level: Tells us whether the field been assigned a value or is it null

For example: Consider the below table

message Book {
 required string title,
 repeated string author,
 repeated group price {
   optional int64 discount,
   optional int64 usd,
   optional int64 eur,
 }\n}

And consider we have three records in the table

Book1:
 author: "AAA"
 title: "firstTitle"
 price:
  discount: 0
  eur: 11
  usd: 12
Book2:
 author: "BBB"
 author: "CCC"
 author: "DDD"
 title: "secondTitle"
Book3:
 title: "thirdTitle"
 price:
  discount: 0
  eur: 11
 price:
  discount: 1
  eur: 11

Let’s compute the repetition and definition levels for each value. We will also add explicit null values for the missing optional fields

Book1:
 author: "AAA"          R: 0, D: 1
 title: "firstTitle"    R: 0, D: 1
 price:
  discount: 0           R: 0, D: 2
  eur: 11               R: 0, D: 2
  usd: 12               R: 0, D: 2
Book2:
 author: "BBB"          R: 0, D: 1
 author: "CCC"          R: 1, D: 1
 author: "DDD"          R: 1, D: 1
 title: "secondTitle"   R: 0, D: 1
 (price):
  (discount: null)      R: 0, D: 0
  (eur: null)           R: 0, D: 0
  (usd: null)           R: 0, D: 0
Book3:
 title: "thirdTitle"   R: 0, D: 1
 (author: null)        R: 0, D: 0
 price:
  discount: 0          R: 0, D: 2
  eur: 11              R: 0, D: 2
  (usd: null)          R: 0, D: 1
 price:
  discount: 1          R: 1, D: 2
  eur: 11              R: 1, D: 2
  (usd: null)          R: 1, D: 1

Repetition levels are always zero when there is no repetition, when a field is repeated, such as author in the second record, R is 1 because the repetition happens at the first repeated level, same for price in the third record.

Definition levels are quite straightforward, for example in the first record price.discount has 2 because both price and discount are defined. On the other hand, in record 3, the last null price.usd has D equal to 1, because price is defined, but price.usd isn’t

Each column is stored as a set of blocks like:

compressed value, R, D

R and D are stored only when necessary and cannot be inferred. Null values can be inferred as, for them, D is always a number lower than the sum of repeated and optional fields in the field path (as it can be seen from the example).

From the stored information, each record can be easily reconstructed in parallel for each queried column.

For example, let’s consider the price.eur column. On disk we will have:

11 R: 0, D: 2
NULL R: 0, D: 0
11 R: 0, D: 2
11 R: 1, D: 2

Columnar storage and compression:

Columnar storage helps us run extremely efficient compression algorithms. Two classics solutions are bitmaps and run-length encoding (RLE)

Let’s imagine you have a column composed of n rows, with k distinct values. Using the previous example, you have the price.eur column with the following values (n = 10, k = 5)

[10.0, 10.0, 8.99, 8.99, 7.0, 6.0, 7.0, 6.0, 2.0, 2.0]

This column can be easily compressed with k bitmaps (one for each distinct value) of length n (row length), where you have a set bit in the position of a particular value if that value is in the row.

price.eur: 10.0 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
price.eur: 8.99 [0, 0, 1, 1, 0, 0, 0, 0, 0, 0]
price.eur: 7.0  [0, 0, 0, 0, 1, 0, 1, 0, 0, 0]
price.eur: 6.0  [0, 0, 0, 0, 0, 1, 0, 1, 0, 0]
price.eur: 2.0  [0, 0, 1, 1, 0, 0, 0, 0, 1, 1]

The bitmaps can be stored instead of the actual column. The advantage is even bigger, if you think about selection, aggregation and projection patterns. In analytics DBs, queries (like the one below)

SELECT * FROM TABLE WHERE Col1 < 3.0 and Col2 = 4.0

can be directly performed loading the bitmaps for the values = 4.0 and < 3.0 and performing a bit-wise AND.

The compression can be improved even more, using RLE. What you do in that case is representing the sequences of 0s and 1s. As an example, the first three bitmaps would turn into:

price.eur: 10.0 – 0,2 (0 0s, 2 1s, rest 0s)
price.eur: 8.99 – 2,2 (2 0s, 2 1s, rest 0s)
price.eur: 7.0 – 4,1,1,1 (4 0s, 1 one, 1 zero, 1 one, rest 0s)

How BigQuery calculates the storage size of a table

When you load data into BigQuery or query the data, you're charged according to the data size. Data size is calculated based on the size of each column's data type (uncompressed size).

For example: The size of BigQuery's data types is as follows:

Datatype	Size
INT64/INTEGER	8 bytes
FLOAT64/FLOAT	8 bytes
NUMERIC	16 bytes
BOOL/BOOLEAN	1 byte
STRING	2 bytes + the UTF-8 encoded string size
BYTES	2 bytes + the number of bytes in the value
DATE	8 bytes
DATETIME	8 bytes
TIME	8 bytes
TIMESTAMP	8 bytes
STRUCT/RECORD	0 bytes + the size of the contained fields

You can find more details on Google's official page: BigQuery table space calculation