BigQuery Create table
Use the Create table statement to create a new permanent table in BigQuery
Create table example
CREATE TABLE mydataset.Employee ( Id INT64 NOT NULL, Name STRING(300) NOT NULL, Join_Date DATE NOT NULL, Branch INT64, Salary NUMERIC ) PARTITION BY Join_Date CLUSTER BY Branch, Id
Create or replace table, if exists
CREATE OR REPLACE TABLE mydataset.Employee ( Id INT64 NOT NULL, Name STRING(300) NOT NULL, Join_Date DATE NOT NULL, Branch INT64, Salary NUMERIC ) PARTITION BY Join_Date CLUSTER BY Branch, Id
Create table, if not exists
CREATE TABLE IF NOT EXISTS mydataset.Employee ( Id INT64 NOT NULL, Name STRING(300) NOT NULL, Join_Date DATE NOT NULL, Branch INT64, Salary NUMERIC ) PARTITION BY Join_Date CLUSTER BY Branch, Id
Partition table by month
CREATE TABLE mydataset.Employee ( Id INT64 NOT NULL, Name STRING(300) NOT NULL, Join_Date DATE NOT NULL, Branch INT64, Salary NUMERIC ) PARTITION BY DATE_TRUNC(Join_Date, MONTH) CLUSTER BY Branch, Id
Partition by Timestamp column
CREATE TABLE mydataset.Employee ( Id INT64 NOT NULL, Name STRING(300) NOT NULL, Join_Date DATE NOT NULL, Branch INT64, Salary NUMERIC, Create_TS Timestamp ) PARTITION BY TIMESTAMP_TRUNC(Join_Date, DAY) CLUSTER BY Branch, Id
Partition by Integer column
CREATE TABLE mydataset.Employee ( Id INT64 NOT NULL, Name STRING(300) NOT NULL, Join_Date DATE NOT NULL, Branch INT64, Salary NUMERIC, Create_TS Timestamp ) PARTITION BY RANGE_BUCKET(Id, GENERATE_ARRAY(1, 100000, 1000)) CLUSTER BY Branch, Id;
Syntax for Partition by
- DATE({ timestamp_column | datetime_column }). Partition by a TIMESTAMP or DATETIME column with daily partitions.
- DATETIME_TRUNC(datetime_column, { DAY | HOUR | MONTH | YEAR }). Partition by a DATETIME column with the specified partitioning type.
- TIMESTAMP_TRUNC(timestamp_column, { DAY | HOUR | MONTH | YEAR }). Partition by a TIMESTAMP column with the specified partitioning type.
- DATE_TRUNC(date_column, { MONTH | YEAR }). Partition by a DATE column with the specified partitioning type.
- RANGE_BUCKET(int64_column, GENERATE_ARRAY(start, end[, interval])). Partition by an integer column with the specified range, where: start is the start of range partitioning, inclusive. end is the end of range partitioning, exclusive. interval is the width of each range within the partition. Defaults to 1.
Explanation
This BigQuery CREATE TABLE example creates a new table called Employee which has 5 columns.
- The first column is called "Id" which is created as a number datatype and can not contain null values
- The second column is called Name which is a Varchar datatype (300 maximum characters in length) and also can not contain null values.
- and so on...
Choosing Partition and Cluster keys: tips
You can select 1 column as part of parition key and upto 4 columns as part of cluster keys
The Partition column is used to split the data and store in respective chunks
- Choose a column that gets you even data distribution. eg. the partition key results in equal sized partitions
- Make sure the partition key doesn't result in too many partitions. Having lesser than 4000 partitions is recommended
- The partition key is frequently used in the filter clause during data retrieval
The cluster by columns are used to sort the data within the parititions
- Make sure the most popular filter column is specified in the leftmost followed by the less frequently used ones. The column order matters
- There is no performance or cost overhead for using cluster keys, so feel free to define them in all your tables
Syntax reference
CREATE [ OR REPLACE ] [ TEMP | TEMPORARY ] TABLE [ IF NOT EXISTS ] [[project_name.]dataset_name.]table_name [( column[, ...] )] [PARTITION BY partition_expression] [CLUSTER BY clustering_column_list] [OPTIONS(table_option_list)] [AS query_statement]