Mastering BigQuery: Working with Arrays, Nested, and Repeated Fields

Google BigQuery is a powerhouse for managing and analyzing large-scale datasets. Among its standout features are nested and repeated fields, which allow you to effectively represent complex, hierarchical, or denormalized data. This article dives into the details of working with these fields, explores practical examples, and highlights best practices.


What Are Nested and Repeated Fields in BigQuery?

BigQuery’s support for nested and repeated fields enables you to represent structured data hierarchies efficiently. These features reduce the need for traditional normalization while optimizing query performance for hierarchical or parent-child relationships.

1. Nested Fields

Nested fields in BigQuery are created using the STRUCT (or RECORD) type. A STRUCT is a container that groups multiple fields into a single column.

  • Example of a Nested Field:
    A table where customer contact information is grouped under a single column called contact_info.
CREATE TABLE `project.dataset.nested_table` (
  customer_id INT64,
  customer_name STRING,
  contact_info STRUCT<
    email STRING,
    phone STRING
  >
);

2. Repeated Fields

Repeated fields allow you to store arrays of values in a single column. These fields are denoted with the ARRAY type.

  • Example of a Repeated Field:
    A table where a customer has multiple orders stored in a single column.
CREATE TABLE `project.dataset.repeated_table` (
  customer_id INT64,
  order_history ARRAY<
    STRUCT<
      order_id INT64,
      order_date DATE,
      total_amount FLOAT64
    >
  >
);

3. Combined Nested and Repeated Fields

BigQuery also supports a combination of nested and repeated fields, allowing for highly complex data structures.

  • Example:
    Nested STRUCT fields that contain repeated ARRAY fields for customer order history.
CREATE TABLE `project.dataset.combined_table` (
  customer_id INT64,
  customer_name STRING,
  contact_info STRUCT<
    email STRING,
    phone STRING
  >,
  order_history ARRAY<
    STRUCT<
      order_id INT64,
      order_date DATE,
      total_amount FLOAT64
    >
  >
);

Why Use Nested and Repeated Fields?

Advantages

  1. Denormalization for Performance
  • BigQuery excels with denormalized data, minimizing the need for JOIN operations and improving query speed.
  1. Hierarchical Relationships
  • Ideal for parent-child relationships like orders and customers.
  1. Efficient Storage
  • Compact storage for structured data, reducing redundancy.

Best Practices

  1. Use nested and repeated fields for frequently queried hierarchical data.
  2. Avoid nesting more than 15 levels, as this is BigQuery’s schema limit.
  3. Use these fields to maintain denormalization without impacting query performance.

Loading Data with Nested and Repeated Fields

Loading JSON Data

To load data with nested and repeated fields, ensure your JSON file is formatted appropriately and a schema file is defined.

Example JSON File (sample_data.json):

{
  "customer_id": 1,
  "customer_name": "John Doe",
  "contact_info": {
    "email": "john.doe@example.com",
    "phone": "123-456-7890"
  },
  "order_history": [
    {
      "order_id": 101,
      "order_date": "2023-11-01",
      "total_amount": 250.75
    },
    {
      "order_id": 102,
      "order_date": "2023-11-05",
      "total_amount": 100.50
    }
  ]
}

Schema File (schema.json):

[
  {"name": "customer_id", "type": "INT64"},
  {"name": "customer_name", "type": "STRING"},
  {"name": "contact_info", "type": "RECORD", "fields": [
    {"name": "email", "type": "STRING"},
    {"name": "phone", "type": "STRING"}
  ]},
  {"name": "order_history", "type": "RECORD", "mode": "REPEATED", "fields": [
    {"name": "order_id", "type": "INT64"},
    {"name": "order_date", "type": "DATE"},
    {"name": "total_amount", "type": "FLOAT64"}
  ]}
]

Command to Load Data

bq load --source_format=NEWLINE_DELIMITED_JSON \
  --schema=schema.json \
  project.dataset.table_name \
  sample_data.json

For detailed guidance, check out the BigQuery Loading JSON Documentation.


Querying Nested and Repeated Fields

Flattening Repeated Fields with UNNEST

BigQuery requires the UNNEST function to work with repeated fields.

Example Query:

Retrieve all orders for each customer:

SELECT 
  customer_id, 
  customer_name, 
  order_history.order_id, 
  order_history.total_amount
FROM `project.dataset.combined_table`,
UNNEST(order_history) AS order_history;

Filtering Specific Orders:

SELECT 
  customer_id, 
  customer_name, 
  order_history.order_id, 
  order_history.total_amount
FROM `project.dataset.combined_table`,
UNNEST(order_history) AS order_history
WHERE order_history.order_id = 101;

For more examples, see the BigQuery Nested and Repeated Fields Documentation.


Best Practices for Nested and Repeated Fields

  1. Optimize for Read Performance
    Use nested and repeated fields for hierarchical data that is often queried together.
  2. Avoid Over-Normalization
    Normalize only when absolutely necessary to prevent redundancy.
  3. Leverage UNNEST for Analysis
    Simplify data flattening during queries.

Conclusion

Using nested and repeated fields in BigQuery can dramatically improve the efficiency of your queries and data management. These features allow for highly structured data storage while maintaining performance. By following best practices and leveraging tools like UNNEST, you can harness the full power of BigQuery for your analytics needs.

For more information, explore:


Leave a Reply

Your email address will not be published. Required fields are marked *