Content
- What are Wildcard Tables in BigQuery
- When to Use Wildcard Tables in BigQuery
- Comparing Wildcard Tables and Partitioned Tables
- How to Create Wildcard Tables in BigQuery
- Query Tables with Wildcard Tables
- Best Practices for Using Wildcard Tables
- Limitations of Using Wildcard Tables
- Discover the Potential of Related BigQuery Functions
- Gain Advanced Insights with the OWOX BI BigQuery Reports Extension
Using Wildcard Tables in Google BigQuery
Anna Panchenko, Senior Digital Analyst @ OWOX
Are you looking to optimize your queries across multiple tables in Google BigQuery? Wildcard tables provide a powerful solution, allowing you to run queries over multiple tables using concise SQL patterns.
This guide introduces you to wildcard tables, demonstrating their utility in handling extensive datasets that span numerous time-partitioned tables.
By mastering this feature, you’ll enhance query efficiency and simplify your data analysis process. Whether you’re a data engineer or an analyst, learn how to tap into the full potential of wildcard tables to make your BigQuery tasks more effective and time-efficient.
What are Wildcard Tables in BigQuery
Wildcard tables in BigQuery are characterized by tables that share a common naming convention but differ by suffixes. These tables typically represent dates in the YYYYMMDD format.
Google BigQuery allows data professionals to query these tables collectively by substituting wildcard characters for parts or the entire suffix.
Common examples of wildcard tables in data processing environments include:
- GA4 BigQuery Export Tables: These tables often have prefixes indicating their type. For instance, session streaming tables might be named with a prefix like ‘events_’, followed by a date-specific suffix, resulting in names like ‘events_20240825’. Similarly, user tables often start with a prefix such as ‘users_’, creating names like ‘users_20240825’.
- Cost Data Tables: These tables, used for storing cost data, typically begin with a prefix like ‘FacebookAds_’, followed by a date suffix, leading to full names such as ‘FacebookAds_20240801’.
- Standard Export Tables of Analytics Platforms: These tables are usually named with a prefix that indicates their function, like ‘analytics_sessions_’, followed by a date suffix. An example would be ‘analytics_sessions_20200101’.
These naming conventions allow for efficient querying and management of data spread across multiple tables related by time or type.
When to Use Wildcard Tables in BigQuery
Wildcard tables are highly useful for datasets that contain many similarly named tables with uniform schemas, such as those with data segmented by time (e.g., days, months, years). These are useful for querying multiple tables with similar names using a common pattern, enabling efficient analysis of large, partitioned datasets
For example, BigQuery's NOAA Global Surface Summary of the Day dataset includes annual tables from 1929 onwards.
Querying such data without wildcards involves listing each table explicitly, which can be time-consuming as you'd typically need to specify each table in the FROM clause, resulting in a long and complex query structure like the following.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9,1) as celsius,
mo,
da,
year
FROM (
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1929`
UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1930`
UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1931`
# ... additional tables omitted for brevity
UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1940`
)
WHERE
max != 9999.9 # code for missing data
ORDER BY
max DESC
Using wildcards simplifies this process, allowing for concise and manageable queries.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9,1) as celsius,
mo,
da,
year
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
max != 9999.9 # code for missing data
AND _TABLE_SUFFIX BETWEEN '29' AND '40'
ORDER BY
max DESC
This approach reduces the complexity of the query and makes the SQL code more concise and easier to maintain, especially when working with large datasets spanning multiple tables.
Comparing Wildcard Tables and Partitioned Tables
Wildcard Tables and Partitioned Tables offer distinct approaches for managing and optimizing data in large datasets. This comparison highlights their key differences in data structure, data access, and flexibility to help you choose the best fit for your data processing needs.
Data Reading Efficiency
Wildcard tables allow for efficient data access by querying multiple tables with similar names using a common pattern, particularly when analyzing daily data additions.
Partitioned tables, on the other hand, offer optimized access not just by date but also by integer values, enabling more efficient queries and data manipulation across larger datasets.
Query Complexity
Queries using wildcard tables can be more complex due to the need to construct dynamic queries, especially when using the _TABLE_SUFFIX for date-specific data access.
Partitioned tables simplify query construction, especially in Standard SQL, where the partitioning criterion is predefined.
Data Storage and Management
Wildcard tables create virtual sections automatically as new data is added, making them suitable for daily data accumulation but potentially limiting for retrospective updates.
In contrast, partitioned tables require pre-defined partitions, which can hold up to 4000 partitions, with the partitioning field occupying space within the data, impacting storage and access efficiency.
Cost and Performance Optimization
Wildcard tables can be cost-effective for querying data by specific dates without needing an additional date field, but the performance may degrade if the queries become too complex.
Partitioned tables provide better performance for large datasets, allowing for cost-efficient queries that leverage the partitioning strategy, especially when the dataset is large and retrospective updates are frequent.
Support and Compatibility
Wildcard tables are supported in Legacy SQL and Standard SQL, though the SQL queries may become more complex than flat tables.
Partitioned tables, however, are supported only in Standard SQL, making them incompatible with Legacy SQL, which limits their usage in systems reliant on the older query syntax.
Use Cases and Flexibility
Wildcard tables are ideal for accumulating daily data, especially when ease of section creation and simple date-based access are priorities.
Partitioned tables are better suited for scenarios requiring optimized data access and flexibility in applying Data Manipulation Language (DML) operations across multiple partitions, such as in complex reporting schemes that involve retrospective data updates.
Dive deeper with this read
BigQuery Data Manipulation Language (DML): The Essential Guide
Advantages and Disadvantages
Wildcard tables offer simplicity in data management and are cost-effective for daily data accumulation. However, they can be limited in flexibility and performance for complex queries and updates.
Partitioned tables provide robust performance and flexibility for large datasets with complex access patterns, but they require more upfront setup and are only compatible with Standard SQL.
Aspect | Wildcard Tables | Partitioned Tables |
Data Reading Efficiency | Efficient for querying multiple tables with similar names, particularly for daily data additions. | Optimized access by date and integer values, enabling efficient queries across large datasets. |
Query Complexity | Queries can be more complex due to the need to construct dynamic queries with _TABLE_SUFFIX. | Simplifies query construction with predefined partitions in Standard SQL. |
Data Storage and Management | Virtual sections are created automatically with new data, suitable for daily data but limited for updates. | It requires pre-defined partitions and can hold up to 4000, with space occupied by the partitioning field. |
Cost and Performance Optimization | Cost-effective for date-specific queries, but performance may degrade with complex queries. | Provides better performance for large datasets and is cost-efficient with optimized queries across multiple partitions. |
Support and Compatibility | Supported in Legacy SQL and Standard SQL, though queries may be more complex. | Supported only in Standard SQL; incompatible with Legacy SQL. |
Use Cases and Flexibility | Ideal for daily data accumulation with simple date-based access. | Better for scenarios requiring optimized access and flexibility in DML operations across partitions. |
Advantages and Disadvantages | Simple data management, cost-effective for daily data, but limited in flexibility for updates and performance. | Robust performance and flexibility, especially for large datasets, requires more setup and is Standard SQL only. |
How to Create Wildcard Tables in BigQuery
To create wildcard tables, simply create multiple tables with a common prefix followed by an underscore and a suffix. BigQuery SQL automatically recognizes this naming convention and treats these tables as a wildcard group.
Example of Creating Wildcard Table
Suppose you have weather data in the NOAA GSOD dataset from 1929 to 1935, with each year's data stored in a separate table. The tables are named as follows:
- gsod1929
- gsod1930
- gsod1931
- gsod1932
- gsod1933
- gsod1934
- gsod1935
You can create a wildcard table using the common prefix gsod19 and a wildcard * to represent the suffix, which covers all these years.
SELECT
max,
min
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
_TABLE_SUFFIX BETWEEN '29' AND '35'
Here:
- Table Naming: Each table is named with the common prefix gsod19 followed by the year as the suffix.
- Wildcard Usage: The wildcard * is used in the FROM clause to query all tables with names starting with gsod19, covering the years 1929 to 1935.
- WHERE Clause: The _TABLE_SUFFIX pseudo column is used to filter the query only to include tables from 1929 to 1935.
This setup allows you to efficiently query multiple tables at once without listing each table individually.
Get BigQuery Reports in Seconds
Seamlessly generate and update reports in Google Sheets—no complex setup needed
Query Tables with Wildcard Tables
Using wildcard tables in queries enhances the ability to manage and analyze large datasets across multiple tables efficiently.
By employing wildcards, you can perform comprehensive analyses over a range of tables without specifying each one individually, significantly streamlining the querying process and reducing both complexity and execution time.
Querying All Tables in a Dataset
To query all tables within a dataset, you can utilize a wildcard in the FROM clause with an empty prefix.
This approach leverages the _TABLE_SUFFIX pseudo column, which captures the full table names.
Example 1:
For instance, to examine every table in the GSOD dataset, you would write:
SELECT * FROM `bigquery-public-data.noaa_gsod.*`
Here:
- FROM: Specifies the source of the data.
- big query-public-data: Project ID on Google BigQuery, indicating the data comes from Google's public datasets.
- noaa_gsod: Dataset ID for NOAA's Global Surface Summary of the Day, containing daily weather summaries.
- *: Wildcard character used to include all tables within the noaa_gsod dataset, enabling queries across the entire dataset without specifying individual tables.
Example 2:
The _TABLE_SUFFIX pseudo column uses an empty prefix and includes the complete table names.
For instance, the following query achieves the result of the highest temperature from 1929 to 1935, but it specifies full table names in the WHERE clause:
#standardSQL
SELECT
max,
ROUND((max-32)*5/9, 1) as celsius,
mo,
da,
year
FROM
`bigquery-public-data.noaa_gsod.*`
WHERE
max != 9999.9 # filter out missing data
AND _TABLE_SUFFIX BETWEEN 'gsod1929' AND 'gsod1935'
ORDER BY
max DESC
Here:
- #standardSQL: Specifies the use of standard SQL syntax in BigQuery.
- SELECT clause: Selects maximum temperature (max), converts it to Celsius (celsius), and retrieves the month (mo), day (da), and year from the dataset.
- FROM clause: Queries all tables in the noaa_gsod dataset using a wildcard *.
- WHERE clause: Excludes records with missing data (max != 9999.9) and limits the query to tables from 1929 to 1935 using _TABLE_SUFFIX.
- ORDER BY clause: Sorts the results by maximum temperature in descending order.
This query efficiently scans all tables from 1929 to 1935 in the NOAA GSOD dataset to find the maximum temperatures, simplifying access to a broad dataset.
Filtering Specific Tables Using _TABLE_SUFFIX
The _TABLE_SUFFIX function in BigQuery allows you to filter specific tables within a wildcard table by matching suffix patterns in the table names.
This is particularly useful when querying multiple tables with a common prefix, enabling efficient data retrieval from the desired subset.
Filtering by Specific Years (1940 and 1944):
The _TABLE_SUFFIX pseudo column, which contains values ranging from 0 to 9, represents tables gsod1940 through gsod1949. You can use these _TABLE_SUFFIX values in a WHERE clause to filter for specific tables within that range.
For instance, to filter for the maximum temperature between 1940 and 1944, use the values '0' and '4' for _TABLE_SUFFIX.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9, 1) as celsius,
mo,
da,
year
FROM
`bigquery-public-data.noaa_gsod.gsod194*`
WHERE
max != 9999.9 # filter out missing data
AND (_TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4')
ORDER BY
max DESC
Here:
- SELECT: Retrieves maximum temperature (max), converts it to Celsius (celsius), and selects month (mo), day (da), and year.
- FROM: Targets NOAA GSOD tables from the 1940s using the wildcard gsod194*.
- WHERE: Filters out missing data (max != 9999.9) and limits results to the years 1940 (_TABLE_SUFFIX = '0') and 1944 (_TABLE_SUFFIX = '4').
- ORDER BY: Sorts results by descending maximum temperature.
Using _TABLE_SUFFIX can significantly reduce the number of bytes scanned, lowering the cost of your queries.
Filtering with Dynamic Value:
However, filters on _TABLE_SUFFIX that involve conditions without constant expressions do not limit the number of tables scanned in a wildcard query.
For instance, the following query does not reduce the tables scanned for the wildcard bigquery-public-data.noaa_gsod.gsod19* because it uses a dynamic value from the table_id column:
#standardSQL
# Scans all tables that match the prefix `gsod19`
SELECT
ROUND((max-32)*5/9, 1) as celsius
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
_TABLE_SUFFIX = (
SELECT SUBSTR(MAX(table_name), LENGTH('gsod19') + 1)
FROM `bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE 'gsod194%'
)
Here:
- SELECT: Converts the maximum temperature to Celsius (celsius).
- FROM: Queries all NOAA GSOD tables from the 1900s using the wildcard gsod19*.
- WHERE: Dynamically filters _TABLE_SUFFIX based on the table_id from the INFORMATION_SCHEMA.TABLES. This does not limit the number of tables scanned due to the dynamic nature of the filter.
This query scans all tables matching gsod19* due to the dynamic filter in the WHERE clause, leading to higher costs and processing time. For better efficiency, use constant expressions in _TABLE_SUFFIX filters.
Scanning a Range of Tables Using _TABLE_SUFFIX
To query a specific range of tables, leverage the _TABLE_SUFFIX pseudo-column in conjunction with the BETWEEN clause.
Example 1:
For instance, to determine the highest temperature from 1929 to 1935, you can use a wildcard to match the last two digits of the year.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9,1) as celsius,
mo,
da,
year
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
max != 9999.9 # filter out missing data
AND _TABLE_SUFFIX BETWEEN '29' AND '35'
ORDER BY
max DESC
Here:
- #standardSQL: Indicates the use of standard SQL dialect in BigQuery.
- SELECT clause: Retrieves maximum temperatures, converts them to Celsius, and selects date components (month, day, year).
- FROM clause: Targets all tables from the 1900s in the NOAA GSOD dataset that match the wildcard pattern.
- WHERE clause: Excludes entries with missing temperature data and limits the range of years from 1929 to 1935 using the table suffix.
- ORDER BY clause: Sorts results by descending maximum temperatures, showing the highest first.
This query efficiently pulls and processes temperature data across a specified range of years using table suffixes to streamline the selection.
Example 2:
To query a dynamic date range using the _TABLE_SUFFIX in BigQuery, you can utilize a combination of functions to filter tables based on the current date or a specific range of dates.
Here's an example that retrieves data for the last 14 days:
#standardSQL
SELECT
event_name,
COUNT(*) AS event_count,
event_date
FROM
`owox-analytics.analytics_123456789.events_*`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP('%Y%m%d', TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY))
AND FORMAT_TIMESTAMP('%Y%m%d', CURRENT_TIMESTAMP())
GROUP BY
event_date, event_name
ORDER BY
event_date DESC, event_count DESC
Here:
- #standardSQL: Indicates the use of standard SQL dialect in BigQuery.
- SELECT clause: Retrieves the event name, counts the occurrences, and formats the date.
- FROM clause: Targets tables with a wildcard pattern that includes the date in the table name.
- WHERE clause: Filters tables by the last 14 days using the _TABLE_SUFFIX and FORMAT_TIMESTAMP function.
- GROUP BY clause: Groups the results by the formatted event date and event name.
- ORDER BY clause: Sorts results by date in descending order, then by event count.
Delete a Date Range from Wildcard Tables
To delete data within a date range in wildcard tables, you must delete each table individually.
While wildcard tables allow you to access a range of tables using _TABLE_SUFFIX, it’s important to remember that each table in the range is still a separate entity.
Therefore, deleting data from a specific date range requires removing each corresponding table one by one.
Example:
For instance, Suppose you need to delete event data from the 123456789 dataset for the dates 20240801 to 20240803. The tables are named:
- events_20240801
- events_20240802
- events_20240803
To delete all these tables, you would need to issue a separate DROP TABLE command for each one.
DROP TABLE `owox-analytics.analytics_123456789.events_20240801`;
DROP TABLE `owox-analytics.analytics_123456789.events_20240801`;
DROP TABLE `owox-analytics.analytics_123456789.events_20240801`
Here:
- Table Identification: Each table is individually identified by its unique suffix, corresponding to a specific year.
- Separate Deletion: Despite the wildcard query capability, each table must be deleted with its own DROP TABLE statement, as wildcard deletion is not supported.
- Manual Process: You must manually execute the deletion for each year within the specified range.
This ensures that the data for each year is completely removed, but it requires handling each table separately.
Simplify BigQuery Reporting in Sheets
Easily analyze corporate data directly into Google Sheets. Query, run, and automatically update reports aligned with your business needs
Best Practices for Using Wildcard Tables
Using wildcard tables in BigQuery can significantly improve querying efficiency across multiple tables, but it's crucial to follow best practices to keep performance on track and avoid issues.
This means keeping your schemas consistent, limiting wildcard usage to when needed, and being aware of the limitations of wildcard queries in specific situations.
Enclose Table Names with Wildcards in Backticks
When using wildcard table names that include special characters like (*), enclosing the table name in backticks (`) is essential.
Example:
For example, the following query is valid because it properly uses backticks:
#standardSQL
SELECT
max
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
max != 9999.9 # filter out missing data
AND _TABLE_SUFFIX = '1929'
ORDER BY
max DESC
Here:
- **Backticks () around the table name**: Used to enclose the table name containing the wildcard (*`), which is necessary for BigQuery to recognize it correctly.
- SELECT max: Retrieves the maximum temperature value.
- FROM bigquery-public-data.noaa_gsod.gsod*: Targets all tables in the dataset that match the pattern gsod*, using a wildcard to include tables with similar names.
- WHERE max != 9999.9: Excludes rows where the maximum temperature is recorded as 9999.9, which denotes missing data.
- AND _TABLE_SUFFIX = '1929': Filters the query to only include tables with a suffix of 1929, focusing on that specific year.
- ORDER BY max DESC: Sorts the results by the maximum temperature in descending order, showing the highest value first.
Use the Most Granular Prefix Possible
When querying wildcard tables, using the most specific prefix available is crucial. More granular prefixes improve performance by limiting the number of tables that match the wildcard.
For instance, using FROM big query.noaa_gsod.gsod19* is more efficient than FROM big query.noaa_gsod.*, as it targets a smaller, more precise set of tables.
Avoid Excessive Wildcard Usage<b> </b>
Minimize the use of wildcard queries that span a large number of tables. Instead, consider using partitioning and clustering strategies to manage data more efficiently. You can place a wildcard (*) after the common table prefix to query a wildcard table. Consider partitioning and clustering to optimize performance and reduce costs.
Ensure Schema Consistency
All tables matched by a wildcard query should have consistent schemas. Inconsistent schemas can lead to query failures and performance issues.
When executing a GoogleSQL query with a wildcard table, BigQuery automatically determines the schema based on the most recently created table that matches the wildcard. This schema is applied to the wildcard table, even if you limit the number of tables using the _TABLE_SUFFIX pseudo column in the WHERE clause.
If a column from the inferred schema is missing in one of the matched tables, BigQuery will return NULL for that column in the rows from the table where it is absent.
However, if the schema varies across the tables matched by the wildcard query – such as differences in column data types or missing columns that cannot be assumed to have NULL values – BigQuery will return an error.
Example:
Suppose you have a series of tables in the NOAA GSOD dataset, named gsod1929, gsod1930, gsod1931, and so on, representing yearly weather data from 1929 to 1935.
The most recently created table, gsod1935, includes an additional column called humidity, which is not present in the earlier tables.
SELECT
max,
min,
humidity
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
_TABLE_SUFFIX BETWEEN '29' AND '35'
Here:
- SELECT: Retrieves max_temp, min_temp, and humidity from each table.
- FROM: Queries all tables in the NOAA GSOD dataset starting with gsod19*, covering years 1929-1935.
- WHERE: Filters to include only tables with suffixes between '29' and '35' (1929-1935).
Scenario 1: Missing Column
- BigQuery applies the schema from gsod1935 (the most recent table) across all matched tables.
- Since gsod1929, gsod1930, and other earlier tables do not have the humidity column, BigQuery returns NULL for that column in rows from these tables.
Scenario 2: Inconsistent Schema
- If gsod1933 had the humidity column but with a different data type (e.g., STRING instead of FLOAT as in gsod1935), BigQuery would return an error because the schema is inconsistent across the matched tables.
- Similarly, if there’s a column in one of the earlier tables missing in the most recent schema, and it cannot be assumed to have a NULL value, an error would occur.
This example demonstrates the need to ensure consistent schemas across all tables when using wildcard queries in BigQuery, particularly when working with datasets like NOAA GSOD spanning multiple years.
Why Switch to Partitioned Tables
Switching to partitioned tables enhances data management by improving query performance and reducing costs.
By organizing data into partitions, such as by date or region, only relevant data is scanned during queries, leading to faster execution and lower processing costs.
This approach also simplifies data maintenance, making it easier to manage and archive specific subsets of your dataset, especially as data volumes grow.
When querying a partitioned table, use the following columns to filter partitions:
- For ingestion-time partitioned tables, use the _PARTITIONTIME pseudocolumn.
- Use the respective partitioning column for time-unit, column-based, and integer-range partitioned tables. In time-unit partitioned tables, you can filter data by specifying a date or date range.
Example:
For instance, the WHERE clause below uses the _PARTITIONTIME pseudocolumn to filter partitions between January 1, 2016, and January 31, 2016.
WHERE _PARTITIONTIME
BETWEEN TIMESTAMP("20160101")
AND TIMESTAMP("20160131")
Here:
- _PARTITIONTIME: Filters data based on the partition time.
- BETWEEN: Limits results to the specified range.
- TIMESTAMP("20160101") to TIMESTAMP("20160131"): Filters partitions between January 1 and January 31, 2016.
This query processes only the data in the specified date range, improving query performance and reducing costs by limiting the amount of data scanned.
Get BigQuery Reports in Seconds
Seamlessly generate and update reports in Google Sheets—no complex setup needed
Limitations of Using Wildcard Tables
Wildcard tables in BigQuery offer great flexibility, but they come with a few important limitations you should be aware of before diving in.
As we explore these constraints, you'll see how they can affect your query performance, cost, and overall efficiency. Let's dive into the key limitations you'll need to keep in mind.
Unsupported Views
Wildcard table functionality does not support views. If your wildcard table pattern matches any view in the dataset, the query will return an error, even if you try to exclude the view using a WHERE clause with the _TABLE_SUFFIX pseudo-column.
Unsupported Cached Results
Cached results are not supported for queries that use wildcards across multiple tables, even if the "Use Cached Results" option is enabled. Each time you run the same wildcard query, you will be billed for the query.
Example:
Let's say you have several weather tables named gsod1929, gsod1930, and so on, each corresponding to a different year. You want to analyze weather data across all these tables using a wildcard table pattern in BigQuery.
The solution would be to use INFORMATION SCHEMA.
Here is an example of how we can use the same for the weather dataset.
from google.cloud import bigquery
client = bigquery.Client()
# Specify the dataset and wildcard pattern for the NOAA GSOD dataset
project_id = "bigquery-public-data"
dataset_id = "noaa_gsod"
wildcard_pattern = "gsod19"
# Query the INFORMATION_SCHEMA to get matching table names
query = f"""
SELECT table_name
FROM `{dataset_id}.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE '{wildcard_pattern}%'
ORDER BY table_name
"""
# Execute the query to get the list of matching table names
rows = list(client.query(query))
# If no tables match the pattern, exit the function
if not rows:
print("No tables found matching the pattern.")
else:
# Initialize the view query with the first table
view_query = f"""
SELECT * FROM `{dataset_id}.{rows[0].table_name}`
"""
# Loop through the remaining table names and add them using UNION ALL
for row in rows[1:]:
table_name = row.table_name
view_query += f"""
UNION ALL
SELECT * FROM `{dataset_id}.{table_name}`
"""
# Now, view_query contains a single query that combines all tables matching the pattern
# You can use this query directly or create a view in BigQuery
print("Generated Query:")
print(view_query)
# Example: Execute the combined query (not creating a view)
result = client.query(view_query)
# Process the result (for example, printing the first few rows)
for row in result:
print(row)
Here:
- Dataset and Wildcard Pattern: The dataset_id is set to bigquery-public-data.noaa_gsod, which is the public dataset for the NOAA GSOD weather data. The wildcard_pattern is set to gsod19% to match tables from the 1900s, such as gsod1929, gsod1930, etc.
- Querying INFORMATION_SCHEMA: The script queries the INFORMATION_SCHEMA.TABLES within the NOAA GSOD dataset to retrieve all tables that match the gsod19% pattern. The tables are ordered to ensure cache consistency.
- Building the Union All Query: The script constructs a query that unites the relevant tables using UNION ALL, ensuring that the data from all matching tables is included in the query.
- Cache Consideration: By querying tables in a consistent and ordered manner, BigQuery can better use caching mechanisms when running repeated queries, as the table order and structure remain the same.
Cannot Query External Tables
Wildcard tables in BigQuery are a powerful tool that allows you to query multiple tables that share a common naming pattern with a single query. However, it's important to note that wildcard tables are supported only for tables stored within BigQuery's native storage. This means you cannot use wildcard patterns to query external tables (e.g., tables linked to external data sources like Google Cloud Storage, Google Sheets, or Bigtable) or views.
Compatible with Identical Clustering Only
You cannot use wildcard queries on tables with different partitioning schemes or a mix of partitioned and non-partitioned tables. Additionally, the queried tables must have identical clustering specifications.
To work around this limitation, you should ensure that all tables are consistently partitioned and clustered before running your wildcard query. Before creating your tables, ensure that all tables have the same partitioning and clustering setup.
No Cluster Pruning for Unpartitioned Wildcard Tables
You can use wildcard queries with partitioned tables; both partition and cluster pruning are supported. However, if the tables are clustered but not partitioned, you won't benefit from cluster pruning when using wildcards.
To ensure you're maximizing query efficiency with wildcard queries, it’s recommended to use both partitioning and clustering on your tables. Also, before running queries, ensure that your tables are both partitioned and clustered.
Once your tables are correctly partitioned and clustered, you can run a wildcard query that benefits from both partition and cluster pruning.
DML Statements Restrictions
Queries that include Data Manipulation Language (DML) statements cannot target a wildcard table as the query's destination.
To address the limitation that Data Manipulation Language (DML) statements (such as INSERT, UPDATE, DELETE, MERGE) cannot target a wildcard table as the query's destination in BigQuery, you can do the following:
- Direct your DML operations to specific tables rather than using a wildcard table.
- Use scripting to automate applying DML operations across multiple tables.
- If needed, aggregate data into a single table for more straightforward DML operations.
No Table Limit with JavaScript Filters
Filters on the _TABLE_SUFFIX or _PARTITIONTIME pseudocolumns that include JavaScript user-defined functions do not reduce the number of tables scanned in a wildcard query.
To ensure that filters effectively reduce the number of tables scanned in a wildcard query, avoid using JavaScript user-defined functions (UDFs) in your filters on the _TABLE_SUFFIX or _PARTITIONTIME pseudo columns. Instead, use standard SQL functions and direct comparisons, which BigQuery can optimize to prune tables and partitions.
Here’s how to structure your query for optimal performance:
- Use Direct Comparisons: Filter tables or partitions using direct comparisons on _TABLE_SUFFIX or _PARTITIONTIME. Avoid using JavaScript UDFs in these filters.
- Use SQL Functions Instead of JavaScript UDFs: If you require transformation or calculation on _TABLE_SUFFIX or _PARTITIONTIME, use BigQuery's built-in SQL functions rather than JavaScript UDFs.
Customer-Managed Encryption Keys (CMEK)
Wildcard queries in BigQuery provide a powerful way to query multiple tables at once, but they are not compatible with tables that are protected by customer-managed encryption keys (CMEK). This limitation arises because CMEK adds an additional layer of security, requiring specific encryption and decryption processes that are not supported by the wildcard query mechanism.
As a result, if your datasets include CMEK-protected tables, you won't be able to leverage wildcard queries to query across them, potentially limiting your ability to perform comprehensive analyses or aggregations across multiple secured tables. It's important to plan your data architecture accordingly if you rely on CMEK for enhanced security.
Wildcard Queries Ignore Tags
When using wildcard queries in BigQuery, it's essential that all referenced tables have identical tag keys and values. This uniformity is necessary because wildcard queries treat the tables as a single logical entity, and any discrepancy in tags can lead to query failures. If the tables differ in their tags, the query will not execute, as BigQuery expects consistency across all tables involved in the wildcard operation.
To avoid this issue, ensure that all tables in the wildcard query are tagged with the same keys and values before running your query. You can standardize tags across your tables by manually updating them or by implementing automated processes that enforce consistent tagging during table creation or update stages.
Scans All Table that begin with Name instead of *
When using wildcard tables in BigQuery, all tables in the dataset that match the prefix before the * in your query are scanned, even if you apply a regular expression to _TABLE_SUFFIX using REGEXP_CONTAINS.
This means that while the regular expression can filter the results that are ultimately returned by the query, it doesn't reduce the number of tables scanned.
As a result, even tables that don't match the regular expression are processed, potentially leading to unnecessary data scanning and increased costs.
To mitigate this issue, you can optimize your approach by carefully designing your table naming conventions to narrow down the initial set of tables included in the wildcard match. Instead of relying solely on REGEXP_CONTAINS, consider using a more specific prefix before the * to limit the number of tables that are scanned.
Additionally, you can structure your data into smaller, more targeted datasets or use partitioned tables to control and minimize the scope of data scanning. By strategically organizing your tables and refining your query patterns, you can reduce unnecessary data scans, improving query performance and cost-efficiency.
Uncover in-depth insights
Modern Data Management Guide
Download nowBonus for readers
Discover the Potential of Related BigQuery Functions
Explore BigQuery's powerful features for complex data analysis and efficient management of large datasets with these essential functions:
- Conditional Expressions: Conditional expressions like CASE and IF in SQL enable you to return different results based on specific conditions, allowing dynamic decision-making within queries.
- String Functions: String functions manipulate text data in SQL, handling tasks like trimming, concatenating, and extracting substrings for effective text management.
- Conversion Functions: Conversion functions change data types in SQL, ensuring data is in the correct format for analysis or processing, such as converting strings to numbers or dates.
- Array Functions: Array functions in SQL manage collections of elements within a field, allowing for manipulation and transformation of array data.
- Numbering Functions: Numbering functions like ROW_NUMBER() assign unique identifiers or ranks to rows, essential for pagination and ordering.
- Navigation Functions: Navigation functions like LEAD() and LAG() access data from adjacent rows, useful for calculations across sequences in SQL.
These functions boost BigQuery’s analytical power, simplifying the management, transformation, and analysis of large datasets, allowing you to efficiently extract meaningful insights.
Gain Advanced Insights with the OWOX BI BigQuery Reports Extension
Mastering the use of wildcard tables in BigQuery is crucial for optimizing your data querying processes and ensuring efficient data analysis. Wildcard tables allow you to query multiple tables with similar names in a single operation, streamlining your workflow and enhancing query performance.
Get BigQuery Reports in Seconds
Seamlessly generate and update reports in Google Sheets—no complex setup needed
For those looking to take their data analytics to the next level, consider using advanced tools like the OWOX Reports Extension for BigQuery and Sheets.
This tool integrates seamlessly with BigQuery, helping you create detailed reports and visualizations directly from your data.
By leveraging OWOX Reports, you can gain deeper insights, simplify complex data analysis tasks, and make more informed, data-driven decisions.
FAQ
-
How do wildcard tables in BigQuery differ from partitioned tables?
Wildcard tables allow you to query multiple tables with similar names using a single query, whereas partitioned tables organize data within a single table based on a specific column, such as date. Wildcard tables are useful for querying across multiple tables, while partitioned tables are more efficient for handling large datasets within a single table structure.
-
What are the limitations of using wildcard tables in BigQuery?
Wildcard tables cannot be used with views, external tables, or tables protected by customer-managed encryption keys (CMEK). Additionally, wildcard queries may scan more data than necessary if not used carefully, leading to higher costs. They also require consistent schemas across all referenced tables.
-
How can I create and manage wildcard tables in BigQuery?
To create wildcard tables, structure your table names with a common prefix followed by a variable suffix, such as a date. You can then use a wildcard (*) in your queries to reference these tables. Managing wildcard tables involves ensuring that the schemas are consistent across all tables and using filters like _TABLE_SUFFIX to query specific tables efficiently.
-
When should I use wildcard tables in BigQuery?
Wildcard tables are ideal when you need to query multiple tables with similar structures, such as daily logs or monthly reports. They are particularly useful when data is stored in separate tables over time, allowing you to run aggregated queries across multiple tables without manually specifying each one.
-
What best practices should I follow when using wildcard tables in BigQuery?
Use the most granular prefix possible to limit the number of tables scanned. Ensure that all tables have consistent schemas to avoid errors. Avoid excessive wildcard usage by considering partitioning and clustering strategies for more efficient data management. Finally, be mindful of the cost implications of scanning large datasets with wildcard queries.
-
How does the _TABLE_SUFFIX pseudocolumn work with wildcard tables?
The _TABLE_SUFFIX pseudocolumn represents the suffix of table names in wildcard queries. It allows you to filter and query specific tables within the wildcard range, such as selecting tables from certain dates or years. This helps narrow down the data scanned and improves query performance.