Are you looking to optimize your queries across multiple tables in Google BigQuery? Wildcard tables provide a powerful solution, allowing you to run queries over multiple tables using concise SQL patterns.
This guide introduces you to wildcard tables, demonstrating their utility in handling extensive datasets that span numerous time-partitioned tables.
By mastering this feature, you’ll enhance query efficiency and simplify your data analysis process. Whether you’re a data engineer or an analyst, learn how to tap into the full potential of wildcard tables to make your BigQuery tasks more effective and time-efficient.
Wildcard tables in BigQuery are characterized by tables that share a common naming convention but differ by suffixes. These tables typically represent dates in the YYYYMMDD format.
Google BigQuery allows data professionals to query these tables collectively by substituting wildcard characters for parts or the entire suffix.
Common examples of wildcard tables in data processing environments include:
These naming conventions allow for efficient querying and management of data spread across multiple tables related by time or type.
Wildcard tables are highly useful for datasets that contain many similarly named tables with uniform schemas, such as those with data segmented by time (e.g., days, months, years). These are useful for querying multiple tables with similar names using a common pattern, enabling efficient analysis of large, partitioned datasets
For example, BigQuery's NOAA Global Surface Summary of the Day dataset includes annual tables from 1929 onwards.
Querying such data without wildcards involves listing each table explicitly, which can be time-consuming as you'd typically need to specify each table in the FROM clause, resulting in a long and complex query structure like the following.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9,1) as celsius,
mo,
da,
year
FROM (
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1929`
UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1930`
UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1931`
# ... additional tables omitted for brevity
UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod1940`
)
WHERE
max != 9999.9 # code for missing data
ORDER BY
max DESC
Using wildcards simplifies this process, allowing for concise and manageable queries.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9,1) as celsius,
mo,
da,
year
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
max != 9999.9 # code for missing data
AND _TABLE_SUFFIX BETWEEN '29' AND '40'
ORDER BY
max DESC
This approach reduces the complexity of the query and makes the SQL code more concise and easier to maintain, especially when working with large datasets spanning multiple tables.
Wildcard Tables and Partitioned Tables offer distinct approaches for managing and optimizing data in large datasets. This comparison highlights their key differences in data structure, data access, and flexibility to help you choose the best fit for your data processing needs.
Wildcard tables allow for efficient data access by querying multiple tables with similar names using a common pattern, particularly when analyzing daily data additions.
Partitioned tables, on the other hand, offer optimized access not just by date but also by integer values, enabling more efficient queries and data manipulation across larger datasets.
Queries using wildcard tables can be more complex due to the need to construct dynamic queries, especially when using the _TABLE_SUFFIX for date-specific data access.
Partitioned tables simplify query construction, especially in Standard SQL, where the partitioning criterion is predefined.
Wildcard tables create virtual sections automatically as new data is added, making them suitable for daily data accumulation but potentially limiting for retrospective updates.
In contrast, partitioned tables require pre-defined partitions, which can hold up to 4000 partitions, with the partitioning field occupying space within the data, impacting storage and access efficiency.
Wildcard tables can be cost-effective for querying data by specific dates without needing an additional date field, but the performance may degrade if the queries become too complex.
Partitioned tables provide better performance for large datasets, allowing for cost-efficient queries that leverage the partitioning strategy, especially when the dataset is large and retrospective updates are frequent.
Wildcard tables are supported in Legacy SQL and Standard SQL, though the SQL queries may become more complex than flat tables.
Partitioned tables, however, are supported only in Standard SQL, making them incompatible with Legacy SQL, which limits their usage in systems reliant on the older query syntax.
Wildcard tables are ideal for accumulating daily data, especially when ease of section creation and simple date-based access are priorities.
Partitioned tables are better suited for scenarios requiring optimized data access and flexibility in applying Data Manipulation Language (DML) operations across multiple partitions, such as in complex reporting schemes that involve retrospective data updates.
Wildcard tables offer simplicity in data management and are cost-effective for daily data accumulation. However, they can be limited in flexibility and performance for complex queries and updates.
Partitioned tables provide robust performance and flexibility for large datasets with complex access patterns, but they require more upfront setup and are only compatible with Standard SQL.
_TABLE_SUFFIX
.To create wildcard tables, simply create multiple tables with a common prefix followed by an underscore and a suffix. BigQuery SQL automatically recognizes this naming convention and treats these tables as a wildcard group.
Suppose you have weather data in the NOAA GSOD dataset from 1929 to 1935, with each year's data stored in a separate table. The tables are named as follows:
You can create a wildcard table using the common prefix gsod19 and a wildcard * to represent the suffix, which covers all these years.
SELECT
max,
min
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
_TABLE_SUFFIX BETWEEN '29' AND '35'
Here:
This setup allows you to efficiently query multiple tables at once without listing each table individually.
Using wildcard tables in queries enhances the ability to manage and analyze large datasets across multiple tables efficiently.
By employing wildcards, you can perform comprehensive analyses over a range of tables without specifying each one individually, significantly streamlining the querying process and reducing both complexity and execution time.
To query all tables within a dataset, you can utilize a wildcard in the FROM clause with an empty prefix.
This approach leverages the _TABLE_SUFFIX pseudo column, which captures the full table names.
Example 1:
For instance, to examine every table in the GSOD dataset, you would write:
SELECT * FROM `bigquery-public-data.noaa_gsod.*`
Here:
Example 2:
The _TABLE_SUFFIX pseudo column uses an empty prefix and includes the complete table names.
For instance, the following query achieves the result of the highest temperature from 1929 to 1935, but it specifies full table names in the WHERE clause:
#standardSQL
SELECT
max,
ROUND((max-32)*5/9, 1) as celsius,
mo,
da,
year
FROM
`bigquery-public-data.noaa_gsod.*`
WHERE
max != 9999.9 # filter out missing data
AND _TABLE_SUFFIX BETWEEN 'gsod1929' AND 'gsod1935'
ORDER BY
max DESC
Here:
This query efficiently scans all tables from 1929 to 1935 in the NOAA GSOD dataset to find the maximum temperatures, simplifying access to a broad dataset.
The _TABLE_SUFFIX function in BigQuery allows you to filter specific tables within a wildcard table by matching suffix patterns in the table names.
This is particularly useful when querying multiple tables with a common prefix, enabling efficient data retrieval from the desired subset.
Filtering by Specific Years (1940 and 1944):
The _TABLE_SUFFIX pseudo column, which contains values ranging from 0 to 9, represents tables gsod1940 through gsod1949. You can use these _TABLE_SUFFIX values in a WHERE clause to filter for specific tables within that range.
For instance, to filter for the maximum temperature between 1940 and 1944, use the values '0' and '4' for _TABLE_SUFFIX.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9, 1) as celsius,
mo,
da,
year
FROM
`bigquery-public-data.noaa_gsod.gsod194*`
WHERE
max != 9999.9 # filter out missing data
AND (_TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4')
ORDER BY
max DESC
Here:
Using _TABLE_SUFFIX can significantly reduce the number of bytes scanned, lowering the cost of your queries.
Filtering with Dynamic Value:
However, filters on _TABLE_SUFFIX that involve conditions without constant expressions do not limit the number of tables scanned in a wildcard query.
For instance, the following query does not reduce the tables scanned for the wildcard bigquery-public-data.noaa_gsod.gsod19* because it uses a dynamic value from the table_id column:
#standardSQL
# Scans all tables that match the prefix `gsod19`
SELECT
ROUND((max-32)*5/9, 1) as celsius
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
_TABLE_SUFFIX = (
SELECT SUBSTR(MAX(table_name), LENGTH('gsod19') + 1)
FROM `bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE 'gsod194%'
)
Here:
This query scans all tables matching gsod19* due to the dynamic filter in the WHERE clause, leading to higher costs and processing time. For better efficiency, use constant expressions in _TABLE_SUFFIX filters.
To query a specific range of tables, leverage the _TABLE_SUFFIX pseudo-column in conjunction with the BETWEEN clause.
Example 1:
For instance, to determine the highest temperature from 1929 to 1935, you can use a wildcard to match the last two digits of the year.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9,1) as celsius,
mo,
da,
year
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
max != 9999.9 # filter out missing data
AND _TABLE_SUFFIX BETWEEN '29' AND '35'
ORDER BY
max DESC
Here:
This query efficiently pulls and processes temperature data across a specified range of years using table suffixes to streamline the selection.
Example 2:
To query a dynamic date range using the _TABLE_SUFFIX in BigQuery, you can utilize a combination of functions to filter tables based on the current date or a specific range of dates.
Here's an example that retrieves data for the last 14 days:
#standardSQL
SELECT
event_name,
COUNT(*) AS event_count,
event_date
FROM
`owox-analytics.analytics_123456789.events_*`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP('%Y%m%d', TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY))
AND FORMAT_TIMESTAMP('%Y%m%d', CURRENT_TIMESTAMP())
GROUP BY
event_date, event_name
ORDER BY
event_date DESC, event_count DESC
Here:
To delete data within a date range in wildcard tables, you must delete each table individually.
While wildcard tables allow you to access a range of tables using _TABLE_SUFFIX, it’s important to remember that each table in the range is still a separate entity.
Therefore, deleting data from a specific date range requires removing each corresponding table one by one.
Example:
For instance, Suppose you need to delete event data from the 123456789 dataset for the dates 20240801 to 20240803. The tables are named:
To delete all these tables, you would need to issue a separate DROP TABLE command for each one.
DROP TABLE `owox-analytics.analytics_123456789.events_20240801`;
DROP TABLE `owox-analytics.analytics_123456789.events_20240801`;
DROP TABLE `owox-analytics.analytics_123456789.events_20240801`
Here:
This ensures that the data for each year is completely removed, but it requires handling each table separately.
Using wildcard tables in BigQuery can significantly improve querying efficiency across multiple tables, but it's crucial to follow best practices to keep performance on track and avoid issues.
This means keeping your schemas consistent, limiting wildcard usage to when needed, and being aware of the limitations of wildcard queries in specific situations.
When using wildcard table names that include special characters like (*), enclosing the table name in backticks (`) is essential.
Example:
For example, the following query is valid because it properly uses backticks:
#standardSQL
SELECT
max
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
max != 9999.9 # filter out missing data
AND _TABLE_SUFFIX = '1929'
ORDER BY
max DESC
Here:
When querying wildcard tables, using the most specific prefix available is crucial. More granular prefixes improve performance by limiting the number of tables that match the wildcard.
For instance, using FROM big query.noaa_gsod.gsod19* is more efficient than FROM big query.noaa_gsod.*, as it targets a smaller, more precise set of tables.
Minimize the use of wildcard queries that span a large number of tables. Instead, consider using partitioning and clustering strategies to manage data more efficiently. You can place a wildcard (*) after the common table prefix to query a wildcard table. Consider partitioning and clustering to optimize performance and reduce costs.
All tables matched by a wildcard query should have consistent schemas. Inconsistent schemas can lead to query failures and performance issues.
When executing a GoogleSQL query with a wildcard table, BigQuery automatically determines the schema based on the most recently created table that matches the wildcard. This schema is applied to the wildcard table, even if you limit the number of tables using the _TABLE_SUFFIX pseudo column in the WHERE clause.
If a column from the inferred schema is missing in one of the matched tables, BigQuery will return NULL for that column in the rows from the table where it is absent.
However, if the schema varies across the tables matched by the wildcard query – such as differences in column data types or missing columns that cannot be assumed to have NULL values – BigQuery will return an error.
Example:
Suppose you have a series of tables in the NOAA GSOD dataset, named gsod1929, gsod1930, gsod1931, and so on, representing yearly weather data from 1929 to 1935.
The most recently created table, gsod1935, includes an additional column called humidity, which is not present in the earlier tables.
SELECT
max,
min,
humidity
FROM
`bigquery-public-data.noaa_gsod.gsod19*`
WHERE
_TABLE_SUFFIX BETWEEN '29' AND '35'
Here:
Scenario 1: Missing Column
Scenario 2: Inconsistent Schema
This example demonstrates the need to ensure consistent schemas across all tables when using wildcard queries in BigQuery, particularly when working with datasets like NOAA GSOD spanning multiple years.
Switching to partitioned tables enhances data management by improving query performance and reducing costs.
By organizing data into partitions, such as by date or region, only relevant data is scanned during queries, leading to faster execution and lower processing costs.
This approach also simplifies data maintenance, making it easier to manage and archive specific subsets of your dataset, especially as data volumes grow.
When querying a partitioned table, use the following columns to filter partitions:
Example:
For instance, the WHERE clause below uses the _PARTITIONTIME pseudocolumn to filter partitions between January 1, 2016, and January 31, 2016.
WHERE _PARTITIONTIME
BETWEEN TIMESTAMP("20160101")
AND TIMESTAMP("20160131")
Here:
This query processes only the data in the specified date range, improving query performance and reducing costs by limiting the amount of data scanned.
Wildcard tables in BigQuery offer great flexibility, but they come with a few important limitations you should be aware of before diving in.
As we explore these constraints, you'll see how they can affect your query performance, cost, and overall efficiency. Let's dive into the key limitations you'll need to keep in mind.
Wildcard table functionality does not support views. If your wildcard table pattern matches any view in the dataset, the query will return an error, even if you try to exclude the view using a WHERE clause with the _TABLE_SUFFIX pseudo-column.
Cached results are not supported for queries that use wildcards across multiple tables, even if the "Use Cached Results" option is enabled. Each time you run the same wildcard query, you will be billed for the query.
Example:
Let's say you have several weather tables named gsod1929, gsod1930, and so on, each corresponding to a different year. You want to analyze weather data across all these tables using a wildcard table pattern in BigQuery.
The solution would be to use INFORMATION SCHEMA.
Here is an example of how we can use the same for the weather dataset.
from google.cloud import bigquery
client = bigquery.Client()
# Specify the dataset and wildcard pattern for the NOAA GSOD dataset
project_id = "bigquery-public-data"
dataset_id = "noaa_gsod"
wildcard_pattern = "gsod19"
# Query the INFORMATION_SCHEMA to get matching table names
query = f"""
SELECT table_name
FROM `{dataset_id}.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE '{wildcard_pattern}%'
ORDER BY table_name
"""
# Execute the query to get the list of matching table names
rows = list(client.query(query))
# If no tables match the pattern, exit the function
if not rows:
print("No tables found matching the pattern.")
else:
# Initialize the view query with the first table
view_query = f"""
SELECT * FROM `{dataset_id}.{rows[0].table_name}`
"""
# Loop through the remaining table names and add them using UNION ALL
for row in rows[1:]:
table_name = row.table_name
view_query += f"""
UNION ALL
SELECT * FROM `{dataset_id}.{table_name}`
"""
# Now, view_query contains a single query that combines all tables matching the pattern
# You can use this query directly or create a view in BigQuery
print("Generated Query:")
print(view_query)
# Example: Execute the combined query (not creating a view)
result = client.query(view_query)
# Process the result (for example, printing the first few rows)
for row in result:
print(row)
Here:
Wildcard tables in BigQuery are a powerful tool that allows you to query multiple tables that share a common naming pattern with a single query. However, it's important to note that wildcard tables are supported only for tables stored within BigQuery's native storage. This means you cannot use wildcard patterns to query external tables (e.g., tables linked to external data sources like Google Cloud Storage, Google Sheets, or Bigtable) or views.
You cannot use wildcard queries on tables with different partitioning schemes or a mix of partitioned and non-partitioned tables. Additionally, the queried tables must have identical clustering specifications.
To work around this limitation, you should ensure that all tables are consistently partitioned and clustered before running your wildcard query. Before creating your tables, ensure that all tables have the same partitioning and clustering setup.
You can use wildcard queries with partitioned tables; both partition and cluster pruning are supported. However, if the tables are clustered but not partitioned, you won't benefit from cluster pruning when using wildcards.
To ensure you're maximizing query efficiency with wildcard queries, it’s recommended to use both partitioning and clustering on your tables. Also, before running queries, ensure that your tables are both partitioned and clustered.
Once your tables are correctly partitioned and clustered, you can run a wildcard query that benefits from both partition and cluster pruning.
Queries that include Data Manipulation Language (DML) statements cannot target a wildcard table as the query's destination.
To address the limitation that Data Manipulation Language (DML) statements (such as INSERT, UPDATE, DELETE, MERGE) cannot target a wildcard table as the query's destination in BigQuery, you can do the following:
Filters on the _TABLE_SUFFIX or _PARTITIONTIME pseudocolumns that include JavaScript user-defined functions do not reduce the number of tables scanned in a wildcard query.
To ensure that filters effectively reduce the number of tables scanned in a wildcard query, avoid using JavaScript user-defined functions (UDFs) in your filters on the _TABLE_SUFFIX or _PARTITIONTIME pseudo columns. Instead, use standard SQL functions and direct comparisons, which BigQuery can optimize to prune tables and partitions.
Here’s how to structure your query for optimal performance:
Wildcard queries in BigQuery provide a powerful way to query multiple tables at once, but they are not compatible with tables that are protected by customer-managed encryption keys (CMEK). This limitation arises because CMEK adds an additional layer of security, requiring specific encryption and decryption processes that are not supported by the wildcard query mechanism.
As a result, if your datasets include CMEK-protected tables, you won't be able to leverage wildcard queries to query across them, potentially limiting your ability to perform comprehensive analyses or aggregations across multiple secured tables. It's important to plan your data architecture accordingly if you rely on CMEK for enhanced security.
When using wildcard queries in BigQuery, it's essential that all referenced tables have identical tag keys and values. This uniformity is necessary because wildcard queries treat the tables as a single logical entity, and any discrepancy in tags can lead to query failures. If the tables differ in their tags, the query will not execute, as BigQuery expects consistency across all tables involved in the wildcard operation.
To avoid this issue, ensure that all tables in the wildcard query are tagged with the same keys and values before running your query. You can standardize tags across your tables by manually updating them or by implementing automated processes that enforce consistent tagging during table creation or update stages.
When using wildcard tables in BigQuery, all tables in the dataset that match the prefix before the * in your query are scanned, even if you apply a regular expression to _TABLE_SUFFIX using REGEXP_CONTAINS.
This means that while the regular expression can filter the results that are ultimately returned by the query, it doesn't reduce the number of tables scanned.
As a result, even tables that don't match the regular expression are processed, potentially leading to unnecessary data scanning and increased costs.
To mitigate this issue, you can optimize your approach by carefully designing your table naming conventions to narrow down the initial set of tables included in the wildcard match. Instead of relying solely on REGEXP_CONTAINS, consider using a more specific prefix before the * to limit the number of tables that are scanned.
Additionally, you can structure your data into smaller, more targeted datasets or use partitioned tables to control and minimize the scope of data scanning. By strategically organizing your tables and refining your query patterns, you can reduce unnecessary data scans, improving query performance and cost-efficiency.
Explore BigQuery's powerful features for complex data analysis and efficient management of large datasets with these essential functions:
These functions boost BigQuery’s analytical power, simplifying the management, transformation, and analysis of large datasets, allowing you to efficiently extract meaningful insights.
Mastering the use of wildcard tables in BigQuery is crucial for optimizing your data querying processes and ensuring efficient data analysis. Wildcard tables allow you to query multiple tables with similar names in a single operation, streamlining your workflow and enhancing query performance.
For those looking to take their data analytics to the next level, consider using advanced tools like the OWOX Reports Extension for Google Sheets.
This tool integrates seamlessly with BigQuery, helping you create detailed reports and visualizations directly from your data.
By leveraging OWOX Reports, you can gain deeper insights, simplify complex data analysis tasks, and make more informed, data-driven decisions.
Wildcard tables allow you to query multiple tables with similar names using a single query, whereas partitioned tables organize data within a single table based on a specific column, such as date. Wildcard tables are useful for querying across multiple tables, while partitioned tables are more efficient for handling large datasets within a single table structure.
Wildcard tables cannot be used with views, external tables, or tables protected by customer-managed encryption keys (CMEK). Additionally, wildcard queries may scan more data than necessary if not used carefully, leading to higher costs. They also require consistent schemas across all referenced tables.
To create wildcard tables, structure your table names with a common prefix followed by a variable suffix, such as a date. You can then use a wildcard (*) in your queries to reference these tables. Managing wildcard tables involves ensuring that the schemas are consistent across all tables and using filters like _TABLE_SUFFIX to query specific tables efficiently.
Wildcard tables are ideal when you need to query multiple tables with similar structures, such as daily logs or monthly reports. They are particularly useful when data is stored in separate tables over time, allowing you to run aggregated queries across multiple tables without manually specifying each one.
Use the most granular prefix possible to limit the number of tables scanned. Ensure that all tables have consistent schemas to avoid errors. Avoid excessive wildcard usage by considering partitioning and clustering strategies for more efficient data management. Finally, be mindful of the cost implications of scanning large datasets with wildcard queries.
The _TABLE_SUFFIX pseudocolumn represents the suffix of table names in wildcard queries. It allows you to filter and query specific tables within the wildcard range, such as selecting tables from certain dates or years. This helps narrow down the data scanned and improves query performance.