Wondering how to turn your data into decisions quickly and easily? BigQuery ML gives data analysts and business intelligence teams the ability to build, train, and deploy machine learning models directly within their data environment, streamlining the process from raw data to actionable insights - without the need for complex setups.
By integrating with Vertex AI, BigQuery ML improves predictive accuracy, simplifies data workflows, and helps organizations efficiently manage machine learning projects at scale.
BigQuery ML integrates machine learning directly into Google BigQuery, allowing users to create and deploy models without needing to move data to external services. This setup speeds up development by letting data analysts and machine learning practitioners work within the familiar BigQuery environment.
By eliminating data transfers to other Google Cloud AI services, BigQuery ML simplifies workflows, boosts productivity, and streamlines the process of developing and deploying machine learning models.
BigQuery ML simplifies machine learning by enabling data analysts to build models within Google BigQuery using SQL, directly handling large data structures.
BigQuery ML supports various model types for different use cases, including:
A model in BigQuery ML represents learned patterns from training data. It analyzes data, identifies trends, and makes predictions or classifications based on historical data to inform future outcomes.
Prerequisites for BigQuery ML:
#1: GCP Account: Sign up and create a project.
#2: BigQuery API: Enable the BigQuery API in your project via the API Library.
Additionally, ensure that billing is active for your project and that you have the necessary roles – such as BigQuery Data Editor and BigQuery Job User – to create and manage models effectively.
These roles provide the permissions required to handle data and execute machine learning jobs within BigQuery.
In the BigQuery console, start by selecting or creating a dataset to hold your data and machine learning models.
If you need to create a new one - click on your project name and choose the "Create dataset" option.
For this article, we'll use the pre-cleaned and processed 'Google Analytics Sample' dataset from the BigQuery Public Datasets, adding it to our project for convenience.
To create a logistic regression model in BigQuery, use the CREATE MODEL statement.
Example:
This example demonstrates how to define and train a model to predict whether a transaction occurred based on user session data from the Google Analytics sample dataset.
1CREATE OR REPLACE MODEL `bqml.new_sample_model` OPTIONS(model_type='logistic_reg')
2AS SELECT IF(totals.transactions IS NULL, 0, 1)
3AS purchase_made, -- Target variable (binary classification) IFNULL(device.browser, "")
4AS browser, -- Input feature: Browser type device.deviceCategory
5AS device_category, -- Input feature: Device category (mobile, desktop, tablet) IFNULL(geoNetwork.region, "")
6AS region, -- Input feature: Region IFNULL(totals.hits, 0)
7AS total_hits -- Input feature: Total hits count
8FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
9WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20171231' LIMIT 150000;
Here:
To train and deploy a BigQuery ML model, first prepare and load your dataset. Then, use SQL to create and train the model. Once trained, evaluate its performance and deploy it for predictions.
Before training your model, ensure your dataset is clean and organized. Create a BigQuery dataset for storing your data and models and load relevant data (e.g., Google Analytics 4). Ensure proper data location settings (US/EU) for compliance.
After loading the data, you can perform exploratory data analysis to gain insights or visualize it using Google Data Studio to better understand it. This prepares your dataset for the next steps in model training.
In BigQuery ML, you can enhance model training using the TRANSFORM clause for feature engineering and data preprocessing. Instead of relying solely on raw data,
In BigQuery ML, the TRANSFORM clause is used in the CREATE MODEL statement to define data transformations directly in SQL..
This enables both feature selection and engineering, providing the model with more relevant and refined features. The TRANSFORM clause helps data analysts and BI professionals ensure that their models use the most meaningful data, leading to improved predictions and analysis accuracy.
To train a model in BigQuery ML, start by defining it with a standard SQL query.. First, specify the model’s name, then define the features and any necessary data transformations using the TRANSFORM clause.
Next, select the model type and set parameters, including the target column. Finally, execute an SQL query to select the input data from a table.
Model training time varies depending on complexity, from minutes for simple models to hours for more advanced models.
Track progress in the BigQuery UI's TRAINING tab.
BigQuery automatically calculates performance metrics based on the model type, such as precision, recall, accuracy, F1 score, log loss, or ROC AUC. You can view these metrics in the EVALUATION tab within the BigQuery console, or access them via SQL queries like ML.EVALUATE.
These metrics help assess model performance. If the results are unsatisfactory, you can return to the model training step to fine-tune and improve performance.
Once you're satisfied with your model's performance, it's time to use it for predictions on new, unseen data. In BigQuery, prepare a dataset with new data that wasn’t used during training, such as using a date range outside the training period.
For example, predict the probability of the addedToCart event using data from December 31, 2020. Alternatively, you can export the model for deployment in production systems to generate online predictions.
BigQuery ML offers two model types: built-in and external. Built-in models, like linear regression and K-means, are trained within BigQuery, while external models (e.g., deep neural networks and AutoML) are trained using Vertex AI.
Pricing depends on the model type and usage, either through editions or on-demand. Prediction and evaluation functions are executed within BigQuery, with costs based on the model type and resources used.
BigQuery ML provides Enterprise and Enterprise Plus Editions with a slot-based pricing model for customers using BigQuery ML on dedicated resources, ideal for customers seeking predictable costs compared to the on-demand model. Pricing and resource management differ for built-in and external models.
Built-in Models
Built-in models, such as linear regression or K-means, use BigQuery reservations that share resources with analytical queries, ensuring efficient slot utilization for both machine learning and data analysis tasks.
External Models
External models, trained outside BigQuery, require ML_EXTERNAL reservations to avoid resource conflicts. This ensures external workloads do not affect other processes, maintaining cost parity with BigQuery slots and external Google Cloud services.
BigQuery ML on-demand pricing is determined by the type of operation, including model creation, evaluation, inspection, and prediction. For iterative models, the CREATE MODEL statement is limited to 50 iterations, applying to both on-demand and edition pricing.
For time series models with auto-arima enabled, multiple candidate models are evaluated during training. The number of models can be controlled by the AUTO_ARIMA_MAX_ORDER setting. This impacts model creation costs, but for evaluation, inspection, and prediction, only the selected model is used, following standard query pricing.
BigQuery ML allows customers to create remote models targeting Vertex AI, Cloud AI APIs (e.g., Vision, NLP, Translate), or prediction endpoints. Pricing includes charges for BigQuery processing and additional costs for remote endpoints, such as text generation, translation, or image annotation.
For supervised tuning with Vertex AI LLMs, costs include BigQuery data processing, but GPU/TPU usage is billed separately. Use billing labels like billing_service = 'bigquery_ml' to track these charges across services.
For certain BigQuery ML model types, due to the complexity of their underlying algorithms, the bytes processed for billing may not be calculated until after the training process is complete.
This is because estimating the initial resource usage is more complex for these models, resulting in billing adjustments being made once training has finished.
When using BigQuery ML, it's essential to optimize costs while maintaining performance. This can be achieved by selecting the appropriate pricing model, leveraging built-in features like dry runs for accurate cost estimation, managing model training efficiently, and using editions for predictable pricing.
Opt for BigQuery ML Enterprise or Enterprise Plus Editions for predictable, capacity-based pricing instead of on-demand pricing. These editions allocate compute slots, avoiding unexpected costs during frequent, large-scale model training.
Built-in models use shared BigQuery reservations, while external models require ML_EXTERNAL reservations, ensuring efficient resource management and cost parity with Google Cloud services.
Use dry runs to estimate data processed and potential costs before executing queries or training models. This allows you to identify costly mistakes in advance, manage resources efficiently, and optimize BigQuery ML usage without incurring unexpected expenses.
Register your models in the Vertex AI Model Registry to monitor, compare, and manage versions efficiently. This prevents unnecessary retraining, reduces additional costs, and ensures better control over model lifecycle management within BigQuery ML.
Vertex AI is Google’s platform for developing, deploying, and managing AI/ML models. It integrates with BigQuery to prepare training data, manage features in the Vertex AI Feature Store, and support various training methods like AutoML and custom training. Models trained in BigQuery ML can be registered in Vertex AI for easier management and deployment.
Benefits of Managing BigQuery ML Models in Vertex AI
You can register BigQuery ML models with the Vertex AI Model Registry to manage them alongside Vertex AI models without exporting. This allows you to version, evaluate, and deploy models for online predictions through a unified interface, eliminating the need for a separate serving container or complex setup.
Before registering BigQuery ML models in the Vertex AI Model Registry, you must enable the Vertex AI API in your Google Cloud project. This can be done through the APIs & Services page in the Google Cloud Console, or by running the following Google Cloud CLI command:
1gcloud --project PROJECT_ID services enable aiplatform.googleapis.com
To register BigQuery ML models in the Vertex AI Model Registry, the service or user account must have the Vertex AI Administrator role (roles/aiplatform.admin). You can grant this role using the Google Cloud CLI.
For a service account, use the following command:
1gcloud projects add-iam-policy-binding PROJECT_ID
2--member=serviceAccount:SERVICE_ACCOUNT_EMAIL
3--role=roles/aiplatform.admin --condition=None
For a user account, use the following command:
1gcloud projects add-iam-policy-binding PROJECT_ID
2--member=user:USER_ACCOUNT --role=roles/aiplatform.admin
3--condition=None
When creating a BigQuery ML model, use the MODEL_REGISTRY option in the CREATE MODEL statement to register it in the Vertex AI Model Registry. You can specify model IDs and version aliases with the VERTEX_AI_MODEL_ID and VERTEX_AI_MODEL_VERSION_ALIASES options. Once registered, models can be deployed, managed, and monitored, with Vertex Explainable AI offering feature-based insights.
Example:
We'll create a BigQuery time series forecasting model and register it to the Vertex AI Model Registry during its creation.
Here's a step-by-step guide:
Step 1: Import Time Series Data
First, upload the dataset in BigQuery and create a table. You can also use an existing dataset.
Step 2: Create a Time Series Model
Using BigQuery ML syntax, we will create an ARIMA time series model:
1CREATE OR REPLACE MODEL
2 `demo.transportation_revenue_insights`
3OPTIONS(MODEL_TYPE='ARIMA',
4 TIME_SERIES_TIMESTAMP_COL='service_date',
5 TIME_SERIES_DATA_COL='total_rides',
6 HOLIDAY_REGION='us',
7 MODEL_REGISTRY='vertex_ai',
8 VERTEX_AI_MODEL_ID='cta_ridership_forecast') AS
9SELECT
10 service_date, total_rides
11FROM
12 `demo.transportation_revenue_insights`;
Here:
Step 3: Evaluate the Model
Once the model is trained, evaluate it with this query:
1SELECT
2 *
3FROM
4 ML.EVALUATE(MODEL `transportation_revenue_insights`);
This returns model parameters, evaluation metrics, and the best ARIMA model based on AIC.
Step 4: Forecast with the Model
Use the ML.FORECAST function to forecast ridership for the next 7 days:
1SELECT
2 *
3FROM
4 ML.FORECAST(MODEL `transportation_revenue_insights`,
5 STRUCT(7 AS horizon));
This query generates a 7-day forecast along with confidence intervals.
By following these steps, you not only create and evaluate a time series model but also register it in Vertex AI for easy management and deployment.
To register an existing BigQuery ML model in the Vertex AI Model Registry, use the ALTER MODEL statement.
1ALTER MODEL [IF EXISTS] <model_name> SET OPTIONS
2(vertex_ai_model_id="VERTEX_AI_MODEL_ID");
Once your time series forecasting model is registered in the Vertex AI Model Registry, you can deploy it for predictions on new time series data, monitor the model’s performance and predictions, and use Vertex Explainable AI to gain insights into the key factors influencing the forecasts. This ensures better understanding and control over the model’s outputs.
When working with BigQuery ML models registered in Vertex AI, you can assign a model ID and aliases to simplify management and versioning.
Specify a Vertex AI Model ID
When creating a model, use the VERTEX_AI_MODEL_ID option to assign a model ID. Ensure it's lowercase. If unspecified, the BigQuery ML model ID will be used. To update, you must delete and recreate the model.
Specify a Vertex AI Model Alias
Use VERTEX_AI_MODEL_VERSION_ALIASES to create model aliases. These allow easy reference to specific model versions without needing the exact version ID.
BigQuery ML streamlines model building by integrating with BigQuery, while traditional ML requires more complex workflows and external tools.
Here is a detailed comparison of the two:
By simplifying model creation and deployment within a familiar environment, BigQuery ML offers a more efficient, accessible approach to machine learning compared to traditional workflows.
The OWOX Reports Extension for Google Sheets empowers users to transform raw data into actionable insights with ease. By connecting Google Sheets directly to your data sources, such as Google BigQuery and Google Analytics, it enables users to automate report generation, streamline data analysis, and eliminate the need for manual data imports.
With intuitive dashboards, charts, and reports, business users and data analysts can track key metrics, uncover trends, and make data-driven decisions without leaving Google Sheets. The extension simplifies data management, making it accessible to non-technical users while providing in-depth insights.
BigQuery ML allows users to build and deploy machine learning models directly within Google BigQuery using SQL. It simplifies the machine learning process by letting data analysts and engineers create models on large datasets without moving data, leveraging SQL knowledge to perform predictive analytics.
With BigQuery ML, you can create a variety of models including linear and logistic regression, time series (ARIMA), k-means clustering, matrix factorization for recommendations, PCA for dimensionality reduction, and custom models through external training with Vertex AI for more advanced use cases.
Yes, BigQuery ML integrates with Vertex AI for advanced machine learning tasks. You can register BigQuery ML models in the Vertex AI Model Registry to use features like online predictions, model monitoring, automated retraining, and MLOps capabilities, enhancing the scalability and management of machine learning workflows.
Built-in models are trained within BigQuery using SQL, like linear regression or ARIMA, while external models are trained outside BigQuery using services like Vertex AI. External models offer more customization and advanced capabilities, but require more setup and integration with other Google Cloud services.
BigQuery ML offers two pricing models: on-demand (based on bytes processed) and editions (slot-based). Model creation, evaluation, and prediction incur costs based on the data processed and model complexity. External models have additional costs for resources like Vertex AI, with flexible options based on usage.
To reduce costs in BigQuery ML, consider using dry runs to estimate data processed, selecting appropriate pricing models (on-demand or editions), using built-in models when possible, and optimizing queries. Additionally, leveraging Vertex AI's features like model monitoring and retraining can help manage resources efficiently.