BigQuery, Google's robust data warehouse, changes the game in data analysis and reporting with its speed and scalability. This guide breaks down BigQuery's essential numbering functions, making complex data tasks more manageable and insightful.
These functions fall under the category of window functions in SQL, which allow you to perform calculations across sets of rows that are related to the current row in a way that goes beyond the capabilities of standard SQL functions. Whether you're looking to order your data, rank items without gaps, or assign unique identifiers to each row, BigQuery's numbering functions offer a versatile toolkit.
Numbering functions like ROW_NUMBER(), RANK(), and DENSE_RANK() are important for advanced data analysis in BigQuery. They assign unique identifiers to rows based on specific criteria, helping tasks such as ranking and segmentation, which are important for organizing and extracting insights from large datasets efficiently. Moreover, these functions significantly enhance data analysis efficiency in BigQuery.
BigQuery's numbering functions, like ROW_NUMBER(), RANK(), and DENSE_RANK(), are important for tasks such as data sorting and windowed analysis.Understanding their syntax and application is important for efficient query design.
For example, ROW_NUMBER() assigns a unique number to each row based on the order specified in the OVER() clause. We can grasp how these functions transform data analysis by examining real-world scenarios, such as ranking sales data or analyzing time series.
CUME_DIST calculates the cumulative distribution of a value within a group of values, essentially showing the relative standing of a specific value. The formula is the number of rows with values less than or equal to the current row's value divided by the total number of rows. The CUME_DIST function returns a decimal value between 0 and 1, inclusive.
It's beneficial for understanding the position or percentile rank of data points within a dataset, aiding in statistical analysis and decision-making processes. If the partition clause is not specified, the function treats the entire result set as a single partition, calculating the distribution across all rows.
CUME_DIST() OVER (
[PARTITION BY partition_expression, ...]
ORDER BY sort_expression [ASC|DESC], ...
)
Consider a dataset of exam scores for a class of students, where you want to find out the percentile ranking of each score.
The SQL query might look like this:
SELECT score, CUME_DIST() OVER (ORDER BY score) AS percentile_rank
FROM student_grades;
Here:
The output will consist of two columns: score and percentile_rank. The score column lists the exam scores, and percentile_rank shows the cumulative percentage of students with scores less than or equal to each score. The values in percentile_rank will vary depending on the distribution of exam scores in the dataset.
DENSE_RANK function is particularly beneficial when you need to understand the relative standing of items in a dataset, such as products in sales data or students in exam scores. DENSE_RANK is like RANK, but it fills in the gaps when there are ties. This makes it easier to see the exact ranking of items without any breaks.
Furthermore, in DENSE_RANK(), all rows with the same value receive the same rank, and the rank value for the next distinct value is incremented by just one. This approach ensures a continuous sequence of ranks without any gaps, regardless of ties. It's super useful for things like analyzing sales data or exam scores because you get a smooth sequence of ranks without any gaps, which is crucial for making accurate decisions based on the data.
DENSE_RANK() OVER (
[PARTITION BY column_name(s)]
ORDER BY column_name(s)
)
Imagine a scenario where we want to rank sales employees based on their total sales, without skipping ranks for ties.
The SQL query might look like this:
SELECT employee_name, total_sales,
DENSE_RANK() OVER (ORDER BY total_sales DESC) as sales_rank
FROM sales_records
ORDER BY total_sales DESC;
Here:
The output of the given SQL query will show a list of sales employees ranked by their total sales in descending order, using the DENSE_RANK() function. This function assigns a unique rank to each total sales amount, without skipping ranks for ties.
The NTILE function divides a sorted dataset into a specified number of groups of roughly equal size. For instance, NTILE(4) will create four groups representing quartiles.
Using NTILE helps in evenly distributing data, which is important for tasks like segmenting customers into groups based on spending or categorizing students based on grades. It offers a clear, ordered partitioning of data, which simplifies the analysis and helps in identifying trends or anomalies efficiently.
NTILE(number_of_buckets) OVER ([PARTITION BY partition
_expression] ORDER BY sort_expression)
Suppose, a teacher wants to divide a class of students into 4 performance groups based on their scores in order to allocate different levels of additional support and resources.
SELECT
student_name,
score,
NTILE(4) OVER (ORDER BY score DESC) AS performance_group
FROM
students;
Here:
The query generates a list of students along with their scores and assigns them to one of four performance groups. Higher scores will be in Group 1, indicating top performers, and lower scores will be in Group 4, indicating those who may need the most additional support.
PERCENT_RANK is a window function that assigns a relative rank to each row in a dataset as a percentage from 0 to 1, with 0 representing the first row and 1 the last. It's used to understand the standing of a particular value within a dataset, such as determining a product's sales performance compared to others.
This function is beneficial for identifying outliers, evaluating relative performance, and conducting percentile-based analysis.
PERCENT_RANK() OVER ([PARTITION BY partition_expression] ORDER BY sort_expression)
Assume we have sales data for different sales agents and want to find out the percentage rank of their sales amounts.
The SQL query will look like this:
SELECT sales_agent, sales_amount, PERCENT_RANK() OVER (ORDER BY sales_amount DESC) AS sales_percent_rank
FROM sales_agent_data;
Here:
The output would display each sales agent's name, their sales amount, and their percentage rank, where the percentage rank is calculated based on their sales amount compared to other agents. The ranking is done in descending order of sales amounts, with the highest sales figure getting the lowest percentage rank.
The RANK function assigns a rank to each row in a dataset based on the specified ordering, with equal values getting the same rank and creating gaps for subsequent ranks. It is used extensively in leaderboard creation, sales performance analysis, and anywhere relative standings are needed.
The benefit of using RANK is its ability to show the relative position of items within a set, accommodating ties naturally. This is important for accurately reflecting performance or standings in competitive contexts, enabling clear comparisons and facilitating strategic decision-making.
RANK() OVER ([PARTITION BY partition_expression] ORDER BY sort_expression)
Suppose we have a sales dataset and want to rank salespersons based on their sales amounts.
The SQL query will look like this:
SELECT salesperson, sales_amount, RANK() OVER (ORDER BY sales_amount DESC) AS sales_rank
FROM salesData
ORDER BY 3;
Here:
The output of the query will list the salesperson names, their sales_amount, and their sales_rank, where the rank is determined by the sales_amount in descending order. Salespersons with the same sales amount will have the same rank, and the ranking will skip the next number(s) accordingly to reflect ties.
For example, if two salespersons are tied for second place, the next rank will be fourth, not third. This method effectively illustrates the hierarchical structure of sales performance.
ROW_NUMBER provides a straightforward method to assign a unique identifier to each row in a result set, based on the order specified. This function is incredibly useful for tasks that require row-level differentiation, such as generating rankings or identifying the sequence of events.
It is often used in data analysis for creating top-N or bottom-N reports, pagination to display results, or simply to count rows within partitions.
ROW_NUMBER() OVER ([PARTITION BY partition_expression] ORDER BY sort_expression)
Suppose, we are generating a list of individuals from the people table, sorted by their age. Each person is assigned a unique rank based on their age, with the youngest person receiving a rank of 1. This ranking helps in quickly identifying the order of individuals from youngest to oldest, useful for scenarios where age-related prioritization is needed, such as targeting specific age groups for marketing or organizing age-appropriate activities.
The SQL query will look like this:
SELECT
name,
age,
ROW_NUMBER() OVER (ORDER BY age) AS age_rank
FROM
people;
Here:
The query ranks each person by age from youngest to oldest in the entire dataset. It assigns a rank with no gaps, meaning that even if two people have the same age, they will receive consecutive numbers (e.g., 1, 2, 3, not 1, 1, 3), and the person with the next distinct age will receive the next sequential number.
Numbering functions in BigQuery facilitate complex data manipulations, enabling analysts to perform tasks like partitioning datasets, calculating running totals, or creating histograms. Advanced techniques involve nesting functions, using partitions wisely, and understanding the impact of ordering on function results.
The ROW_NUMBER() function in BigQuery assigns each row a unique number based on the order you specify. When you use it with partitions, it can automatically update these rankings whenever your data changes. This is especially useful in situations where you need to keep track of rankings over time or among different groups, like in competitive rankings or time-series data analysis.
Unlike simpler uses of ROW_NUMBER() that just count rows, using it with partitions allows for more targeted and dynamic analyses that adapt as new data comes in.
Syntax:
ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name [ASC|DESC])
Example:
Consider a dataset sales_data with monthly sales figures per salesperson across different regions.
To rank these salespersons within each region by their sales amounts, you would use the following syntax.
SELECT salesperson, region, sales_amount,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales_amount DESC) AS sales_rank
FROM sales_region;
Here:
This approach enables dynamic and contextual ranking, making ROW_NUMBER() a vital tool for analyzing trends and performances across different segments of your data.
NTILE is a window function in BigQuery that divides ordered dataset rows into a specified number of roughly equal groups or 'tiles'. For standard applications, NTILE simply segments the entire dataset into equal parts, which is ideal for straightforward statistical analyzes and examining overall data distributions.
In more advanced scenarios, NTILE can be used with the PARTITION BY clause to perform more complex segmentations. This approach allows for the analysis of data distribution within specific subsets or categories of the data, enhancing its utility for detailed, context-specific analytical tasks.
Syntax:
NTILE(n) OVER (ORDER BY column [PARTITION BY column])
Example:
A business analyst at a retail company wants to segment monthly sales data into quartiles across different regions to analyze regional performance variations and identify underperforming areas.
SELECT
region,
month,
sales_amount,
NTILE(4) OVER (PARTITION BY region ORDER BY sales_amount DESC) AS sales_quartile
FROM
monthly_sales;
Here:
The query outputs a list that includes regions, months, sales amounts, and quartile rankings for each region. Each region’s sales are categorized into four quartiles; the top 25% of the highest sales are in Quartile 1, and the lowest 25% are in Quartile 4. This helps in identifying both high-performing and underperforming months within each region.
This advanced use of NTILE() is dynamic and multifaceted, considering both geographical (region) and quantitative (sales) dimensions, which offers a nuanced view of performance and helps in targeted decision-making.
RANK() and DENSE_RANK() are BigQuery functions used for ranking records. They differ in how they handle ties within the data.
RANK() assigns a unique rank to each record, skipping ranks after ties (e.g., a tie for first results in the next rank being third). DENSE_RANK() also ranks uniquely but without gaps, ensuring a continuous sequence even after ties. Both functions are useful for comparing data points where duplicates may occur, like in sales or academic grading.
ROW_NUMBER() assigns a distinct sequential integer to rows within a partition of a result set, starting at 1 for the first row in each partition. Unlike RANK() and DENSE_RANK(), ROW_NUMBER() does not consider the values in the columns when assigning ranks. This means that even if two rows have the same value in the ordered column, they will still receive different row numbers based on their order in the result set.
Syntax:
SELECT column_name,
RANK() OVER (PARTITION BY partition_column ORDER BY order_column ASC|DESC) AS rank
DENSE_RANK() OVER (PARTITION BY partition_column ORDER BY order_column ASC|DESC) AS dense_rank
ROW_NUMBER() OVER (PARTITION BY partition_column ORDER BY order_column ASC|DESC) AS row_number
FROM table_name
Example:
To illustrate, consider ranking employees by sales within their departments:
SELECT employee_id, department, sales,
RANK() OVER (PARTITION BY department ORDER BY sales DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY department ORDER BY sales DESC) AS dense_rank,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY sales DESC) AS dense_rank,
FROM employee_sales;
Here:
The rank and dense_rank columns display each employee's rank within their department, with RANK() showing gaps for ties and DENSE_RANK() providing a consecutive ranking.
This method reveals the comparative performance of employees in each department, highlighting the ranking even when there are ties in sales figures.
Optimal use of numbering functions in BigQuery can significantly enhance query performance. Strategies like minimizing the window frame, using approximate functions, and ensuring efficient partitioning and ordering can reduce computation time and resource usage. Performance can be further optimized by understanding the trade-offs between different numbering functions and their impact on execution plans, particularly in large-scale data environments.
Partitioning data effectively is important when utilizing Numbering functions in BigQuery to enhance query performance and manageability. By segmenting data into relevant groups, such as by time periods or geographic areas, BigQuery can process and analyze the data more efficiently.
Example:
If you frequently analyze sales data by quarter, partitioning the data by time frame enables BigQuery to scan only the relevant subset of data for each query.
For example, partitioning a sales table by quarter would make queries filtering on a specific quarter much faster as BigQuery would only process the data for that quarter, rather than scanning the entire dataset.
In the previous section, we discussed the same with ROW_NUMBER, and NTILE function. Those are some real examples of how partitioning can be used for efficiency.
In BigQuery, the ordering of data within numbering functions like RANK(), DENSE_RANK(), and ROW_NUMBER() is important. The sequence defined by the ORDER BY clause directly influences the analytical significance of the resulting numbers and can affect query performance.
Ordering can also prioritize data based on business rules or analytical needs. For example, in a sales dataset, ordering by the sales amount before applying a ranking function helps identify top-performing products or sales regions.
Example:
SELECT ticket_id, received_time, priority,
RANK() OVER (PARTITION BY priority ORDER BY received_time) AS rank
FROM support_tickets;
Here:
This approach illustrates how careful ordering can provide meaningful sequences that enhance the understanding of data trends, prioritize actions, and facilitate detailed performance analysis.
Selecting the right operators in SQL queries, especially when using numbering functions in BigQuery, can greatly enhance query performance. Efficient operators reduce the amount of work BigQuery must do, indirectly speeding up the execution of numbering functions.
Example:
In BigQuery, JOIN and INTERSECT are used to structure and refine datasets that can then be processed with numbering functions such as ROW_NUMBER(), RANK(), and DENSE_RANK(). Although these set operations and join techniques are not embedded within numbering functions themselves, they play a crucial role in preparing the data.
For instance, a JOIN can be used to combine data from different tables, allowing a comprehensive dataset to be formed which is then subject to ranking or ordering through numbering functions. This is particularly useful for creating rankings within specific groups or partitions that are defined by the joined tables.
Conversely, INTERSECT might be employed in more complex queries to filter datasets to only include records common to multiple data sources or conditions, upon which numbering functions can then be applied to assign ranks or sequence numbers based on specified criteria.
When working with BigQuery, especially with numbering functions, it's important to keep an eye on the resources your queries use. BigQuery allows a maximum of 1000 unique resources like tables, views, and functions in a single query after it's fully expanded. Staying within this limit ensures your queries are optimized for better performance.
To maintain efficiency, focus on minimizing the number of tables and functions referenced in your queries. This reduction can lead to less complex queries, which are faster to execute and easier to maintain.
Instead of querying many small tables, consider consolidating data into fewer, larger tables where appropriate. This approach can reduce the number of resources accessed during query execution.
Example:
Instead of querying multiple tables like this:
SELECT RANK() OVER (ORDER BY total_sales DESC) AS rank
FROM (
SELECT * FROM sales_region1
UNION ALL
SELECT * FROM sales_region2
UNION ALL
SELECT * FROM sales_region3
Consolidate data beforehand and query a single, comprehensive table:
SELECT RANK() OVER (ORDER BY total_sales DESC) AS rank
FROM consolidated_sales_data;
In the initial approach using Multiple Tables:
The optimized approach uses a Consolidated Table “consolidated_sales_data” and directly accesses a single table. This implies that the data from various regions has been pre-consolidated into this table, simplifying and potentially speeding up the query process.
Cross joins create a combination of every row from two datasets, often leading to a massive increase in data volume. This can negatively impact the performance of Numbering functions by forcing them to operate on much larger datasets than necessary.
To ensure that numbering functions like ROW_NUMBER(), RANK(), and DENSE_RANK() are efficient and resource-friendly, it's important to avoid unnecessary cross joins. Instead, use more selective joins, such as inner or left joins, based on relevant keys or conditions. This approach helps focus on the most relevant data, reducing the computational load and improving query performance.
Example:
Consider you have two tables, employees and departments, and you want to rank employees within their departments. Instead of using a cross join, which would pair every employee with every department, use an inner join to combine records only where department IDs match:
Syntax:
SELECT e.name, d.name as department, RANK() OVER (PARTITION BY e.department_id ORDER BY e.salary DESC) as rank
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;
This SQL query efficiently ranks employees within their respective departments, limiting the operation to relevant data pairs and avoiding the performance hit associated with cross joins.💡If combining datasets manually is causing you trouble, leverage SQL functions like JOIN and UNION to streamline your data merging tasks. For a detailed understanding of these powerful SQL functions, check out our complete guide on JOIN and UNION in SQL.
To enhance query performance in BigQuery, it's important to minimize the use of functions or operations that generate non-cacheable results.
Common challenges with BigQuery's numbering functions include handling large data volumes, managing partition skew, and ensuring consistent performance across varying datasets. Addressing these challenges requires a deep understanding of how numbering functions operate and their interaction with BigQuery's architecture.
Understanding and resolving errors with the NTILE function in BigQuery often involves ensuring that the number of tiles specified is a positive integer and does not exceed the number of rows in the partition.
⚠️ Error:
NTILE function error occurs when the number of tiles is NULL, zero, negative, or exceeds the row count in the partition.
✅ Solution:
Make sure the integer used in the NTILE function is positive and does not exceed the number of rows. Use checks or the COALESCE function to prevent invalid values.
Syntax:
NTILE(COALESCE(your_value, default_positive_value))
The use of COALESCE in this syntax is a safeguard against potential issues such as your_value being NULL or not a positive integer, ensuring that NTILE always has a valid and appropriate number of groups to divide the data into.
Example:
NTILE(COALESCE(your_value, 1))
By using NTILE(COALESCE(your_value, 1)), we ensure that NTILE will not fail due to an invalid number of groups and will default to creating one single group if no valid number is provided, ensuring the stability and reliability of your SQL query.
The "Resources Exceeded" error in BigQuery typically occurs when processing very large datasets, especially when using the ROW_NUMBER function with an ORDER BY clause. This situation can overload the system's capacity to handle the data in a single go.
⚠️ Error:
The "Resources Exceeded" error occurs in BigQuery when large datasets are processed with the ROW_NUMBER function alongside an ORDER BY clause, overwhelming the system's capacity. This error surfaces when BigQuery attempts to sort and assign row numbers to a vast number of records using the ROW_NUMBER() function with ORDER BY on large datasets.
✅ Solution:
Employ the ARRAY_AGG() function combined with ORDER BY and LIMIT within the aggregation to minimize the data load. This technique selects a smaller, manageable subset of data, like the top-n records per group, which alleviates the system's resource strain.
Syntax:
SELECT category, ARRAY_AGG(data ORDER BY some_metric DESC LIMIT 1)[OFFSET(0)] AS top_record FROM large_dataset GROUP BY category
Example:
SELECT product_type, ARRAY_AGG(sales_data ORDER BY total_sales DESC LIMIT 1)[OFFSET(0)] AS top_selling_product FROM sales_records_table GROUP BY product_type
Here:
By using ARRAY_AGG with ORDER BY and LIMIT, BigQuery can efficiently process large datasets without exceeding resource limits.
To address data skew in partitioned queries, first identify skew by analyzing partition size distribution, such as a region-based partition where one region's data significantly outweighs others. Skew often results from uneven distribution or partitioning on non-uniform columns, impacting query performance and resource utilization.
⚠️ Error:
Data skew in BigQuery occurs when data distribution across partitions is imbalanced, notably with numbering functions like ROW_NUMBER(), RANK(), or NTILE(), causing uneven workload distribution.
✅ Solution:
To address data skew in BigQuery, especially with numbering functions, consider the following strategies:
There is no specific SQL syntax for this strategy because it focuses on general data management practices rather than a particular query change.
When using the NTILE function, one common mistake is incorrectly determining the number of groups.
⚠️ Error:
If the number of rows in the set does not evenly divide into the number of groups, some groups may contain more rows than others. In this case, BigQuery will issue a warning, and the last group will have more rows than the others.
✅ Solution:
Recalculate the number of rows or adjust the number of groups to ensure an even distribution.
Syntax:
SELECT value, NTILE(num_groups) OVER (ORDER BY value) AS group_number
FROM dataset
Example:
SELECT full_name,
NTILE(2) OVER (ORDER BY full_name) AS group_number_2,
NTILE(3) OVER (ORDER BY full_name) AS group_number_3
FROM scores_student
Here:
If you have 10 rows and divide them into 3 groups, consider adjusting to 2 or 5 groups for a more even distribution.
Another common mistake with the NTILE function is related to improper data sorting.
⚠️ Error:
If the data is not properly sorted before applying NTILE, the results can be unpredictable. For example, if the data is sorted in descending order, the first group will contain rows with the highest values, which may be unexpected.
✅ Solution:
Ensure data is correctly sorted before applying NTILE to get the desired grouping.
Syntax:
SELECT column, NTILE(num_groups) OVER (ORDER BY column ASC) AS group_number
FROM table_name
Example:
Sort data in ascending order to distribute the rows evenly across groups based on the sorted values.
SELECT sales_agent, sales_amount, NTILE(4) OVER (ORDER BY sales_amount ASC) AS group_number
FROM sales_agent_data
Here:
By sorting in ascending order, the rows are evenly distributed from the lowest to the highest sales amounts across the 4 groups.
The NTILE function can consume a significant amount of memory, especially when dealing with large datasets and dividing them into a large number of groups.
⚠️ Error:
If the dataset you are working with is too large, the NTILE function can consume a large amount of memory, leading to an increase in query execution time or even its unsuccessful completion due to lack of memory.
✅ Solution:
Use APPROX_QUANTILES instead. This function allows for more efficient query execution since it does not require a global ORDER BY for all rows in the table.
Syntax:
SELECT column, APPROX_QUANTILES(column, num_groups)
FROM table_name
Example:
WITH random_values AS (
SELECT
transaction_id,
RAND() AS random_value
FROM
`owox-analytics.dataset.transaction_table`
),
quantile_boundaries AS (
SELECT
APPROX_QUANTILES(random_value, 100) AS quantiles
FROM
random_values
)
SELECT
transaction_id,
ARRAY_LENGTH(
ARRAY(
SELECT boundary
FROM UNNEST(quantiles) AS boundary
WHERE random_value >= boundary
)
) + 1 AS group_number
FROM
random_values,
quantile_boundaries
ORDER BY
group_number;
Here:
By using APPROX_QUANTILES, you can achieve a similar distribution of data without the heavy memory usage that NTILE might cause.
If you aim to advance your skills in Google BigQuery, it's crucial to understand its more complex functions and capabilities:
The OWOX BI BigQuery Reports Extension enhances data analysis by offering seamless integration with BigQuery, allowing for the creation of detailed reports and dashboards directly from your datasets. This tool simplifies the data analysis process, making it more efficient and effective in driving decision-making.
By automating report generation and enabling real-time analytics, the OWOX BI BigQuery Add-on empowers organizations to quickly and accurately derive actionable insights, leading to better business decisions and outcomes.
Window functions allow for more complex calculations across a set of table rows related to the current row. Using ROW_NUMBER() with window functions enhances data analysis by:Providing more detailed and flexible analysis over partitions of data.Allowing for running totals, moving averages, and other cumulative metrics without the need for subqueries or complex joins.mproving performance and readability of queries compared to traditional methods.
To resolve this, consider:Filtering data before applying ROW_NUMBER().Breaking down large queries into smaller chunks.Increasing the BigQuery slot allocation if possible.
CUME_DIST() can be used to identify outliers in your dataset in BigQuery. The CUME_DIST() function computes the cumulative distribution of a value in a set of values. Here's how you can use it to identify outliers:SELECT * FROM ( SELECT value, CUME_DIST() OVER (ORDER BY value) AS cume_dist FROM student_grades ) WHERE ROUND(cume_dist, 2) > 0.83 OR ROUND(cume_dist, 2) < 0.33In this query, values significantly higher or lower than the majority can be considered outliers. We round the cumulative distribution to two decimal places and filter out values that fall in the top 17% or bottom 33% of the distribution.However, it's important to note that we can use this function in the SELECT block but cannot use it in the WHERE or HAVING blocks directly. We can use it in WHERE or HAVING only after calculating the metric beforehand. Therefore, the query is structured with a subquery to calculate the cumulative distribution first and then apply the filtering condition.
PERCENT_RANK() calculates the relative rank of a row within a partition as a percentage. Common use cases include:Determining sales performance relative to peers.Ranking employees based on performance metrics.Analyzing customer behavior against the broader dataset.
The NTILE() function divides sorted rows into a specified number of approximately equal buckets. It's useful for data segmentation, such as:SELECT NTILE(4) OVER (ORDER BY column_name) AS quartile, *FROM table_name_salesWHERE demographic_column = 'demographic_1'This query segments the data into quartiles based on the specified column, which can be adjusted for different demographics.
ROW_NUMBER(): Assigns a unique number to each row starting from 1, based on the order specified in the ORDER BY clause. If two rows are equal based on the ORDER BY criteria, they will still receive different row numbers, essentially making it a sequential numbering.RANK(): Assigns a rank to each row with the same rank for tie values, but the next rank will be incremented based on the total number of ties. For example, if two rows tie for rank 1, the next row will have rank 3.DENSE_RANK(): Similar to RANK(), but without gaps in the rank values. Following the previous example, if two rows tie for rank 1, the next row will have rank 2.