The initial DEFINE MEASURE part can be useful to define measures that are local to the query. It becomes useful when we are debugging formulas because we can define a local measure, test it, and then deploy the code in the model once it behaves as expected. Most of the syntax is optional. Indeed, the simplest query one can author retrieves all the rows and columns from an existing table, as shown in Figure 3-1: EVALUATE 'Product'
FIGURE 3-1 The result of the query execution in DAX Studio.
The ORDER BY clause controls the sort order: EVALUATE FILTER ( 'Product', 'Product'[Unit Price] > 3000 ) ORDER BY 'Product'[Color], 'Product'[Brand] ASC, 'Product'[Class] DESC
Note Please note that the Sort By Column property defined in a model does not affect the sort order in a DAX query. The sort order specified by EVALUATE can only use columns included in the result. Thus, a client that generates a dynamic DAX query should read the Sort By Column property in a model’s metadata, include the column for the sort order in the query, and then generate a corresponding ORDER BY condition. EVALUATE is not a powerful statement by itself. The power of querying with DAX comes from the power of using the many DAX table functions that are available in the language. In the next sections, you learn how to create advanced calculations by using and combining different table functions. 60
Understanding FILTER Now that we have introduced what table functions are, it is time to describe in full the basic table functions. Indeed, by combining and nesting the basic functions, you can already compute many powerful expressions. The first function you learn is FILTER. The syntax of FILTER is the following: FILTER (
FILTER receives a table and a logical condition as parameters. As a result, FILTER returns all the rows satisfying the condition. FILTER is both a table function and an iterator at the same time. In order to return a result, it scans the table evaluating the condition on a row-by-row basis. In other words, it iterates the table. For example, the following calculated table returns the Fabrikam products (Fabrikam being a brand). FabrikamProducts = FILTER ( 'Product', 'Product'[Brand] = "Fabrikam" )
FILTER is often used to reduce the number of rows in iterations. For example, if a developer wants to compute the sales of red products, they can author a measure like the following one: RedSales := SUMX ( FILTER ( Sales, RELATED ( 'Product'[Color] ) = "Red" ), Sales[Quantity] * Sales[Net Price] )
You can see the result in Figure 3-2, along with the total sales.
FIGURE 3-2 RedSales shows the amount of sales of only red products.
CHAPTER 3
Using basic table functions
61
The RedSales measure iterated over a subset of the Sales table—namely the set of sales that are related to a red product. FILTER adds a condition to the existing conditions. For example, RedSales in the Audio row shows the sales of products that are both of Audio category and of Red color. It is possible to nest FILTER in another FILTER function. In general, nesting two filters produces the same result as combining the conditions of the two FILTER functions with an AND function. In other words, the following two queries produce the same result: FabrikamHighMarginProducts = FILTER ( FILTER ( 'Product', 'Product'[Brand] = "Fabrikam" ), 'Product'[Unit Price] > 'Product'[Unit Cost] * 3 ) FabrikamHighMarginProducts = FILTER ( 'Product', AND ( 'Product'[Brand] = "Fabrikam", 'Product'[Unit Price] > 'Product'[Unit Cost] * 3 ) )
However, performance might be different on large tables depending on the selectivity of the conditions. If one condition is more selective than the other, applying the most selective condition first by using a nested FILTER function is considered best practice. For example, if there are many products with the Fabrikam brand, but few products priced at three times their cost, then the following query applies the filter over Unit Price and Unit Cost in the innermost FILTER. By doing so, the formula applies the most restrictive filter first, in order to reduce the number of iterations needed to check for the brand: FabrikamHighMarginProducts = FILTER ( FILTER ( 'Product', 'Product'[Unit Price] > 'Product'[Unit Cost] * 3 ), 'Product'[Brand] = "Fabrikam" )
Using FILTER, a developer can often produce code that is easier to read and to maintain over time. For example, imagine you need to compute the number of red products. Without using table functions, one possible implementation might be the following:
62
CHAPTER 3
Using basic table functions
NumOfRedProducts := SUMX ( 'Product', IF ( 'Product'[Color] = "Red", 1, 0 ) )
The inner IF returns either 1 or 0 depending on the color of the product, and summing this expression returns the number of red products. Although it works, this code is somewhat tricky. A better implementation of the same measure is the following: NumOfRedProducts := COUNTROWS ( FILTER ( 'Product', 'Product'[Color] = "Red" ) )
This latter expression better shows what the developer wanted to obtain. Moreover, not only is the code easier to read for a human being, but the DAX optimizer is also better able to understand the developer’s intention. Therefore, the optimizer produces a better query plan, leading in turn to better performance.
Introducing ALL and ALLEXCEPT In the previous section you learned FILTER, which is a useful function whenever we want to restrict the number of rows in a table. Sometimes we want to do the opposite; that is, we want to extend the number of rows to consider for a certain calculation. In that case, DAX offers a set of functions designed for that purpose: ALL, ALLEXCEPT, ALLCROSSFILTERED, ALLNOBLANKROW, and ALLSELECTED. In this section, you learn ALL and ALLEXCEPT, whereas the latter two are described later in this chapter and ALLCROSSFILTERD is introduced in Chapter 14, “Advanced DAX concepts.” ALL returns all the rows of a table or all the values of one or more columns, depending on the parameters used. For example, the following DAX expression returns a ProductCopy calculated table with a copy of all the rows in the Product table: ProductCopy = ALL ( 'Product' )
Note ALL is not necessary in a calculated table because there are no report filters influencing it. However, ALL is useful in measures, as shown in the next examples. ALL is extremely useful whenever we need to compute percentages or ratios because it ignores the filters automatically introduced by a report. Imagine we need a report like the one in Figure 3-3, which shows on the same row both the sales amount and the percentage of the given amount against the grand total.
CHAPTER 3
Using basic table functions
63
FIGURE 3-3 The report shows the sales amounts and each percentage against the grand total.
The Sales Amount measure computes a value by iterating over the Sales table and performing the multiplication of Sales[Quantity] by Sales[Net Price]: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] )
To compute the percentage, we divide the sales amount by the grand total. Thus, the formula must compute the grand total of sales even when the report is deliberately filtering one given category. This can be obtained by using the ALL function. Indeed, the following measure produces the total of all sales, no matter what filter is being applied to the report: All Sales Amount := SUMX ( ALL ( Sales ), Sales[Quantity] * Sales[Net Price] )
In the formula we replaced the reference to Sales with ALL ( Sales ), making good use of the ALL function. At this point, we can compute the percentage by performing a simple division: Sales Pct := DIVIDE ( [Sales Amount], [All Sales Amount] )
Figure 3-4 shows the result of the three measures together. The parameter of ALL cannot be a table expression. It needs to be either a table name or a list of column names. You have already learned what ALL does with a table. What is its result if we use a column instead? In that case, ALL returns all the distinct values of the column in the entire table. The Categories calculated table is obtained from the Category column of the Product table: Categories = ALL ( 'Product'[Category] )
Figure 3-5 shows the result of the Categories calculated table.
64
CHAPTER 3
Using basic table functions
FIGURE 3-4 The All Sales Amount measure always produces the grand total as a result.
FIGURE 3-5 Using ALL with a column produces the list of distinct values of that column.
We can specify multiple columns from the same table in the parameters of the ALL function. In that case, ALL returns all the existing combinations of values in those columns. For example, we can obtain the list of all categories and subcategories by adding the Product[Subcategory] column to the list of values, obtaining the result shown in Figure 3-6: Categories = ALL ( 'Product'[Category], 'Product'[Subcategory] )
Throughout all its variations, ALL ignores any existing filter in order to produce a result. We can use ALL as an argument of an iteration function, such as SUMX and FILTER, or as a filter argument in a CALCULATE function. You learn the CALCULATE function in Chapter 5. If we want to include most, but not all the columns of a table in an ALL function call, we can use ALLEXCEPT instead. The syntax of ALLEXCEPT requires a table followed by the columns we want to exclude. As a result, ALLEXCEPT returns a table with a unique list of existing combinations of values in the other columns of the table.
CHAPTER 3
Using basic table functions
65
FIGURE 3-6 The list contains the distinct, existing values of category and subcategory.
ALLEXCEPT is a way to write a DAX expression that will automatically include in the result any additional columns that could appear in the table in the future. For example, if we have a Product table with five columns (ProductKey, Product Name, Brand, Class, Color), the following two expressions produce the same result: ALL ( 'Product'[Product Name], 'Product'[Brand], 'Product'[Class] ) ALLEXCEPT ( 'Product', 'Product'[ProductKey], 'Product'[Color] )
However, if we later add the two columns Product[Unit Cost] and Product[Unit Price], then the result of ALL will ignore them, whereas ALLEXCEPT will return the equivalent of: ALL ( 'Product'[Product Name], 'Product'[Brand], 'Product'[Class], 'Product'[Unit Cost], 'Product'[Unit Price] )
In other words, with ALL we declare the columns we want, whereas with ALLEXCEPT we declare the columns that we want to remove from the result. ALLEXCEPT is mainly useful as a parameter of CALCULATE in advanced calculations, and it is seldomly adopted with simpler formulas. Thus, even if we included its description here for completeness, it will become useful only later in the learning path.
Top categories and subcategories As an example of using ALL as a table function, imagine we want to produce a dashboard that shows the category and subcategory of products that sold more than twice the average sales amount. To produce this report, we need to first compute the average sales per subcategory and then, once the value has been determined, retrieve from the list of subcategories the ones that have a sales amount larger than twice that average. 66
CHAPTER 3
Using basic table functions
The following code produces that table, and it is worth examining deeper to get a feeling of the power of table functions and variables: BestCategories = VAR Subcategories = ALL ( 'Product'[Category], 'Product'[Subcategory] ) VAR AverageSales = AVERAGEX ( Subcategories, SUMX ( RELATEDTABLE ( Sales ), Sales[Quantity] * Sales[Net Price] ) ) VAR TopCategories = FILTER ( Subcategories, VAR SalesOfCategory = SUMX ( RELATEDTABLE ( Sales ), Sales[Quantity] * Sales[Net Price] ) RETURN SalesOfCategory >= AverageSales * 2 ) RETURN TopCategories
The first variable (Subcategories) stores the list of all categories and subcategories. Then, AverageSales computes the average of the sales amount for each subcategory. Finally, TopCategories removes from Subcategories the subcategories that do not have a sales amount larger than twice the value of AverageSales. The result of this table is visible in Figure 3-7.
FIGURE 3-7 These are the top subcategories that sold more than twice the average.
Once you master CALCULATE and filter contexts, you will be able to author the same calculations with a shorter and more efficient syntax. Nevertheless, in this example you can already appreciate how combining table functions can produce powerful results, which are useful for dashboards and reports.
CHAPTER 3
Using basic table functions
67
Understanding VALUES, DISTINCT, and the blank row In the previous section, you saw that ALL used with one column returns a table with all its unique values. DAX provides two other similar functions that return a list of unique values for a column: VALUES and DISTINCT. These two functions look almost identical, the only difference being in how they handle the blank row that might exist in a table. You will learn about the optional blank row later in this section; for now let us focus on what these two functions perform. ALL always returns all the distinct values of a column. On the other hand, VALUES returns only the distinct visible values. You can appreciate the difference between the two behaviors by looking at the two following measures: NumOfAllColors := COUNTROWS ( ALL ( 'Product'[Color] ) ) NumOfColors := COUNTROWS ( VALUES ( 'Product'[Color] ) )
NumOfAllColors counts all the colors of the Product table, whereas NumOfColors counts only the ones that—given the filter in the report—are visible. The result of these two measures, sliced by category, is visible in Figure 3-8.
FIGURE 3-8 For a given category, only a subset of the colors is returned by VALUES.
Because the report slices by category, each given category contains products with some, but not all, the colors. VALUES returns the distinct values of a column evaluated in the current filter. If we use VALUES or DISTINCT in a calculated column or in a calculated table, then their behavior is identical to that of ALL because there is no active filter. On the other hand, when used in a measure, these two functions compute their result considering the existing filters, whereas ALL ignores any filter. As you read earlier, the two functions are nearly identical. It is now important to understand why VALUES and DISTINCT are two variations of the same behavior. The difference is the way they consider the presence of a blank row in the table. First, we need to understand how come a blank row might appear in our table if we did not explicitly create a blank row. The fact is that the engine automatically creates a blank row in any table that is on the one-side of a relationship in case the relationship is invalid. To demonstrate the behavior, we removed all the silvercolored products from the Product table. Since there were 16 distinct colors initially and we removed one 68
CHAPTER 3
Using basic table functions
color, one would expect the total number of colors to be 15. Instead, the report in Figure 3-9 shows something unexpected: NumOfAllColors is still 16 and the report shows a new row at the top, with no name.
FIGURE 3-9 The first rows shows a blank for the category, and the total number of colors is 16 instead of 15.
Because Product is on the one-side of a relationship with Sales, for each row in the Sales table there is a related row in the Product table. Nevertheless, because we deliberately removed all the products with one color, there are now many rows in Sales that no longer have a valid relationship with the Product table. Be mindful, we did not remove any row from Sales; we removed a color with the intent of breaking the relationship. To guarantee that these rows are considered in all the calculations, the engine automatically added to the Product table a row containing blank in all its columns. All the orphaned rows in Sales are linked to this newly introduced blank row.
Important Only one blank row is added to the Product table, despite the fact that multiple different products referenced in the Sales table no longer have a corresponding ProductKey in the Product table. Indeed, in Figure 3-9 you can see that the first row shows a blank for the Category and accounts for one color. The number comes from a row containing blank in the category, blank in the color, and blank in all the columns of the table. You will not see the row if you inspect the table because it is an automatic row created during the loading of the data model. If, at some point, the relationship becomes valid again—if you were to add the silver products back—then the blank row will disappear from the table. Certain functions in DAX consider the blank row as part of their result, whereas others do not. Specifically, VALUES considers the blank row as a valid row, and it returns it. On the other hand, DISTINCT does not return it. You can appreciate the difference by looking at the following new measure, which counts the DISTINCT colors instead of VALUES: NumOfDistinctColors := COUNTROWS ( DISTINCT ( 'Product'[Color] ) )
The result is visible in Figure 3-10.
CHAPTER 3
Using basic table functions
69
FIGURE 3-10 NumOfDistinctColors shows a blank for the blank row, and its total shows 15 instead of 16.
A well-designed model should not present any invalid relationships. Thus, if your model is perfect, then the two functions always return the same values. Nevertheless, when dealing with invalid relationships, you need to be aware of this behavior because otherwise you might end up writing incorrect calculations. For example, imagine that we want to compute the average sales per product. A possible solution is to compute the total sales and divide that by the number of products, by using this code: AvgSalesPerProduct := DIVIDE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ), COUNTROWS ( VALUES ( 'Product'[Product Code] ) ) )
The result is visible in Figure 3-11. It is obviously wrong because the first row is a huge, meaningless number.
FIGURE 3-11 The first row shows a huge value accounted for a category with no name.
70
CHAPTER 3
Using basic table functions
The number shown in the first row, where Category is blank, corresponds to the sales of all the silver products—which no longer exist in the Product table. This blank row associates all the products that were silver and are no longer in the Product table. The numerator of DIVIDE considers all the sales of silver products. The denominator of DIVIDE counts a single blank row returned by VALUES. Thus, a single non-existing product (the blank row) is cumulating the sales of many other products referenced in Sales and not available in the Product table, leading to a huge number. Here, the problem is the invalid relationship, not the formula by itself. Indeed, no matter what formula we create, there are many sales of products in the Sales table for which the database has no information. Nevertheless, it is useful to look at how different formulations of the same calculation return different results. Consider these two other variations: AvgSalesPerDistinctProduct := DIVIDE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ), COUNTROWS ( DISTINCT ( 'Product'[Product Code] ) ) ) AvgSalesPerDistinctKey := DIVIDE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ), COUNTROWS ( VALUES ( Sales[ProductKey] ) ) )
In the first variation, we used DISTINCT instead of VALUES. As a result, COUNTROWS returns a blank and the result will be a blank. In the second variation, we still used VALUES, but this time we are counting the number of Sales[ProductKey]. Keep in mind that there are many different Sales[ProductKey] values, all related to the same blank row. The result is visible in Figure 3-12.
FIGURE 3-12 In the presence of invalid relationships, the measures are most likely wrong—each in their own way.
It is interesting to note that AvgSalesPerDistinctKey is the only correct calculation. Since we sliced by Category, each category had a different number of invalid product keys—all of which collapsed to the single blank row. However, the correct approach should be to fix the relationship so that no sale is orphaned of its product. The golden rule is to not have any invalid relationships in the model. If, for any reason, you CHAPTER 3
Using basic table functions
71
have invalid relationships, then you need to be extremely cautious in how you handle the blank row, as well as how its presence might affect your calculations. As a final note, consider that the ALL function always returns the blank row, if present. In case you need to remove the blank row from the result, then ALLNOBLANKROW is the function you will want to use.
VALUES of multiple columns The functions VALUES and DISTINCT only accept a single column as a parameter. There is no corresponding version for two or more columns, as there is for ALL and ALLNOBLANKROW. In case we need to obtain the distinct, visible combinations of values from different columns, then VALUES is of no help. Later in Chapter 12 you will learn that: VALUES ( 'Product'[Category], 'Product'[Subcategory] )
can be obtained by writing: SUMMARIZE ( 'Product', 'Product'[Category], 'Product'[Subcategory] )
Later, you will see that VALUES and DISTINCT are often used as a parameter of iterator functions. There are no differences in their results whenever the relationships are valid. In such a case, when you iterate over the values of a column, you need to consider the blank row as a valid row, in order to make sure that you iterate all the possible values. As a rule of thumb, VALUES should be your default choice, only leaving DISTINCT to cases when you want to explicitly exclude the possible blank value. Later in this book, you will also learn how to leverage DISTINCT instead of VALUES to avoid circular dependencies. We will cover it in Chapter 15, “Advanced relationships handling.” VALUES and DISTINCT also accept a table as an argument. In that case, they exhibit different behaviors: ■
■
DISTINCT returns the distinct values of the table, not considering the blank row. Thus, duplicated rows are removed from the result. VALUES returns all the rows of the table, without removing duplicates, plus the additional blank row if present. Duplicated rows, in this case, are kept untouched.
Using tables as scalar values Although VALUES is a table function, we will often use it to compute scalar values because of a special feature in DAX: a table with a single row and a single column can be used as if it were a scalar value. Imagine we produce a report like the one in Figure 3-13, reporting the number of brands sliced by category and subcategory.
72
CHAPTER 3
Using basic table functions
FIGURE 3-13 The report shows the number of brands available for each category and subcategory.
One might also want to see the names of the brands beside their number. One possible solution is to use VALUES to retrieve the different brands and, instead of counting them, return their value. This is possible only in the special case when there is only one value for the brand. Indeed, in that case it is possible to return the result of VALUES and DAX automatically converts it into a scalar value. To make sure that there is only one brand, one needs to protect the code with an IF statement: Brand Name := IF ( COUNTROWS ( VALUES ( Product[Brand] ) ) = 1, VALUES ( Product[Brand] ) )
The result is visible in Figure 3-14. When the Brand Name column contains a blank, it means that there are two or more different brands.
FIGURE 3-14 When VALUES returns a single row, we can use it as a scalar value, as in the Brand Name measure.
The Brand Name measure uses COUNTROWS to check whether the Color column of the Products table only has one value selected. Because this pattern is frequently used in DAX code, there is a
CHAPTER 3
Using basic table functions
73
simpler function that checks whether a column only has one visible value: HASONEVALUE. The following is a better implementation of the Brand Name measure, based on HASONEVALUE: Brand Name := IF ( HASONEVALUE ( 'Product'[Brand] ), VALUES ( 'Product'[Brand] ) )
Moreover, to make the lives of developers easier, DAX also offers a function that automatically checks if a column contains a single value and, if so, it returns the value as a scalar. In case there are multiple values, it is also possible to define a default value to be returned. That function is SELECTEDVALUE. The previous measure can also be defined as Brand Name := SELECTEDVALUE ( 'Product'[Brand] )
By including the second optional argument, one can provide a message stating that the result contains multiple results: Brand Name := SELECTEDVALUE ( 'Product'[Brand], "Multiple brands" )
The result of this latest measure is visible in Figure 3-15.
FIGURE 3-15 SELECTEDVALUE returns a default value in case there are multiple rows for the Brand Name column.
What if, instead of returning a message like “Multiple brands,” one wants to list all the brands? In that case, an option is to iterate over the VALUES of Product[Brand] and use the CONCATENATEX function, which produces a good result even if there are multiple values: [Brand Name] := CONCATENATEX ( VALUES ( 'Product'[Brand] ), 'Product'[Brand], ", " )
Now the result contains the different brands separated by a comma instead of the generic message, as shown in Figure 3-16. 74
CHAPTER 3
Using basic table functions
FIGURE 3-16 CONCATENATEX builds strings out of tables, concatenating expressions.
Introducing ALLSELECTED The last table function that belongs to the set of basic table functions is ALLSELECTED. Actually, ALLSELECTED is a very complex table function—probably the most complex table function in DAX. In Chapter 14, we will uncover all the secrets of ALLSELECTED. Nevertheless, ALLSELECTED is useful even in its basic implementation. For that reason, it is worth mentioning in this introductory chapter. ALLSELECTED is useful when retrieving the list of values of a table, or a column, as visible in the current report and considering all and only the filters outside of the current visual. To see when ALLSELECTED becomes useful, look at the report in Figure 3-17.
FIGURE 3-17 The report contains a matrix and a slicer, on the same page.
The value of Sales Pct is computed by the following measure: Sales Pct := DIVIDE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ), SUMX ( ALL ( Sales ), Sales[Quantity] * Sales[Net Price] ) )
CHAPTER 3
Using basic table functions
75
Because the denominator uses the ALL function, it always computes the grand total of all sales, regardless of any filter. As such, if one uses the slicer to reduce the number of categories shown, the report still computes the percentage against all the sales. For example, Figure 3-18 shows what happens if one selects some categories with the slicer.
FIGURE 3-18 Using ALL, the percentage is still computed against the grand total of all sales.
Some rows disappeared as expected, but the amounts reported in the remaining rows are unchanged. Moreover, the grand total of the matrix no longer accounts for 100%. If this is not the expected result, meaning that you want the percentage to be computed not against the grand total of sales but rather only on the selected values, then ALLSELECTED becomes useful. Indeed, by writing the code of Sales Pct using ALLSELECTED instead of ALL, the denominator computes the sales of all categories considering all and only the filters outside of the matrix. In other words, it returns the sales of all categories except Audio, Music, and TV. Sales Pct := DIVIDE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ), SUMX ( ALLSELECTED ( Sales ), Sales[Quantity] * Sales[Net Price] ) )
The result of this latter version is visible in Figure 3-19.
FIGURE 3-19 Using ALLSELECTED, the percentage is computed against the sales only considering outer filters.
The total is now 100% and the numbers reported reflect the percentage against the visible total, not against the grand total of all sales. ALLSELECTED is a powerful and useful function. Unfortunately, to achieve this purpose, it ends up being an extraordinarily complex function too. Only much later in
76
CHAPTER 3
Using basic table functions
the book will we be able to explain it in full. Because of its complexity, ALLSELECTED sometimes returns unexpected results. By unexpected we do not mean wrong, but rather, ridiculously hard to understand even for seasoned DAX developers. When used in simple formulas like the one we have shown here, ALLSELECTED proves to be particularly useful, anyway.
Conclusions As you have seen in this chapter, basic table functions are already immensely powerful, and they allow you to start creating many useful calculations. FILTER, ALL, VALUES and ALLSELECTED are extremely common functions that appear in many DAX formulas. Learning how to mix table functions to produce the result you want is particularly important because it will allow you to seamlessly achieve advanced calculations. Moreover, when mixed with the power of CALCULATE and of context transition, table functions produce compact, neat, and powerful calculations. In the next chapters, we introduce evaluation contexts and the CALCULATE function. After having learned CALCULATE, you will probably revisit this chapter to use table functions as parameters of CALCULATE, thus leveraging their full potential.
CHAPTER 3
Using basic table functions
77
CHAPTER 4
Understanding evaluation contexts At this point in the book, you have learned the basics of the DAX language. You know how to create calculated columns and measures, and you have a good understanding of common functions used in DAX. This is the chapter where you move to the next level in this language: After learning a solid theoretical background of the DAX language, you become a real DAX champion. With the knowledge you have gained so far, you can already create many interesting reports, but you need to learn evaluation contexts in order to create more complex formulas. Indeed, evaluation contexts are the basis of all the advanced features of DAX. We want to give a few words of warning to our readers. The concept of evaluation contexts is simple, and you will learn and understand it soon. Nevertheless, you need to thoroughly understand several subtle considerations and details. Otherwise, you will feel lost at a certain point on your DAX learning path. We have been teaching DAX to thousands of users in public and private classes, so we know that this is normal. At a certain point, you have the feeling that formulas work like magic because they work, but you do not understand why. Do not worry: you will be in good company. Most DAX students reach that point, and many others will reach it in the future. It simply means that evaluation contexts are not clear enough to them. The solution, at that point, is easy: Come back to this chapter, read it again, and you will probably find something new that you missed during your first read. Moreover, evaluation contexts play an important role when using the CALCULATE function—which is probably the most powerful and hard-to-learn DAX function. We introduce CALCULATE in Chapter 5, “Understanding CALCULATE and CALCULATETABLE,” and then we use it throughout the rest of the book. Understanding CALCULATE without having a solid understanding of evaluation contexts is problematic. On the other hand, understanding the importance of evaluation contexts without having ever tried to use CALCULATE is nearly impossible. Thus, in our experience with previous books we have written, this chapter and the subsequent one are the two that are always marked up and have the corners of pages folded over. In the rest of the book we will use these concepts. Then in Chapter 14, “Advanced DAX concepts,” you will complete your learning of evaluation contexts with expanded tables. Beware that the content of this chapter is not the definitive description of evaluation contexts just yet. A more detailed description of evaluation contexts is the description based on expanded tables, but it would be too hard to learn about expanded tables before having a good understanding of the basics of evaluation contexts. Therefore, we introduce the whole theory in different steps.
79
Introducing evaluation contexts There are two evaluation contexts: the filter context and the row context. In the next sections, you learn what they are and how to use them to write DAX code. Before learning what they are, it is important to state one point: They are different concepts, with different functionalities and a completely different usage. The most common mistake of DAX newbies is that of confusing the two contexts as if the row context was a slight variation of a filter context. This is not the case. The filter context filters data, whereas the row context iterates tables. When DAX is iterating, it is not filtering; and when it is filtering, it is not iterating. Even though this is a simple concept, we know from experience that it is hard to imprint in the mind. Our brain seems to prefer a short path to learning—when it believes there are some similarities, it uses them by merging the two concepts into one. Do not be fooled. Whenever you have the feeling that the two evaluation contexts look the same, stop and repeat this sentence in your mind like a mantra: “The filter context filters, the row context iterates, they are not the same.” An evaluation context is the context under which a DAX expression is evaluated. In fact, any DAX expression can provide different values in different contexts. This behavior is intuitive, and this is the reason why one can write DAX code without learning about evaluation contexts in advance. You probably reached this point in the book having authored DAX code without learning about evaluation contexts. Because you want more, it is now time to be more precise, to set up the foundations of DAX the right way, and to prepare yourself to unleash the full power of DAX.
Understanding filter contexts Let us begin by understanding what an evaluation context is. All DAX expressions are evaluated inside a context. The context is the “environment” within which the formula is evaluated. For example, consider a measure such as Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] )
This formula computes the sum of quantity multiplied by price in the Sales table. We can use this measure in a report and look at the results, as shown in Figure 4-1.
FIGURE 4-1 The measure Sales Amount, without a context, shows the grand total of sales.
This number alone does not look interesting. However, if you think carefully, the formula computes exactly what one would expect: the sum of all sales amounts. In a real report, one is likely to slice the value by a certain column. For example, we can select the product brand, use it on the rows, and the matrix report starts to reveal interesting business insights as shown in Figure 4-2. 80
CHAPTER 4
Understanding Evaluation Contexts
FIGURE 4-2 Sum of Sales Amount, sliced by brand, shows the sales of each brand in separate rows.
The grand total is still there, but now it is the sum of smaller values. Each value, together with all the others, provides more detailed insights. However, you should note that something weird is happening: The formula is not computing what we apparently asked. In fact, inside each cell of the report, the formula is no longer computing the sum of all sales. Instead, it computes the sales of a given brand. Finally, note that nowhere in the code does it say that it can (or should) work on subsets of data. This filtering happens outside of the formula. Each cell computes a different value because of the evaluation context under which DAX executes the formula. You can think of the evaluation context of a formula as the surrounding area of the cell where DAX evaluates the formula. DAX evaluates all formulas within a respective context. Even though the formula is the same, the result is different because DAX executes the same code against different subsets of data. This context is named Filter Context and, as the name suggests, it is a context that filters tables. Any formula ever authored will have a different value depending on the filter context used to perform its evaluation. This behavior, although intuitive, needs to be well understood because it hides many complexities. Every cell of the report has a different filter context. You should consider that every cell has a different evaluation—as if it were a different query, independent from the other cells in the same report. The engine might perform some level of internal optimization to improve computation speed, but you should assume that every cell has an independent and autonomous evaluation of the underlying DAX expression. Therefore, the computation of the Total row in Figure 4-2 is not computed by summing the other rows of the report. It is computed by aggregating all the rows of the Sales table, although this means other iterations were already computed for the other rows in the same report. Consequently,
CHAPTER 4
Understanding Evaluation Contexts
81
depending on the DAX expression, the result in the Total row might display a different result, unrelated to the other rows in the same report.
Note In these examples, we are using a matrix for the sake of simplicity. We can define an evaluation context with queries too, and you will learn more about it in future chapters. For now, it is better to keep it simple and only think of reports, to have a simplified and visual understanding of the concepts.
When Brand is on the rows, the filter context filters one brand for each cell. If we increase the complexity of the matrix by adding the year on the columns, we obtain the report in Figure 4-3.
FIGURE 4-3 Sales amount is sliced by brand and year.
Now each cell shows a subset of data pertinent to one brand and one year. The reason for this is that the filter context of each cell now filters both the brand and the year. In the Total row, the filter is only on the brand, whereas in the Total column the filter is only on the year. The grand total is the only cell that computes the sum of all sales because—there—the filter context does not apply any filter to the model. The rules of the game should be clear at this point: The more columns we use to slice and dice, the more columns are being filtered by the filter context in each cell of the matrix. If one adds the Store[Continent] column to the rows, the result is—again—different, as shown in Figure 4-4.
82
CHAPTER 4
Understanding Evaluation Contexts
FIGURE 4-4 The context is defined by the set of fields on rows and on columns.
Now the filter context of each cell is filtering brand, country, and year. In other words, the filter context contains the complete set of fields that one uses on rows and columns of the report.
Note Whether a field is on the rows or on the columns of the visual, or on the slicer and/or page/report/visual filter, or in any other kind of filter we can create with a report—all this is irrelevant. All these filters contribute to define a single filter context, which DAX uses to evaluate the formula. Displaying a field on rows or columns is useful for aesthetic purposes, but nothing changes in the way DAX computes values.
Visual interactions in Power BI compose a filter context by combining different elements from a graphical interface. Indeed, the filter context of a cell is computed by merging together all the filters coming from rows, columns, slicers, and any other visual used for filtering. For example, look at Figure 4-5.
CHAPTER 4
Understanding Evaluation Contexts
83
FIGURE 4-5 In a typical report, the context is defined in many ways, including slicers, filters, and other visuals.
The filter context of the top-left cell (A.Datum, CY 2007, 57,276.00) not only filters the row and the column of the visual, but it also filters the occupation (Professional) and the continent (Europe), which are coming from different visuals. All these filters contribute to the definition of a single filter context valid for one cell, which DAX applies to the whole data model prior to evaluating the formula. A more formal definition of a filter context is to say that a filter context is a set of filters. A filter, in turn, is a list of tuples, and a tuple is a set of values for some defined columns. Figure 4-6 shows a visual representation of the filter context under which the highlighted cell is evaluated. Each element of the report contributes to creating the filter context, and every cell in the report has a different filter context.
Calendar Year CY 2007
Education High School Partial College
Brand Contoso
FIGURE 4-6 The figure shows a visual representation of a filter context in a Power BI report.
The filter context of Figure 4-6 contains three filters. The first filter contains a tuple for Calendar Year with the value CY 2007. The second filter contains two tuples for Education with the values High School and Partial College. The third filter contains a single tuple for Brand, with the value Contoso. You might 84
CHAPTER 4
Understanding Evaluation Contexts
notice that each filter contains tuples for one column only. You will learn how to create tuples with multiple columns later. Multi-column tuples are both powerful and complex tools in the hand of a DAX developer. Before leaving this introduction, let us recall the measure used at the beginning of this section: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] )
Here is the correct way of reading the previous measure: The measure computes the sum of Quantity multiplied by Net Price for all the rows in Sales which are visible in the current filter context. The same applies to simpler aggregations. For example, consider this measure: Total Quantity := SUM ( Sales[Quantity] )
It sums the Quantity column of all the rows in Sales that are visible in the current filter context. You can better understand its working by considering the corresponding SUMX version: Total Quantity := SUMX ( Sales, Sales[Quantity] )
Looking at the SUMX definition, we might consider that the filter context affects the evaluation of the Sales expression, which only returns the rows of the Sales table that are visible in the current filter context. This is true, but you should consider that the filter context also applies to the following measures, which do not have a corresponding iterator: Customers := DISTINCTCOUNT ( Sales[CustomerKey] )
-- Count customers in filter context
Colors := VAR ListColors = DISTINCT ( 'Product'[Color] ) RETURN COUNTROWS ( ListColors )
-- Unique colors in filter context -- Count unique colors
It might look pedantic, at this point, to spend so much time stressing the concept that a filter context is always active, and that it affects the formula result. Nevertheless, keep in mind that DAX requires you to be extremely precise. Most of the complexity of DAX is not in learning new functions. Instead, the complexity comes from the presence of many subtle concepts. When these concepts are mixed together, what emerges is a complex scenario. Right now, the filter context is defined by the report. As soon as you learn how to create filter contexts by yourself (a critical skill described in the next chapter), being able to understand which filter context is active in each part of your formula will be of paramount importance.
Understanding the row context In the previous section, you learned about the filter context. In this section, you now learn the second type of evaluation context: the row context. Remember, although both the row context and the filter context are evaluation contexts, they are not the same concept. As you learned in the previous section, the purpose of the filter context is, as its name implies, to filter tables. On the other hand, the row context is not a tool to filter tables. Instead, it is used to iterate over tables and evaluate column values.
CHAPTER 4
Understanding Evaluation Contexts
85
This time we use a different formula for our considerations, defining a calculated column to compute the gross margin: Sales[Gross Margin] = Sales[Quantity] * ( Sales[Net Price] - Sales[Unit Cost] )
There is a different value for each row in the resulting calculated column, as shown in Figure 4-7.
FIGURE 4-7 There is a different value in each row of Gross Margin, depending on the value of other columns.
As expected, for each row of the table there is a different value in the calculated column. Indeed, because there are given values in each row for the three columns used in the expression, it comes as a natural consequence that the final expression computes different values. As it happened with the filter context, the reason is the presence of an evaluation context. This time, the context does not filter a table. Instead, it identifies the row for which the calculation happens.
Note The row context references a row in the result of a DAX table expression. It should not be confused with a row in the report. DAX does not have a way to directly reference a row or a column in the report. The values displayed in a matrix in Power BI and in a PivotTable in Excel are the result of DAX measures computed in a filter context, or are values stored in the table as native or calculated columns. In other words, we know that a calculated column is computed row by row, but how does DAX know which row it is currently iterating? It knows the row because there is another evaluation context providing the row—it is the row context. When we create a calculated column over a table with one million rows, DAX creates a row context that evaluates the expression iterating over the table row by row, using the row context as the cursor.
86
CHAPTER 4
Understanding Evaluation Contexts
When we create a calculated column, DAX creates a row context by default. In that case, there is no need to manually create a row context: A calculated column is always executed in a row context. You have already learned how to create a row context manually—by starting an iteration. In fact, one can write the gross margin as a measure, like in the following code: Gross Margin := SUMX ( Sales, Sales[Quantity] * ( Sales[Net Price] - Sales[Unit Cost] ) )
In this case, because the code is for a measure, there is no automatic row context. SUMX, being an iterator, creates a row context that starts iterating over the Sales table, row by row. During the iteration, it executes the second expression of SUMX inside the row context. Thus, during each step of the iteration, DAX knows which value to use for the three column names used in the expression. The row context exists when we create a calculated column or when we are computing an expression inside an iteration. There is no other way of creating a row context. Moreover, it helps to think that a row context is needed whenever we want to obtain the value of a column for a certain row. For example, the following measure definition is invalid. Indeed, it tries to compute the value of Sales[Net Price] and there is no row context providing the row for which the calculation needs to be executed: Gross Margin := Sales[Quantity] * ( Sales[Net Price] - Sales[Unit Cost] )
This same expression is valid when executed for a calculated column, and it is invalid if used in a measure. The reason is not that measures and calculated columns have different ways of using DAX. The reason is that a calculated column has an automatic row context, whereas a measure does not. If one wants to evaluate an expression row by row inside a measure, one needs to start an iteration to create a row context.
Note A column reference requires a row context to return the value of the column from a table. A column reference can be also used as an argument for several DAX functions without a row context. For example, DISTINCT and DISTINCTCOUNT can have a column reference as a parameter, without defining a row context. Nonetheless, a column reference in a DAX expression requires a row context to be evaluated. At this point, we need to repeat one important concept: A row context is not a special kind of filter context that filters one row. The row context is not filtering the model in any way; the row context only indicates to DAX which row to use out of a table. If one wants to apply a filter to the model, the tool to use is the filter context. On the other hand, if the user wants to evaluate an expression row by row, then the row context will do the job.
CHAPTER 4
Understanding Evaluation Contexts
87
Testing your understanding of evaluation contexts Before moving on to more complex descriptions about evaluation contexts, it is useful to test your understanding of contexts with a couple of examples. Please do not look at the explanation immediately; stop after the question and try to answer it. Then read the explanation to make sense of it. As a hint, try to remember, while thinking, ”The filter context filters; the row context iterates. This means that the row context does not filter, and the filter context does not iterate.”
Using SUM in a calculated column The first test uses an aggregator inside a calculated column. What is the result of the following expression, used in a calculated column, in Sales? Sales[SumOfSalesQuantity] = SUM ( Sales[Quantity] )
Remember, this internally corresponds to this equivalent syntax: Sales[SumOfSalesQuantity] = SUMX ( Sales, Sales[Quantity] )
Because it is a calculated column, it is computed row by row in a row context. What number do you expect to see? Choose from these three answers: ■
The value of Quantity for that row, that is, a different value for each row.
■
The total of Quantity for all the rows, that is, the same value for all the rows.
■
An error; we cannot use SUM inside a calculated column.
Stop reading, please, while we wait for your educated guess before moving on. Here is the correct reasoning. You have learned that the formula means, “the sum of quantity for all the rows visible in the current filter context.” Moreover, because the code is executed for a calculated column, DAX evaluates the formula row by row, in a row context. Nevertheless, the row context is not filtering the table. The only context that can filter the table is the filter context. This turns the question into a different one: What is the filter context, when the formula is evaluated? The answer is straightforward: The filter context is empty. Indeed, the filter context is created by visuals or by queries, and a calculated column is computed at data refresh time when no filtering is happening. Thus, SUM works on the whole Sales table, aggregating the value of Sales[Quantity] for all the rows of Sales. The correct answer is the second answer. This calculated column computes the same value for each row, that is, the grand total of Sales[Quantity] repeated for all the rows. Figure 4-8 shows the result of the SumOfSalesQuantity calculated column.
88
CHAPTER 4
Understanding Evaluation Contexts
FIGURE 4-8 SUM ( Sales[Quantity] ), in a calculated column, is computed against the entire database.
This example shows that the two evaluation contexts exist at the same time, but they do not interact. The evaluation contexts both work on the result of a formula, but they do so in different ways. Aggregators like SUM, MIN, and MAX only use the filter context, and they ignore the row context. If you have chosen the first answer, as many students typically do, it is perfectly normal. The thing is that you are still confusing the filter context and the row context. Remember, the filter context filters; the row context iterates. The first answer is the most common, when using intuitive logic, but it is wrong— now you know why. However, if you chose the correct answer ... then we are glad this section helped you in learning the important difference between the two contexts.
Using columns in a measure The second test is slightly different. Imagine we define the formula for the gross margin in a measure instead of in a calculated column. We have a column with the net price, another column for the product cost, and we write the following expression: GrossMargin% := ( Sales[Net Price] - Sales[Unit Cost] ) / Sales[Unit Cost]
What will the result be? As it happened earlier, choose among the three possible answers: ■
The expression works correctly, time to test the result in a report.
■
An error, we should not even write this formula.
■
We can define the formula, but it will return an error when used in a report.
As in the previous test, stop reading, think about the answer, and then read the following explanation.
CHAPTER 4
Understanding Evaluation Contexts
89
The code references Sales[Net Price] and Sales[Unit Cost] without any aggregator. As such, DAX needs to retrieve the value of the columns for a certain row. DAX has no way of detecting which row the formula needs to be computed for because there is no iteration happening and the code is not in a calculated column. In other words, DAX is missing a row context that would make it possible to retrieve a value for the columns that are part of the expression. Remember that a measure does not have an automatic row context; only calculated columns do. If we need a row context in a measure, we should start an iteration. Thus, the second answer is the correct one. We cannot write the formula because it is syntactically wrong, and we get an error when trying to enter the code.
Using the row context with iterators You learned that DAX creates a row context whenever we define a calculated column or when we start an iteration with an X-function. When we use a calculated column, the presence of the row context is simple to use and understand. In fact, we can create simple calculated columns without even knowing about the presence of the row context. The reason is that the row context is created automatically by the engine. Therefore, we do not need to worry about the presence of the row context. On the other hand, when using iterators we are responsible for the creation and the handling of the row context. Moreover, by using iterators we can create multiple nested row contexts; this increases the complexity of the code. Therefore, it is important to understand more precisely the behavior of row contexts with iterators. For example, look at the following DAX measure: IncreasedSales := SUMX ( Sales, Sales[Net Price] * 1.1 )
Because SUMX is an iterator, SUMX creates a row context on the Sales table and uses it during the iteration. The row context iterates the Sales table (first parameter) and provides the current row to the second parameter during the iteration. In other words, DAX evaluates the inner expression (the second parameter of SUMX) in a row context containing the currently iterated row on the first parameter. Please note that the two parameters of SUMX use different contexts. In fact, any piece of DAX code works in the context where it is called. Thus, when the expression is executed, there might already be a filter context and one or many row contexts active. Look at the same expression with comments: SUMX ( Sales, Sales[Net Price] * 1.1 )
-- External filter and row contexts -- External filter and row contexts + new row context
The first parameter, Sales, is evaluated using the contexts coming from the caller. The second parameter (the expression) is evaluated using both the external contexts plus the newly created row context.
90
CHAPTER 4
Understanding Evaluation Contexts
All iterators behave the same way: 1.
Evaluate the first parameter in the existing contexts to determine the rows to scan.
2.
Create a new row context for each row of the table evaluated in the previous step.
3.
Iterate the table and evaluate the second parameter in the existing evaluation context, including the newly created row context.
4.
Aggregate the values computed during the previous step.
Be mindful that the original contexts are still valid inside the expression. Iterators add a new row context; they do not modify existing filter contexts. For example, if the outer filter context contains a filter for the color Red, that filter is still active during the whole iteration. Besides, remember that the row context iterates; it does not filter. Therefore, no matter what, we cannot override the outer filter context using an iterator. This rule is always valid, but there is an important detail that is not trivial. If the previous contexts already contained a row context for the same table, then the newly created row context hides the previous existing row context on the same table. For DAX newbies, this is a possible source of mistakes. Therefore, we discuss row context hiding in more detail in the next two sections.
Nested row contexts on different tables The expression evaluated by an iterator can be very complex. Moreover, the expression can, on its own, contain further iterations. At first sight, starting an iteration inside another iteration might look strange. Still, it is a common DAX practice because nesting iterators produce powerful expressions. For example, the following code contains three nested iterators, and it scans three tables: Categories, Products, and Sales. SUMX ( 'Product Category', -- Scans the Product Category table SUMX ( -- For each category RELATEDTABLE ( 'Product' ), -- Scans the category products SUMX ( -- For each product RELATEDTABLE ( Sales ) -- Scans the sales of that product Sales[Quantity] -* 'Product'[Unit Price] -- Computes the sales amount of that sale * 'Product Category'[Discount] ) ) )
The innermost expression—the multiplication of three factors—references three tables. In fact, three row contexts are opened during that expression evaluation: one for each of the three tables that are currently being iterated. It is also worth noting that the two RELATEDTABLE functions return the rows of a related table starting from the current row context. Thus, RELATEDTABLE ( Product), being
CHAPTER 4
Understanding Evaluation Contexts
91
executed in a row context from the Categories table, returns the products of the given category. The same reasoning applies to RELATEDTABLE ( Sales ), which returns the sales of the given product. The previous code is suboptimal in terms of both performance and readability. As a rule, it is fine to nest iterators provided that the number of rows to scan is not too large: hundreds is good, thousands is fine, millions is bad. Otherwise, we may easily hit performance issues. We used the previous code to demonstrate that it is possible to create multiple nested row contexts; we will see more useful examples of nested iterators later in the book. One can express the same calculation in a much faster and readable way by using the following code, which relies on one individual row context and the RELATED function: SUMX ( Sales, Sales[Quantity] * RELATED ( 'Product'[Unit Price] ) * RELATED ( 'Product Category'[Discount] ) )
Whenever there are multiple row contexts on different tables, one can use them to reference the iterated tables in a single DAX expression. There is one scenario, however, which proves to be challenging. This happens when we nest multiple row contexts on the same table, which is the topic covered in the following section.
Nested row contexts on the same table The scenario of having nested row contexts on the same table might seem rare. However, it does happen quite often, and more frequently in calculated columns. Imagine we want to rank products based on the list price. The most expensive product should be ranked 1, the second most expensive product should be ranked 2, and so on. We could solve the scenario using the RANKX function. But for educational purposes, we show how to solve it using simpler DAX functions. To compute the ranking, for each product we can count the number of products whose price is higher than the current product’s. If there is no product with a higher price than the current product price, then the current product is the most expensive and its ranking is 1. If there is only one product with a higher price, then the ranking is 2. In fact, what we are doing is computing the ranking of a product by counting the number of products with a higher price and adding 1 to the result. Therefore, one can author a calculated column using this code, where we used PriceOfCurrentProduct as a placeholder to indicate the price of the current product. 1. 2. 3. 4. 5. 6. 7.
92
'Product'[UnitPriceRank] = COUNTROWS ( FILTER ( 'Product', 'Product'[Unit Price] > PriceOfCurrentProduct ) ) + 1
CHAPTER 4
Understanding Evaluation Contexts
FILTER returns the products with a price higher than the current products’ price, and COUNTROWS counts the rows of the result of FILTER. The only remaining issue is finding a way to express the price of the current product, replacing PriceOfCurrentProduct with a valid DAX syntax. By “current,” we mean the value of the column in the current row when DAX computes the column. It is harder than you might expect. Focus your attention on line 5 of the previous code. There, the reference to Product[Unit Price] refers to the value of Unit Price in the current row context. What is the active row context when DAX executes row number 5? There are two row contexts. Because the code is written in a calculated column, there is a default row context automatically created by the engine that scans the Product table. Moreover, FILTER being an iterator, there is the row context generated by FILTER that scans the product table again. This is shown graphically in Figure 4-9. Row context of the calculated column
Product[UnitPriceRank] = COUNTROWS ( FILTER ( Product, Product[Unit Price] >= PriceOfCurrentProduct ) ) + 1 Row context of the FILTER function FIGURE 4-9 During the evaluation of the innermost expression, there are two row contexts on the
same table.
The outer box includes the row context of the calculated column, which is iterating over Product. However, the inner box shows the row context of the FILTER function, which is iterating over Product too. The expression Product[Unit Price] depends on the context. Therefore, a reference to Product[Unit Price] in the inner box can only refer to the currently iterated row by FILTER. The problem is that, in that box, we need to evaluate the value of Unit Price that is referenced by the row context of the calculated column, which is now hidden. Indeed, when one does not create a new row context using an iterator, the value of Product[Unit Price] is the desired value, which is the value in the current row context of the calculated column, as in this simple piece of code: Product[Test] = Product[Unit Price]
To further demonstrate this, let us evaluate Product[Unit Price] in the two boxes, with some dummy code. What comes out are different results as shown in Figure 4-10, where we added the evaluation of Product[Unit Price] right before COUNTROWS, only for educational purposes.
CHAPTER 4
Understanding Evaluation Contexts
93
Products[UnitPriceRank] =
This is the value of the current product in thecalculated column
Product[UnitPrice] + COUNTROWS ( FILTER ( Product, Product[Unit Price] >= PriceOfCurrentProduct ) ) + 1 This is the value of the product iterated by FILTER FIGURE 4-10 Outside of the iteration, Product[Unit Price] refers to the row context of the calculated column.
Here is a recap of the scenario so far: ■ ■
■
The inner row context, generated by FILTER, hides the outer row context. We need to compare the inner Product[Unit Price] with the value of the outer Product[Unit Price]. If we write the comparison in the inner expression, we are unable to access the outer Product[Unit Price].
Because we can retrieve the current unit price, if we evaluate it outside of the row context of FILTER, the best approach to this problem is saving the value of the Product[Unit Price] inside a variable. Indeed, one can evaluate the variable in the row context of the calculated column using this code: 'Product'[UnitPriceRank] = VAR PriceOfCurrentProduct = 'Product'[Unit Price] RETURN COUNTROWS ( FILTER ( 'Product', 'Product'[Unit Price] > PriceOfCurrentProduct ) ) + 1
Moreover, it is even better to write the code in a more descriptive way by using more variables to separate the different steps of the calculation. This way, the code is also easier to follow: 'Product'[UnitPriceRank] = VAR PriceOfCurrentProduct = 'Product'[Unit Price] VAR MoreExpensiveProducts = FILTER ( 'Product', 'Product'[Unit Price] > PriceOfCurrentProduct ) RETURN COUNTROWS ( MoreExpensiveProducts ) + 1
94
CHAPTER 4
Understanding Evaluation Contexts
Figure 4-11 shows a graphical representation of the row contexts of this latter formulation of the code, which makes it easier to understand which row context DAX computes each part of the formula in. This is the value of the current product in thecalculated column
Product[UnitPriceRank] = VAR PriceOfCurrentProduct = Product[Unit Price] VAR MoreExpensiveProducts = FILTER ( Product, Product[Unit Price] > PriceOfCurrentProduct ) RETURN COUNTROWS ( MoreExpensiveProducts ) + 1 This is the value of the product iterated by FILTER FIGURE 4-11 The value of PriceOfCurrentProduct is evaluated in the outer row context.
Figure 4-12 shows the result of this calculated column.
FIGURE 4-12 UnitPriceRank is a useful example of how to use variables to navigate within nested row contexts.
CHAPTER 4
Understanding Evaluation Contexts
95
Because there are 14 products with the same unit price, their rank is always 1; the fifteenth product has a rank of 15, shared with other products with the same price. It would be great if we could rank 1, 2, 3 instead of 1, 15, 19 as is the case in the figure. We will fix this soon but, before that, it is important to make a small digression. To solve a scenario like the one proposed, it is necessary to have a solid understanding of what a row context is, to be able to detect which row context is active in different parts of the formula and, most importantly, to conceive how the row context affects the value returned by a DAX expression. It is worth stressing that the same expression Product[Unit Price], evaluated in two different parts of the formula, returns different values because of the different contexts under which it is evaluated. When one does not have a solid understanding of evaluation contexts, it is extremely hard to work on such complex code. As you have seen, a simple ranking expression with two row contexts proves to be a challenge. Later in Chapter 5 you learn how to create multiple filter contexts. At that point, the complexity of the code increases a lot. However, if you understand evaluation contexts, these scenarios are simple. Before moving to the next level in DAX, you need to understand evaluation contexts well. This is the reason why we urge you to read this whole section again—and maybe the whole chapter so far—until these concepts are crystal clear. It will make reading the next chapters much easier and your learning experience much smoother. Before leaving this example, we need to solve the last detail—that is, ranking using a sequence of 1, 2, 3 instead of the sequence obtained so far. The solution is easier than expected. In fact, in the previous code we focused on counting the products with a higher price. By doing that, the formula counted 14 products ranked 1 and assigned 15 to the second ranking level. However, counting products is not very useful. If the formula counted the prices higher than the current price, rather than the products, then all 14 products would be collapsed into a single price. 'Product'[UnitPriceRankDense] = VAR PriceOfCurrentProduct = 'Product'[Unit Price] VAR HigherPrices = FILTER ( VALUES ( 'Product'[Unit Price] ), 'Product'[Unit Price] > PriceOfCurrentProduct ) RETURN COUNTROWS ( HigherPrices ) + 1
Figure 4-13 shows the new calculated column, along with UnitPriceRank.
96
CHAPTER 4
Understanding Evaluation Contexts
FIGURE 4-13 UnitPriceRankDense returns a more useful ranking because it counts prices, not products.
This final small step is counting prices instead of counting products, and it might seem harder than expected. The more you work with DAX, the easier it will become to start thinking in terms of ad hoc temporary tables created for the purpose of a calculation. In this example you learned that the best technique to handle multiple row contexts on the same table is by using variables. Keep in mind that variables were introduced in the DAX language as late as 2015. You might find existing DAX code—written before the age of variables—that uses another technique to access outer row contexts: the EARLIER function, which we describe in the next section.
Using the EARLIER function DAX provides a function that accesses the outer row contexts: EARLIER. EARLIER retrieves the value of a column by using the previous row context instead of the last one. Therefore, we can express the value of PriceOfCurrentProduct using EARLIER ( Product[UnitPrice] ). Many DAX newbies feel intimidated by EARLIER because they do not understand row contexts well enough and they do not realize that they can nest row contexts by creating multiple iterations over the
CHAPTER 4
Understanding Evaluation Contexts
97
same table. EARLIER is a simple function, once you understand the concept of row context and nesting. For example, the following code solves the previous scenario without using variables: 'Product'[UnitPriceRankDense] = COUNTROWS ( FILTER ( VALUES ( 'Product'[Unit Price] ), 'Product'[UnitPrice] > EARLIER ( 'Product'[UnitPrice] ) ) ) + 1
Note EARLIER accepts a second parameter, which is the number of steps to skip, so that one can skip two or more row contexts. Moreover, there is also a function named EARLIEST that lets a developer access the outermost row context defined for a table. In the real world, neither EARLIEST nor the second parameter of EARLIER is used often. Though having two nested row contexts is a common scenario in calculated columns, having three or more of them is something that rarely happens. Besides, since the advent of variables, EARLIER has virtually become useless because variable usage superseded EARLIER. The only reason to learn EARLIER is to be able to read existing DAX code. There are no further reasons to use EARLIER in newer DAX code because variables are a better way to save the required value when the right row context is accessible. Using variables for this purpose is a best practice and results in more readable code.
Understanding FILTER, ALL, and context interactions In the preceding examples, we used FILTER as a convenient way of filtering a table. FILTER is a common function to use whenever one wants to apply a filter that further restricts the existing filter context. Imagine that we want to create a measure that counts the number of red products. With the knowledge gained so far, the formula is easy: NumOfRedProducts := VAR RedProducts = FILTER ( 'Product', 'Product'[Color] = "Red" ) RETURN COUNTROWS ( RedProducts )
We can use this formula inside a report. For example, put the product brand on the rows to produce the report shown in Figure 4-14.
98
CHAPTER 4
Understanding Evaluation Contexts
FIGURE 4-14 We can count the number of red products using the FILTER function.
Before moving on with this example, stop for a moment and think carefully about how DAX computed these values. Brand is a column of the Product table. Inside each cell of the report, the filter context filters one given brand. Therefore, each cell shows the number of products of the given brand that are also red. The reason for this is that FILTER iterates the Product table as it is visible in the current filter context, which only contains products with that specific brand. It might seem trivial, but it is better to repeat this a few times than there being a chance of forgetting it. This is more evident if we add a slicer to the report filtering the color. In Figure 4-15 there are two identical reports with two slicers filtering color, where each slicer only filters the report on its immediate right. The report on the left filters Red and the numbers are the same as in Figure 4-14, whereas the report on the right is empty because the slicer is filtering Azure.
FIGURE 4-15 DAX evaluates NumOfRedProducts taking into account the outer context defined by the slicer.
In the report on the right, the Product table iterated by FILTER only contains Azure products, and, because FILTER can only return Red products, there are no products to return. As a result, the NumOfRedProducts measure always evaluates to blank.
CHAPTER 4
Understanding Evaluation Contexts
99
The important part of this example is the fact that in the same formula, there are both a filter context coming from the outside—the cell in the report, which is affected by the slicer selection—and a row context introduced in the formula by the FILTER function. Both contexts work at the same time and modify the result. DAX uses the filter context to evaluate the Product table, and the row context to evaluate the filter condition row by row during the iteration made by FILTER. We want to repeat this concept again: FILTER does not change the filter context. FILTER is an iterator that scans a table (already filtered by the filter context) and it returns a subset of that table, according to the filtering condition. In Figure 4-14, the filter context is filtering the brand and, after FILTER returned the result, it still only filtered the brand. Once we added the slicer on the color in Figure 4-15, the filter context contained both the brand and the color. For this reason, in the left-hand side report FILTER returned all the products iterated, and in the right-hand side report it did not return any product. In both reports, FILTER did not change the filter context. FILTER only scanned a table and returned a filtered result. At this point, one might want to define another formula that returns the number of red products regardless of the selection done on the slicer. In other words, the code needs to ignore the selection made on the slicer and must always return the number of all the red products. To accomplish this, the ALL function comes in handy. ALL returns the content of a table ignoring the filter context. We can define a new measure, named NumOfAllRedProducts, by using this expression: NumOfAllRedProducts := VAR AllRedProducts = FILTER ( ALL ( 'Product' ), 'Product'[Color] = "Red" ) RETURN COUNTROWS ( AllRedProducts )
This time, FILTER does not iterate Product. Instead, it iterates ALL ( Product ). ALL ignores the filter context and always returns all the rows of the table, so that FILTER returns the red products even if products were previously filtered by another brand or color. The result shown in Figure 4-16—although correct—might be surprising.
FIGURE 4-16 NumOfAllRedProducts returns strange results.
100
CHAPTER 4
Understanding Evaluation Contexts
There are a couple of interesting things to note here, and we want to describe both in more detail: ■
The result is always 99, regardless of the brand selected on the rows.
■
The brands in the left matrix are different from the brands in the right matrix.
First, 99 is the total number of red products, not the number of red products of any given brand. ALL—as expected—ignores the filters on the Product table. It not only ignores the filter on the color, but it also ignores the filter on the brand. This might be an undesired effect. Nonetheless, ALL is easy and powerful, but it is an all-or-nothing function. If used, ALL ignores all the filters applied to the table specified as its argument. With the knowledge you have gained so far, you cannot yet choose to only ignore part of the filter. In the example, it would have been better to only ignore the filter on the color. Only after the next chapter, with the introduction of CALCULATE, will you have better options to achieve the selective ignoring of filters. Let us now describe the second point: The brands on the two reports are different. Because the slicer is filtering one color, the full matrix is computed with the filter on the color. On the left the color is Red, whereas on the right the color is Azure. This determines two different sets of products, and consequently, of brands. The list of brands used to populate the axis of the report is computed in the original filter context, which contains a filter on color. Once the axes have been computed, then DAX computes values for the measure, always returning 99 as a result regardless of the brand and color. Thus, the report on the left shows the brands of red products, whereas the report on the right shows the brands of azure products, although in both reports the measure shows the total of all the red products, regardless of their brand.
Note The behavior of the report is not specific to DAX, but rather to the SUMMARIZECOLUMNS function used by Power BI. We cover SUMMARIZECOLUMNS in Chapter 13, “Authoring queries.” We do not want to further explore this scenario right now. The solution comes later when you learn CALCULATE, which offers a lot more power (and complexity) for the handling of filter contexts. As of now, we used this example to show that you might find unexpected results from relatively simple formulas because of context interactions and the coexistence, in the same expression, of filter and row contexts.
Working with several tables Now that you have learned the basics of evaluation contexts, we can describe how the context behaves when it comes to relationships. In fact, few data models contain just one single table. There would most likely be several tables, linked by relationships. If there is a relationship between Sales and Product, does a filter context on Product filter Sales, too? And what about a filter on Sales, is it filtering Product? Because there are two types of evaluation contexts (the row context and the filter context) and relationships have two sides (a one-side and a many-side), there are four different scenarios to analyze. CHAPTER 4
Understanding Evaluation Contexts
101
The answer to these questions is already found in the mantra you are learning in this chapter, “The filter context filters; the row context iterates” and in its consequence, “The filter context does not iterate; the row context does not filter.” To examine the scenario, we use a data model containing six tables, as shown in Figure 4-17.
FIGURE 4-17 Data model used to learn the interaction between contexts and relationships.
The model presents a couple of noteworthy details: ■
■
There is a chain of relationships starting from Sales and reaching Product Category, through Product and Product Subcategory. The only bidirectional relationship is between Sales and Product. All remaining relationships are set to be single cross-filter direction.
This model is going to be useful when looking at the details of evaluation contexts and relationships in the next sections.
Row contexts and relationships The row context iterates; it does not filter. Iteration is the process of scanning a table row by row and of performing an operation in the meantime. Usually, one wants some kind of aggregation like sum or average. During an iteration, the row context is iterating an individual table, and it provides a value to
102
CHAPTER 4
Understanding Evaluation Contexts
all the columns of the table, and only that table. Other tables, although related to the iterated table, do not have a row context on them. In other words, the row context does not interact automatically with relationships. Consider as an example a calculated column in the Sales table containing the difference between the unit price stored in the fact table and the unit price stored in the Product table. The following DAX code does not work because it uses the Product[UnitPrice] column and there is no row context on Product: Sales[UnitPriceVariance] = Sales[Unit Price] – 'Product'[Unit Price]
This being a calculated column, DAX automatically generates a row context on the table containing the column, which is the Sales table. The row context on Sales provides a row-by-row evaluation of expressions using the columns in Sales. Even though Product is on the one-side of a one-to-many relationship with Sales, the iteration is happening on the Sales table only. When we are iterating on the many-side of a relationship, we can access columns on the one-side of the relationship, but we must use the RELATED function. RELATED accepts a column reference as the parameter and retrieves the value of the column in the corresponding row in the target table. RELATED can only reference one column and multiple RELATED functions are required to access more than one column on the one-side of the relationship. The correct version of the previous code is the following: Sales[UnitPriceVariance] = Sales[Unit Price] - RELATED ( 'Product'[Unit Price] )
RELATED requires a row context (that is, an iteration) on the table on the many-side of a relationship. If the row context were active on the one-side of a relationship, then RELATED would no longer be useful because RELATED would find multiple rows by following the relationship. In this case, that is, when iterating the one-side of a relationship, the function to use is RELATEDTABLE. RELATEDTABLE returns all the rows of the table on the many-side that are related with the currently iterated table. For example, if one wants to compute the number of sales of each product, the following formula defined as a calculated column on Product solves the problem: Product[NumberOfSales] = VAR SalesOfCurrentProduct = RELATEDTABLE ( Sales ) RETURN COUNTROWS ( SalesOfCurrentProduct )
This expression counts the number of rows in the Sales table that corresponds to the current product. The result is visible in Figure 4-18.
CHAPTER 4
Understanding Evaluation Contexts
103
FIGURE 4-18 RELATEDTABLE is useful in a row context on the one-side of the relationship.
Both RELATED and RELATEDTABLE can traverse a chain of relationships; they are not limited to a single hop. For example, one can create a column with the same code as before but, this time, in the Product Category table: 'Product Category'[NumberOfSales] = VAR SalesOfCurrentProductCategory = RELATEDTABLE ( Sales ) RETURN COUNTROWS ( SalesOfCurrentProductCategory )
The result is the number of sales for the category, which traverses the chain of relationships from Product Category to Product Subcategory, then to Product to finally reach the Sales table. In a similar way, one can create a calculated column in the Product table that copies the category name from the Product Category table. 'Product'[Category] = RELATED ( 'Product Category'[Category] )
In this case, a single RELATED function traverses the chain of relationships from Product to Product Subcategory to Product Category.
Note The only exception to the general rule of RELATED and RELATEDTABLE is for oneto-one relationships. If two tables share a one-to-one relationship, then both RELATED and RELATEDTABLE work in both tables and they result either in a column value or in a table with a single row, depending on the function used. Regarding chains of relationships, all the relationships need to be of the same type—that is, oneto-many or many-to-one. If the chain links two tables through a one-to-many relationship to a bridge table, followed by a many-to-one relationship to the second table, then neither RELATED nor RELATEDTABLE works with single-direction filter propagation. Only RELATEDTABLE can work using bidirectional
104
CHAPTER 4
Understanding Evaluation Contexts
filter propagation, as explained later. On the other hand, a one-to-one relationship behaves as a one-to-many and as a many-to-one relationship at the same time. Thus, there can be a one-to-one relationship in a chain of one-to-many (or many-to-one) without interrupting the chain. For example, in the model we chose as a reference, Customer is related to Sales and Sales is related to Product. There is a one-to-many relationship between Customer and Sales, and then a many-to-one relationship between Sales and Product. Thus, a chain of relationships links Customer to Product. However, the two relationships are not in the same direction. This scenario is known as a many-tomany relationship. A customer is related to many products bought and a product is in turn related to many customers who bought that product. We cover many-to-many relationships later in Chapter 15, “Advanced relationships”; let us focus on row context, for the moment. If one uses RELATEDTABLE through a many-to-many relationship, the result would be wrong. Consider a calculated column in Product with this formula: Product[NumOfBuyingCustomers] = VAR CustomersOfCurrentProduct = RELATEDTABLE ( Customer ) RETURN COUNTROWS ( CustomersOfCurrentProduct )
The result of the previous code is not the number of customers who bought that product. Instead, the result is the total number of customers, as shown in Figure 4-19.
FIGURE 4-19 RELATEDTABLE does not work over a many-to-many relationship.
RELATEDTABLE cannot follow the chain of relationships because they are not going in the same direction. The row context from Product does not reach Customers. It is worth noting that if we try the formula in the opposite direction, that is, if we count the number of products bought for each customer, the result is correct: a different number for each row representing the number of products bought by the customer. The reason for this behavior is not the propagation of a row context but, rather, the context transition generated by RELATEDTABLE. We added this final note for full disclosure. It is not time to elaborate on this just yet. You will have a better understanding of this after reading Chapter 5.
CHAPTER 4
Understanding Evaluation Contexts
105
Filter context and relationships In the previous section, you learned that the row context iterates and, as such, that it does not use relationships. The filter context, on the other hand, filters. A filter context is not applied to an individual table. Instead, it always works on the whole model. At this point, you can update the evaluation context mantra to its complete formulation: The filter context filters the model; the row context iterates one table. Because a filter context filters the model, it uses relationships. The filter context interacts with relationships automatically, and it behaves differently depending on how the cross-filter direction of the relationship is set. The cross-filter direction is represented with a small arrow in the middle of a relationship, as shown in Figure 4-20.
FIGURE 4-20 Behavior of filter context and relationships.
The filter context uses a relationship by going in the direction allowed by the arrow. In all relationships the arrow allows propagation from the one-side to the many-side, whereas when the cross-filter direction is BOTH, propagation is allowed from the many-side to the one-side too. A relationship with a single cross-filter is a unidirectional relationship, whereas a relationship with BOTH cross-filter directions is a bidirectional relationship. This behavior is intuitive. Although we have not explained this sooner, all the reports we have used so far relied on this behavior. Indeed, in a typical report filtering by Product[Color] and aggregating the Sales[Quantity], one would expect the filter from Product to propagate to Sales. This is exactly what happens: Product is on the one-side of a relationship; thus a filter on Product propagates to Sales, regardless of the cross-filter direction.
106
CHAPTER 4
Understanding Evaluation Contexts
Because our sample data model contains both a bidirectional relationship and many unidirectional relationships, we can demonstrate the filtering behavior by using three different measures that count the number of rows in the three tables: Sales, Product, and Customer. [NumOfSales] := COUNTROWS ( Sales ) [NumOfProducts] := COUNTROWS ( Product ) [NumOfCustomers] := COUNTROWS ( Customer )
The report contains the Product[Color] on the rows. Therefore, each cell is evaluated in a filter context that filters the product color. Figure 4-21 shows the result.
FIGURE 4-21 This shows the behavior of filter context and relationships.
In this first example, the filter is always propagating from the one-side to the many-side of relationships. The filter starts from Product[Color]. From there, it reaches Sales, which is on the many-side of the relationship with Product, and Product, because it is the very same table. On the other hand, NumOfCustomers always shows the same value—the total number of customers. This is because the relationship between Customer and Sales does not allow propagation from Sales to Customer. The filter is moved from Product to Sales, but from there it does not reach Customer. You might have noticed that the relationship between Sales and Product is a bidirectional relationship. Thus, a filter context on Customer also filters Sales and Product. We can prove it by changing the report, slicing by Customer[Education] instead of Product[Color]. The result is visible in Figure 4-22.
CHAPTER 4
Understanding Evaluation Contexts
107
FIGURE 4-22 Filtering by customer education, the Product table is filtered too.
This time the filter starts from Customer. It can reach the Sales table because Sales is on the manyside of the relationship. Furthermore, it propagates from Sales to Product because the relationship between Sales and Product is bidirectional—its cross-filter direction is BOTH. Beware that a single bidirectional relationship in a chain does not make the whole chain bidirectional. In fact, a similar measure that counts the number of subcategories, such as the following one, demonstrates that the filter context starting from Customer does not reach Product Subcategory: NumOfSubcategories := COUNTROWS ( 'Product Subcategory' )
Adding the measure to the previous report produces the results shown in Figure 4-23, where the number of subcategories is the same for all the rows.
FIGURE 4-23 If the relationship is unidirectional, customers cannot filter subcategories.
Because the relationship between Product and Product Subcategory is unidirectional, the filter does not propagate to Product Subcategory. If we update the relationship, setting the cross-filter direction to BOTH, the result is different as shown in Figure 4-24.
FIGURE 4-24 If the relationship is bidirectional, customers can filter subcategories too.
108
CHAPTER 4
Understanding Evaluation Contexts
With the row context, we use RELATED and RELATEDTABLE to propagate the row context through relationships. On the other hand, with the filter context, no functions are needed to propagate the filter. The filter context filters the model, not a table. As such, once one applies a filter context, the entire model is subject to the filter according to the relationships.
Important From the examples, it may look like enabling bidirectional filtering on all the relationships is a good option to let the filter context propagate to the whole model. This is definitely not the case. We will cover advanced relationships in depth later, in Chapter 15. Bidirectional filters come with a lot more complexity than what we can share with this introductory chapter, and you should not use them unless you have a clear idea of the consequences. As a rule, you should enable bidirectional filters in specific measures by using the CROSSFILTER function, and only when strictly required.
Using DISTINCT and SUMMARIZE in filter contexts Now that you have a solid understanding of evaluation contexts, we can use this knowledge to solve a scenario step-by-step. In the meantime, we provide the analysis of a few details that—hopefully—will shed more light on the fundamental concepts of row context and filter context. Besides, in this example we also further describe the SUMMARIZE function, briefly introduced in Chapter 3, “Using basic table functions.” Before going into more details, please note that this example shows several inaccurate calculations before reaching the correct solution. The purpose is educational because we want to teach the process of writing DAX code rather than give a solution. In the process of authoring a measure, it is likely you will make several initial errors. In this guided example, we describe the correct way of reasoning, which helps you solve similar errors by yourself. The requirement is to compute the average age of customers of Contoso. Even though this looks like a legitimate requirement, it is not complete. Are we speaking about their current age or their age at the time of the sale? If a customer buys three times, should it count as one event or as three events in the average? What if they buy three times at different ages? We need to be more precise. Here is the more complete requirement: “Compute the average age of customers at the time of sale, counting each customer only once if they made multiple purchases at the same age.” The solution can be split into two steps: ■
Computing the age of the customer when the sale happened
■
Averaging it
CHAPTER 4
Understanding Evaluation Contexts
109
The age of the customer changes for every sale. Thus, the age needs to be stored in the Sales table. For each row in Sales, one can compute the age of the customer at the time when the sale happened. A calculated column perfectly fits this need: Sales[Customer Age] = DATEDIFF ( -- Compute the difference between RELATED ( Customer[Birth Date] ), -- the customer’s birth date Sales[Order Date], -- and the date of the sale YEAR -- in years )
Because Customer Age is a calculated column, it is evaluated in a row context that iterates Sales. The formula needs to access Customer[Birth Date], which is a column in Customer, on the one-side of a relationship with Sales. In this case, RELATED is needed to let DAX access the target table. In the sample database Contoso, there are many customers for whom the birth date is blank. DATEDIFF returns blank if the first parameter is blank. Because the requirement is to provide the average, a first—and inaccurate—solution might be a measure that averages this column: Avg Customer Age Wrong := AVERAGE ( Sales[Customer Age] )
The result is incorrect because Sales[Customer Age] contains multiple rows with the same age if a customer made multiple purchases at a certain age. The requirement is to compute each customer only once, and this formula is not following such a requirement. Figure 4-25 shows the result of this last measure side-by-side with the expected result.
FIGURE 4-25 A simple average computes the wrong result for the customer’s age.
110
CHAPTER 4
Understanding Evaluation Contexts
Here is the problem: The age of each customer must be counted only once. A possible solution— still inaccurate—would be to perform a DISTINCT of the customer ages and then average it, with the following measure: Avg Customer Age Wrong Distinct := AVERAGEX ( DISTINCT ( Sales[Customer Age] ), Sales[Customer Age] )
-- Iterate on the distinct values of -- Sales[Customer Age] and compute the -- average of the customer’s age
This solution is not the correct one yet. In fact, DISTINCT returns the distinct values of the customer age. Two customers with the same age would be counted only once by this formula. The requirement is to count each customer once, whereas this formula is counting each age once. In fact, Figure 4-26 shows the report with the new formulation of Avg Customer Age. You see that this solution is still inaccurate.
FIGURE 4-26 The average of the distinct customer ages still provides a wrong result.
In the last formula, one might try to replace Customer Age with CustomerKey as the parameter of DISTINCT, as in the following code: Avg Customer Age Invalid Syntax := AVERAGEX ( DISTINCT ( Sales[CustomerKey] ), Sales[Customer Age] )
-- Iterate on the distinct values of -- Sales[CustomerKey] and compute the -- average of the customer’s age
This code contains an error and DAX will not accept it. Can you spot the reason, without reading the solution we provide in the next paragraph? CHAPTER 4
Understanding Evaluation Contexts
111
AVERAGEX generates a row context that iterates a table. The table provided as the first parameter to AVERAGEX is DISTINCT ( Sales[CustomerKey] ). DISTINCT returns a table with one column only, and all the unique values of the customer key. Therefore, the row context generated by AVERAGEX only contains one column, namely Sales[CustomerKey]. DAX cannot evaluate Sales[Customer Age] in a row context that only contains Sales[CustomerKey]. What is needed is a row context that has the granularity of Sales[CustomerKey] but that also contains Sales[Customer Age]. SUMMARIZE, introduced in Chapter 3, can generate the existing unique combinations of two columns. Now we can finally show a version of this code that implements all the requirements: Correct Average := AVERAGEX ( SUMMARIZE ( Sales, Sales[CustomerKey], Sales[Customer Age] ), Sales[Customer Age] )
--------
Iterate on all the existing combinations that exist in Sales of the customer key and the customer age and average the customer’s age
As usual, it is possible to use a variable to split the calculation in multiple steps. Note that the access to the Customer Age column still requires a reference to the Sales table name in the second argument of the AVERAGEX function. A variable can contain a table, but it cannot be used as a table reference. Correct Average := VAR CustomersAge = SUMMARIZE ( Sales, Sales[CustomerKey], Sales[Customer Age] ) RETURN AVERAGEX ( CustomersAge, Sales[Customer Age] )
-----
Existing combinations that exist in Sales of the customer key and the customer age
-- Iterate on list of -- Customers/age in Sales -- and average the customer’s age
SUMMARIZE generates all the combinations of customer and age available in the current filter context. Thus, multiple customers with the same age will duplicate the age, once per customer. AVERAGEX ignores the presence of CustomerKey in the table; it only uses the customer age. CustomerKey is only needed to count the correct number of occurrences of each age. It is worth stressing that the full measure is executed in the filter context generated by the report. Thus, only the customers who bought something are evaluated and returned by SUMMARIZE. Every cell of the report has a different filter context, only considering the customers who purchased at least one product of the color displayed in the report.
112
CHAPTER 4
Understanding Evaluation Contexts
Conclusions It is time to recap the most relevant topics you learned in this chapter about evaluation contexts. ■
■
■
■
■
■
■
■
There are two evaluation contexts: the filter context and the row context. The two evaluation contexts are not variations of the same concept: the filter context filters the model; the row context iterates one table. To understand a formula’s behavior, you always need to consider both evaluation contexts because they operate at the same time. DAX creates a row context automatically for a calculated column. One can also create a row context programmatically by using an iterator. Every iterator defines a row context. You can nest row contexts and, in case they are on the same table, the innermost row context hides the previous row contexts on the same table. Variables are useful to store values retrieved when the required row context is accessible. In earlier versions of DAX where variables were not available, the EARLIER function was used to get access to the previous row context. As of today, using EARLIER is discouraged. When iterating over a table that is the result of a table expression, the row context only contains the columns returned by the table expression. Client tools like Power BI create a filter context when you use fields on rows, columns, slicers, and filters. A filter context can also be created programmatically by using CALCULATE, which we introduce in the next chapter. The row context does not propagate through relationships automatically. One needs to force the propagation by using RELATED and RELATEDTABLE. You need to use these functions in a row context on the correct side of a one-to-many relationship: RELATED on the many-side, RELATEDTABLE on the one-side. The filter context filters the model, and it uses relationships according to their cross-filter direction. It always propagates from the one-side to the many-side. In addition, if you use the cross-filtering direction BOTH, then the propagation also happens from the many-side to the one-side.
At this point, you have learned the most complex conceptual topics of the DAX language. These points rule all the evaluation flows of your formulas, and they are the pillars of the DAX language. Whenever you encounter an expression that does not compute what you want, there is a huge chance that was because you have not fully understood these rules. As we said in the introduction, at first glance all these topics look simple. In fact, they are. What makes them complex is the fact that in a DAX expression you might have several evaluation contexts active in different parts of the formula. Mastering evaluation contexts is a skill that you will gain with experience, and we will try to help you on this by showing many examples in the next chapters. After writing some DAX formulas of your own, you will intuitively know which contexts are used and which functions they require, and you will finally master the DAX language. CHAPTER 4
Understanding Evaluation Contexts
113
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE In this chapter we continue our journey in discovering the power of the DAX language with a detailed explanation of a single function: CALCULATE. The same considerations apply for CALCULATETABLE, which evaluates and returns a table instead of a scalar value. For simplicity’s sake, we will refer to CALCULATE in the examples, but remember that CALCULATETABLE displays the same behavior. CALCULATE is the most important, useful, and complex function in DAX, so it deserves a full chapter. The function itself is simple to learn; it only performs a few tasks. Complexity comes from the fact that CALCULATE and CALCULATETABLE are the only functions in DAX that can create new filter contexts. Thus, although they are simple functions, using CALCULATE or CALCULATETABLE in a formula instantly increases its complexity. This chapter is as tough as the previous chapter was. We suggest you carefully read it once, get a general feeling for CALCULATE, and move on to the remaining part of the book. Then, as soon as you feel lost in a specific formula, come back to this chapter and read it again from the beginning. You will probably discover new information each time you read it.
Introducing CALCULATE and CALCULATETABLE The previous chapter described the two evaluation contexts: the row context and the filter context. The row context automatically exists for a calculated column, and one can create a row context programmatically by using an iterator. The filter context, on the other hand, is created by the report, and we have not described yet how to programmatically create a filter context. CALCULATE and CALCULATETABLE are the only functions required to operate on the filter context. Indeed, CALCULATE and CALCULATETABLE are the only functions that can create a new filter context by manipulating the existing one. From here onwards, we will show examples based on CALCULATE only, but remember that CALCULATETABLE performs the same operation for DAX expressions returning a table. Later in the book there are more examples using CALCULATETABLE, as in Chapter 12, “Working with tables,” and in Chapter 13, “Authoring queries.”
Creating filter contexts Here we will introduce the reason why one would want to create new filter contexts with a practical example. As described in the next sections, writing code without being able to create new filter 115
contexts results in verbose and unreadable code. What follows is an example of how creating a new filter context can drastically improve code that, at first, looked rather complex. Contoso is a company that sells electronic products all around the world. Some products are branded Contoso, whereas others have different brands. One of the reports requires a comparison of the gross margins, both as an amount and as a percentage, of Contoso-branded products against their competitors. The first part of the report requires the following calculations: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) Gross Margin := SUMX ( Sales, Sales[Quantity] * ( Sales[Net Price] - Sales[Unit Cost] ) ) GM % := DIVIDE ( [Gross Margin], [Sales Amount] )
One beautiful aspect of DAX is that you can build more complex calculations on top of existing measures. In fact, you can appreciate this in the definition of GM %, the measure that computes the percentage of the gross margin against the sales. GM % simply invokes the two original measures as it divides them. If you already have a measure that computes a value, you can call the measure instead of rewriting the full code. Using the three measures defined above, one can build the first report, as shown in Figure 5-1.
FIGURE 5-1 The three measures provide quick insights in the margin of different categories.
The next step in building the report is more intricate. In fact, the final report we want is the one in Figure 5-2 that shows two additional columns: the gross margin for Contoso-branded products, both as amount and as percentage.
FIGURE 5-2 The last two columns of the report show gross margin amount and gross margin percentage for Contoso-branded products.
116
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
With the knowledge acquired so far, you are already capable of authoring the code for these two measures. Indeed, because the requirement is to restrict the calculation to only one brand, a solution is to use FILTER to restrict the calculation of the gross margin to Contoso products only: Contoso GM := VAR ContosoSales = -- Saves the rows of Sales which are related FILTER ( -- to Contoso-branded products into a variable Sales, RELATED ( 'Product'[Brand] ) = "Contoso" ) VAR ContosoMargin = -- Iterates over ContosoSales SUMX ( -- to only compute the margin for Contoso ContosoSales, Sales[Quantity] * ( Sales[Net Price] - Sales[Unit Cost] ) ) RETURN ContosoMargin
The ContosoSales variable contains the rows of Sales related to all the Contoso-branded products. Once the variable is computed, SUMX iterates on ContosoSales to compute the margin. Because the iteration is on the Sales table and the filter is on the Product table, one needs to use RELATED to retrieve the related product for each row in Sales. In a similar way, one can compute the gross margin of Contoso by iterating the ContosoSales variable twice: Contoso GM % := VAR ContosoSales = -- Saves the rows of Sales which are related FILTER ( -- to Contoso-branded products into a variable Sales, RELATED ( 'Product'[Brand] ) = "Contoso" ) VAR ContosoMargin = -- Iterates over ContosoSales SUMX ( -- to only compute the margin for Contoso ContosoSales, Sales[Quantity] * ( Sales[Net Price] - Sales[Unit Cost] ) ) VAR ContosoSalesAmount = -- Iterates over ContosoSales SUMX ( -- to only compute the sales amount for Contoso ContosoSales, Sales[Quantity] * Sales[Net Price] ) VAR Ratio = DIVIDE ( ContosoMargin, ContosoSalesAmount ) RETURN Ratio
The code for Contoso GM % is a bit longer but, from a logical point of view, it follows the same pattern as Contoso GM. Although these measures work, it is easy to note that the initial elegance of DAX is lost. Indeed, the model already contains one measure to compute the gross margin and another measure to compute the gross margin percentage. However, because the new measures needed to be filtered, we had to rewrite the expression to add the condition.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
117
It is worth stressing that the basic measures Gross Margin and GM % can already compute the values for Contoso. In fact, from Figure 5-2 you can note that the gross margin for Contoso is equal to 3,877,070.65 and the percentage is equal to 52.73%. One can obtain the very same numbers by slicing the base measures Gross Margin and GM % by Brand, as shown in Figure 5-3.
FIGURE 5-3 When sliced by brand, the base measures compute the value of Gross Margin and GM % for Contoso.
In the highlighted cells, the filter context created by the report is filtering the Contoso brand. The filter context filters the model. Therefore, a filter context placed on the Product[Brand] column filters the Sales table because of the relationship linking Sales to Product. Using the filter context, one can filter a table indirectly because the filter context operates on the whole model. Thus, if we could make DAX compute the Gross Margin measure by creating a filter context programmatically, which only filters the Contoso-branded products, then our implementation of the last two measures would be much easier. This is possible by using CALCULATE. The complete description of CALCULATE comes later in this chapter. First, we examine the syntax of CALCULATE: CALCULATE ( Expression, Condition1, … ConditionN )
CALCULATE can accept any number of parameters. The only mandatory parameter is the first one, that is, the expression to evaluate. The conditions following the first parameter are called filter arguments. CALCULATE creates a new filter context based on the set of filter arguments. Once the new filter context is computed, CALCULATE applies it to the model, and it proceeds with the evaluation of the expression. Thus, by leveraging CALCULATE, the code for Contoso Margin and Contoso GM % becomes much simpler: Contoso GM := CALCULATE ( [Gross Margin],
118
CHAPTER 5
-- Computes the gross margin
Understanding CALCULATE and CALCULATETABLE
'Product'[Brand] = "Contoso"
-- In a filter context where brand = Contoso
) Contoso GM % := CALCULATE ( [GM %], 'Product'[Brand] = "Contoso" )
-- Computes the gross margin percentage -- In a filter context where brand = Contoso
Welcome back, simplicity and elegance! By creating a filter context that forces the brand to be Contoso, one can rely on existing measures and change their behavior without having to rewrite the code of the measures. CALCULATE lets you create new filter contexts by manipulating the filters in the current context. As you have seen, this leads to simple and elegant code. In the next sections we provide a complete and more formal definition of the behavior of CALCULATE, describing in detail what CALCULATE does and how to take advantage of its features. Indeed, so far we have kept the example rather high-level when, in fact, the initial definition of the Contoso measures is not semantically equivalent to the final definition. There are some differences that one needs to understand well.
Introducing CALCULATE Now that you have had an initial exposure to CALCULATE, it is time to start learning the details of this function. As introduced earlier, CALCULATE is the only DAX function that can modify the filter context; and remember, when we mention CALCULATE, we also include CALCULATETABLE. CALCULATE does not modify a filter context: It creates a new filter context by merging its filter parameters with the existing filter context. Once CALCULATE ends, its filter context is discarded and the previous filter context becomes effective again. We have introduced the syntax of CALCULATE as CALCULATE ( Expression, Condition1, … ConditionN )
The first parameter is the expression that CALCULATE will evaluate. Before evaluating the expression, CALCULATE computes the filter arguments and uses them to manipulate the filter context. The first important thing to note about CALCULATE is that the filter arguments are not Boolean conditions: The filter arguments are tables. Whenever you use a Boolean condition as a filter argument of CALCULATE, DAX translates it into a table of values. In the previous section we used this code: Contoso GM := CALCULATE ( [Gross Margin], 'Product'[Brand] = "Contoso" )
-- Computes the gross margin -- In a filter context where brand = Contoso
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
119
Using a Boolean condition is only a shortcut for the complete CALCULATE syntax. This is known as syntax sugar. It reads this way: Contoso GM := CALCULATE ( [Gross Margin], FILTER ( ALL ( 'Product'[Brand] ), 'Product'[Brand] = "Contoso" ) )
-----
Computes the gross margin Using as valid values for Product[Brand] any value for Product[Brand] which is equal to "Contoso"
The two syntaxes are equivalent, and there are no performance or semantic differences between them. That being said, particularly when you are learning CALCULATE for the first time, it is useful to always read filter arguments as tables. This makes the behavior of CALCULATE more apparent. Once you get used to CALCULATE semantics, the compact version of the syntax is more convenient. It is shorter and easier to read. A filter argument is a table, that is, a list of values. The table provided as a filter argument defines the list of values that will be visible—for the column—during the evaluation of the expression. In the previous example, FILTER returns a table with one row only, containing a value for Product[Brand] that equals “Contoso”. In other words, “Contoso” is the only value that CALCULATE will make visible for the Product[Brand] column. Therefore, CALCULATE filters the model including only products of the Contoso brand. Consider these two definitions: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) Contoso Sales := CALCULATE ( [Sales Amount], FILTER ( ALL ( 'Product'[Brand] ), 'Product'[Brand] = "Contoso" ) )
The filter parameter of FILTER in the CALCULATE of Contoso Sales scans ALL(Product[Brand]); therefore, any previously existing filter on the product brand is overwritten by the new filter. This is more evident when you use the measures in a report that slices by brand. You can see in Figure 5-4 that Contoso Sales reports on all the rows/brands the same value as Sales Amount did for Contoso specifically. In every row, the report creates a filter context containing the relevant brand. For example, in the row for Litware the original filter context created by the report contains a filter that only shows Litware products. Then, CALCULATE evaluates its filter argument, which returns a table containing only Contoso. The newly created filter overwrites the previously existing filter on the same column. You can see a graphic representation of the process in Figure 5-5. 120
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
FIGURE 5-4 Contoso Sales overwrites the existing filter with the new filter for Contoso.
Contoso Sales := CALCULATE ( [Sales Amount], FILTER ( ALL ( 'Product'[Brand] ), 'Product'[Brand] = "Contoso" ) )
Brand
Brand
Litware
Contoso
OVERWRITE
Brand Contoso FIGURE 5-5 The filter with Litware is overwritten by the filter with Contoso evaluated by CALCULATE.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
121
CALCULATE does not overwrite the whole original filter context. It only replaces previously existing filters on the columns contained in the filter argument. In fact, if one changes the report to now slice by Product[Category], the result is different, as shown in Figure 5-6.
FIGURE 5-6 If the report filters by Category, the filter on Brand will be merged and no overwrite happens.
Now the report is filtering Product[Category], whereas CALCULATE applies a filter on Product[Brand] to evaluate the Contoso Sales measure. The two filters do not work on the same column of the Product table. Therefore, no overwriting happens, and the two filters work together as a new filter context. As a result, each cell is showing the sales of Contoso for the given category. The scenario is depicted in Figure 5-7.
Contoso Sales := CALCULATE ( [Sales Amount], FILTER ( ALL ( 'Product'[Brand] ), 'Product'[Brand] = "Contoso" ) )
Category
Brand
Cell phones
Contoso
Category
Brand
Cell phones
Contoso
FIGURE 5-7 CALCULATE overwrites filters on the same column. It merges filters if they are on different columns.
122
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
Now that you have seen the basics of CALCULATE, we can summarize its semantics: ■ ■
■
■
■
CALCULATE makes a copy of the current filter context. CALCULATE evaluates each filter argument and produces, for each condition, the list of valid values for the specified columns. If two or more filter arguments affect the same column, they are merged together using an AND operator (or using the set intersection in mathematical terms). CALCULATE uses the new condition to replace existing filters on the columns in the model. If a column already has a filter, then the new filter replaces the existing one. On the other hand, if the column does not have a filter, then CALCULATE adds the new filter to the filter context. Once the new filter context is ready, CALCULATE applies the filter context to the model, and it computes the first argument: the expression. In the end, CALCULATE restores the original filter context, returning the computed result.
Note CALCULATE does another very important task: It transforms any existing row context into an equivalent filter context. You find a more detailed discussion on this topic later in this chapter, under “Understanding context transition.” Should you do a second reading of this section, do remember: CALCULATE creates a filter context out of the existing row contexts. CALCULATE accepts filters of two types: ■
■
Lists of values, in the form of a table expression. In that case, you provide the exact list of values you want to make visible in the new filter context. The filter can be a table with any number of columns. Only the existing combinations of values in different columns will be considered in the filter. Boolean conditions, such as Product[Color] = “White”. These filters need to work on a single column because the result needs to be a list of values for a single column. This type of filter argument is also known as predicate.
If you use the syntax with a Boolean condition, DAX transforms it into a list of values. Thus, whenever you write this code: Sales Amount Red Products := CALCULATE ( [Sales Amount], 'Product'[Color] = "Red" )
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
123
DAX transforms the expression into this: Sales Amount Red Products := CALCULATE ( [Sales Amount], FILTER ( ALL ( 'Product'[Color] ), 'Product'[Color] = "Red" ) )
For this reason, you can only reference one column in a filter argument with a Boolean condition. DAX needs to detect the column to iterate in the FILTER function, which is generated in the background automatically. If the Boolean expression references two or more columns, then you must explicitly write the FILTER iteration, as you learn later in this chapter.
Using CALCULATE to compute percentages Now that we have introduced CALCULATE, we can use it to define several calculations. The goal of this section is to bring your attention to some details about CALCULATE that are not obvious at first sight. Later in this chapter, we will cover more advanced aspects of CALCULATE. For now, we focus on some of the issues you might encounter when you start using CALCULATE. A pattern that appears often is that of percentages. When working with percentages, it is very important to define exactly the calculation required. In this set of examples, you learn how different uses of CALCULATE and ALL functions provide different results. We can start with a simple percentage calculation. We want to build the following report showing the sales amount along with the percentage over the grand total. You can see in Figure 5-8 the result we want to obtain.
FIGURE 5-8 Sales Pct shows the percentage of the current category against the grand total.
To compute the percentage, one needs to divide the value of Sales Amount in the current filter context by the value of Sales Amount in a filter context that ignores the existing filter on Category. In fact, the value of 1.26% for Audio is computed as 384,518.16 divided by 30,591,343.98. 124
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
In each row of the report, the filter context already contains the current category. Thus, for Sales Amount, the result is automatically filtered by the given category. The denominator of the ratio needs to ignore the current filter context, so that it evaluates the grand total. Because the filter arguments of CALCULATE are tables, it is enough to provide a table function that ignores the current filter context on the category and always returns all the categories—regardless of any filter. You previously learned that this function is ALL. Look at the following measure definition: All Category Sales := CALCULATE ( [Sales Amount], ALL ( 'Product'[Category] ) )
-- Changes the filter context of -- the sales amount -- making ALL categories visible
ALL removes the filter on the Product[Category] column from the filter context. Thus, in any cell of the report, it ignores any filter existing on the categories. The effect is that the filter on the category applied by the row of the report is removed. Look at the result in Figure 5-9. You can see that each row of the report for the All Category Sales measure returns the same value all the way through—the grand total of Sales Amount.
All Category Sales := CALCULATE ( [Sales Amount], ALL ( 'Product'[Category] ) )
Category Audio
REMOVE FILTER Category
ALL ( 'Product'[Category] ) removes the current filter on the category FIGURE 5-9 ALL removes the filter on Category, so CALCULATE defines a filter context without any filter on
Category.
The All Category Sales measure is not useful by itself. It is unlikely a user would want to create a report that shows the same value on all the rows. However, that value is perfect as the denominator of the percentage we are looking to compute. In fact, the formula computing the percentage can be written this way:
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
125
Sales Pct := VAR CurrentCategorySales = [Sales Amount] VAR AllCategoriesSales = CALCULATE ( [Sales Amount], ALL ( 'Product'[Category] ) ) VAR Ratio = DIVIDE ( CurrentCategorySales, AllCategoriesSales ) RETURN Ratio
-------
CurrentCategorySales contains the sales in the current context AllCategoriesSales contains the sales amount in a filter context where all the product categories are visible
As you have seen in this example, mixing table functions and CALCULATE makes it possible to author useful measures easily. We use this technique a lot in the book because it is the primary calculation tool in DAX.
Note ALL has specific semantics when used as a filter argument of CALCULATE. In fact, it does not replace the filter context with all the values. Instead, CALCULATE uses ALL to remove the filter on the category column from the filter context. The side effects of this behavior are somewhat complex to follow and do not belong in this introductory section. We will cover them in more detail later in this chapter. As we said in the introduction of this section, it is important to pay attention to small details when authoring percentages like the one we are currently writing. In fact, the percentage works fine if the report is slicing by category. The code removes the filter from the category, but it does not touch any other existing filter. Therefore, if the report adds other filters, the result might not be exactly what one wants to achieve. For example, look at the report in Figure 5-10 where we added the Product[Color] column as a second level of detail in the rows of the report.
FIGURE 5-10 Adding the color to the report produces unexpected results at the color level.
126
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
Looking at percentages, the value at the category level is correct, whereas the value at the color level looks wrong. In fact, the color percentages do not add up—neither to the category level nor to 100%. To understand the meaning of these values and how they are evaluated, it is always of great help to focus on one cell and understand exactly what happened to the filter context. Focus on Figure 5-11.
Sales Pct := VAR CurrentCategorySales = [Sales Amount] VAR AllCategoriesSales = CALCULATE ( [Sales Amount], ALL ( 'Product'[Category] ) ) VAR Ratio = DIVIDE ( CurrentCategorySales, AllCategoriesSales ) RETURN Ratio
Category
Color
Audio
Black
REMOVE FILTER Category
Color Black FIGURE 5-11 ALL on Product[Category] removes the filter on category, but it leaves the filter on color intact.
The original filter context created by the report contained both a filter on category and a filter on color. The filter on Product[Color] is not overwritten by CALCULATE, which only removes the filter from Product[Category]. As a result, the final filter context only contains the color. Therefore, the denominator of the ratio contains the sales of all the products of the given color—Black—and of any category. The calculation being wrong is not an unexpected behavior of CALCULATE. The problem here is that the formula has been designed to specifically work with a filter on a category, leaving any other filter untouched. The same formula makes perfect sense in a different report. Look at what happens if one switches the order of the columns, building a report that slices by color first and category second, as in Figure 5-12.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
127
FIGURE 5-12 The result looks more reasonable once color and category are interchanged.
The report in Figure 5-12 makes a lot more sense. The measure computes the same result, but it is more intuitive thanks to the layout of the report. The percentage shown is the percentage of the category inside the given color. Color by color, the percentage always adds up to 100%. In other words, when the user is required to compute a percentage, they should pay special attention in determining the denominator of the percentage. CALCULATE and ALL are the primary tools to use, but the specification of the formula depends on the business requirements. Back to the example: The goal is to fix the calculation so that it computes the percentage against a filter on either the category or the color. There are multiple ways of performing the operation, all leading to slightly different results that are worth examining deeper. One possible solution is to let CALCULATE remove the filter from both the category and the color. Adding multiple filter arguments to CALCULATE accomplishes this goal: Sales Pct := VAR CurrentCategorySales = [Sales Amount] VAR AllCategoriesAndColorSales = CALCULATE ( [Sales Amount], ALL ( 'Product'[Category] ), -- The two ALL conditions could also be replaced ALL ( 'Product'[Color] ) -- by ALL ( 'Product'[Category], 'Product'[Color] ) )
128
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
VAR Ratio = DIVIDE ( CurrentCategorySales, AllCategoriesAndColorSales ) RETURN Ratio
This latter version of Sales Pct works fine with the report containing the color and the category, but it still suffers from limitations similar to the previous versions. In fact, it produces the right percentage with color and category—as you can see in Figure 5-13—but it will fail as soon as one adds other columns to the report.
FIGURE 5-13 With ALL on product category and color, the percentages now sum up correctly.
Adding another column to the report would create the same inconsistency noticed so far. If the user wants to create a percentage that removes all the filters on the Product table, they could still use the ALL function passing a whole table as an argument: Sales Pct All Products := VAR CurrentCategorySales = [Sales Amount] VAR AllProductSales = CALCULATE ( [Sales Amount], ALL ( 'Product' ) ) VAR Ratio = DIVIDE ( CurrentCategorySales, AllProductSales ) RETURN Ratio
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
129
ALL on the Product table removes any filter on any column of the Product table. In Figure 5-14 you can see the result of that calculation.
FIGURE 5-14 ALL used on the product table removes the filters from all the columns of the Product table.
So far, you have seen that by using CALCULATE and ALL together, you can remove filters—from a column, from multiple columns, or from a whole table. The real power of CALCULATE is that it offers many options to manipulate a filter context, and its capabilities do not end there. In fact, one might want to analyze the percentages by also slicing columns from different tables. For example, if the report is sliced by product category and customer continent, the last measure we created is not perfect yet, as you can see in Figure 5-15.
FIGURE 5-15 Slicing with columns of multiple tables still shows unexpected results.
At this point, the problem might be evident to you. The measure at the denominator removes any filter from the Product table, but it leaves the filter on Customer[Continent] intact. Therefore, the denominator computes the total sales of all products in the given continent. As in the previous scenario, the filter can be removed from multiple tables by putting several filters as arguments of CALCULATE: 130
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
Sales Pct All Products and Customers := VAR CurrentCategorySales = [Sales Amount] VAR AllProductAndCustomersSales = CALCULATE ( [Sales Amount], ALL ( 'Product' ), ALL ( Customer ) ) VAR Ratio = DIVIDE ( CurrentCategorySales, AllProductAndCustomersSales ) RETURN Ratio
By using ALL on two tables, now CALCULATE removes the filters from both tables. The result, as expected, is a percentage that adds up correctly, as you can appreciate in Figure 5-16.
FIGURE 5-16 Using ALL on two tables removes the filter context on both tables at the same time.
As with two columns, the same challenge comes up with two tables. If a user adds another column from a third table to the context, the measure will not remove the filter from the third table. One possible solution when they want to remove the filter from any table that might affect the calculation is to remove any filter from the fact table itself. In our model the fact table is Sales. Here is a measure that computes an additive percentage no matter what filter is interacting with the Sales table: Pct All Sales := VAR CurrentCategorySales = [Sales Amount] VAR AllSales = CALCULATE ( [Sales Amount], ALL ( Sales ) )
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
131
VAR Ratio = DIVIDE ( CurrentCategorySales, AllSales ) RETURN Ratio
This measure leverages relationships to remove the filter from any table that might filter Sales. At this stage, we cannot explain the details of how it works because it leverages expanded tables, which we introduce in Chapter 14, “Advanced DAX concepts.” You can appreciate its behavior by inspecting Figure 5-17, where we removed the amount from the report and added the calendar year on the columns. Please note that the Calendar Year belongs to the Date table, which is not used in the measure. Nevertheless, the filter on Date is removed as part of the removal of filters from Sales.
FIGURE 5-17 ALL on the fact table removes any filter from related tables as well.
Before leaving this long exercise with percentages, we want to show another final example of filter context manipulation. As you can see in Figure 5-17, the percentage is always against the grand total, exactly as expected. What if the goal is to compute a percentage over the grand total of only the current year? In that case, the new filter context created by CALCULATE needs to be prepared carefully. Indeed, the denominator needs to compute the total of sales regardless of any filter apart from the current year. This requires two actions: ■
Removing all filters from the fact table
■
Restoring the filter for the year
Beware that the two conditions are applied at the same time, although it might look like the two steps come one after the other. You have already learned how to remove all the filters from the fact table. The last step is learning how to restore an existing filter.
132
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
Note The goal of this section is to explain basic techniques for manipulating the filter context. Later in this chapter you see another easier approach to solve this specific requirement—percentage over the visible grand total—by using ALLSELECTED. In Chapter 3, “Using basic table functions,” you learned the VALUES function. VALUES returns the list of values of a column in the current filter context. Because the result of VALUES is a table, it can be used as a filter argument for CALCULATE. As a result, CALCULATE applies a filter on the given column, restricting its values to those returned by VALUES. Look at the following code: Pct All Sales CY := VAR CurrentCategorySales = [Sales Amount] VAR AllSalesInCurrentYear = CALCULATE ( [Sales Amount], ALL ( Sales ), VALUES ( 'Date'[Calendar Year] ) ) VAR Ratio = DIVIDE ( CurrentCategorySales, AllSalesInCurrentYear ) RETURN Ratio
Once used in the report the measure accounts for 100% for every year, still computing the percentage against any other filter apart from the year. You see this in Figure 5-18.
FIGURE 5-18 By using VALUES, you can restore part of the filter context, reading it from the original filter context.
Figure 5-19 depicts the full behavior of this complex formula.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
133
Pct All Sales CY := VAR CurrentCategorySales = [Sales Amount] VAR AllSalesInCurrentYear = CALCULATE ( [Sales Amount], ALL ( Sales ), VALUES ( 'Date'[Calendar Year] ) ) VAR Ratio = DIVIDE ( CurrentCategorySales, AllSalesInCurrentYear ) RETURN Ratio
Category
Calendar Year
Cell phones
CY 2007
REMOVE FILTER Category
Calendar Year CY 2007
OVERWRITE
Calendar Year CY 2007
FIGURE 5-19 The key of this diagram is that VALUES is still evaluated in the original filter context.
Here is a review of the diagram: ■
■
The cell containing 4.22% (sales of Cell Phones for Calendar Year 2007) has a filter context that filters Cell phones for CY 2007. CALCULATE has two filter arguments: ALL ( Sales ) and VALUES ( Date[Calendar Year] ).
• •
ALL ( Sales ) removes the filter from the Sales table. VALUES ( Date[Calendar Year] ) evaluates the VALUES function in the original filter context, still affected by the presence of CY 2007 on the columns. As such, it returns the only year visible in the current filter context—that is, CY 2007.
The two filter arguments of CALCULATE are applied to the current filter context, resulting in a filter context that only contains a filter on Calendar Year. The denominator computes the total sales in a filter context with CY 2007 only. It is of paramount importance to understand clearly that the filter arguments of CALCULATE are evaluated in the original filter context where CALCULATE is called. In fact, CALCULATE changes the filter context, but this only happens after the filter arguments are evaluated. Using ALL over a table followed by VALUES over a column is a technique used to replace the filter context with a filter over that same column.
134
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
Note The previous example could also have been obtained by using ALLEXCEPT. The semantics of ALL/VALUES is different from ALLEXCEPT. In Chapter 10, “Working with the filter context,” you will see a complete description of the differences between the ALLEXCEPT and the ALL/VALUES techniques. As you have seen in these examples, CALCULATE, in itself, is not a complex function. Its behavior is simple to describe. At the same time, as soon as you start using CALCULATE, the complexity of the code becomes much higher. Indeed, you need to focus on the filter context and understand exactly how CALCULATE generates the new filter context. A simple percentage hides a lot of complexity, and that complexity is all in the details. Before one really masters the handling of evaluation contexts, DAX is a bit of a mystery. The key to unlocking the full power of the language is all in mastering evaluation contexts. Moreover, in all these examples we only had to manage one CALCULATE. In a complex formula, having four or five different contexts in the same code is not unusual because of the presence of many instances of CALCULATE. It is a good idea to read this whole section about percentages at least twice. In our experience, a second read is much easier and lets you focus on the important aspects of the code. We wanted to show this example to stress the importance of theory, when it comes to CALCULATE. A small change in the code has an important effect on the numbers computed by the formula. After your second read, proceed with the next sections where we focus more on theory than on practical examples.
Introducing KEEPFILTERS You learned in the previous sections that the filter arguments of CALCULATE overwrite any previously existing filter on the same column. Thus, the following measure returns the sales of Audio regardless of any previously existing filter on Product[Category]: Audio Sales := CALCULATE ( [Sales Amount], 'Product'[Category] = "Audio" )
As you can see in Figure 5-20, the value of Audio is repeated on all the rows of the report.
FIGURE 5-20 Audio Sales always shows the sales of Audio products, regardless of the current filter context.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
135
CALCULATE overwrites the existing filters on the columns where a new filter is applied. All the remaining columns of the filter context are left intact. In case you do not want to overwrite existing filters, you can wrap the filter argument with KEEPFILTERS. For example, if you want to show the amount of Audio sales when Audio is present in the filter context and a blank value if Audio is not present in the filter context, you can write the following measure: Audio Sales KeepFilters := CALCULATE ( [Sales Amount], KEEPFILTERS ( 'Product'[Category] = "Audio" ) )
KEEPFILTERS is the second CALCULATE modifier that you learn—the first one was ALL. We further cover CALCULATE modifiers later in this chapter. KEEPFILTERS alters the way CALCULATE applies a filter to the new filter context. Instead of overwriting an existing filter over the same column, it adds the new filter to the existing ones. Therefore, only the cells where the filtered category was already included in the filter context will produce a visible result. You see this in Figure 5-21.
FIGURE 5-21 Audio Sales KeepFilters shows the sales of Audio products only for the Audio row and for the Grand
Total.
KEEPFILTERS does exactly what its name implies. Instead of overwriting the existing filter, it keeps the existing filter and adds the new filter to the filter context. We can depict the behavior with Figure 5-22. Because KEEPFILTERS avoids overwriting, the new filter generated by the filter argument of CALCULATE is added to the context. If we look at the cell for the Audio Sales KeepFilters measure in the Cell Phones row, there the resulting filter context contains two filters: one filters Cell Phones; the other filters Audio. The intersection of the two conditions results in an empty set, which produces a blank result.
136
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
Audio Sales KeepFilters := CALCULATE ( [Sales Amount], KEEPFILTERS ( 'Product'[Category] = "Audio" ) )
Category
Category
Cell phones
Audio
KEEPFILTERS
Category
Category
Cell phones
Audio
FIGURE 5-22 The filter context generated with KEEPFILTERS filters at the same time as both Cell phones and Audio.
The behavior of KEEPFILTERS is clearer when there are multiple elements selected in a column. For example, consider the following measures; they filter Audio and Computers with and without KEEPFILTERS: Always Audio-Computers := CALCULATE ( [Sales Amount], 'Product'[Category] IN { "Audio", "Computers" } ) KeepFilters Audio-Computers := CALCULATE ( [Sales Amount], KEEPFILTERS ( 'Product'[Category] IN { "Audio", "Computers" } ) )
The report in Figure 5-23 shows that the version with KEEPFILTERS only computes the sales amount values for Audio and for Computers, leaving all other categories blank. The Total row only takes Audio and Computers into account.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
137
FIGURE 5-23 Using KEEPFILTERS, the original and the new filter contexts are merged together.
KEEPFILTERS can be used either with a predicate or with a table. Indeed, the previous code could also be written in a more verbose way: KeepFilters Audio-Computers := CALCULATE ( [Sales Amount], KEEPFILTERS ( FILTER ( ALL ( 'Product'[Category] ), 'Product'[Category] IN { "Audio", "Computers" } ) ) )
This is just an example for educational purposes. You should use the simplest predicate syntax available for a filter argument. When filtering a single column, you can avoid writing the FILTER explicitly. Later however, you will see that more complex filter conditions require an explicit FILTER. In those cases, the KEEPFILTERS modifier can be used around the explicit FILTER function, as you see in the next section.
Filtering a single column In the previous section, we introduced filter arguments referencing a single column in CALCULATE. It is important to note that you can have multiple references to the same column in one expression. For example, the following is a valid syntax because it references the same column (Sales[Net Price]) twice. Sales 10-100 := CALCULATE ( [Sales Amount], Sales[Net Price] >= 10 && Sales[Net Price] = 10 && Sales[Net Price] = 10, Sales[Net Price] = 1000 )
This code is not valid because the filter argument references two different columns in the same expression. As such, it cannot be converted automatically by DAX into a suitable FILTER condition. The best way to write the required filter is by using a table that only has the existing combinations of the columns referenced in the predicate: Sales Large Amount := CALCULATE ( [Sales Amount], FILTER ( ALL ( Sales[Quantity], Sales[Net Price] ), Sales[Quantity] * Sales[Net Price] >= 1000 ) )
140
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
This results in a filter context that has a filter with two columns and a number of rows that correspond to the unique combinations of Quantity and Net Price that satisfy the filter condition. This is shown in Figure 5-24.
Quantity
Net Price
1
1000.00
1
1001.00
1
1199.00
…
…
2
500.00
2
500.05
…
…
3
333.34
…
FIGURE 5-24 The multi-column filter only includes combinations of Quantity and Net Price producing a result
greater than or equal to 1,000.
This filter produces the result in Figure 5-25.
FIGURE 5-25 Sales Large Amount only shows sales of transactions with a large amount.
Be mindful that the slicer in Figure 5-25 is not filtering any value: The two displayed values are the minimum and the maximum values of Net Price. The next step is showing how the measure is interacting with the slicer. In a measure like Sales Large Amount, you need to pay attention when you overwrite existing filters over Quantity or Net Price. Indeed, because the filter argument uses ALL on the two columns, it ignores any previously existing filter on the same columns including, in this example, the filter of the slicer. The report in Figure 5-26 is the same as Figure 5-25 but, this time, the slicer filters for net prices between 500 and 3,000. The result is surprising.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
141
FIGURE 5-26 There are no sales for Audio in the current price range; still Sales Large Amount is showing a result.
The presence of value of Sales Large Amount for Audio and Music, Movies and Audio Books is unexpected. Indeed, for these two categories there are no sales in the net price range between 500 and 3,000, which is the filter context generated by the slicer. Still, the Sales Large Amount measure is showing a result. The reason is that the filter context of Net Price created by the slicer is ignored by the Sales Large Amount measure, which overwrites the existing filter over both Quantity and Net Price. If you carefully compare figures 5-25 and 5-26, you will notice that the value of Sales Large Amount is identical, as if the slicer was not added to the report. Indeed, Sales Large Amount is completely ignoring the slicer. If you focus on a cell, like the value of Sales Large Amount for Audio, the code executed to compute its value is the following: Sales Large Amount := CALCULATE ( CALCULATE ( [Sales Amount], FILTER ( ALL ( Sales[Quantity], Sales[Net Price] ), Sales[Quantity] * Sales[Net Price] >= 1000 ) ), 'Product'[Category] = "Audio", Sales[Net Price] >= 500 )
From the code, you can see that the innermost ALL ignores the filter on Sales[Net Price] set by the outer CALCULATE. In that scenario, you can use KEEPFILTERS to avoid the overwrite of existing filters: Sales Large Amount KeepFilter := CALCULATE ( [Sales Amount], KEEPFILTERS ( FILTER ( ALL ( Sales[Quantity], Sales[Net Price] ), Sales[Quantity] * Sales[Net Price] >= 1000 ) ) )
142
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
The new Sales Large Amount KeepFilter measure produces the result shown in Figure 5-27.
FIGURE 5-27 Using KEEPFILTERS, the calculation takes into account the outer slicer too.
Another way of specifying a complex filter is by using a table filter instead of a column filter. This is one of the preferred techniques of DAX newbies, although it is very dangerous to use. In fact, the previous measure can be written using a table filter: Sales Large Amount Table := CALCULATE ( [Sales Amount], FILTER ( Sales, Sales[Quantity] * Sales[Net Price] >= 1000 ) )
As you may remember, all the filter arguments of CALCULATE are evaluated in the filter context that exists outside of the CALCULATE itself. Thus, the iteration over Sales only considers the rows filtered in the existing filter context, which contains a filter on Net Price. Therefore, the semantic of the Sales Large Amount Table measure corresponds to the Sales Large Amount KeepFilter measure. Although this technique looks easy, you should be careful in using it because it could have serious consequences on performance and on result accuracy. We will cover the details of these issues in Chapter 14. For now, just remember that the best practice is to always use a filter with the smallest possible number of columns. Moreover, you should avoid table filters because they usually are more expensive. The Sales table might be very large, and scanning it row by row to evaluate a predicate can be a time-consuming operation. The filter in Sales Large Amount KeepFilter, on the other hand, only iterates the number of unique combinations of Quantity and Net Price. That number is usually much smaller than the number of rows of the entire Sales table.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
143
Evaluation order in CALCULATE Whenever you look at DAX code, the natural order of evaluation is innermost first. For example, look at the following expression: Sales Amount Large := SUMX ( FILTER ( Sales, Sales[Quantity] >= 100 ), Sales[Quantity] * Sales[Net Price] )
DAX needs to evaluate the result of FILTER before starting the evaluation of SUMX. In fact, SUMX iterates a table. Because that table is the result of FILTER, SUMX cannot start executing before FILTER has finished its job. This rule is true for all DAX functions, except for CALCULATE and CALCULATETABLE. Indeed, CALCULATE evaluates its filter arguments first and only at the end does it evaluate the first parameter, which is the expression to evaluate to provide the CALCULATE result. Moreover, things are a bit more intricate because CALCULATE changes the filter context. All the filter arguments are executed in the filter context outside of CALCULATE, and each filter is evaluated independently. The order of filters within the same CALCULATE does not matter. Consequently, all the following measures are completely equivalent: Sales Red Contoso := CALCULATE ( [Sales Amount], 'Product'[Color] = "Red", KEEPFILTERS ( 'Product'[Brand] = "Contoso" ) ) Sales Red Contoso := CALCULATE ( [Sales Amount], KEEPFILTERS ( 'Product'[Brand] = "Contoso" ), 'Product'[Color] = "Red" ) Sales Red Contoso := VAR ColorRed = FILTER ( ALL ( 'Product'[Color] ), 'Product'[Color] = "Red" ) VAR BrandContoso = FILTER ( ALL ( 'Product'[Brand] ), 'Product'[Brand] = "Contoso" )
144
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
VAR SalesRedContoso = CALCULATE ( [Sales Amount], ColorRed, KEEPFILTERS ( BrandContoso ) ) RETURN SalesRedContoso
The version of Sales Red Contoso defined using variables is more verbose than the other versions, but you might want to use it in case the filters are complex expressions with explicit filters. This way, it is easier to understand that the filter is evaluated “before” CALCULATE. This rule becomes more important in case of nested CALCULATE statements. In fact, the outermost filters are applied first, and the innermost are applied later. Understanding the behavior of nested CALCULATE statements is important, because you encounter this situation every time you nest measures calls. For example, consider the following measures, where Sales Green calls Sales Red: Sales Red := CALCULATE ( [Sales Amount], 'Product'[Color] = "Red" ) Green calling Red := CALCULATE ( [Sales Red], 'Product'[Color] = "Green" )
To make the nested measure call more evident, we can expand Sales Green this way: Green calling Red Exp := CALCULATE ( CALCULATE ( [Sales Amount], 'Product'[Color] = "Red" ), 'Product'[Color] = "Green" )
The order of evaluation is the following: 1.
First, the outer CALCULATE applies the filter, Product[Color] = “Green”.
2.
Second, the inner CALCULATE applies the filter, Product[Color] = “Red”. This filter overwrites the previous filter.
3.
Last, DAX computes [Sales Amount] with a filter for Product[Color] = “Red”.
Therefore, the result of both Red and Green calling Red is still Red, as shown in Figure 5-28.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
145
FIGURE 5-28 The last three measures return the same result, which is always the sales of red products.
Note The description we provided is for educational purposes only. In reality the engine uses lazy evaluation for the filter context. So, in the presence of filter argument overwrites such as the previous code, the outer filter might never be evaluated because it would have been useless. Nevertheless, this behavior is for optimization only. It does not change the semantics of CALCULATE in any way. We can review the order of the evaluation and how the filter context is evaluated with another example. Consider the following measure: Sales YB := CALCULATE ( CALCULATE ( [Sales Amount], 'Product'[Color] IN { "Yellow", "Black" } ), 'Product'[Color] IN { "Black", "Blue" } )
The evaluation of the filter context produced by Sales YB is visible in Figure 5-29. As seen before, the innermost filter over Product[Color] overwrites the outermost filters. Therefore, the result of the measure shows the sum of products that are Yellow or Black. By using KEEPFILTERS in the innermost CALCULATE, the filter context is built by keeping the two filters instead of overwriting the existing filter: Sales YB KeepFilters := CALCULATE ( CALCULATE ( [Sales Amount], KEEPFILTERS ( 'Product'[Color] IN { "Yellow", "Black" } ) ), 'Product'[Color] IN { "Black", "Blue" } )
146
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
CALCULATE ( CALCULATE ( …, Product[Color] IN { "Yellow", "Black" } ), Product[Color] { "Black", "Blue" } )
Color
Color
Yellow
Black
Black
Blue OVERWRITE
Color Yellow Black FIGURE 5-29 The innermost filter overwrites the outer filter.
The evaluation of the filter context produced by Sales YB KeepFilters is visible in Figure 5-30. CALCULATE ( CALCULATE ( …, KEEPFILTERS ( Product[Color] IN { "Yellow", "Black" } ) ), Product[Color] { "Black", "Blue" } )
Color
Color
Yellow
Black
Black
Blue KEEPFILTERS
Color
Color
Yellow
Black
Black
Blue
FIGURE 5-30 By using KEEPFILTERS, CALCULATE does not overwrite the previous filter context.
Because the two filters are kept together, they are intersected. Therefore, in the new filter context the only visible color is Black because it is the only value present in both filters.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
147
However, the order of the filter arguments within the same CALCULATE is irrelevant because they are applied to the filter context independently.
Understanding context transition In Chapter 4, “Understanding evaluation contexts,” we evoked multiple times that the row context and the filter context are different concepts. This still holds true. However, there is one operation performed by CALCULATE that can transform a row context into a filter context. It is the operation of context transition, defined as follows: CALCULATE invalidates any row context. It automatically adds as filter arguments all the columns that are currently being iterated in any row context—filtering their actual value in the row being iterated. Context transition is hard to understand at the beginning, and even seasoned DAX coders find it complex to follow all the implications of context transition. We are more than confident that the previous definition does not suffice to fully understand context transition. Therefore, we are going to describe context transition through several examples of increasing complexity. But before discussing such a delicate concept, let us make sure we thoroughly understand row context and filter context.
Row context and filter context recap We can recap some important facts about row context and filter context with the aid of Figure 5-31, which shows a report with the Brand on the rows and a diagram describing the evaluation process. Products and Sales in the diagram are not displaying real data. They only contain a few rows to make the points clearer.
Brand
Filter Context
Contoso
Products Product Brand A
Contoso
B
Litware
Sales Amount = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] )
Sales Product Quantity Net Price A
1
11.00
B
2
25.00
A
2
10.99 SUMX Iterations Iteration Operation Result
Row Context
1
1*11.00
11.00
2
2*10.99
21.98
FIGURE 5-31 The diagram depicts the full flow of execution of a simple iteration with SUMX.
148
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
The following comments on Figure 5-31 are helpful to monitor your understanding of the whole process for evaluating the Sales Amount measure for the Contoso row: ■
The report creates a filter context containing a filter for Product[Brand] = “Contoso”.
■
The filter works on the entire model, filtering both the Product and the Sales tables.
■
■ ■
■ ■
■
■
The filter context reduces the number of rows iterated by SUMX while scanning Sales. SUMX only iterates the Sales rows that are related to a Contoso product. In the figure there are two rows in Sales with product A, which is branded Contoso. Consequently, SUMX iterates two rows. In the first row it computes 1*11.00 with a partial result of 11.00. In the second row it computes 2*10.99 with a partial result of 21.98. SUMX returns the sum of the partial results gathered during the iteration. During the iteration of Sales, SUMX only scans the visible portion of the Sales table, generating a row context for each visible row. When SUMX iterates the first row, Sales[Quantity] equals 1, whereas Sales[Net Price] equals 11. On the second row, the values are different. Columns have a current value that depends on the iterated row. Potentially, each row iterated has a different value for all the columns. During the iteration, there is a row context and a filter context. The filter context is still the same that filters Contoso because no CALCULATE has been executed to modify it.
Speaking about context transition, the last statement is the most important. During the iteration the filter context is still active, and it filters Contoso. The row context, on the other hand, is currently iterating the Sales table. Each column of Sales has a given value. The row context is providing the value via the current row. Remember that the row context iterates; the filter context does not. This is an important detail. We invite you to double-check your understanding in the following scenario. Imagine you create a measure that simply counts the number of rows in the Sales table, with the following code: NumOfSales := COUNTROWS ( Sales )
Once used in the report, the measure counts the number of Sales rows that are visible in the current filter context. The result shown in Figure 5-32 is as expected: a different number for each brand.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
149
FIGURE 5-32 NumOfSales counts the number of rows visible in the current filter context in the Sales table.
Because there are 37,984 rows in Sales for the Contoso brand, this means that an iteration over Sales for Contoso will iterate exactly 37,984 rows. The Sales Amount measure we used so far would complete its execution after 37,984 multiplications. With the understanding you have obtained so far, can you guess the result of the following measure on the Contoso row? Sum Num Of Sales := SUMX ( Sales, COUNTROWS ( Sales ) )
Do not rush in deciding your answer. Take your time, study this simple code carefully, and make an educated guess. In the following paragraph we provide the correct answer. The filter context is filtering Contoso. From the previous examples, it is understood that SUMX iterates 37,984 times. For each of these 37,984 rows, SUMX computes the number of rows visible in Sales in the current filter context. The filter context is still the same, so for each row the result of COUNTROWS is always 37,984. Consequently, SUMX sums the value of 37,984 for 37,984 times. The result is 37,984 squared. You can confirm this by looking at Figure 5-33, where the measure is displayed in the report.
150
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
FIGURE 5-33 Sum Num Of Sales computes NumOfSales squared because it counts all the rows for each iteration.
Now that we have refreshed the main ideas about row context and filter context, we can further discuss the impact of context transition.
Introducing context transition A row context exists whenever an iteration is happening on a table. Inside an iteration are expressions that depend on the row context itself. The following expression, which you have studied multiple times by now, comes in handy: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Unit Price] )
The two columns Quantity and Unit Price have a value in the current row context. In the previous section we showed that if the expression used inside an iteration is not strictly bound to the row context, then it is evaluated in the filter context. As such the results are surprising, at least for beginners. Nevertheless, one is completely free to use any function inside a row context. Among the many functions available, one appears to be more special: CALCULATE. If executed in a row context, CALCULATE invalidates the row context before evaluating its expression. Inside the expression evaluated by CALCULATE, all the previous row contexts will no longer be valid. Thus, the following code produces a syntax error: Sales Amount := SUMX ( Sales, CALCULATE ( Sales[Quantity] ) )
-- No row context inside CALCULATE, ERROR !
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
151
The reason is that the value of the Sales[Quantity] column cannot be retrieved inside CALCULATE because CALCULATE invalidates the row context that exists outside of CALCULATE itself. Nevertheless, this is only part of what context transition performs. The second—and most relevant—operation is that CALCULATE adds as filter arguments all the columns of the current row context with their current value. For example, look at the following code: Sales Amount := SUMX ( Sales, CALCULATE ( SUM ( Sales[Quantity] ) ) -- SUM does not require a row context )
There are no filter arguments in CALCULATE. The only CALCULATE argument is the expression to evaluate. Thus, it looks like CALCULATE will not overwrite the existing filter context. The point is that CALCULATE, because of context transition, is silently creating many filter arguments. It creates a filter for each column in the iterated table. You can use Figure 5-34 to obtain a first look at the behavior of context transition. We used a reduced set of columns for visual purposes. Sales Product Quantity Net Price A
1
11.00
B
2
25.00
A
2
10.99
Row Context Test := SUMX ( Sales, CALCULATE ( SUM ( Sales[Quantity] ) ) ) Filter Context Product Quantity Net Price A
1
11.00
SUMX Iteration Row Iterated
Sales[Quantity] Value
Row Result
1
1
1
2
2
2
3
2
2
The result of SUMX is 5
FIGURE 5-34 When CALCULATE is executed in a row context, it creates a filter context with a filter for each of the columns in the currently iterated table.
During the iteration CALCULATE starts on the first row, and it computes SUM ( Sales[Quantity] ). Even though there are no filter arguments, CALCULATE adds one filter argument for each of the columns of the iterated table. Namely, there are three columns in the example: Product, Quantity, and Net Price. As a result, the filter context generated by the context transition contains the current value (A, 1, 11.00) for each of the columns (Product, Quantity, Net Price). The process, of course, continues for each one of the three rows during the iteration made by SUMX.
152
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
In other words, the execution of the previous SUMX results in these three CALCULATE executions: CALCULATE ( SUM ( Sales[Quantity] ), Sales[Product] = "A", Sales[Quantity] = 1, Sales[Net Price] = 11 ) + CALCULATE ( SUM ( Sales[Quantity] ), Sales[Product] = "B", Sales[Quantity] = 2, Sales[Net Price] = 25 ) + CALCULATE ( SUM ( Sales[Quantity] ), Sales[Product] = "A", Sales[Quantity] = 2, Sales[Net Price] = 10.99 )
These filter arguments are hidden. They are added by the engine automatically, and there is no way to avoid them. In the beginning, context transition seems very strange. Nevertheless, once one gets used to context transition, it is an extremely powerful feature. Hard to master, but extremely powerful. We summarize the considerations presented earlier, before we further discuss a few of them specifically: ■
■
■
Context transition is expensive. If context transition is used during an iteration on a table with 10 columns and one million rows, then CALCULATE needs to apply 10 filters, one million times. No matter what, it will be a slow operation. This is not to say that relying on context transition should be avoided. However, it does make CALCULATE a feature that needs to be used carefully. Context transition does not only filter one row. The original row context existing outside of CALCULATE always only points to one row. The row context iterates on a row-by-row basis. When the row context is moved to a filter context through context transition, the newly created filter context filters all the rows with the same set of values. Thus, you should not assume that the context transition creates a filter context with one row only. This is very important, and we will return to this topic in the next sections. Context transition uses columns that are not present in the formula. Although the columns used in the filter are hidden, they are part of the expression. This makes any formula with CALCULATE much more complex than it first seems. If a context transition is used, then all the columns of the table are part of the expression as hidden filter arguments. This behavior might create unexpected dependencies. This topic is also described later in this section.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
153
■
■
■
■
Context transition creates a filter context out of a row context. You might remember the evaluation context mantra, “the row context iterates a table, whereas the filter context filters the model.” Once context transition transforms a row context into a filter context, it changes the nature of the filter. Instead of iterating a single row, DAX filters the whole model; relationships become part of the equation. In other words, context transition happening on one table might propagate its filtering effects far from the table the row context originated from. Context transition is invoked whenever there is a row context. For example, if one uses CALCULATE in a calculated column, context transition occurs. There is an automatic row context inside a calculated column, and this is enough for context transition to occur. Context transition transforms all the row contexts. When nested iterations are being performed on multiple tables, context transition considers all the row contexts. It invalidates all of them and adds filter arguments for all the columns that are currently being iterated by all the active row contexts. Context transition invalidates the row contexts. Though we have repeated this concept multiple times, it is worth bringing to your attention again. None of the outer row contexts are valid inside the expression evaluated by CALCULATE. All the outer row contexts are transformed into equivalent filter contexts.
As anticipated earlier in this section, most of these considerations require further explanation. In the remaining part of this section about context transition, we provide a deeper analysis of these main points. Although all these considerations are shown as warnings, in reality they are important features. Being ignorant of certain behaviors can ensure surprising results. Nevertheless, once you master the behavior, you start leveraging it as you see fit. The only difference between a strange behavior and a useful feature—at least in DAX—is your level of knowledge.
Context transition in calculated columns A calculated column is evaluated in a row context. Therefore, using CALCULATE in a calculated column triggers a context transition. We use this feature to create a calculated column in Product that marks as “High Performance” all the products that—alone—sold more than 1% of the total sales of all the products. To produce this calculated column, we need two values: the sales of the current product and the total sales of all the products. The former requires filtering the Sales table so that it only computes sales amount for the current product, whereas the latter requires scanning the Sales table with no active filters. Here is the code: 'Product'[Performance] = VAR TotalSales = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] )
154
CHAPTER 5
-- Sales of all the products -- Sales is not filtered -- thus here we compute all sales
Understanding CALCULATE and CALCULATETABLE
VAR CurrentSales = CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) ) VAR Ratio = 0.01 VAR Result = IF ( CurrentSales >= TotalSales * Ratio, "High Performance product", "Regular product" ) RETURN Result
-- Performs context transition -- Sales of the current product only -- thus here we compute sales of the -- current product only -- 1% expressed as a real number
You note that there is only one difference between the two variables: TotalSales is executed as a regular iteration, whereas CurrentSales computes the same DAX code within a CALCULATE function. Because this is a calculated column, the row context is transformed into a filter context. The filter context propagates through the model and it reaches Sales, only filtering the sales of the current product. Thus, even though the two variables look similar, their content is completely different. TotalSales computes the sales of all the products because the filter context in a calculated column is empty and does not filter anything. CurrentSales computes the sales of the current product only thanks to the context transition performed by CALCULATE. The remaining part of the code is a simple IF statement that checks whether the condition is met and marks the product appropriately. One can use the resulting calculated column in a report like the one visible in Figure 5-35.
FIGURE 5-35 Only four products are marked High Performance.
In the code of the Performance calculated column, we used CALCULATE and context transition as a feature. Before moving on, we must check that we considered all the implications. The Product table is CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
155
small, containing just a few thousand rows. Thus, performance is not an issue. The filter context generated by CALCULATE filters all the columns. Do we have a guarantee that CurrentSales only contains the sales of the current product? In this special case, the answer is yes. The reason is that each row of Product is unique because Product contains a column with a different value for each row—ProductKey. Consequently, the filter context generated by the context transition is guaranteed to only filter one product. In this case, we could rely on context transition because each row of the iterated table is unique. Beware that this is not always true. We want to demonstrate that with an example that is purposely wrong. We create a calculated column, in Sales, containing this code: Sales[Wrong Amt] = CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) )
Being a calculated column, it runs in a row context. CALCULATE performs the context transition, so SUMX iterates all the rows in Sales with an identical set of values corresponding to the current row in Sales. The problem is that the Sales table does not have any column with unique values. Therefore, there is a chance that multiple identical rows exist and, if they exist, they will be filtered together. In other words, there is no guarantee that SUMX always iterates only one row in the Wrong Amt column. If you are lucky, there are many duplicated rows, and the value computed by this calculated column is totally wrong. This way, the problem would be clearly visible and immediately recognized. In many real-world scenarios, the number of duplicated rows in tables is tiny, making these inaccurate calculations hard to spot and debug. The sample database we use in this book is no exception. Look at the report in Figure 5-36 showing the correct value for Sales Amount and the wrong value computed by summing the Wrong Amt calculated column.
FIGURE 5-36 Most results are correct; only two rows have different values.
156
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
You can see that the difference only exists at the total level and for the Fabrikam brand. There are some duplicates in the Sales table—related to some Fabrikam product—that perform the calculation twice. The presence of these rows might be legitimate: The same customer bought the same product in the same store on the same day in the morning and in the afternoon, but the Sales table only stores the date and not the time of the transaction. Because the number of duplicates is small, most numbers look correct. However, the calculation is wrong because it depends on the content of the table. Inaccurate numbers might appear at any time because of duplicated rows. The more duplicates there are, the worse the result turns out. In this case, relying on context transition is the wrong choice. Because the table is not guaranteed to only have unique rows, context transition is not safe to use. An expert DAX coder should know this in advance. Besides, the Sales table might contain millions of rows; thus, this calculated column is not only wrong, it is also very slow.
Context transition with measures Understanding context transition is very important because of another important aspect of DAX. Every measure reference always has an implicit CALCULATE surrounding it. Because of CALCULATE, a measure reference generates an implicit context transition if executed in the presence of any row context. This is why in DAX, it is important to use the correct naming convention when writing column references (always including the table name) and measure references (always without the table name). You want to be aware of any implicit context transition writing and reading a DAX expression. This simple initial definition deserves a longer explanation with several examples. The first one is that translating a measure reference always requires wrapping the expression of the measure within a CALCULATE function. For example, consider the following definition of the Sales Amount measure and of the Product Sales calculated column in the Product table: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) 'Product'[Product Sales] = [Sales Amount]
The Product Sales column correctly computes the sum of Sales Amount only for the current product in the Product table. Indeed, expanding the Sales Amount measure in the definition of Product Sales requires the CALCULATE function that wraps the definition of Sales Amount: 'Product'[Product Sales] = CALCULATE SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) )
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
157
Without CALCULATE, the result of the calculated column would produce the same value for all the products. This would correspond to the sales amount of all the rows in Sales without any filtering by product. The presence of CALCULATE means that context transition occurs, producing in this case the desired result. A measure reference always calls CALCULATE. This is very important and can be used to write short and powerful DAX expressions. However, it could also lead to big mistakes if you forget that the context transition takes place every time the measure is called in a row context. As a rule of thumb, you can always replace a measure reference with the expression that defines the measure wrapped inside CALCULATE. Consider the following definition of a measure called Max Daily Sales, which computes the maximum value of Sales Amount computed day by day: Max Daily Sales := MAXX ( 'Date', [Sales Amount] )
This formula is intuitive to read. However, Sales Amount must be computed for each date, only filtering the sales of that day. This is exactly what context transition performs. Internally, DAX replaced the Sales Amount measure reference with its definition wrapped by CALCULATE, as in the following example: Max Daily Sales := MAXX ( 'Date', CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) ) )
We will use this feature extensively in Chapter 7, “Working with iterators and CALCULATE,” when we start writing complex DAX code to solve specific scenarios. This initial description just completes the explanation of context transition, which happens in these cases: ■ ■
When a CALCULATE or CALCULATETABLE function is called in the presence of any row context. When there is a measure reference in the presence of any row context because the measure reference internally executes its DAX code within a CALCULATE function.
This powerful behavior might lead to mistakes, mainly due to the incorrect assumption that you can replace a measure reference with the DAX code of its definition. You cannot. This could work when there are no row contexts, like in a measure, but this is not possible when the measure reference appears within a row context. It is easy to forget this rule, so we provide an example of what could happen by making an incorrect assumption.
158
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
You may have noticed that in the previous example, we wrote the code for a calculated column repeating the iteration over Sales twice. Here is the code we already presented in the previous example: 'Product'[Performance] = VAR TotalSales = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) VAR CurrentSales = CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) ) VAR Ratio = 0.01 VAR Result = IF ( CurrentSales >= TotalSales * Ratio, "High Performance product", "Regular product" ) RETURN Result
-- Sales of all the products -- Sales is not filtered -- thus here we compute all sales
-- Performs the context transition -- Sales of the current product only -- thus here we compute sales of the -- current product only -- 1% expressed as a real number
The iteration executed by SUMX is the same code for the two variables: One is surrounded by CALCULATE, whereas the other is not. It might seem like a good idea to rewrite the code and use a measure to host the code of the iteration. This could be even more relevant in case the expression is not a simple SUMX but, rather, some more complex code. Unfortunately, this approach will not work because the measure reference will always include a CALCULATE around the expression that the measure replaced. Imagine creating a measure, Sales Amount, and then a calculated column that calls the measure surrounding it—once with CALCULATE and once without CALCULATE. Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) 'Product'[Performance] = VAR TotalSales = [Sales Amount] VAR CurrentSales = CALCULATE ( [Sales Amount] ) VAR Ratio = 0.01 VAR Result = IF ( CurrentSales >= TotalSales * Ratio, "High Performance product", "Regular product" ) RETURN Result
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
159
Though it looked like a good idea, this calculated column does not compute the expected result. The reason is that both measure references will have their own implicit CALCULATE around them. Thus, TotalSales does not compute the sales of all the products. Instead, it only computes the sales of the current product because the hidden CALCULATE performs a context transition. CurrentSales computes the same value. In CurrentSales, the extra CALCULATE is redundant. Indeed, CALCULATE is already there, only because it is referencing a measure. This is more evident by looking at the code resulting by expanding the Sales Amount measure: 'Product'[Performance] = VAR TotalSales = CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) ) VAR CurrentSales = CALCULATE ( CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) ) ) VAR Ratio = 0.01 VAR Result = IF ( CurrentSales >= TotalSales * Ratio, "High Performance product", "Regular product" ) RETURN Result
Whenever you read a measure call in DAX, you should always read it as if CALCULATE were there. Because it is there. We introduced a rule in Chapter 2, “Introducing DAX,” where we said that it is a best practice to always use the table name in front of columns, and never use the table name in front of measures. The reason is what we are discussing now. When reading DAX code, it is of paramount importance that the user be immediately able to understand whether the code is referencing a measure or a column. The de facto standard that nearly every DAX coder adopts is to omit the table name in front of measures. The automatic CALCULATE makes it easy to author formulas that perform complex calculations with iterations. We will use this feature extensively in Chapter 7 when we start writing complex DAX code to solve specific scenarios.
160
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
Understanding circular dependencies When you design a data model, you should pay attention to the complex topic of circular dependencies in formulas. In this section, you learn what circular dependencies are and how to avoid them in your model. Before introducing circular dependencies, it is worth discussing simple, linear dependencies with the aid of an example. Look at the following calculated column: Sales[Margin] = Sales[Net Price] - Sales[Unit Cost]
The new calculated column depends on two columns: Net Price and Unit Cost. This means that to compute the value of Margin, DAX needs to know in advance the values of the two other columns. Dependencies are an important part of the DAX model because they drive the order in which calculated columns and calculated tables are processed. In the example, Margin can only be computed after Net Price and Unit Cost already have a value. The coder does not need to worry about dependencies. Indeed, DAX handles them gracefully, building a complex graph that drives the order of evaluation of all its internal objects. However, it is possible to write code in such a way that circular dependencies appear in the graph. Circular dependencies happen when DAX cannot determine the order of evaluation of an expression because there is a loop in the chain of dependencies. For example, consider two calculated columns with the following formulas: Sales[MarginPct] = DIVIDE ( Sales[Margin], Sales[Unit Cost] ) Sales[Margin] = Sales[MarginPct] * Sales[Unit Cost]
In this code, MarginPct depends on Margin and, at the same time, Margin depends on MarginPct. There is a loop in the chain of dependencies. In that scenario, DAX refuses to accept the last formula and raises the error, “A circular dependency was detected.” Circular dependencies do not happen frequently because as humans we understand the problem well. B cannot depend on A if, at the same time, A depends on B. Nevertheless, there is a scenario where circular dependency occurs—not because it is one’s intention to do so, but only because one does not consider certain implications by reading DAX code. This scenario includes the use of CALCULATE. Imagine a calculated column in Sales with the following code: Sales[AllSalesQty] = CALCULATE ( SUM ( Sales[Quantity] ) )
The interesting question is, which columns does AllSalesQty depend on? Intuitively, one would answer that the new column depends solely on Sales[Quantity] because it is the only column used in the expression. However, it is all too easy to forget the real semantics of CALCULATE and context transition. Because CALCULATE runs in a row context, all current values of all the columns of the
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
161
table are included in the expression, though hidden. Thus, the real expression evaluated by DAX is the following: Sales[AllSalesQty] = CALCULATE ( SUM ( Sales[Quantity] ), Sales[ProductKey] = , Sales[StoreKey] = , ..., Sales[Margin] = )
As you see, the list of columns AllSalesQty depends on is actually the full set of columns of the table. Once CALCULATE is being used in a row context, the calculation suddenly depends on all the columns of the iterated table. This is much more evident in calculated columns, where the row context is present by default. If one authors a single calculated column using CALCULATE, everything still works fine. The problem appears if one tries to author two separate calculated columns in a table, with both columns using CALCULATE, thus firing context transition in both cases. In fact, the following new calculated column will fail: Sales[NewAllSalesQty] = CALCULATE ( SUM ( Sales[Quantity] ) )
The reason for this is that CALCULATE adds all the columns of the table as filter arguments. Adding a new column to a table changes the definition of existing columns too. If one were able to create NewAllSalesQty, the code of the two calculated columns would look like this: Sales[AllSalesQty] = CALCULATE ( SUM ( Sales[Quantity] ), Sales[ProductKey] = , ..., Sales[Margin] = , Sales[NewAllSalesQty] = ) Sales[NewAllSalesQty] = CALCULATE ( SUM ( Sales[Quantity] ), Sales[ProductKey] = , ..., Sales[Margin] = , Sales[AllSalesQty] = )
You can see that the two highlighted rows reference each other. AllSalesQty depends on the value of NewAllSalesQty and, at the same time, NewAllSalesQty depends on the value of AllSalesQty. Although very well hidden, a circular dependency does exist. DAX detects the circular dependency, preventing the code from being accepted.
162
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
The problem, although somewhat complex to detect, has a simple solution. If the table on which CALCULATE performs the context transition contains one column with unique values and DAX is aware of that, then the context transition only filters that column from a dependency point of view. For example, consider a calculated column in the Product table with the following code: 'Product'[ProductSales] = CALCULATE ( SUM ( Sales[Quantity] ) )
In this case, there is no need to add all the columns as filter arguments. In fact, Product contains one column that has a unique value for each row of the Product table—that is ProductKey. This is well-known by the DAX engine because that column is on the one-side of a one-to-many relationship. Consequently, when the context transition occurs, the engine knows that it would be pointless to add a filter to each column. The code would be translated into the following: 'Product'[ProductSales] = CALCULATE ( SUM ( Sales[Quantity] ), 'Product'[ProductKey] = )
As you can see, the ProductSales calculated column in the Product table depends solely on ProductKey. Therefore, one could create many calculated columns using CALCULATE because all of them would only depend on the column with unique values.
Note The last CALCULATE equivalent statement for the context transition is not totally accurate. We used it for educational purposes only. CALCULATE adds all the columns of the table as filter arguments, even if a row identifier is present. Nevertheless, the internal dependency is only created on the unique column. The presence of the unique column lets DAX evaluate multiple columns with CALCULATE. Still, the semantics of CALCULATE is the same with or without the unique column: All the columns of the iterated table are added as filter arguments. We already discussed the fact that relying on context transition on a table that contains duplicates is a serious problem. The presence of circular dependencies is another very good reason why one should avoid using CALCULATE and context transition whenever the uniqueness of rows is not guaranteed. Resorting to a column with unique values for each row is not enough to ensure that CALCULATE only depends on it for the context transition. The data model must be aware of that. How does DAX know that a column contains unique values? There are multiple ways to provide this information to the engine: ■
■
When a table is the target (one-side) of a relationship, then the column used to build the relationship is marked as unique. This technique works in any tool. When a column is selected in the Mark As Date Table setting, then the column is implicitly unique—more on this in Chapter 8, “Time intelligence calculations.” CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
163
■
You can manually set the property of a row identifier for the unique column by using the Table Behavior properties. This technique only works in Power Pivot for Excel and Analysis Services Tabular; it is not available in Power BI at the time of writing.
Any one of these operations informs the DAX engine that the table has a row identifier, stopping the process of a table that does not respect that constraint. When a table has a row identifier, you can use CALCULATE without worrying about circular dependencies. The reason is that the context transition depends on the key column only.
Note Though described as a feature, this behavior is actually a side effect of an optimization. The semantics of DAX require the dependency from all the columns. A specific optimization introduced very early in the engine only creates the dependency on the primary key of the table. Because many users rely on this behavior today, it has become part of the language. Still, it remains an optimization. In borderline scenarios—for example when using USERELATIONSHIP as part of the formula—the optimization does not kick in, thus recreating the circular dependency error.
CALCULATE modifiers As you have learned in this chapter, CALCULATE is extremely powerful and produces complex DAX code. So far, we have only covered filter arguments and context transition. There is still one concept required to provide the set of rules to fully understand CALCULATE. It is the concept of CALCULATE modifier. We introduced two modifiers earlier, when we talked about ALL and KEEPFILTERS. While ALL can be both a modifier and a table function, KEEPFILTERS is always a filter argument modifier—meaning that it changes the way one filter is merged with the original filter context. CALCULATE accepts several different modifiers that change how the new filter context is prepared. However, the most important of all these modifiers is a function that you already know very well: ALL. When ALL is directly used in a CALCULATE filter argument, it acts as a CALCULATE modifier instead of being a table function. Other important modifiers include USERELATIONSHIP, CROSSFILTER, and ALLSELECTED, which have separate descriptions. The ALLEXCEPT, ALLSELECTED, ALLCROSSFILTERED and ALLNOBLANKROW modifiers have the same precedence rules of ALL. In this section we introduce these modifiers; then we will discuss the order of precedence of the different CALCULATE modifiers and filter arguments. At the end, we will present the final schema of CALCULATE rules.
Understanding USERELATIONSHIP The first CALCULATE modifier you learn is USERELATIONSHIP. CALCULATE can activate a relationship during the evaluation of its expression by using this modifier. A data model might contain both active 164
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
and inactive relationships. One might have inactive relationships in the model because there are several relationships between two tables, and only one of them can be active. As an example, one might have order date and delivery date stored in the Sales table for each order. Typically, the requirement is to perform sales analysis based on the order date, but one might need to consider the delivery date for some specific measures. In that scenario, an option is to create two relationships between Sales and Date: one based on Order Date and another one based on Delivery Date. The model looks like the one in Figure 5-37.
FIGURE 5-37 Sales and Date are linked through two relationships, although only one can be active.
Only one of the two relationships can be active at a time. For example, in this demo model the relationship with Order Date is active, whereas the one linked to Delivery Date is kept inactive. To author a measure that shows the delivered value in a given time period, the relationship with Delivery Date needs to be activated for the duration of the calculation. In this scenario, USERELATIONSHIP is of great help as in the following code: Delivered Amount:= CALCULATE ( [Sales Amount], USERELATIONSHIP ( Sales[Delivery Date], 'Date'[Date] ) )
The relationship between Delivery Date and Date is activated during the evaluation of Sales Amount. In the meantime, the relationship with Order Date is deactivated. Keep in mind that at a given point in CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
165
time, only one relationship can be active between any two tables. Thus, USERELATIONSHIP temporarily activates one relationship, deactivating the one active outside of CALCULATE. Figure 5-38 shows the difference between Sales Amount based on the Order Date, and the new Delivered Amount measure.
FIGURE 5-38 The figure illustrates the difference between ordered and delivered sales.
When using USERELATIONSHIP to activate a relationship, you need to be aware of an important aspect: Relationships are defined when a table reference is used, not when RELATED or other relational functions are invoked. We will cover the details of this in Chapter 14 by using expanded tables. For now, an example should suffice. To compute all amounts delivered in 2007, the following formula will not work: Delivered Amount 2007 v1 := CALCULATE ( [Sales Amount], FILTER ( Sales, CALCULATE ( RELATED ( 'Date'[Calendar Year] ), USERELATIONSHIP ( Sales[Delivery Date], 'Date'[Date] ) ) = "CY 2007" ) )
166
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
In fact, CALCULATE would inactivate the row context generated by the FILTER iteration. Thus, inside the CALCULATE expression, one cannot use the RELATED function at all. One option to author the code would be the following: Delivered Amount 2007 v2 := CALCULATE ( [Sales Amount], CALCULATETABLE ( FILTER ( Sales, RELATED ( 'Date'[Calendar Year] ) = "CY 2007" ), USERELATIONSHIP ( Sales[Delivery Date], 'Date'[Date] ) ) )
In this latter formulation, Sales is referenced after CALCULATE has activated the required relationship. Therefore, the use of RELATED inside FILTER happens with the relationship with Delivery Date active. The Delivered Amount 2007 v2 measure works, but a much better formulation of the same measure relies on default filter context propagation rather than relying on RELATED: Delivered Amount 2007 v3 := CALCULATE ( [Sales Amount], 'Date'[Calendar Year] = "CY 2007", USERELATIONSHIP ( Sales[Delivery Date], 'Date'[Date] ) )
When you use USERELATIONSHIP in a CALCULATE statement, all the filter arguments are evaluated using the relationship modifiers that appear in the same CALCULATE statement—regardless of their order. For example, in the Delivered Amount 2007 v3 measure, the USERELATIONSHIP modifier affects the predicate filtering Calendar Year, although it is the previous parameter within the same CALCULATE function call. This behavior makes the use of nondefault relationships a complex operation in calculated column expressions. The invocation of the table is implicit in a calculated column definition. Therefore, you do not have control over it, and you cannot change that behavior by using CALCULATE and USERELATIONSHIP. One important note is the fact that USERELATIONSHIP does not introduce any filter by itself. Indeed, USERELATIONSHIP is not a filter argument. It is a CALCULATE modifier. It only changes the way other filters are applied to the model. If you carefully look at the definition of Delivered Amount in 2007 v3, you might notice that the filter argument applies a filter on the year 2007, but it does not indicate
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
167
which relationship to use. Is it using Order Date or Delivery Date? The relationship to use is defined by USERELATIONSHIP. Thus, CALCULATE first modifies the structure of the model by activating the relationship, and only later does it apply the filter argument. If that were not the case—that is, if the filter argument were always evaluated on the current relationship architecture—then the calculation would not work. There are precedence rules in the application of filter arguments and of CALCULATE modifiers. The first rule is that CALCULATE modifiers are always applied before any filter argument, so that the effect of filter arguments is applied on the modified version of the model. We discuss precedence of CALCULATE arguments in more detail later.
Understanding CROSSFILTER The next CALCULATE modifier you learn is CROSSFILTER. CROSSFILTER is somewhat similar to USERELATIONSHIP because it manipulates the architecture of the relationships in the model. Nevertheless, CROSSFILTER can perform two different operations: ■
It can change the cross-filter direction of a relationship.
■
It can disable a relationship.
USERELATIONSHIP lets you activate a relationship while disabling the active relationship, but it cannot disable a relationship without activating another one between the same tables. CROSSFILTER works in a different way. CROSSFILTER accepts two parameters, which are the columns involved in the relationship, and a third parameter that can be either NONE, ONEWAY, or BOTH. For example, the following measure computes the distinct count of product colors after activating the relationship between Sales and Product as a bidirectional one: NumOfColors := CALCULATE ( DISTINCTCOUNT ( 'Product'[Color] ), CROSSFILTER ( Sales[ProductKey], 'Product'[ProductKey], BOTH ) )
As is the case with USERELATIONSHIP, CROSSFILTER does not introduce filters by itself. It only changes the structure of the relationships, leaving to other filter arguments the task of applying filters. In the previous example, the effect of the relationship only affects the DISTINCTCOUNT function because CALCULATE has no further filter arguments.
Understanding KEEPFILTERS We introduced KEEPFILTERS earlier in this chapter as a CALCULATE modifier. Technically, KEEPFILTERS is not a CALCULATE modifier, it is a filter argument modifier. Indeed, it does not change the entire evaluation of CALCULATE. Instead, it changes the way one individual filter argument is applied to the final filter context generated by CALCULATE.
168
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
We already discussed in depth the behavior of CALCULATE in the presence of calculations like the following one: Contoso Sales := CALCULATE ( [Sales Amount], KEEPFILTERS ( 'Product'[Brand] = "Contoso" ) )
The presence of KEEPFILTERS means that the filter on Brand does not overwrite a previously existing filter on the same column. Instead, the new filter is added to the filter context, leaving the previous one intact. KEEPFILTERS is applied to the individual filter argument where it is used, and it does not change the semantic of the whole CALCULATE function. There is another way to use KEEPFILTERS that is less obvious. One can use KEEPFILTERS as a modifier for the table used for an iteration, like in the following code: ColorBrandSales := SUMX ( KEEPFILTERS ( ALL ( 'Product'[Color], 'Product'[Brand] ) ), [Sales Amount] )
The presence of KEEPFILTERS as the top-level function used in an iteration forces DAX to use KEEPFILTERS on the implicit filter arguments added by CALCULATE during a context transition. In fact, during the iteration over the values of Product[Color] and Product[Brand], SUMX invokes CALCULATE as part of the evaluation of the Sales Amount measure. At that point, the context transition occurs, and the row context becomes a filter context by adding a filter argument for Color and Brand. Because the iteration started with KEEPFILTERS, context transition will not overwrite existing filters. It will intersect the existing filters instead. It is uncommon to use KEEPFILTERS as the top-level function in an iteration. We will cover some examples of this advanced use later in Chapter 10.
Understanding ALL in CALCULATE ALL is a table function, as you learned in Chapter 3. Nevertheless, ALL acts as a CALCULATE modifier when used as a filter argument in CALCULATE. The function name is the same, but the semantics of ALL as a CALCULATE modifier is slightly different than what one would expect. Looking at the following code, one might think that ALL returns all the years, and that it changes the filter context making all years visible: All Years Sales := CALCULATE ( [Sales Amount], ALL ( 'Date'[Year] ) )
However, this is not true. When used as a top-level function in a filter argument of CALCULATE, ALL removes an existing filter instead of creating a new one. A proper name for ALL would have been CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
169
REMOVEFILTER. For historical reasons, the name remained ALL and it is a good idea to know exactly how the function behaves. If one considers ALL as a table function, they would interpret the CALCULATE behavior like in Figure 5-39. CALCULATE ( CALCULATE ( … ALL ( 'Date'[Year] ) ), 'Date'[Year] = 2007 )
Year
Year
2005
2007
2006 2007
OVERWRITE
Year 2005 2006 2007 FIGURE 5-39 It looks like ALL returns all the years and uses the list to overwrite the previous filter context.
The innermost ALL over Date[Year] is a top-level ALL function call in CALCULATE. As such, it does not behave as a table function. It should really be read as REMOVEFILTER. In fact, instead of returning all the years, in that case ALL acts as a CALCULATE modifier that removes any filter from its argument. What really happens inside CALCULATE is the diagram of Figure 5-40. CALCULATE ( CALCULATE ( … ALL ( 'Date'[Year] ) ), 'Date'[Year] = 2007 )
Removes Year from the filter Year
Year 2007
REMOVE
Empty filter
FIGURE 5-40 ALL removes a previously existing filter from the context, when used as REMOVEFILTER.
170
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
The difference between the two behaviors is subtle. In most calculations, the slight difference in semantics will go unnoticed. Nevertheless, when we start authoring more advanced code, this small difference will make a big impact. For now, the important detail is that when ALL is used as REMOVEFILTER, it acts as a CALCULATE modifier instead of acting as a table function. This is important because of the order of precedence of filters in CALCULATE. The CALCULATE modifiers are applied to the final filter context before explicit filter arguments. Thus, consider the presence of ALL on a column where KEEPFILTERS is being used on another explicit filter over that column; it produces the same result as a filter applied to that same column without KEEPFILTERS. In other words, the following definitions of the Sales Red measure produce the same result: Sales Red := CALCULATE ( [Sales Amount], 'Product'[Color] = "Red" ) Sales Red := CALCULATE ( [Sales Amount], KEEPFILTERS ( 'Product'[Color] = "Red" ), ALL ( 'Product'[Color] ) )
The reason is that ALL is a CALCULATE modifier. Therefore, ALL is applied before KEEPFILTERS. Moreover, the same precedence rule of ALL is shared by other functions with the same ALL prefix: These are ALL, ALLSELECTED, ALLNOBLANKROW, ALLCROSSFILTERED, and ALLEXCEPT. We generally refer to these functions as the ALL* functions. As a rule, ALL* functions are CALCULATE modifiers when used as top-level functions in CALCULATE filter arguments.
Introducing ALL and ALLSELECTED with no parameters We introduced ALLSELECTED in Chapter 3. We introduced it early on, mainly because of how useful it is. Like all the ALL* functions, ALLSELECTED acts as a CALCULATE modifier when used as a top-level function in CALCULATE. Moreover, when introducing ALLSELECTED, we described it as a table function that can return the values of either a column or a table. The following code computes a percentage over the total number of colors selected outside of the current visual. The reason is that ALLSELECTED restores the filter context outside of the current visual on the Product[Color] column. SalesPct := DIVIDE ( [Sales], CALCULATE ( [Sales], ALLSELECTED ( 'Product'[Color] ) ) )
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
171
One achieves a similar result using ALLSELECTED ( Product ), which executes ALLSELECTED on top of a whole table. Nevertheless, when used as a CALCULATE modifier, both ALL and ALLSELECTED can also work without any parameter. Thus, the following is a valid syntax: SalesPct := DIVIDE ( [Sales], CALCULATE ( [Sales], ALLSELECTED ( ) ) )
As you can easily notice, in this case ALLSELECTED cannot be a table function. It is a CALCULATE modifier that instructs CALCULATE to restore the filter context that was active outside of the current visual. The way this whole calculation works is rather complex. We will take the behavior of ALLSELECTED to the next level in Chapter 14. Similarly, ALL with no parameters clears the filter context from all the tables in the model, restoring a filter context with no filters active. Now that we have completed the overall structure of CALCULATE, we can finally discuss in detail the order of evaluation of all the elements involving CALCULATE.
CALCULATE rules In this final section of a long and difficult chapter, we are now able to provide the definitive guide to CALCULATE. You might want to reference this section multiple times, while reading the remaining part of the book. Whenever you need to recall the complex behavior of CALCULATE, you will find the answer in this section. Do not fear coming back here multiple times. We started working with DAX many years ago, and we must still remind ourselves of these rules for complex formulas. DAX is a clean and powerful language, but it is easy to forget small details here and there that are actually crucial in determining the calculation outcome of particular scenarios. To recap, this is the overall picture of CALCULATE: ■
■
172
CALCULATE is executed in an evaluation context, which contains a filter context and might contain one or more row contexts. This is the original context. CALCULATE creates a new filter context, in which it evaluates its first argument. This is the new filter context. The new filter context only contains a filter context. All the row contexts disappear in the new filter context because of the context transition.
CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
■
■
CALCULATE accepts three kinds of parameters:
•
One expression that will be evaluated in the new filter context. This is always the first argument.
•
A set of explicit filter arguments that manipulate the original filter context. Each filter argument might have a modifier, such as KEEPFILTERS.
•
A set of CALCULATE modifiers that can change the model and/or the structure of the original filter context, by removing some filters or by altering the relationships architecture.
When the original context includes one or more row contexts, CALCULATE performs a context transition adding implicit and hidden filter arguments. The implicit filter arguments obtained by row contexts iterating table expressions marked as KEEPFILTERS are also modified by KEEPFILTERS.
When using all these parameters, CALCULATE follows a very precise algorithm. It needs to be well understood if the developer hopes to be able to make sense of certain complex calculations. 1.
CALCULATE evaluates all the explicit filter arguments in the original evaluation context. This includes both the original row contexts (if any) and the original filter context. All explicit filter arguments are evaluated independently in the original evaluation context. Once this evaluation is finished, CALCULATE starts building the new filter context.
2.
CALCULATE makes a copy of the original filter context to prepare the new filter context. It discards the original row contexts because the new evaluation context will not contain any row context.
3.
CALCULATE performs the context transition. It uses the current value of columns in the original row contexts to provide a filter with a unique value for all the columns currently being iterated in the original row contexts. This filter may or may not contain one individual row. There is no guarantee that the new filter context contains a single row at this point. If there are no row contexts active, this step is skipped. Once all implicit filters created by the context transition are applied to the new filter context, CALCULATE moves on to the next step.
4.
CALCULATE evaluates the CALCULATE modifiers USERELATIONSHIP, CROSSFILTER, and ALL*. This step happens after step 3. This is very important because it means that one can remove the effects of the context transition by using ALL, as described in Chapter 10. The CALCULATE modifiers are applied after the context transition, so they can alter the effects of the context transition.
5.
CALCULATE evaluates all the explicit filter arguments in the original filter context. It applies their result to the new filter context generated after step 4. These filter arguments are applied to the new filter context once the context transition has happened so they can overwrite it, after filter removal—their filter is not removed by any ALL* modifier—and after the relationship architecture has been updated. However, the evaluation of filter arguments happens in the original filter context, and it is not affected by any other modifier or filter within the same CALCULATE function.
The filter context generated after point (5) is the new filter context used by CALCULATE in the evaluation of its expression. CHAPTER 5
Understanding CALCULATE and CALCULATETABLE
173
CHAPTER 6
Variables Variables are important for at least two reasons: code readability and performance. In this chapter, we provide detailed information about variables and their usage, whereas considerations about performance and readability are found all around the book. Indeed, we use variables in almost all the code examples, sometimes showing the version with and without variables to let you appreciate how using variables improves readability. Later in Chapter 20, “Optimizing DAX,” we will also show how the use of variables can dramatically improve the performance of your code. In this chapter, we are mainly interested in providing all the useful information about variables in a single place.
Introducing VAR syntax What introduces variables in an expression is first the keyword VAR, which defines the variable, followed by the RETURN part, which defines the result. You can see a typical expression containing a variable in the following code: VAR SalesAmt = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) RETURN IF ( SalesAmt > 100000, SalesAmt, SalesAmt * 1.2 )
Adding more VAR definitions within the same block allows for the definition of multiple variables, whereas the RETURN block needs to be unique. It is important to note that the VAR/RETURN block is, indeed, an expression. As such, a variable definition makes sense wherever an expression can be used. This makes it possible to define variables during an iteration, or as part of more complex expressions, like in the following example: VAR SalesAmt = SUMX ( Sales,
175
VAR Quantity = Sales[Quantity] VAR Price = Sales[Price] RETURN Quantity * Price ) RETURN ...
Variables are commonly defined at the beginning of a measure definition and then used throughout the measure code. Nevertheless, this is only a writing habit. In complex expressions, defining local variables deeply nested inside other function calls is common practice. In the previous code example, the Quantity and Price variables are assigned for every row of the Sales table iterated by SUMX. These variables are not available outside of the expression executed by SUMX for each row. A variable can store either a scalar value or a table. The variables can be—and often are—of a different type than the expression returned after RETURN. Multiple variables in the same VAR/RETURN block can be of different types too—scalar values or tables. A very frequent usage of variables is to divide the calculation of a complex formula into logical steps, by assigning the result of each step to a variable. For example, in the following code variables are used to store partial results of the calculation: Margin% := VAR SalesAmount = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) VAR TotalCost = SUMX ( Sales, Sales[Quantity] * Sales[Unit Cost] ) VAR Margin = SalesAmount - TotalCost VAR MarginPerc = DIVIDE ( Margin, TotalCost ) RETURN MarginPerc
The same expression without variables takes a lot more attention to read: Margin% := DIVIDE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) - SUMX ( Sales, Sales[Quantity] * Sales[Unit Cost] ), SUMX ( Sales, Sales[Quantity] * Sales[Unit Cost] ) )
176
CHAPTER 6
Variables
Moreover, the version with variables has the advantage that each variable is only evaluated once. For example, TotalCost is used in two different parts of the code but, because it is defined as a variable, DAX guarantees that its evaluation only happens once. You can write any expression after RETURN. However, using a single variable for the RETURN part is considered best practice. For example, in the previous code, it would be possible to remove the MarginPerc variable definition by writing DIVIDE right after RETURN. However, using RETURN followed by a single variable (like in the example) allows for an easy change of the value returned by the measure. This is useful when inspecting the value of intermediate steps. In our example, if the total is not correct, it would be a good idea to check the value returned in each step, by using a report that includes the measure. This means replacing MarginPerc with Margin, then with TotalCost, and then with SalesAmount in the final RETURN. You would execute the report each time to see the result produced in the intermediate steps.
Understanding that variables are constant Despite its name, a DAX variable is a constant. Once assigned a value, the variable cannot be modified. For example, if a variable is assigned within an iterator, it is created and assigned for every row iterated. Moreover, the value of the variable is only available within the expression of the iterator it is defined in. Amount at Current Price := SUMX ( Sales, VAR Quantity = Sales[Quantity] VAR CurrentPrice = RELATED ( 'Product'[Unit Price] ) VAR AmountAtCurrentPrice = Quantity * CurrentPrice RETURN AmountAtCurrentPrice ) -- Any reference to Quantity, CurrentPrice, or AmountAtCurrentPrice -- would be invalid outside of SUMX
Variables are evaluated once in the scope of the definition (VAR) and not when their value is used. For example, the following measure always returns 100% because the SalesAmount variable is not affected by CALCULATE. Its value is only computed once. Any reference to the variable name returns the same value regardless of the filter context where the variable value is used. % of Product := VAR SalesAmount = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) RETURN DIVIDE ( SalesAmount, CALCULATE ( SalesAmount, ALL ( 'Product' ) ) )
CHAPTER 6
Variables
177
In this latter example, we used a variable where we should have used a measure. Indeed, if the goal is to avoid the duplication of the code of SalesAmount in two parts of the expression, the right solution requires using a measure instead of a variable to obtain the expected result. In the following code, the correct percentage is obtained by defining two measures: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) % of Product := DIVIDE ( [Sales Amount], CALCULATE ( [Sales Amount], ALL ( 'Product' ) ) )
In this case the Sales Amount measure is evaluated twice, in two different filter contexts—leading as expected to two different results.
Understanding the scope of variables Each variable definition can reference the variables previously defined within the same VAR/RETURN statement. All the variables already defined in outer VAR statements are also available. A variable definition can access the variables defined in previous VAR statements, but not the variables defined in following statements. Thus, this code works fine: Margin := VAR SalesAmount = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) VAR TotalCost = SUMX ( Sales, Sales[Quantity] * Sales[Unit Cost] ) VAR Margin = SalesAmount - TotalCost RETURN Margin
Whereas if one moves the definition of Margin at the beginning of the list, as in the following example, DAX will not accept the syntax. Indeed, Margin references two variables that are not yet defined—SalesAmount and TotalCost: Margin := VAR Margin = SalesAmount - TotalCost -- Error: SalesAmount and TotalCost are not defined VAR SalesAmount = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) VAR TotalCost = SUMX ( Sales, Sales[Quantity] * Sales[Unit Cost] ) RETURN Margin
178
CHAPTER 6
Variables
Because it is not possible to reference a variable before its definition, it is also impossible to create either a circular dependency between variables, or any sort of recursive definition. It is possible to nest VAR/RETURN statements inside each other, or to have multiple VAR/RETURN blocks in the same expression. The scope of variables differs in the two scenarios. For example, in the following measure the two variables LineAmount and LineCost are defined in two different scopes that are not nested. Thus, at no point in the code can LineAmount and LineCost both be accessed within the same expression: Margin := SUMX ( Sales, ( VAR LineAmount = Sales[Quantity] * Sales[Net Price] RETURN LineAmount ) -- The parenthesis closes the scope of LineAmount -- The LineAmount variable is not accessible from here on in ( VAR LineCost = Sales[Quantity] * Sales[Unit Cost] RETURN LineCost ) )
Clearly, this example is only for educational purposes. A better way of defining the two variables and of using them is the following definition of Margin: Margin := SUMX ( Sales, VAR LineAmount = Sales[Quantity] * Sales[Net Price] VAR LineCost = Sales[Quantity] * Sales[Unit Cost] RETURN LineAmount - LineCost )
As a further educational example, it is interesting to consider the real scope where a variable is accessible when the parentheses are not used and an expression defines and reads several variables in separate VAR/RETURN statements. For example, consider the following code: Margin := SUMX ( Sales, VAR LineAmount = Sales[Quantity] * Sales[Net Price] RETURN LineAmount VAR LineCost = Sales[Quantity] * Sales[Unit Cost] RETURN LineCost -- Here LineAmount is still accessible )
CHAPTER 6
Variables
179
The entire expression after the first RETURN is part of a single expression. Thus, the LineCost definition is nested within the LineAmount definition. Using the parentheses to delimit each RETURN expression and indenting the code appropriately makes this concept more visible: Margin := SUMX ( Sales, VAR LineAmount = Sales[Quantity] * Sales[Net Price] RETURN ( LineAmount - VAR LineCost = Sales[Quantity] * Sales[Unit Cost] RETURN ( LineCost -- Here LineAmount is still accessible ) ) )
As shown in the previous example, because a variable can be defined for any expression, a variable can also be defined within the expression assigned to another variable. In other words, it is possible to define nested variables. Consider the following example: Amount at Current Price := SUMX ( 'Product', VAR CurrentPrice = 'Product'[Unit Price] RETURN -- CurrentPrice is available within the inner SUMX SUMX ( RELATEDTABLE ( Sales ), VAR Quantity = Sales[Quantity] VAR AmountAtCurrentPrice = Quantity * CurrentPrice RETURN AmountAtCurrentPrice ) -- Any reference to Quantity, or AmountAtCurrentPrice -- would be invalid outside of the innermost SUMX ) -- Any reference to CurrentPrice -- would be invalid outside of the outermost SUMX
The rules pertaining to the scope of variables are the following: ■
■
180
A variable is available in the RETURN part of its VAR/RETURN block. It is also available in all the variables defined after the variable itself, within that VAR/RETURN block. The VAR/RETURN block replaces any DAX expression, and in such expression the variable can be read. In other words, the variable is accessible from its declaration point until the end of the expression following the RETURN statement that is part of the same VAR/RETURN block. A variable is never available outside of its own VAR/RETURN block definition. After the expression following the RETURN statement, the variables declared within the VAR/RETURN block are no longer visible. Referencing them generates a syntax error.
CHAPTER 6
Variables
Using table variables A variable can store either a table or a scalar value. The type of the variable depends on its definition; for instance, if the expression used to define the variable is a table expression, then the variable contains a table. Consider the following code: Amount := IF ( HASONEVALUE ( Slicer[Factor] ), VAR Factor = VALUES ( Slicer[Factor] ) RETURN DIVIDE ( [Sales Amount], Factor ) )
If Slicer[Factor] is a column with a single value in the current filter context, then it can be used as a scalar expression. The Factor variable stores a table because it contains the result of VALUES, which is a table function. If the user does not check for the presence of a single row with HASONEVALUE, the variable assignment works fine; the line raising an error is the second parameter of DIVIDE, where the variable is used, and conversion fails. When a variable contains a table, it is likely because one wants to iterate on it. It is important to note that, during such iteration, one should access the columns of a table variable by using their original names. In other words, a variable name is not an alias of the underlying table in column references: Filtered Amount := VAR MultiSales = FILTER ( Sales, Sales[Quantity] > 1 ) RETURN SUMX ( MultiSales, -- MultiSales is not a table name for column references -- Trying to access MultiSales[Quantity] would generate an error Sales[Quantity] * Sales[Net Price] )
Although SUMX iterates over MultiSales, you must use the Sales table name to access the Quantity and Net Price columns. A column reference such as MultiSales[Quantity] is invalid. One current DAX limitation is that a variable cannot have the same name as any table in the data model. This prevents the possible confusion between a table reference and a variable reference. Consider the following code: SUMX ( LargeSales, Sales[Quantity] * Sales[NetPrice] )
CHAPTER 6
Variables
181
A human reader immediately understands that LargeSales should be a variable because the column references in the iterator reference another table name: Sales. However, DAX disambiguates at the language level through the distinctiveness of the name. A certain name can be either a table or a variable, but not both at the same time. Although this looks like a convenient limitation because it reduces confusion, it might be problematic in the long run. Indeed, whenever you define the name of a variable, you should use a name that will never be used as a table name in the future. Otherwise, if at some point you create a new table whose name conflicts with variables used in any measure, you will obtain an error. Any syntax limitation that requires you to predict what will happen in the future—like choosing the name of a table—is an issue to say the least. For this reason, when Power BI generates DAX queries, it uses variable names adopting a prefix with two underscores (__). The rationale is that a user is unlikely to use the same name in a data model.
Note This behavior could change in the future, thus enabling a variable name to override the name of an existing table. When this change is implemented, there will no longer be a risk of breaking an existing DAX expression by giving a new table the name of a variable. When a variable name overrides a table name, the disambiguation will be possible by using the single quote to delimit the table identifier using the following syntax: variableName 'tableName'
Should a developer design a DAX code generator to be injected in existing expressions, they can use the single quote to disambiguate table identifiers. This is not required in regular DAX code, if the code does not include ambiguous names between variables and tables.
Understanding lazy evaluation As you have learned, DAX evaluates the variable within the evaluation context where it is defined, and not where it is being used. Still, the evaluation of the variable itself is delayed until its first use. This technique is known as lazy evaluation. Lazy evaluation is important for performance reasons: a variable that is never used in an expression will never be evaluated. Moreover, once a variable is computed for the first time, it will never be computed again in the same scope. For example, consider the following code: Sales Amount := VAR SalesAmount = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) VAR DummyError = ERROR ( "This error will never be displayed" ) RETURN SalesAmount
182
CHAPTER 6
Variables
The variable DummyError is never used, so its expression is never executed. Therefore, the error never happens and the measure works correctly. Obviously, nobody would ever write code like this. The goal of the example is to show that DAX does not spend precious CPU time evaluating a variable if it is not useful to do so, and you can rely on this behavior when writing code. If a sub-expression is used multiple times in a complex expression, then creating a variable to store its value is always a best practice. This guarantees that evaluation only happens once. Performancewise, this is more important than you might think. We will discuss this in more detail in Chapter 20, but we cover the general idea here. The DAX optimizer features a process called sub-formula detection. In a complex piece of code, sub-formula detection checks for repeating sub-expressions that should only be computed once. For example, look at the following code: SalesAmount TotalCost Margin Margin%
:= := := :=
SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) SUMX ( Sales, Sales[Quantity] * Sales[Unit Cost] ) [SalesAmount] – [TotalCost] DIVIDE ( [Margin], [TotalCost] )
The TotalCost measure is called twice—once in Margin and once in Margin%. Depending on the quality of the optimizer, it might be able to detect that both measure calls refer to the same value, so it might be able to compute TotalCost only once. Nevertheless, the optimizer is not always able to detect that a sub-formula exists and that it can be evaluated only once. As a human, and being the author of your own code, you always have a much better understanding of when part of the code can be used in multiple parts of your formula. If you get used to using variables whenever you can, defining sub-formulas as variables will come naturally. When you use their value multiple times, you will greatly help the optimizer in finding the best execution path for your code.
Common patterns using variables In this section, you find practical uses of variables. It is not an exhaustive list of scenarios where variables become useful, and although there are many other situations where a variable would be a good fit, these are relevant and frequent uses. The first and most relevant reason to use variables is to provide documentation in your code. A good example is when you need to use complex filters in a CALCULATE function. Using variables as CALCULATE filters only improves readability. It does not change semantics or performance. Filters would be executed outside of the context transition triggered by CALCULATE in any case, and DAX also
CHAPTER 6
Variables
183
uses lazy evaluation for filter contexts. Nevertheless, improving readability is an important task for any DAX developer. For example, consider the following measure definition: Sales Large Customers := VAR LargeCustomers = FILTER ( Customer, [Sales Amount] > 10000 ) VAR WorkingDaysIn2008 = CALCULATETABLE ( ALL ( 'Date'[IsWorkingDay], 'Date'[Calendar Year] ), 'Date'[IsWorkingDay] = TRUE (), 'Date'[Calendar Year] = "CY 2008" ) RETURN CALCULATE ( [Sales Amount], LargeCustomers, WorkingDaysIn2008 )
Using the two variables for the filtered customers and the filtered dates splits the full execution flow into three distinct parts: the definition of what a large customer is, the definition of the period one wants to consider, and the actual calculation of the measure with the two filters applied. Although it might look like we are only talking about style, you should never forget that a more elegant and simple formula is more likely to also be an accurate formula. Writing a simpler formula, the author is more likely to have understood the code and fixed any possible flaws. Whenever an expression takes more than 10 lines of code, it is time to split its execution path with multiple variables. This allows the author to focus on smaller fragments of the full formula. Another scenario where variables are important is when nesting multiple row contexts on the same table. In this scenario, variables let you save data from hidden row contexts and avoid the use of the EARLIER function: 'Product'[RankPrice] = VAR CurrentProductPrice = 'Product'[Unit Price] VAR MoreExpensiveProducts = FILTER ( 'Product', 'Product'[Unit Price] > CurrentProductPrice ) RETURN COUNTROWS ( MoreExpensiveProducts ) + 1
Filter contexts can be nested too. Nesting multiple filter contexts does not create syntax problems as it does with multiple row contexts. One frequent scenario with nested filter contexts is needing to save the result of a calculation to use it later in the code when the filter context changes.
184
CHAPTER 6
Variables
For example, if one needs to search for the customers who bought more than the average customer, this code is not going to work: AverageSalesPerCustomer := AVERAGEX ( Customer, [Sales Amount] ) CustomersBuyingMoreThanAverage := COUNTROWS ( FILTER ( Customer, [Sales Amount] > [AverageSalesPerCustomer] ) )
The reason is that the AverageSalesPerCustomer measure is evaluated inside an iteration over Customer. As such, there is a hidden CALCULATE around the measure that performs a context transition. Thus, AverageSalesPerCustomer evaluates the sales of the current customer inside the iteration every time, instead of the average over all the customers in the filter context. There is no customer whose sales amount is strictly greater than the sales amount itself. The measure always returns blank. To obtain the correct behavior, one needs to evaluate AverageSalesPerCustomer outside of the iteration. A variable fits this requirement perfectly: AverageSalesPerCustomer := AVERAGEX ( Customer, [Sales Amount] ) CustomersBuyingMoreThanAverage := VAR AverageSales = [AverageSalesPerCustomer] RETURN COUNTROWS ( FILTER ( Customer, [Sales Amount] > AverageSales ) )
In this example DAX evaluates the variable outside of the iteration, computing the correct average sales for all the selected customers. Moreover, the optimizer knows that the variable can (and must) be evaluated only once, outside of the iteration. Thus, the code is likely to be faster than any other possible implementation.
Conclusions Variables are useful for multiple reasons: readability, performance, and elegance of the code. Whenever you need to write a complex formula, split it into multiple variables. You will appreciate having done so the next time you review your code.
CHAPTER 6
Variables
185
It is true that expressions using variables tend to be longer than the same expressions without variables. A longer expression is not a bad thing if it means that each part is easier to understand. Unfortunately, in several tools the user interface to author DAX code makes it hard to write expressions over 10 lines long. You might think that a shorter formulation of the same code without variables is preferable because it is easier to author in a specific tool—for example Power BI. That is incorrect. We certainly need better tools to author longer DAX code that includes comments and many variables. These tools will come eventually. In the meantime, rather than authoring shorter and confusing code directly into a small text box, it is wiser to use external tools like DAX Studio to author longer DAX code. You would then copy and paste the resulting code into Power BI or Visual Studio.
186
CHAPTER 6
Variables
CHAPTER 7
Working with iterators and with CALCULATE In previous chapters we provided the theoretical foundations of DAX: row context, filter context, and context transition. These are the pillars any DAX expression is built on. We already introduced iterators, and we used them in many different formulas. However, the real power of iterators starts to show when they are being used in conjunction with evaluation contexts and context transition. In this chapter we take iterators to the next level, by describing the most common uses of iterators and by introducing many new iterators. Learning how to leverage iterators in your code is an important skill to acquire. Indeed, using iterators and context transition together is a feature that is unique to the DAX language. In our teaching experience, students usually struggle with learning the power of iterators. But that does not mean that the use of iterators is difficult to understand. The concept of iteration is simple, as is the usage of iterators in conjunction with context transition. What is hard is realizing that the solution to a complex calculation is resorting to an iteration. For this reason, we provide several examples of calculations that are simple to create with the help of iterators.
Using iterators Most iterators accept at least two parameters: the table to iterate and an expression that the iterator evaluates on a row-by-row basis, in the row context generated during the iteration. A simple expression using SUMX will support our explanation: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] )
-- Table to iterate -- Expression to evaluate row by row
SUMX iterates the Sales table, and for each row it computes the expression by multiplying quantity by net price. Iterators differ from one another in the use they make of the partial results gathered during the iteration. SUMX is a simple iterator that aggregates these results using sum. It is important to understand the difference between the two parameters. The first argument is the value resulting from a table expression to iterate. Being a value parameter, it is evaluated before the iteration starts. The second parameter, on the other hand, is an expression that is not evaluated before 187
the execution of SUMX. Instead, the iterator evaluates the expression in the row context of the iteration. The official Microsoft documentation does not provide an accurate classification of the iterator functions. More specifically, it does not indicate which parameters represent a value and which parameters represent an expression evaluated during the iteration. On https://dax.guide all the functions that evaluate an expression in a row context have a special marker (ROW CONTEXT) to identify the argument executed in a row context. Any function that has an argument marked with ROW CONTEXT is an iterator. Several iterators accept additional arguments after the first two. For example, RANKX is an iterator that accepts many arguments, whereas SUMX, AVERAGEX and simple iterators only use two arguments. In this chapter we describe many iterators individually. But first, we go deeper on a few important aspects of iterators.
Understanding iterator cardinality The first important concept to understand about iterators is the iterator cardinality. The cardinality of an iterator is the number of rows being iterated. For example, in the following iteration if Sales has one million rows, then the cardinality is one million: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] )
-- Sales has 1M rows, as a consequence -- the expression is evaluated one million times
When speaking about cardinality, we seldom use numbers. In fact, the cardinality of the previous example depends on the number of rows of the Sales table. Thus, we prefer to say that the cardinality of the iterator is the same as the cardinality of Sales. The more rows in Sales, the higher the number of iterated rows. In the presence of nested iterators, the resulting cardinality is a combination of the cardinality of the two iterators—up to the product of the two original tables. For example, consider the following formula: Sales at List Price 1 := SUMX ( 'Product', SUMX ( RELATEDTABLE ( Sales ), 'Product'[Unit Price] * Sales[Quantity] ) )
In this example there are two iterators. The outer iterates Product. As such, its cardinality is the cardinality of Product. Then for each product the inner iteration scans the Sales table, limiting its iteration to the rows in Sales that have a relationship with the given product. In this case, because each row in Sales is pertinent to only one product, the full cardinality is the cardinality of Sales. If the inner table expression is not related to the outer table expression, then the cardinality becomes much higher. 188
CHAPTER 7
Working with iterators and with CALCULATE
For example, consider the following code. It computes the same value as the previous code, but instead of relying on relationships, it uses an IF function to filter the sales of the current product: Sales at List Price High Cardinality := SUMX ( VALUES ( 'Product' ), SUMX ( Sales, IF ( Sales[ProductKey] = 'Product'[ProductKey], 'Product'[Unit Price] * Sales[Quantity], 0 ) ) )
In this example the inner SUMX always iterates over the whole Sales table, relying on the internal IF statement to check whether the product should be considered or not for the calculation. In this case, the outer SUMX has the cardinality of Product, whereas the inner SUMX has the cardinality of Sales. The cardinality of the whole expression is Product times Sales; much higher than the first example. Be mindful that this example is for educational purposes only. It would result in bad performance if one ever used such a pattern in a DAX expression. A better way to express this code is the following: Sales at List Price 2 := SUMX ( Sales, RELATED ( 'Product'[Unit Price] ) * Sales[Quantity] )
The cardinality of the entire expression is the same as in the Sales at List Price 1 measure, but the latter has a better execution plan. Indeed, it avoids nested iterators. Nested iterations mostly happen because of context transition. In fact, by looking at the following code, one might think that there are no nested iterators: Sales at List Price 3 := SUMX ( 'Product', 'Product'[Unit Price] * [Total Quantity] )
However, inside the iteration there is a reference to a measure (Total Quantity) which we need to consider. In fact, here is the expanded definition of Total Quantity: Total Quantity := SUM ( Sales[Quantity] )
-- Internally translated into SUMX ( Sales, Sales[Quantity] )
Sales at List Price 4 := SUMX ( 'Product',
CHAPTER 7
Working with iterators and with CALCULATE
189
'Product'[Unit Price] * CALCULATE ( SUMX ( Sales, Sales[Quantity] ) ) )
You can now see that there is a nested iteration—that is, a SUMX inside another SUMX. Moreover, the presence of CALCULATE, which performs a context transition, is also made visible. From a performance point of view, when there are nested iterators, only the innermost iterator can be optimized with the more efficient query plan. The presence of outer iterators requires the creation of temporary tables in memory. These temporary tables store the intermediate result produced by the innermost iterator. This results in slower performance and higher memory consumption. As a consequence, nested iterators should be avoided if the cardinality of the outer iterators is very large—in the order of several million rows. Please note that in the presence of context transition, unfolding nested iterations is not as easy as it might seem. In fact, a typical mistake is to obtain nested iterators by writing a measure that is supposed to reuse an existing measure. This could be dangerous when the existing logic of a measure is reused within an iterator. For example, consider the following calculation: Sales at List Price 5 := SUMX ( 'Sales', RELATED ( 'Product'[Unit Price] ) * [Total Quantity] )
The Sales at List Price 5 measure seems identical to Sales at List Price 3. Unfortunately, Sales at List Price 5 violates several of the rules of context transition outlined in Chapter 5, “Understanding CALCULATE and CALCULATETABLE”: It performs context transition on a large table (Sales), and worse, it performs context transition on a table where the rows are not guaranteed to be unique. Consequently, the formula is slow and likely to produce incorrect results. This is not to say that nested iterations are always bad. There are various scenarios where the use of nested iterations is convenient. In fact, in the rest of this chapter we show many examples where nested iterators are a powerful tool to use.
Leveraging context transition in iterators A calculation might require nested iterators, usually when it needs to compute a measure in different contexts. These are the scenarios where using context transition is powerful and allows for the concise, efficient writing of complex calculations. For example, consider a measure that computes the maximum daily sales in a time period. The definition of the measure is important because it defines the granularity right away. Indeed, one needs to first compute the daily sales in the given period, then find the maximum value in the list of computed 190
CHAPTER 7
Working with iterators and with CALCULATE
values. Even though it would seem intuitive to create a table containing daily sales and then use MAX on it, in DAX you are not required to build such a table. Instead, iterators are a convenient way of obtaining the desired result without any additional table. The idea of the algorithm is the following: ■
Iterate over the Date table.
■
Compute the sales amount for each day.
■
Find the maximum of all the values computed in the previous step.
You can write this measure by using the following approach: Max Daily Sales 1 := MAXX ( 'Date', VAR DailyTransactions = RELATEDTABLE ( Sales ) VAR DailySales = SUMX ( DailyTransactions, Sales[Quantity] * Sales[Net Price] ) RETURN DailySales )
However, a simpler approach is the following, which leverages the implicit context transition of the measure Sales Amount: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) Max Daily Sales 2 := MAXX ( 'Date', [Sales Amount] )
In both cases there are two nested iterators. The outer iteration happens on the Date table, which is expected to contain a few hundred rows. Moreover, each row in Date is unique. Thus, both calculations are safe and quick. The former version is more complete, as it outlines the full algorithm. On the other hand, the second version of Max Daily Sales hides many details and makes the code more readable, leveraging context transition to move the filter from Date over to Sales. You can view the result of this measure in Figure 7-1 that shows the maximum daily sales for each month.
CHAPTER 7
Working with iterators and with CALCULATE
191
FIGURE 7-1 The report shows the Max Daily Sales measure computed by month and year.
By leveraging context transition and an iteration, the code is usually more elegant and intuitive to write. The only issue you should be aware of is the cost involved in context transition: it is a good idea to avoid measure references in large iterators. By looking at the report in Figure 7-1, a logical question is: When did sales hit their maximum? For example, the report is indicating that in one certain day in January 2007, Contoso sold 92,244.07 USD. But in which day did it happen? Iterators and context transition are powerful tools to answer this question. Look at the following code: Date of Max = VAR MaxDailySales = [Max Daily Sales] VAR DatesWithMax = FILTER ( VALUES ( 'Date'[Date] ), [Sales Amount] = MaxDailySales ) VAR Result = IF ( COUNTROWS ( DatesWithMax ) = 1, DatesWithMax, BLANK () ) RETURN Result
The formula first stores the value of the Max Daily Sales measure into a variable. Then, it creates a temporary table containing the dates where sales equals MaxDailySales. If there is only one date when
192
CHAPTER 7
Working with iterators and with CALCULATE
this happened, then the result is the only row which passed the filter. If there are multiple dates, then the formula blanks its result, showing that a single date cannot be determined. You can look at the result of this code in Figure 7-2.
FIGURE 7-2 The Date of Max measures make it clear which unique date generated the maximum sales.
The use of iterators in DAX requires you to always define, in this order: ■
The granularity at which you want the calculation to happen,
■
The expression to evaluate at the given granularity,
■
The kind of aggregation to use.
In the previous example (Max Daily Sales 2) the granularity is the date, the expression is the amount of sales, and the aggregation to use is MAX. The result is the maximum daily sales. There are several scenarios where the same pattern can be useful. Another example could be displaying the average customer sales. If you think about it in terms of iterators using the pattern described above, you obtain the following: Granularity is the individual customer, the expression to use is sales amount, and the aggregation is AVERAGE. Once you follow this mental process, the formula is short and easy: Avg Sales by Customer := AVERAGEX ( Customer, [Sales Amount] )
With this simple formula, one can easily build powerful reports like the one in Figure 7-3 that shows the average sales per customer by continent and year.
CHAPTER 7
Working with iterators and with CALCULATE
193
FIGURE 7-3 The Avg Sales by Customer measure computed by year and by continent.
Context transition in iterators is a powerful tool. It can also be expensive, so always checking the cardinality of the outer iterator is a good practice. This will result in more efficient DAX code.
Using CONCATENATEX In this section, we show a convenient usage of CONCATENATEX to display the filters applied to a report in a user-friendly way. Suppose you build a simple visual that shows sales sliced by year and continent, and you put it in a more complex report where the user has the option of filtering colors using a slicer. The slicer might be near the visual or it might be in a different page. If the slicer is in a different page, then looking at the visual, it is not clear whether the numbers displayed are a subset of the whole dataset or not. In that case it would be useful to add a label to the report, showing the selection made by the user in textual form as in Figure 7-4.
FIGURE 7-4 The label at the bottom of the visual indicates which filters are being applied.
One can inspect the values of the selected colors by querying the VALUES function. Nevertheless, CONCATENATEX is required to convert the resulting table into a string. Look at the definition of the Selected Colors measure, which we used to show the colors in Figure 7-4: Selected Colors := "Showing " & CONCATENATEX ( VALUES ( 'Product'[Color] ), 'Product'[Color], ", ",
194
CHAPTER 7
Working with iterators and with CALCULATE
'Product'[Color], ASC ) & " colors."
CONCATENATEX iterates over the values of product color and creates a string containing the list of these colors separated by a comma. As you can see, CONCATENATEX accepts multiple parameters. As usual, the first two are the table to scan and the expression to evaluate. The third parameter is the string to use as the separator between expressions. The fourth and the fifth parameters indicate the sort order and its direction (ASC or DESC). The only drawback of this measure is that if there is no selection on the color, it produces a long list with all the colors. Moreover, in the case where there are more than five colors, the list would be too long anyway and the user experience sub-optimal. Nevertheless, it is easy to fix both problems by making the code slightly more complex to detect these situations: Selected Colors := VAR Colors = VALUES ( 'Product'[Color] ) VAR NumOfColors = COUNTROWS ( Colors ) VAR NumOfAllColors = COUNTROWS ( ALL ( 'Product'[Color] ) ) VAR AllColorsSelected = NumOfColors = NumOfAllColors VAR SelectedColors = CONCATENATEX ( Colors, 'Product'[Color], ", ", 'Product'[Color], ASC ) VAR Result = IF ( AllColorsSelected, "Showing all colors.", IF ( NumOfColors > 5, "More than 5 colors selected, see slicer page for details.", "Showing " & SelectedColors & " colors." ) ) RETURN Result
In Figure 7-5 you can see two results for the same visual, with different selections for the colors. With this latter version, it is much clearer whether the user needs to look at more details or not about the color selection.
CHAPTER 7
Working with iterators and with CALCULATE
195
FIGURE 7-5 Depending on the filters, the label now shows user-friendly descriptions of the filtering.
This latter version of the measure is not perfect yet. In the case where the user selects five colors, but only four are present in the current selection because other filters hide some colors, then the measure does not report the complete list of colors. It only reports the existing list. In Chapter 10, “Working with the filter context,” we describe a different version of this measure that addresses this last detail. In fact, to author the final version, we first need to describe a set of new functions that aim at investigating the content of the current filter context.
Iterators returning tables So far, we have described iterators that aggregate an expression. There are also iterators that return a table produced by merging a source table with one or more expressions evaluated in the row context of the iteration. ADDCOLUMNS and SELECTCOLUMNS are the most interesting and useful. They are the topic of this section. As its name implies, ADDCOLUMNS adds new columns to the table expression provided as the first parameter. For each added column, ADDCOLUMNS requires knowing the column name and the expression that defines it. For example, you can add two columns to the list of colors, including for each color the number of products and the value of Sales Amount in two new columns: Colors = ADDCOLUMNS ( VALUES ( 'Product'[Color] ), "Products", CALCULATE ( COUNTROWS ( 'Product' ) ), "Sales Amount", [Sales Amount] )
196
CHAPTER 7
Working with iterators and with CALCULATE
The result of this code is a table with three columns: the product color, which is coming from the values of Product[Color], and the two new columns added by ADDCOLUMNS as you can see in Figure 7-6.
FIGURE 7-6 The Sales Amount and Products columns are computed by ADDCOLUMNS.
ADDCOLUMNS returns all the columns of the table expression it iterates, adding the requested columns. To keep only a subset of the columns of the original table expression, an option is to use SELECTCOLUMNS, which only returns the requested columns. For instance, you can rewrite the previous example of ADDCOLUMNS by using the following query: Colors = SELECTCOLUMNS ( VALUES ( 'Product'[Color] ), "Color", 'Product'[Color], "Products", CALCULATE ( COUNTROWS ( 'Product' ) ), "Sales Amount", [Sales Amount] )
The result is the same, but you need to explicitly include the Color column of the original table to obtain the same result. SELECTCOLUMNS is useful whenever you need to reduce the number of columns of a table, oftentimes resulting from some partial calculations.
CHAPTER 7
Working with iterators and with CALCULATE
197
ADDCOLUMNS and SELECTCOLUMNS are useful to create new tables, as you have seen in this first example. These functions are also often used when authoring measures to make the code easier and faster. As an example, look at the measure, defined earlier in this chapter, that aims at finding the date with the maximum daily sales: Max Daily Sales := MAXX ( 'Date', [Sales Amount] ) Date of Max := VAR MaxDailySales = [Max Daily Sales] VAR DatesWithMax = FILTER ( VALUES ( 'Date'[Date] ), [Sales Amount] = MaxDailySales ) VAR Result = IF ( COUNTROWS ( DatesWithMax ) = 1, DatesWithMax, BLANK () ) RETURN Result
If you look carefully at the code, you will notice that it is not optimal in terms of performance. In fact, as part of the calculation of the variable MaxDailySales, the engine needs to compute the daily sales to find the maximum value. Then, as part of the second variable evaluation, it needs to compute the daily sales again to find the dates when the maximum sales happened. Thus, the engine performs two iterations on the Date table, and each time it computes the sales amount for each date. The DAX optimizer might be smart enough to understand that it can compute the daily sales only once, and then use the previous result the second time you need it, but this is not guaranteed to happen. Nevertheless, by refactoring the code leveraging ADDCOLUMNS, one can write a faster version of the same measure. This is achieved by first preparing a table with the daily sales and storing it into a variable, then using this first—partial—result to compute both the maximum daily sales and the date with the maximum sales: Date of Max := VAR DailySales = ADDCOLUMNS ( VALUES ( 'Date'[Date] ), "Daily Sales", [Sales Amount] ) VAR MaxDailySales = MAXX ( DailySales, [Daily Sales] ) VAR DatesWithMax = SELECTCOLUMNS ( FILTER ( DailySales, [Daily Sales] = MaxDailySales
198
CHAPTER 7
Working with iterators and with CALCULATE
), "Date", 'Date'[Date] ) VAR Result = IF ( COUNTROWS ( DatesWithMax ) = 1, DatesWithMax, BLANK () ) RETURN Result
The algorithm is close to the previous one, with some noticeable differences: ■
■
■
The DailySales variable contains a table with date, and sales amount on each given date. This table is created by using ADDCOLUMNS. MaxDailySales no longer computes the daily sales. It scans the precomputed DailySales variable, resulting in faster execution time. The same happens with DatesWithMax, which scans the DailySales variable. Because after that point the code only needs the date and no longer the daily sales, we used SELECTCOLUMNS to remove the daily sales from the result.
This latter version of the code is more complex than the original version. This is often the price to pay when optimizing code: Worrying about performance means having to write more complex code. You will see ADDCOLUMNS and SELECTCOLUMNS in more detail in Chapter 12, “Working with tables,” and in Chapter 13, “Authoring queries.” There are many details that are important there, especially if you want to use the result of SELECTCOLUMNS in other iterators that perform context transition.
Solving common scenarios with iterators In this section we continue to show examples of known iterators and we also introduce a common and useful one: RANKX. You start learning how to compute moving averages and the difference between using an iterator or a straight calculation for the average. Later in this section, we provide a complete description of the RANKX function, which is extremely useful to compute ranking based on expressions.
Computing averages and moving averages You can calculate the mean (arithmetic average) of a set of values by using one of the following DAX functions: ■
AVERAGE: returns the average of all the numbers in a numeric column.
■
AVERAGEX: calculates the average on an expression evaluated over a table. CHAPTER 7
Working with iterators and with CALCULATE
199
Note DAX also provides the AVERAGEA function, which returns the average of all the numbers in a text column. However, you should not use it. AVERAGEA only exists in DAX for Excel compatibility. The main issue of AVERAGEA is that when you use a text column as an argument, it does not try to convert each text row to a number as Excel does. Instead, if you pass a string column as an argument, you always obtain 0 as a result. That is quite useless. On the other hand, AVERAGE would return an error, clearly indicating that it cannot average strings.
We discussed how to compute regular averages over a table earlier in this chapter. Here we want to show a more advanced usage, that is a moving average. For example, imagine that you want to analyze the daily sales of Contoso. If you just build a report that plots the sales amount sliced by day, the result is hard to analyze. As you can see in Figure 7-7, the value obtained has strong daily variations.
FIGURE 7-7 Plotting the sales amount on a daily basis is a hard report to read.
To smooth out the chart, a common technique is to compute the average over a certain period greater than just the day level. In our example, we decided to use 30 days as our period. Thus, on each day the chart shows the average over the last 30 days. This technique helps in removing peaks from the chart, making it easier to detect a trend.
200 CHAPTER 7 Working with iterators and with CALCULATE
The following calculation provides the average at the date cardinality, over the last 30 days: AvgXSales30 := VAR LastVisibleDate = MAX ( 'Date'[Date] ) VAR NumberOfDays = 30 VAR PeriodToUse = FILTER ( ALL ( 'Date' ), AND ( 'Date'[Date] > LastVisibleDate - NumberOfDays, 'Date'[Date] LastVisibleDate - NumberOfDays && 'Date'[Date] 0, [NumOfWorkingDays] ) ) VAR Result = DIVIDE ( [Sales Amount], WorkingDays ) RETURN Result
This new version of the code provides an accurate result at the year level, as shown in Figure 7-21, though it is still not perfect.
FIGURE 7-21 Using an iterator the total at the year level is now accurate.
When performing the calculation at a different granularity, one needs to ensure the correct level of granularity. The iteration started by SUMX iterates the values of the month column, which are January through December. At the year level everything is working correctly, but the value is still incorrect at the grand total. You can observe this behavior in Figure 7-22.
CHAPTER 7
Working with iterators and with CALCULATE
213
FIGURE 7-22 Every yearly total is above 35,000 and the grand total is—again—surprisingly low.
When the filter context contains the year, an iteration of months works fine because—after the context transition—the new filter context contains both a year and a month. However, at the grand total level, the year is no longer part of the filter context. Consequently, the filter context only contains the currently iterated month, and the formula does not check if there are sales in that year and month. Instead, it checks if there are sales in that month for any year. The problem of this formula is the iteration over the month column. The correct granularity of the iteration is not the month; it is the pair of year and month together. The best solution is to iterate over a column containing a different value for each year and month. It turns out that we have such a column in the data model: the Calendar Year Month column. To fix the code, it is enough to iterate over the Calendar Year Month column instead of over Month: SalesPerWorkingDay := VAR WorkingDays = SUMX ( VALUES ( 'Date'[Calendar Year Month] ), IF ( [Sales Amount] > 0, [NumOfWorkingDays] ) ) VAR Result = DIVIDE ( [Sales Amount], WorkingDays ) RETURN Result
This final version of the code works fine because it computes the total using an iteration at the correct level of granularity. You can see the result in Figure 7-23.
FIGURE 7-23 Applying the calculation at the correct level of granularity returns accurate values also at the
Total level.
214
CHAPTER 7
Working with iterators and with CALCULATE
Conclusions As usual, let us conclude this chapter with a recap of the important concepts you learned here: ■
■
■
■
■
Iterators are an important part of DAX, and you will find yourself using them more, the more you use DAX. There are mainly two kinds of iterations in DAX: iterations to perform simple calculations on a row-by-row basis and iterations that leverage context transition. The definition of Sales Amount we used so far in the book uses an iteration to compute the quantity multiplied by the net price, on a row-by-row basis. In this chapter, we introduced iterators with a context transition, a powerful tool to compute more complex expressions. Whenever using an iterator with context transition, you must check the cardinality the iteration should happen at—it should be quite small. You also need to check that the rows in the table are guaranteed to be unique. Otherwise, the code is at risk of being slow or of computing bad results. When computing averages over time, you always should check whether an iterator is the correct solution or not. AVERAGEX does not consider blanks as part of its calculation and, when using time, this could be wrong. Nevertheless, always double-check the formula requirements; each scenario is unique. Iterators are useful to compute values at a different granularity, as you learned in the last example. When dealing with calculations at different granularities, it is of paramount importance to check the correct granularity to avoid errors in the code.
You will see many more examples of iterators in the remaining part of the book. Starting from the next chapter, when dealing with time intelligence calculations, you will see different calculations, most of which rely on iterations.
CHAPTER 7
Working with iterators and with CALCULATE
215
CHAPTER 8
Time intelligence calculations Almost any data model includes some sort of calculation related to dates. DAX offers several functions to simplify these calculations, which are useful if the underlying data model follows certain specific requirements. On the other hand, if the model contains peculiarities in the handling of time that would prevent the use of standard time intelligence functions, then writing custom calculations is always an option. In this chapter, you learn how to implement common date-related calculations such as year-to-date, year-over-year, and other calculations over time including nonadditive and semi-additive measures. You learn both how to use specific time intelligence functions and how to rely on custom DAX code for nonstandard calendars and week-based calculations.
Introducing time intelligence Typically, a data model contains a date table. In fact, when slicing data by year and month, it is preferable to use the columns of a table specifically designed to slice dates. Extracting the date parts from a single column of type Date or DateTime in calculated columns is a less desirable approach. There are several reasons for this choice. By using a date table, the model becomes easier to browse, and you can use specific DAX functions that perform time intelligence calculations. In fact, in order to work properly, most of the time intelligence functions in DAX require a separate date table. If a model contains multiple dates, like the order date and the delivery date, then one can either create multiple relationships with a single date table or duplicate the date table. The resulting models are different, and so are the calculations. Later in this chapter, we will discuss these two alternatives in more detail. In any case, one should always create at least one date table whenever there are one or more date columns in the data. Power BI and Power Pivot for Excel offer embedded features to automatically create tables or columns to manage dates in the model, whereas Analysis Services has no specific feature for the handling of time intelligence. However, the implementation of these features does not always follow the best practice of keeping a single date table in the data model. Also, because these features come with several restrictions, it is usually better to use your own date table. The next sections expand on this last statement.
217
Automatic Date/Time in Power BI Power BI has a feature called Auto Date/Time, which can be configured through the options in the Data Load section (see Figure 8-1).
FIGURE 8-1 The Auto Date/Time setting is enabled by default in a new model.
When the setting is enabled—it is by default—Power BI automatically creates a date table for each Date or DateTime column in the model. We will call it a “date column” from here on. This makes it possible to slice each date by year, quarter, month, and day. These automatically created tables are hidden to the user and cannot be modified. Connecting to the Power BI Desktop file with DAX Studio makes them visible to any developers curious about their structure. The Auto Date/Time feature comes with two major drawbacks: ■
■
218
Power BI Desktop generates one table per date column. This creates an unnecessarily high number of date tables in the model, unrelated to one another. Building a simple report presenting the amount ordered and the amount sold in the same matrix proves to be a real challenge. The tables are hidden and cannot be modified by the developer. Consequently, if one needs to add a column for the weekday, they cannot.
CHAPTER 8
Time intelligence calculations
Building a proper date table for complete freedom is a skill that you learn in the next few pages, and it only requires a few lines of DAX code. Forcing your model to follow bad practices in data modeling just to save a couple of minutes when building the model for the first time is definitely a bad choice.
Automatic date columns in Power Pivot for Excel Power Pivot for Excel also has a feature to handle the automatic creation of data structures, making it easier to browse dates. However, it uses a different technique that is even worse than that of Power BI. In fact, when one uses a date column in a pivot table, Power Pivot automatically creates a set of calculated columns in the same table that contains the date column. Thus, it creates one calculated column for the year, one for the month name, one for the quarter, and one for the month number—required for sorting. In total, it adds four columns to your table. As a bad practice, it shares all the bad features of Power BI and it adds a new one. In fact, if there are multiple date columns in a single table, then the number of these calculated columns will start to increase. There is no way to use the same set of columns to slice different dates, as is the case with Power BI. Finally, if the date column is in a table with millions of rows—as is often the case—these calculated columns increase the file size and the memory footprint of the model. This feature can be disabled in the Excel options, as you can see in Figure 8-2.
FIGURE 8-2 The Excel options contain a setting to disable automatic grouping of DateTime columns.
CHAPTER 8
Time intelligence calculations
219
Date table template in Power Pivot for Excel Excel offers another feature that works much better than the previous feature. Indeed, since 2017 there is an option in Power Pivot for Excel to create a date table, which can be activated through the Power Pivot window, as shown in Figure 8-3.
FIGURE 8-3 Power Pivot for Excel lets you create a new date table through a menu option.
In Power Pivot, clicking on New creates a new table in the model with a set of calculated columns that include year, month, and weekday. It is up to the developer to create the correct set of relationships in the model. Also, if needed, one has the option to modify the names and the formulas of the calculated columns, as well as adding new ones. There is also the option of saving the current table as a new template, which will be used in the future for newly created date tables. Overall, this technique works well. The table generated by Power Pivot is a regular date table that fulfills all the requirements of a good date table. This, in conjunction with the fact that Power Pivot for Excel does not support calculated tables, makes the feature useful.
Building a date table As you have learned, the first step for handling date calculations in DAX is to create a date table. Because of its relevance, one should pay attention to some details when creating the date table. In this section, we provide the best practices regarding the creation of a date table. There are two different aspects to consider: a technical aspect and a data modeling aspect. From a technical point of view, the date table must follow these guidelines: ■
The date table contains all dates included in the period to analyze. For example, if the minimum and maximum dates contained in Sales are July 3, 2016, and July 27, 2019, respectively, the range of dates of the table is between January 1, 2016, and December 31, 2019. In other words, the date
220 CHAPTER 8 Time intelligence calculations
table needs to contain all the days for all the years containing sales data. There can be no gaps in the sequence of dates. All dates need to be present, regardless of whether there are transactions or not on each date. ■
■
■
The date table contains one column of DateTime type, with unique values. The Date data type is a better choice because it guarantees that the time part is empty. If the DateTime column also contains a time part, then all the times of the day need to be identical throughout the table. It is not necessary that the relationship between Sales and the date table be based on the DateTime column. One can use an integer to relate the two tables, yet the DateTime column needs to be present. The table should be marked as a Date table. Though this is not a strictly mandatory step, it greatly helps in writing correct code. We will cover the details of this feature later in this chapter.
Important It is common for newbies to create a huge date table with many more years than needed. That is a mistake. For example, one might create a date table with two hundred years ranging from 1900 to 2100, just in case. Technically the date table works fine, but there will be serious performance issues whenever it is used in calculations. Using a table with only the relevant years is a best practice. From the technical point of view, a table containing a single date column with all the required dates is enough. Nevertheless, a user typically wants to analyze information slicing by year, month, quarter, weekday, and many other attributes. Consequently, a good date table should include a rich set of columns that—although not used by the engine—greatly improve the user experience. If you are loading the date table from an existing data source, then it is likely that all the columns describing a date are already present in the source date table. If necessary, additional columns can be created as calculated columns or by changing the source query. Performing simple calculations in the data source is preferable whenever possible—reducing the use of calculated columns to when they are strictly required. Alternatively, you can create the date table by using a DAX calculated table. We describe the calculated table technique along with the CALENDAR and CALENDARAUTO functions in the next sections.
Note The term “Date” is a reserved keyword in DAX; it corresponds to the DATE function. Therefore, you should embed the Date name in quotes when referring to the table name, despite the fact that there are no spaces or special characters in that name. You might prefer using Dates instead of Date as the name of the table to avoid this requirement. However, it is better to be consistent in table names, so if you use the singular form for all the other table names, it is better to keep it singular for the date table too.
CHAPTER 8
Time intelligence calculations
221
Using CALENDAR and CALENDARAUTO If you do not have a date table in your data source, you can create the date table by using either CALENDAR or CALENDARAUTO. These functions return a table of one column, of DateTime data type. CALENDAR requires you to provide the upper and lower boundaries of the set of dates. CALENDARAUTO scans all the date columns across the entire data model, finds the minimum and maximum years referenced, and finally generates the set of dates between these years. For example, a simple calendar table containing all the dates in the Sales table can be created using the following code: Date = CALENDAR ( DATE ( YEAR ( MIN ( Sales[Order Date] ) ), 1, 1 ), DATE ( YEAR ( MAX ( Sales[Order Date] ) ), 12, 31 ) )
In order to force all dates from the first of January up to the end of December, the code only extracts the minimum and maximum years, forcing day and month to be the first and last of the year. A similar result can be obtained by using the simpler CALENDARAUTO: Date = CALENDARAUTO ( )
CALENDARAUTO scans all the date columns, except for calculated columns. For example, if one uses CALENDARAUTO to create a Date table in a model that contains sales between 2007 and 2011 and has an AvailableForSaleDate column in the Product table starting in 2004, the result is the set of all the days between January 1, 2004, and December 31, 2011. However, if the data model contains other date columns, they affect the date range considered by CALENDARAUTO. Storing dates that are not useful to slice and dice is very common. For example, if among the many dates a model also contains the customers’ birthdates, then the result of CALENDARAUTO starts from the oldest year of birth of any customer. This produces a large date table, which in turn negatively affects performance. CALENDARAUTO accepts an optional parameter that represents the final month number of a fiscal year. If provided, CALENDARAUTO generates dates from the first day of the following month to the last day of the month indicated as an argument. This is useful when you have a fiscal year that ends in a month other than December. For example, the following expression generates a Date table for fiscal years starting on July 1 and ending on June 30: Date = CALENDARAUTO ( 6 )
CALENDARAUTO is slightly easier to use than CALENDAR because it automatically determines the boundaries of the set of dates. However, it might extend this set by considering unwanted columns. One can obtain the best of both worlds by restricting the result of CALENDARAUTO to only the desired set of dates, as follows: Date = VAR MinYear = YEAR ( MIN ( Sales[Order Date] ) ) VAR MaxYear = YEAR ( MAX ( Sales[Order Date] ) ) RETURN
222 CHAPTER 8 Time intelligence calculations
FILTER ( CALENDARAUTO ( ), YEAR ( [Date] ) >= MinYear && YEAR ( [Date] ) = MinYear && YEAR ( [Date] ) = DATE ( 2007, 1, 1 ), 'Date'[Date] 1 ) CALCULATE ( [Sales Amount], FILTER ( Sales, Sales[Quantity] > 1 ) )
The two definitions are actually very different. One is filtering a column; the other is filtering a table. Even though the two versions of the code provide the same result in several scenarios, they are, in fact, computing a completely different expression. To demonstrate their behavior, we included the two definitions in a query: EVALUATE ADDCOLUMNS ( VALUES ( 'Product'[Brand] ), "FilterCol", CALCULATE ( [Sales Amount], Sales[Quantity] > 1 ), "FilterTab", CALCULATE ( [Sales Amount], FILTER ( Sales, Sales[Quantity] > 1 ) ) )
The result is surprising to say the least, as we can see in Figure 14-7. FilterCol returns the expected values, whereas FilterTab always returns the same number that corresponds to the grand total of all the brands. Expanded tables play an important role in understanding the reason for this result. We can examine the behavior of the FilterTab calculation in detail. The filter argument of CALCULATE iterates over Sales and returns all the rows of Sales with a quantity greater than 1. The result of FILTER is a subset of rows of the Sales table. Remember: In DAX a table reference always references the expanded table. Because Sales has a relationship with Product, the expanded table of Sales contains the whole Product table too. Among the many columns, it also contains Product[Brand].
CHAPTER 14
Advanced DAX concepts
445
FIGURE 14-7 The first column computes the correct results, whereas the second column always shows a higher
number corresponding to the grand total.
The filter arguments of CALCULATE are evaluated in the original filter context, ignoring the context transition. The filter on Brand comes into effect after CALCULATE has performed the context transition. Consequently, the result of FILTER contains the values of all the brands related to rows with a quantity greater than 1. Indeed, there are no filters on Product[Brand] during the iteration made by FILTER. When generating the new filter context, CALCULATE performs two consecutive steps: 1.
It operates the context transition.
2.
It applies the filter arguments.
Therefore, filter arguments might override the effects of context transition. Because ADDCOLUMNS is iterating over the product brand, the effects of context transition on each row should be that of filtering an individual brand. Nevertheless, because the result of FILTER also contains the product brand, it overrides the effects of the context transition. The net result is that the value shown is always the total of Sales Amount for all the transactions whose quantity is greater than 1, regardless of the product brand. Using table filters is always challenging because of table expansion. Whenever one applies a filter to a table, the filter is really applied to the expanded table, and this can cause several side effects. The golden rule is simple: Try to avoid using table filters whenever possible. Working with columns leads to simpler calculations, whereas working with tables is much more problematic.
446 CHAPTER 14 Advanced DAX concepts
Note The example shown in this section might not be easily applied to a measure defined in a data model. This is because the measure is always executed in an implicit CALCULATE to produce the context transition. For example, consider the following measure: Multiple Sales := CALCULATE ( [Sales Amount], FILTER ( Sales, Sales[Quantity] > 1 ) )
When executed in a report, a possible DAX query could be: EVALUATE ADDCOLUMNS ( VALUES ( 'Product'[Brand] ), "FilterTabMeasure", [Multiple Sales] )
The expansion of the table drives the execution of this corresponding query: EVALUATE ADDCOLUMNS ( VALUES ( 'Product'[Brand] ), "FilterTabMeasure", CALCULATE ( CALCULATE ( [Sales Amount], FILTER ( Sales, Sales[Quantity] > 1 ) ) ) )
The first CALCULATE performs the context transition that affects both arguments of the second CALCULATE, including the FILTER argument. Even though this produces the same result as FilterCol, the use of a table filter has a negative impact on performance. Therefore, it is always better to use column filters whenever possible.
Using table filters in measures In the previous section, we showed a first example where being familiar with expanded tables helped make sense of a result. However, there are several other scenarios where expanded tables prove to be useful. Besides, in previous chapters we used the concept of expanded tables multiple times, although we could not describe what was happening in detail just yet.
CHAPTER 14
Advanced DAX concepts
447
For example in Chapter 5, “Understanding CALCULATE and CALCULATETABLE,” while explaining how to remove all the filters applied to the model, we used the following code in a report that was slicing measures by category: Pct All Sales := VAR CurrentCategorySales = [Sales Amount] VAR AllSales = CALCULATE ( [Sales Amount], ALL ( Sales ) ) VAR Result = DIVIDE ( CurrentCategorySales, AllSales ) RETURN Result
Why does ALL ( Sales ) remove any filter? If one does not think in terms of expanded tables, ALL should only remove filters from the Sales table, keeping any other filter untouched. In fact, using ALL on the Sales table means removing any filter from the expanded Sales table. Because Sales expands to all the related tables, including Product, Customer, Date, Store, and any other related tables, using ALL ( Sales ) removes any filter from the entire data model used by that example. Most of the time this behavior is the one desired and it works intuitively. Still, understanding the internal behavior of expanded tables is of paramount importance; failing to gain that understanding might be a root cause for inaccurate calculations. In the next example, we demonstrate how a simple calculation can fail simply due to a subtlety of expanded tables. We will see why it is better to avoid using table filters in CALCULATE statements, unless the developer is purposely looking to take advantage of the side effects of expanded tables. The latter are described in the following sections. Consider the requirements of a report like the one in Figure 14-8. The report contains a slicer that filters the Category, and a matrix showing the sales of subcategories and their respective percentage against the total.
FIGURE 14-8 The Pct column shows the percentage of a subcategory against the total sales.
448 CHAPTER 14 Advanced DAX concepts
Because the percentage needs to divide the current Sales Amount by the corresponding Sales Amount for all the subcategories of the selected category, a first (inaccurate) solution might be the following: Pct := DIVIDE ( [Sales Amount], CALCULATE ( [Sales Amount], ALL ( 'Product Subcategory' ) ) )
The idea is that by removing the filter on Product Subcategory, DAX retains the filter on Category and produces the correct result. However, the result is wrong, as we can see in Figure 14-9.
FIGURE 14-9 The first implementation of Pct produces the wrong result.
The problem with this formula is that ALL ( 'Product Subcategory' ) refers to the expanded Product Subcategory table. Product Subcategory expands to Product Category. Consequently, ALL removes the filter not only from the Product Subcategory table, but also from the Product Category table. Therefore, the denominator returns the grand total of all the categories, in turn calculating the wrong percentage. There are multiple solutions available. In the current report, they all compute the same value, even though they use slightly different approaches. For example, the following Pct Of Categories measure computes the percentage of the selected subcategories compared to the total of the related categories. After removing the filter from the expanded table of Product Subcategory, VALUES restores the filter of the Product Category table: Pct Of Categories := DIVIDE ( [Sales Amount], CALCULATE ( [Sales Amount], ALL ( 'Product Subcategory' ), VALUES ( 'Product Category' ) ) )
CHAPTER 14
Advanced DAX concepts
449
Another possible solution is the Pct Of Visual Total measure, which uses ALLSELECTED without an argument. ALLSELECTED restores the filter context of the slicers outside the visual, without the developer having to worry about expanded tables: Pct Of Visual Total := DIVIDE ( [Sales Amount], CALCULATE ( [Sales Amount], ALLSELECTED () ) )
ALLSELECTED is attractive because of its simplicity. However, in a later section of this chapter we introduce shadow filter contexts. These will provide the reader with a fuller understanding of ALLSELECTED. ALLSELECTED can be powerful, but it is also a complex function that must be used carefully in convoluted expressions. Finally, another solution is available using ALLEXCEPT, thus comparing the selected subcategories with the categories selected in the slicer: Pct := DIVIDE ( [Sales Amount], CALCULATE ( [Sales Amount], ALLEXCEPT ( 'Product Subcategory', 'Product Category' ) ) )
This last formula leverages a particular ALLEXCEPT syntax that we have never used so far in the book: ALLEXCEPT with two tables, instead of a table and a list of columns. ALLEXCEPT removes filters from the source table, with the exception of any columns provided as further arguments. That list of columns can include any column (or table) belonging to the expanded table of the first argument. Because the expanded table of Product Subcategory contains the whole Product Category table, the code provided is a valid syntax. It removes any filter from the whole expanded table of Product Subcategory, except for the columns of the expanded table of Product Category. It is worth noting that expanded tables tend to cause more issues when the data model is not correctly denormalized. As a matter of fact, in most of this book we use a version of Contoso where Category and Subcategory are stored as columns in the Product table, instead of being tables by themselves. In other words, we denormalized the category and subcategory tables as attributes of the Product table. In a correctly denormalized model, table expansion takes place between Sales and Product in a more natural way. So as it often happens, putting some thought into the model makes the DAX code easier to author.
450 CHAPTER 14 Advanced DAX concepts
Understanding active relationships When working with expanded tables, another important aspect to consider is the concept of active relationships. It is easy to get confused in a model with multiple relationships. In this section, we want to share an example where the presence of multiple relationships proves to be a real challenge. Imagine needing to compute Sales Amount and Delivered Amount. These two measures can be computed by activating the correct relationship with USERELATIONSHIP. The following two measures work: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) Delivered Amount := CALCULATE ( [Sales Amount], USERELATIONSHIP ( Sales[Delivery Date], 'Date'[Date] ) )
The result is visible in Figure 14-10.
FIGURE 14-10 Sales Amount and Delivered Amount use different relationships.
It is interesting to see a variation of the Delivered Amount measure that does not work because it uses a table filter: Delivered Amount = CALCULATE ( [Sales Amount], CALCULATETABLE ( Sales, USERELATIONSHIP ( Sales[Delivery Date], 'Date'[Date] ) ) )
This new—and unfortunate—formulation of the measure produces a blank result, as we can see in Figure 14-11.
CHAPTER 14
Advanced DAX concepts
451
FIGURE 14-11 Using a table filter, Delivered Amount only produces a blank value.
We now investigate why the result is a blank. This requires paying a lot of attention to expanded tables. The result of CALCULATETABLE is the expanded version of Sales, and among other tables it contains the Date table. When Sales is evaluated by CALCULATETABLE, the active relationship is the one with Sales[Delivery Date]. CALCULATETABLE therefore returns all the sales delivered in a given year, as an expanded table. When CALCULATETABLE is used as a filter argument by the outer CALCULATE, the result of CALCULATETABLE filters Sales and Date through the Sales expanded table, which uses the relationship between Sales[Delivery Date] and Date[Date]. Nevertheless, once CALCULATETABLE ends its execution, the default relationship between Sales[Order Date] and Date[Date] becomes the active relationship again. Therefore, the dates being filtered are now the order dates, not the delivery dates any more. In other words, a table containing delivery dates is used to filter order dates. At this point, the only rows that remain visible are the ones where Sales[Order Date] equals Sales[Delivery Date]. There are no rows in the model that satisfy this condition; consequently, the result is blank. To further clarify the concept, imagine that the Sales table contains just a few rows, like the ones in Table 14-2. TABLE 14-2 Example of Sales table with only two rows Order Date
Delivery Date
Quantity
12/31/2007
01/07/2008
100
01/05/2008
01/10/2008
200
If the year 2008 is selected, the inner CALCULATETABLE returns the expanded version of Sales, containing, among many others, the columns shown in Table 14-3. TABLE 14-3 The result of CALCULATETABLE is the expanded Sales table, including Date[Date] using the
Sales[Delivery Date] relationship Order Date
Delivery Date
Quantity
Date
12/31/2007
01/07/2008
100
01/07/2008
01/05/2008
01/10/2008
200
01/10/2008
452 CHAPTER 14 Advanced DAX concepts
When this table is used as a filter, the Date[Date] column uses the active relationship, which is the one between Date[Date] and Sales[Order Date]. At this point, the expanded table of Sales appears as in Table 14-4. TABLE 14-4 The expanded Sales table using the default active relationship using the Sales[Order Date]
column
Order Date
Delivery Date
Quantity
Date
12/31/2007
01/07/2008
100
12/31/2007
01/05/2008
01/10/2008
200
01/05/2008
The rows visible in Table 14-3 try to filter the rows visible in Table 14-4. However, the Date column is always different in the two tables, for each corresponding row. Because they do not have the same value, the first row will be removed from the active set of rows. Following the same reasoning, the second row is excluded too. At the end, only the rows where Sales[Order Date] equals Sales[Delivery Date] survive the filter; they produce the same value in the Date[Date] column of the two expanded tables generated for the different relationships. This time, the complex filtering effect comes from the active relationship. Changing the active relationship inside a CALCULATE statement only affects the computation inside CALCULATE, but when the result is used outside of CALCULATE, the relationship goes back to the default. As usual, it is worth pointing out that this behavior is the correct one. It is complex, but it is correct. There are good reasons to avoid table filters as much as possible. Using table filters might result in the correct behavior, or it might turn into an extremely complex and unpredictable scenario. Moreover, the measure with a column filter instead of a table filter works fine and it is easier to read. The golden rule with table filters is to avoid them. The price to pay for developers who do not follow this simple suggestion is twofold: A significant amount of time will be spent understanding the filtering behavior, and performance becomes the worst it could possibly be.
Difference between table expansion and filtering As explained earlier, table expansion solely takes place from the many-side to the one-side of a relationship. Consider the model in Figure 14-12, where we enabled bidirectional filtering in all the relationships of the data model.
CHAPTER 14
Advanced DAX concepts
453
FIGURE 14-12 All the relationships in this model are set with bidirectional cross-filter.
Though the relationship between Product and Product Subcategory is set with bidirectional filtering, the expanded Product table contains subcategories, whereas the expanded Product Subcategory table does not contain Product. The DAX engine injects filtering code in the expressions to make bidirectional filtering work as if the expansion went both ways. A similar behavior happens when using the CROSSFILTER function. Therefore, in most cases a measure works just as if table expansion took place in both directions. However, be mindful that table expansion actually does not go in the many-side direction. The difference becomes important with the use of SUMMARIZE or RELATED. If a developer uses SUMMARIZE to perform a grouping of a table based on another table, they have to use one of the columns of the expanded table. For example, the following SUMMARIZE statement works well: EVALUATE SUMMARIZE ( 'Product', 'Product Subcategory'[Subcategory] )
Whereas the next one—which tries to summarize subcategories based on product color—does not work: EVALUATE SUMMARIZE ( 'Product Subcategory', 'Product'[Color] )
454 CHAPTER 14 Advanced DAX concepts
The error is “The column ‘Color’ specified in the ‘SUMMARIZE’ function was not found in the input table,” meaning that the expanded version of Product Subcategory does not contain Product[Color]. Like SUMMARIZE, RELATED also works with columns that belong to the expanded table exclusively. Similarly, one cannot group the Date table by using columns from other tables, even when these tables are linked by a chain of bidirectional relationships: EVALUATE SUMMARIZE ( 'Date', 'Product'[Color] )
There is only one special case where table expansion goes in both directions, which is the case of a relationship defined as one-to-one. If a relationship is a one-to-one relationship, then both tables are expanded one into the other. This is because a one-to-one relationship makes the two tables semantically identical: Each row in one table has a direct relationship with a single row in the other table. Therefore, it is fair to think of the two tables as being one, split into two sets of columns.
Context transition in expanded tables The expanded table also influences context transition. The row context converts into an equivalent filter context for all the columns that are part of the expanded table. For example, consider the following query returning the category of a product using two techniques: the RELATED function in a row context and the SELECTEDVALUE function with a context transition: EVALUATE SELECTCOLUMNS ( 'Product', "Product Key", 'Product'[ProductKey], "Product Name", 'Product'[Product Name], "Category RELATED", RELATED ( 'Product Category'[Category] ), "Category Context Transition", CALCULATE ( SELECTEDVALUE ( 'Product Category'[Category] ) ) ) ORDER BY [Product Key]
The result of the query includes two identical columns, Category RELATED and Category Context Transition, as shown in Figure 14-13.
FIGURE 14-13 The category of each product is displayed in two columns computed with different techniques.
The Category RELATED column shows the category corresponding to the product displayed on the same line of the report. This value is retrieved by using RELATED when the row context on Product is CHAPTER 14
Advanced DAX concepts
455
available. The Category Context Transition column uses a different approach, generating a context transition by invoking CALCULATE. The context transition filters just one row in the Product table; this filter is also applied to Product Subcategory and Product Category, filtering the corresponding rows for the product. Because at this point the filter context only filters one row in Product Category, SELECTEDVALUE returns the value of the Product Category column in the only row filtered in the Product Category table. While this side effect is well known, it is not efficient to rely on this behavior when wanting to retrieve a value from a related table. Even though the result is identical, performance could be very different. The solution using a context transition is particularly expensive if used for many rows in Product. Context transition comes at a significant computational cost. Thus, as we will see later in the book, reducing the number of context transitions is important in order to improve performance. Therefore, RELATED is a better solution to this specific problem; it avoids the context transition required for SELECTEDVALUE to work.
Understanding ALLSELECTED and shadow filter contexts ALLSELECTED is a handy function that hides a giant trap. In our opinion, ALLSELECTED is the most complex function in the whole DAX language, even though it looks harmless. In this section we provide an exhaustive technical description of the ALLSELECTED internals, along with a few suggestions on when to use and when not to use ALLSELECTED. ALLSELECTED, as any other ALL* function, can be used in two different ways: as a table function or as a CALCULATE modifier. Its behavior differs in these two scenarios. Moreover, ALLSELECTED is the only DAX function that leverages shadow filter contexts. In this section, we first examine the behavior of ALLSELECTED, then we introduce shadow filter contexts, and finally we provide a few tips on using ALLSELECTED optimally. ALLSELECTED can be used quite intuitively. For example, consider the requirements for the report in Figure 14-14.
FIGURE 14-14 The report shows the sales amount of a few selected brands, along with their percentages.
456 CHAPTER 14 Advanced DAX concepts
The report uses a slicer to filter certain brands. It shows the sales amount of each brand, along with the percentage of each given brand over the total of all selected brands. The percentage formula is simple: Pct := DIVIDE ( [Sales Amount], CALCULATE ( [Sales Amount], ALLSELECTED ( 'Product'[Brand] ) ) )
Intuitively, our reader likely knows that ALLSELECTED returns the values of the brands selected outside of the current visual—that is, the brands selected between Adventure Works and Proseware. But what Power BI sends to the DAX engine is a single DAX query that does not have any concept of “current visual.” How does DAX know about what is selected in the slicer and what is selected in the matrix? The answer is that it does not know these. ALLSELECTED does not return the values of a column (or table) filtered outside a visual. What it does is a totally different task, which as a side effect returns the same result most of the time. The correct definition of ALLSELECTED consists of the two following statements: ■
■
When used as a table function, ALLSELECTED returns the set of values as visible in the last shadow filter context. When used as a CALCULATE modifier, ALLSELECTED restores the last shadow filter context on its parameter.
These last two sentences deserve a much longer explanation.
Introducing shadow filter contexts In order to introduce shadow filter contexts, it is useful to look at the query that is executed by Power BI to produce the result shown in Figure 14-14: DEFINE VAR __DS0FilterTable = TREATAS ( { "Adventure Works", "Contoso", "Fabrikam", "Litware", "Northwind Traders", "Proseware" }, 'Product'[Brand] )
CHAPTER 14
Advanced DAX concepts
457
EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ), __DS0FilterTable, "Sales_Amount", 'Sales'[Sales Amount], "Pct", 'Sales'[Pct] ), [IsGrandTotalRowTotal], 0, 'Product'[Brand], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Product'[Brand]
The query is a bit too complex to analyze—not because of its inherent complexity but because it is generated by an engine and is thus not designed to be human-readable. The following is a version of the formula that is close enough to the original, but easier to understand and describe: EVALUATE VAR Brands = FILTER ( ALL ( 'Product'[Brand] ), 'Product'[Brand] IN { "Adventure Works", "Contoso", "Fabrikam", "Litware", "Northwind Traders", "Proseware" } ) RETURN CALCULATETABLE ( ADDCOLUMNS ( VALUES ( 'Product'[Brand] ), "Sales_Amount", [Sales Amount], "Pct", [Pct] ), Brands )
The result of this latter query is nearly the same as the report we examined earlier, with the noticeable difference that it is missing the total. We see this in Figure 14-15.
458 CHAPTER 14 Advanced DAX concepts
FIGURE 14-15 The query provides almost the same result as the prior report. The only missing part is the total.
Here are some useful notes about the query: ■
The outer CALCULATETABLE creates a filter context containing six brands.
■
ADDCOLUMNS iterates over the six brands visible inside the CALCULATETABLE.
■
■
■
Both Sales Amount and Pct are measures executed inside an iteration. Therefore, a context transition is taking place before the execution of both measures, and the filter context of each of the two measures only contains the currently iterated brand. Sales Amount does not change the filter context, whereas Pct uses ALLSELECTED to modify the filter context. After ALLSELECTED modifies the filter context inside Pct, the updated filter context shows all six brands instead of the currently iterated brand.
The last point is the most helpful point in order to understand what a shadow filter context is and how DAX uses it in ALLSELECTED. Indeed, the key is that ADDCOLUMNS iterates over six brands, the context transition makes only one of them visible, and ALLSELECTED needs a way to restore a filter context containing the six iterated brands. Here is a more detailed description of the query execution, where we introduce shadow filter contexts in step 3: 1.
The outer CALCULATETABLE creates a filter context with six brands.
2.
VALUES returns the six visible brands and returns the result to ADDCOLUMNS.
3.
Being an iterator, ADDCOLUMNS creates a shadow filter context containing the result of VALUES, right before starting the iteration.
•
The shadow filter context is like a filter context, but it remains dormant, not affecting the evaluation in any way.
•
A shadow filter context can only be activated by ALLSELECTED, as we are about to explain. For now, just remember that the shadow filter context contains the six iterated brands.
•
We distinguish between a shadow filter context and a regular filter context by calling the latter an explicit filter context.
CHAPTER 14
Advanced DAX concepts
459
4.
During the iteration, the context transition occurs on one given row. Therefore, the context transition creates a new explicit filter context containing solely the iterated brand.
5.
When ALLSELECTED is invoked during the evaluation of the Pct measure, ALLSELECTED does the following: ALLSELECTED restores the last shadow filter context on the column or table passed as parameter, or on all the columns if ALLSELECTED has no arguments. (The behavior of ALLSELECTED without parameters is explained in the following section.)
•
Because the last shadow filter context contained six brands, the selected brands become visible again.
This simple example allowed us to introduce the concept of shadow filter context. The previous query shows how ALLSELECTED takes advantage of shadow filter contexts to retrieve the filter context outside of the current visual. Please note that the description of the execution does not use the Power BI visuals anywhere. Indeed, the DAX engine is not cognizant of which visual it is helping to produce. All it receives is a DAX query. Most of the time ALLSELECTED retrieves the correct filter context; indeed, all the visuals in Power BI and, in general, most of the visuals generated by any client tool all generate the same kind of query. Those auto-generated queries always include a top-level iterator that generates a shadow filter context on the items it is displaying. This is the reason why ALLSELECTED seems to restore the filter context outside of the visual. Having taken our readers one step further in their understanding of ALLSELECTED, we now need to examine more closely the conditions required for ALLSELECTED to work properly: ■
■
■
The query needs to contain an iterator. If there is no iterator, then no shadow filter context is present, and ALLSELECTED does not perform any operation. If there are multiple iterators before ALLSELECTED is executed, then ALLSELECTED restores the last shadow filter context. In other words, nesting ALLSELECTED inside an iteration in a measure will most likely produce unwanted results because the measure is almost always executed in another iteration of the DAX query produced by a client tool. If the columns passed to ALLSELECTED are not filtered by a shadow filter context, then ALLSELECTED does not do anything.
At this point, our readers can see more clearly that the behavior of ALLSELECTED is quite complex. Developers predominantly use ALLSELECTED to retrieve the outer filter context of a visualization. We also used ALLSELECTED previously in the book for the very same purpose. In doing so, we always double-checked that ALLSELECTED was used in the correct environment, even though we did not explain in detail what was happening. The fuller semantics of ALLSELECTED are related to shadow filter contexts, and merely by chance (or, to be honest, by careful and masterful design) does its effect entail the retrieving of the filter context outside of the current visual.
460 CHAPTER 14 Advanced DAX concepts
A good developer knows exactly what ALLSELECTED does and only uses it in the scenarios where ALLSELECTED works the right way. Overusing ALLSELECTED by relying on it in conditions where it is not expected to work can only produce unwanted results, at which point the developer is to blame, not ALLSELECTED.… The golden rule for ALLSELECTED is quite simple: ALLSELECTED can be used to retrieve the outer filter context if and only if it is being used in a measure that is directly projected in a matrix or in a visual. By no means should the developer expect to obtain correct results by using a measure containing ALLSELECTED inside an iteration, as we are going to demonstrate in the following sections. Because of this, we, as DAX developers, use a simple rule: If a measure contains ALLSELECTED anywhere in the code, then that measure cannot be called by any other measure. This is to avoid the risk that in the chain of measure calls, a developer could start an iteration that includes a call to a measure containing ALLSELECTED.
ALLSELECTED returns the iterated rows To further demonstrate the behavior of ALLSELECTED, we make a small change to the previous query. Instead of iterating over VALUES ( Product[Brand] ), we make ADDCOLUMNS iterate over ALL ( Product[Brand] ): EVALUATE VAR Brands = FILTER ( ALL ( 'Product'[Brand] ), 'Product'[Brand] IN { "Adventure Works", "Contoso", "Fabrikam", "Litware", "Northwind Traders", "Proseware" } ) RETURN CALCULATETABLE ( ADDCOLUMNS ( ALL ( 'Product'[Brand] ), "Sales_Amount", [Sales Amount], "Pct", [Pct] ), Brands )
In this new scenario, the shadow filter context created by ADDCOLUMNS before the iteration contains all the brands—not simply the selected brands. Therefore, when called in the Pct measure, ALLSELECTED restores the shadow filter context, thus making all brands visible. The result shown in Figure 14-16 is different from that of the previous query shown in Figure 14-15.
CHAPTER 14
Advanced DAX concepts
461
FIGURE 14-16 ALLSELECTED restores the currently iterated values, not the previous filter context.
As you can see, all the brands are visible—and this is expected—but the numbers are different than before, even though the code computing them is the same. The behavior of ALLSELECTED in this scenario is correct. Developers might think that it behaves unexpectedly because the filter context defined by the Brands variable is ignored by the Pct measure; however, ALLSELECTED is indeed behaving as it was designed to. ALLSELECTED returns the last shadow filter context; In this latter version of the query, the last shadow filter context contains all brands, not only the filtered ones. Indeed, ADDCOLUMNS introduced a shadow filter context on the rows it is iterating, which includes all brands. If one needs to retain the previous filter context, they cannot rely solely on ALLSELECTED. The CALCULATE modifier that retains the previous filter context is KEEPFILTERS. It is interesting to see the result when KEEPFILTERS comes into play: EVALUATE VAR Brands = FILTER ( ALL ( 'Product'[Brand] ), 'Product'[Brand] IN { "Adventure Works", "Contoso", "Fabrikam", "Litware", "Northwind Traders", "Proseware" } ) RETURN CALCULATETABLE ( ADDCOLUMNS ( KEEPFILTERS ( ALL ( 'Product'[Brand] ) ), "Sales_Amount", [Sales Amount], "Pct", [Pct] ), Brands )
462 CHAPTER 14 Advanced DAX concepts
When used as a modifier of an iterator, KEEPFILTERS does not change the result of the iterated table. Instead, it instructs the iterator to apply KEEPFILTERS as an implicit CALCULATE modifier whenever context transition occurs while iterating on the table. As a result, ALL returns all the brands and the shadow filter context also contains all the brands. When the context transition takes place, the previous filter applied by the outer CALCULATETABLE with the Brands variable is kept. Thus, the query returns all the brands, but values are computed considering only the selected brands, as we can see in Figure 14-17.
FIGURE 14-17 ALLSELECTED with KEEPFILTERS produces another result, containing many blanks.
ALLSELECTED without parameters As the name suggests, ALLSELECTED belongs to the ALL* family. As such, when used as a CALCULATE modifier, it acts as a filter remover. If the column used as a parameter is included in any shadow filter context, then it restores the last shadow filter context on that column only. Otherwise, if there is no shadow filter context then it does not do anything. When used as a CALCULATE modifier, ALLSELECTED, like ALL, can also be used without any parameter. In that case, ALLSELECTED restores the last shadow filter context on any column. Remember that this happens if and only if the column is included in any shadow filter context. If a column is filtered through explicit filters only, then its filter remains untouched.
The ALL* family of functions Because of the complexity of the ALL* family of functions, in this section we provide a summary of their behavior. Every ALL* function behaves slightly differently, so mastering them takes time and experience. In this chapter about advanced DAX concepts, it is time to sum up the main concepts. The ALL* family includes the following functions: ALL, ALLEXCEPT, ALLNOBLANKROW, ALLCROSSFILTERED, and ALLSELECTED. All these functions can be used either as table functions or as CALCULATE modifiers. When used as table functions, they are much easier to understand than when used as CHAPTER 14
Advanced DAX concepts
463
CALCULATE modifiers. Indeed, when used as CALCULATE modifiers, they might produce unexpected results because they act as filter removers. Table 14-5 provides a summary of the ALL* functions. In the remaining part of this section we provide a more complete description of each function. TABLE 14-5 Summary of the ALL* family of functions Function
Table function
CALCULATE modifier
ALL
Returns all the distinct values of a column or of a table.
Removes any filter from columns or expanded tables. It never adds a filter; it only removes them if present.
ALLEXCEPT
Returns all the distinct values of a table, ignoring filters on some of the columns of the expanded table.
Removes filters from an expanded table, except from the columns (or tables) passed as further arguments.
ALLNOBLANKROW
Returns all the distinct values of a column or table, ignoring the blank row added for invalid relationships.
Removes any filter from columns or expanded tables; also adds a filter that only removes the blank row. Thus, even if there are no filters, it actively adds one filter to the context.
ALLSELECTED
Returns the distinct values of a col- Restores the last shadow filter context on tables or columns, if a shadow filter context is present. Otherwise, it does not umn or a table, as they are visible do anything. It always adds filters, even in the case where in the last shadow filter context. the filter shows all the values.
ALLCROSSFILTERED
Not available as a table function.
Removes any filter from an expanded table, including also the tables that can be reached directly or indirectly through bidirectional cross-filters. ALLCROSSFILTERED never adds a filter; it only removes filters if present.
The “Table function” column in Table 14-5 corresponds to the scenario where the ALL* function is being used in a DAX expression, whereas the “CALCULATE modifier” column is the specific case when the ALL* function is the top-level function of a filter argument in CALCULATE. Another significant difference between the two usages is that when one retrieves the result of these ALL* functions through an EVALUATE statement, the result contains only the base table columns and not the expanded table. Nevertheless, internal calculations like the context transition always use the corresponding expanded table. The following examples of DAX code show the different uses of the ALL function. The same concepts can be applied to any function of the ALL* family. In the following example, ALL is used as a simple table function. SUMX ( ALL ( Sales ), Sales[Quantity] * Sales[Net Price] )
-- ALL is a table function
In the next example there are two formulas, involving iterations. In both cases the Sales Amount measure reference generates the context transition, and the context transition happens on the expanded table. When used as a table function, ALL returns the whole expanded table.
464 CHAPTER 14 Advanced DAX concepts
FILTER ( Sales, [Sales Amount] > 100
-- The context transition takes place -- over the expanded table
) FILTER ( ALL ( Sales ), [Sales Amount] > 100
-- ALL is a table function -- The context transition takes place -- over the expanded table anyway
)
In the next example we use ALL as a CALCULATE modifier to remove any filter from the expanded version of Sales: CALCULATE ( [Sales Amount], ALL ( Sales ) )
-- ALL is a CALCULATE modifier
This latter example, although similar to the previous one, is indeed very different. ALL is not used as a CALCULATE modifier; instead, it is used as an argument of FILTER. In such a case, ALL behaves as a regular table function returning the entire expanded Sales table. CALCULATE ( [Sales Amount], FILTER ( ALL ( Sales ), Sales[Quantity] > 0 )
-- ALL is a table function -- The filter context receives the -- expanded table as a filter anyway
)
The following are more detailed descriptions of the functions included in the ALL* family. These functions look simple, but they are rather complex. Most of the time, their behavior is exactly what is needed, but they might produce undesired effects in boundary cases. It is not easy to remember all these rules and all the specific behaviors. We hope our reader finds Table 14-5 useful when unsure about an ALL* function.
ALL When used as a table function, ALL is a simple function. It returns all the distinct values of one or more columns, or all the values of a table. When used as a CALCULATE modifier, it acts as a hypothetical REMOVEFILTER function. If a column is filtered, it removes the filter. It is important to note that if a column is cross-filtered, then the filter is not removed. Only direct filters are removed by ALL. Thus, using ALL ( Product[Color] ) as a CALCULATE modifier might still leave Product[Color] cross-filtered in case there is a filter on another column of the Product table. ALL operates on the expanded table. This is why ALL ( Sales ) removes any filter from the tables in the sample model: the expanded Sales table includes all the tables of the entire model. ALL with no arguments removes any filter from the entire model.
CHAPTER 14
Advanced DAX concepts
465
ALLEXCEPT When used as a table function, ALLEXCEPT returns all the distinct values of the columns in a table, except the columns listed. If used as a filter, the result includes the full expanded table. When used as a filter argument in CALCULATE, ALLEXCEPT acts exactly as an ALL, but it does not remove the filter from the columns provided as arguments. It is important to remember that using ALL/VALUES is not the same as ALLEXCEPT. ALLEXCEPT only removes filters, whereas ALL removes filters while VALUES retains cross-filtering by imposing a new filter. Though subtle, this difference is important.
ALLNOBLANKROW When used as a table function, ALLNOBLANKROW behaves like ALL, but it does not return the blank row potentially added because of invalid relationships. ALLNOBLANKROW can still return a blank row, if blanks are present in the table. The only row that is never returned is the one added automatically by the engine to fix invalid relationships. When used as a CALCULATE modifier, ALLNOBLANKROW replaces all the filters with a new filter that only removes the blank row. Therefore, all the columns will only filter out the blank value.
ALLSELECTED When used as a table function, ALLSELECTED returns the values of a table (or column) as filtered in the last shadow filter context. When used as a CALCULATE modifier, it restores the last shadow filter context on each column. If multiple columns are present in different shadow filter contexts, it uses the last shadow filter context for each column.
ALLCROSSFILTERED ALLCROSSFILTERED can be used only as a CALCULATE modifier and cannot be used as a table function. ALLCROSSFILTERED has only one argument that must be a table. ALLCROSSFILTERED removes all the filters on an expanded table (like ALL) and on columns and tables that are cross-filtered because of bidirectional cross-filters set on relationships directly or indirectly connected to the expanded table.
Understanding data lineage We introduced data lineage in Chapter 10, “Working with the filter context,” and we have shown our readers how to control data lineage using TREATAS. In Chapter 12, “Working with tables,” and Chapter 13, “Authoring queries,” we described how certain table functions can manipulate the data lineage of the result. This section is a summary of the rules to remember about data lineage, with additional information we could not cover in previous chapters. Here are the basic rules of data lineage: ■
Each column of a table in a data model has a unique data lineage.
466 CHAPTER 14 Advanced DAX concepts
■
■
When a filter context filters the model, it filters the model column with the same data lineage of the columns included in the filter context. Because a filter is the result of a table, it is important to know how a table function may affect the data lineage of the result:
• • • •
In general, columns used to group data keep their data lineage in the result. Columns containing the result of an aggregation always have a new data lineage. Columns created by ROW and ADDCOLUMNS always have a new data lineage. Columns created by SELECTEDCOLUMNS keep the data lineage of the original column whenever the expression is just a copy of a column in the data model; otherwise, they have a new data lineage.
For example, the following code seems to produce a table where each product color has a corresponding Sales Amount value summing all the sales for that color. Instead, because C2 is a column created by ADDCOLUMNS, it does not have the same lineage as Product[Color], even though it has the same content. Please note that we had to use several steps: first, we create the C2 column; then we select that column only. If other columns remain in the same table, then the result would be very different. DEFINE MEASURE Sales[Sales Amount] = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) EVALUATE VAR NonBlueColors = FILTER ( ALL ( 'Product'[Color] ), 'Product'[Color] "Blue" ) VAR AddC2 = ADDCOLUMNS ( NonBlueColors, "[C2]", 'Product'[Color] ) VAR SelectOnlyC2 = SELECTCOLUMNS ( AddC2, "C2", [C2] ) VAR Result = ADDCOLUMNS ( SelectOnlyC2, "Sales Amount", [Sales Amount] ) RETURN Result ORDER BY [C2]
The previous query produces a result where the Sales Amount column always has the same value, corresponding to the sum of all the rows in the Sales table. This is shown in Figure 14-18.
CHAPTER 14
Advanced DAX concepts
467
FIGURE 14-18 The C2 column does not have the same data lineage as Product[Color].
TREATAS can be used to transform the data lineage of a table. For example, the following code restores the data lineage to Product[Color] so that the last ADDCOLUMNS computes Sales Amount leveraging the context transition over the Color column: DEFINE MEASURE Sales[Sales Amount] = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) EVALUATE VAR NonBlueColors = FILTER ( ALL ( 'Product'[Color] ), 'Product'[Color] "Blue" ) VAR AddC2 = ADDCOLUMNS ( NonBlueColors, "[C2]", 'Product'[Color] ) VAR SelectOnlyC2 = SELECTCOLUMNS ( AddC2, "C2", [C2] ) VAR TreatAsColor = TREATAS ( SelectOnlyC2, 'Product'[Color] ) VAR Result = ADDCOLUMNS ( TreatAsColor, "Sales Amount", [Sales Amount] ) RETURN Result ORDER BY 'Product'[Color]
468 CHAPTER 14 Advanced DAX concepts
As a side effect, TREATAS also changes the column name, which must be correctly referenced in the ORDER BY condition. The result is visible in Figure 14-19.
FIGURE 14-19 The Color column in the result has the same data lineage as Product[Color].
Conclusions In this chapter we introduced two complex concepts: expanded tables and shadow filter contexts. Expanded tables are at the core of DAX. It takes some time before one gets used to thinking in terms of expanded tables. However, once the concept of expanded tables has become familiar, they are much simpler to work with than relationships. Only rarely does a developer have to deal with expanded tables, but knowing about them proves to be invaluable when they are the only way to make sense of a result. In this regard, shadow filter contexts are like expanded tables: They are hard to see and understand, but when they come into play in the evaluation of a formula, they explain exactly how the numbers were computed. Making sense of a complex formula that uses ALLSELECTED without first mastering shadow filter contexts is nearly impossible. However, both concepts are so complex that the best thing to do is to try to avoid them. We do show a few examples of expanded tables being useful in Chapter 15. Shadow filter contexts are useless in code; they are merely a technical means for DAX to let developers compute totals at the visual level.
CHAPTER 14
Advanced DAX concepts
469
Try to avoid using expanded tables by only using column filters and not table filters in CALCULATE filter arguments. Doing this, the code will be much easier to understand. Usually, it is possible to ignore expanded tables, as long as they are not required for some complex measure. Try to avoid shadow filter context by never letting ALLSELECTED be called inside an iteration. The only iteration before ALLSELECTED needs to be the outermost iteration created by the query engine— mostly Power BI. Calling a measure containing ALLSELECTED from inside an iteration makes the calculation more complex. When you follow these two pieces of advice, your DAX code will be correct and easy to understand. Remember that experts can appreciate complexity, but they also understand when it is better to stay away from complexity. Avoiding table filters and ALLSELECTED inside iterations does not make a developer look uneducated. Rather, it puts the developer in the category of experts that want their code to always work smoothly.
470 CHAPTER 14 Advanced DAX concepts
CHAPTER 15
Advanced relationships At this point in the book, there are no more DAX secrets to share. In previous chapters we covered all there is to know about the syntax and the functionalities of DAX. Still, there is a long way to go. There are another two chapters dedicated to DAX, and then we will talk about optimization. The next chapter is dedicated to advanced DAX calculations. In this chapter we describe how to leverage DAX to create advanced types of relationships. These include calculated physical relationships and virtual relationships. Then, while on the topic of relationships, we want to share a few considerations about different types of physical relationships: one-to-one, one-to-many, and many-to-many. Each of these types of relationships is worth describing in its peculiarities. Moreover, a topic that still needs some attention is ambiguity. A DAX model can be—or become—ambiguous; this is a serious problem you need to be aware of, in order to handle it well. At the end of this chapter we cover a topic that is more relevant to data modeling than to DAX, which is relationships with different granularity. When a developer needs to analyze budget and sales, they are likely working with multiple tables with different granularity. Knowing how to manage them properly is a useful skill for DAX developers.
Implementing calculated physical relationships The first set of relationships we describe is calculated physical relationships. In scenarios where the relationship cannot be set because a key is missing, or when one needs to compute the key with complex formulas, a good option is to leverage calculated columns to set the relationship. The result is still a physical relationship; the only difference with a standard relationship is that the relationship key is a calculated column instead of being a column from the data source.
Computing multiple-column relationships A Tabular model allows the creation of relationships based on a single column only. It does not support relationships based on multiple columns. Nevertheless, relationships based on multiple columns are useful when they appear in data models that cannot be changed. Here are two methods to work with relationships based on multiple columns: ■
■
Define a calculated column containing the composition of the keys; then use it as the new key for the relationship. Denormalize the columns of the target table—the one-side in a one-to-many relationship— using the LOOKUPVALUE function. 471
As an example, consider the case of Contoso offering a “Products of the Day” promotion. On certain days, a discount is offered on a set of products. The model is visible in Figure 15-1.
FIGURE 15-1 The Discounts table needs a relationship based on two columns with Sales.
The Discounts table contains three columns: Date, ProductKey, and Discount. If a developer needs this information in order to compute the amount of the discount, they are faced with a problem: for any given sale, the discount depends on ProductKey and Order Date. Thus, it is not possible to create the relationship between Sales and Discounts; it would involve two columns, and DAX only supports relationships based on a single column. The first option is to create a new column in both Discount and Sales, containing the combination of the two columns: Sales[DiscountKey] = COMBINEVALUES ( "-", Sales[Order Date], Sales[ProductKey] ) Discounts[DiscountKey] = COMBINEVALUES( "-", Discounts[Date], Discounts[ProductKey] )
The calculated columns use the COMBINEVALUES function. COMBINEVALUES requires a separator and a set of expressions that are concatenated as strings, separated by the separator provided. One could obtain the same result in terms of column values by using a simpler string concatenation, but COMBINEVALUES offers a few advantages. Indeed, COMBINEVALUES is particularly useful when 472 CHAPTER 15 Advanced relationships
creating relationships based on calculated columns if the model uses DirectQuery. COMBINEVALUES assumes—but does not validate—that when the input values are different, the output strings are also different. Based on this assumption, when COMBINEVALUES is used to create calculated columns to build a relationship that joins multiple columns from two DirectQuery tables, an optimized join condition is generated at query time.
Note More details about optimizations obtained by using COMBINEVALUES with DirectQuery are available at https://www.sqlbi.com/articles/using-combinevalues-to-optimizedirectquery-performance/. Once the two columns are in place, one can finally create the relationship between the two tables. Indeed, a relationship can be safely created on top of calculated columns. This solution is straightforward and works well. Yet there are scenarios where this is not the best option because it requires the creation of two calculated columns with potentially many different values. As you learn in later chapters about optimization, this might have a negative impact on both model size and query speed. The second option is to use the LOOKUPVALUE function. Using LOOKUPVALUE, one can denormalize the discount in the Sales table by defining a new calculated column containing the discount: Sales[Discount] = LOOKUPVALUE ( Discounts[Discount], Discounts[ProductKey], Sales[ProductKey], Discounts[Date], Sales[Order Date] )
Following this second pattern, no relationship is created. Instead, the Discount value is denormalized in the Sales table by performing a lookup. Both options work well, and picking the right one depends on several factors. If Discount is the only column needed, then denormalization is the best option because it makes the code simple to author, and it reduces memory usage. Indeed, it requires a single calculated column with fewer distinct values compared to the two calculated columns required for a relationship. On the other hand, if the Discounts table contains many columns needed in the code, then each of them should be denormalized in the Sales table. This results in a waste of memory and possibly in decreased processing performance. In that case, the calculated column with the new composite key might be preferable. This simple first example is important because it demonstrates a common and important feature of DAX: the ability to create relationships based on calculated columns. This demonstrates that a user can create a new relationship, provided that they can compute and materialize the key in a calculated column. The next example demonstrates how to create relationships based on static ranges. By extending the concept, it is possible to create several kinds of relationships. CHAPTER 15
Advanced relationships
473
Implementing relationships based on ranges In order to show why calculated physical relationships are a useful tool, we examine a scenario where one needs to perform a static segmentation of products based on their list price. The price of a product has many different values and performing an analysis slicing by price does not provide useful insights. In that case, a common technique is to partition the different prices into separate buckets, using a configuration table like the one in Figure 15-2.
FIGURE 15-2 This is the Configuration table for the price ranges.
As was the case in the previous example, it is not possible to create a direct relationship between the Sales table and the Configuration table. The reason is that the key in the configuration table depends on a relationship based on a range of values (also known as a between condition), which is not supported by DAX. We could compute a key in the Sales table by using nested IF statements; however, this would require including the values of the configuration table in the formula like in the following example, which is not the suggested solution: Sales[PriceRangeKey] SWITCH ( TRUE (), Sales[Net Price] Sales[Net Price] Sales[Net Price] Sales[Net Price] 5 )
=
70 VAR IsExpensive = IsLargeTransaction || IsLargePrice RETURN IsExpensive ExpensiveTransactions := CALCULATE ( COUNTROWS ( Sales ), Sales[IsExpensive] = TRUE )
The calculated column containing a logical value (TRUE or FALSE) usually benefits from good compression and a low memory cost. It is also very effective at execution time because it applies a direct filter to the scan of the Sales table required to count the rows. In this case, the benefit at query time is usually evident. Just consider if it is worth the longer processing time for the column; that processing time must be measured before making a final decision.
598 CHAPTER 18 Optimizing VertiPaq
Processing of calculated columns The presence of one or more calculated columns slows down the refresh of any part of a table that is somewhat related to the calculated column. This section describes the reasons for that; it also provides background information on why an incremental refresh operation can be very expensive because of the presence of calculated columns. Any refresh operation of a table requires recomputing all the calculated columns in the entire data model referencing any column of that table. For example, refreshing a partition of a table—as during any incremental refresh—requires a complete update of all the calculated columns stored in the table. Such a calculation is performed for all the rows of the table, even though the refresh only affects a single partition of the table. It does not matter whether the expression of the calculated column only depends on other columns of the same table; the calculated column is always computed for the entire table and not for a single partition. Moreover, the expression of a calculated column might depend on the content of other tables. In this case, the calculated columns referencing a partially refreshed table must also be recalculated to guarantee the consistency of the data model. The cost for computing a calculated column usually depends on the number of rows of the table where the column is stored. The process of a calculated column is a single-thread job, which iterates all the rows of the table to compute the column expression. In case there are several calculated columns, they are evaluated one at a time, making the entire operation a process bottleneck for large tables. For these reasons, creating a calculated column in a large table with hundreds of millions of rows is not a good idea. Creating tens of calculated columns in a large table can result in a very long processing time, adding minutes to the time required to process the native data.
Choosing the right columns to store The previous section about calculated columns explained that storing a column that can be computed row-by-row using other columns of the same table is not always an advantage. The same consideration is also valid for native columns of the table. When choosing the columns to store in a table, consider the memory size and the query performance. Good optimizations of resource allocation (and memory in particular) are possible by doing the right evaluation in this area. We consider the following types of columns in a table: ■ ■
■
Primary or alternate keys: The column contains a unique value for each row of the table. Qualitative attributes: The column can be text or number, used to group and/or filter rows in a table; for instance, name, color, city, country. Quantitative attributes: The number is a value used both as a filter (for example, less than a certain value) and as an argument in a calculation, such as price, amount, quantity.
CHAPTER 18
Optimizing VertiPaq
599
■
■
Descriptive attributes: The column contains text providing additional information about a row, but its content is never used to filter or to aggregate rows—for example, notes, comments. Technical attributes: Information recorded in the database for technical reasons, without a business value, such as username of last update, timestamp, GUID for replication.
The general principle is to try to minimize the cardinality of the columns imported into a table, not importing columns that have a high cardinality and that are not relevant for the analysis. However, every type of column deserves additional considerations. The columns for primary or alternate keys are necessary if there are one or more one-to-many relationships with other tables. For instance, the product code and the product key columns of a table of products are certainly required columns. However, a table should not include a primary or alternate key column not used in a relationship with other tables. For example, the Sales table might have a unique identifier for each row in the original table. Such a column has a cardinality that corresponds to the number of rows of the Sales table. Moreover, a unique identifier is not necessary for relationships because no tables target Sales for a relationship. For these reasons, it is a very expensive column in terms of memory, and it should not be imported in memory. In a composite data model, a similar highgranularity column could be accessed only through DirectQuery without being stored in memory, as described later in the “Optimizing column storage” section of this chapter. A table should always include qualitative attributes that have a low cardinality because they have a good compression and might be useful for the analysis. For example, the product category is a column that has a low cardinality, related to the Product table. In case there is a high cardinality, we should consider carefully whether to import the column or not because its storage memory cost can be high. The high selectivity might justify the cost, but we should check that filters in queries usually select a low number of values in that column. For instance, the production lot number might be a piece of information included in the Sales table that users want to filter at query time. Its high cost might be justified by a business need to apply this filter in certain queries. All the quantitative attributes are generally imported to guarantee any calculation, although we might consider skipping columns providing redundant information. Consider the Quantity, Price, and Amount columns of a Sales table, where the Amount column contains the result of the product between Quantity and Price. We probably want to create measures that aggregate each of these columns; yet we will probably calculate the price as a weighted average considering the sum of amount and quantity, instead of a simple average of the price considering each transaction at the same level. This is an example of the measure we want to define: Sum of Quantity := SUM ( Sales[Quantity] ) Sum of Amount
:= SUM ( Sales[Amount] )
Average Price
:= DIVIDE ( [Sum of Amount], [Sum of Quantity] )
By looking at these measures, we might say that we only need to import Quantity and Amount in the data model, without importing the Price column, which is not used by these measures. However, if we consider the cardinality of the columns, we start to have doubts. If there are 100 unique values 600 CHAPTER 18 Optimizing VertiPaq
in the Quantity column, and there are 10,000 unique values in the Price column, we might have up to 1,000,000 unique values in the Amount column. At this point, we might consider importing only the Quantity and Price columns, using the following definition of the measures in the data model; only Sum of Amount changes, the other two measures did not change: Sum of Quantity := SUM ( Sales[Quantity] ) Sum of Amount
:= SUMX ( Sales, Sales[Quantity] * Sales[Price] )
Average Price
:= DIVIDE ( [Sum of Amount], [Sum of Quantity] )
The new definition of the Sum of Amount measure might be slower because it has to scan two columns instead of one. However, these columns might be smaller than the original Amount. Trying to predict the faster option is very hard because we should also consider the distribution of the values in the table, and not only the cardinality of the column. We suggest measuring the memory used and the performance in both scenarios before making a final decision. Based on our experience, removing the Amount column in a small data model can be more important for Power BI and Power Pivot. Indeed, the available memory in personal computers is usually more limited than that of a server, and a smaller memory footprint also produces a faster loading time opening the smaller file. At any rate, in a large table with billions of rows stored in an Analysis Services Tabular model, the performance penalty of the multiplication between two columns (Quantity and Price) could be larger than the increased memory scan time for the Amount column. In this case, the better response time for the queries justifies the higher memory cost to store the Amount column. Regardless, we should measure size and performance in each specific case because the distribution of data plays a key role in compression and affects any decision pertaining to it.
Note Storing Quantity and Price instead of Amount is an advantage if the table is stored in VertiPaq, whereas it is not the suggested best practice for DirectQuery models. Moreover, if the table in VertiPaq contains billions of rows in memory, the Amount column can provide better query performance and it is compatible with future Aggregations over VertiPaq. More details in the section “Managing VertiPaq Aggregations” later in this chapter. We should consider whether to import descriptive attributes or not. In general, they have a high storage cost for the dictionary of the column when imported in memory. A few examples of descriptive attributes are the Notes field in an invoice and the Description column in the Product table. Usually, these attributes are mainly used to provide additional information about a specific entity. Users hardly use this type of column to group or filter data; the typical use case is to get detailed drill-through information. The only issue with including these columns in the data model is their memory storage cost, mainly related to the column dictionary. If the column has many blank values and a low number of unique nonblank values in the table, then its dictionary will be small and the column cost will be more acceptable. Nevertheless, a column containing the transcription of conversations made in a call center is probably too expensive for a Service Calls table containing date, time, duration, and operator who managed the call. When the cost of storing descriptive attributes in memory is too expensive, we can consider only accessing them through DirectQuery in a composite data model. CHAPTER 18
Optimizing VertiPaq
601
A particular type of descriptive attribute is the information provided as detail for transactions in a drill-through operation. For example, the invoice number or the order number of a transaction is an attribute that has a high cardinality, but that could be important for some reports. In this case, we should consider the particular optimizations for drill-through attributes described in the next section, “Optimizing column storage.” Most of the time, there is no reason to import columns for technical attributes, such as timestamp, date, time, and operator of the last update. This information is mainly for auditing and forensic requirements. Unless we have a data model specifically built for auditing requirements, the need for this information is usually low in an analytical solution. However, technical attributes are good candidates for columns accessed only through DirectQuery in a composite data model.
Optimizing column storage The best optimization for a column is to remove the column from a table entirely. In the previous section, we described when this decision makes sense based on the type of columns in a table. Once we define the set of columns that are part of the data model, we can still use optimization techniques in order to reduce the amount of memory used, even though each optimization comes with side effects. In case the composite data model feature is available, an additional option is that of keeping a column in the data source, only making it accessible through DirectQuery.
Using column split optimization The memory footprint of a column can be lowered by reducing the column cardinality. In certain conditions, we can achieve this result by splitting the column into two or more parts. The column split cannot be obtained with calculated columns because that would require storing the original column in memory. We show examples of the split operation in SQL, but any other transformation tool (such as Power Query) can obtain the same result. For instance, if there is a 10-character string (such as the values in TransactionID), we can split the column in two parts, five characters each (as in TransactionID_High and TransactionID_Low): SELECT LEFT ( TransactionID, 5 ) AS TransactionID_High, SUBSTRING ( TransactionID, 6, LEN ( TransactionID ) - 5 ) AS TransactionID_Low, ...
In case of an integer value, we can use division and modulo for a number that creates an even distribution between the two columns. If there is an integer TransactionID column with numbers between 0 and 100 million, we can divide them by 10,000 as in the following example: SELECT TransactionID / 10000 AS TransactionID_High, TransactionID % 10000 AS TransactionID_Low, ...
602 CHAPTER 18 Optimizing VertiPaq
We can use a similar technique for decimal numbers. An easy split is separating the integer from the decimal part, although this might not produce an even distribution. For example, we can transform a UnitPrice decimal number column into UnitPrice_Integer and UnitPrice_Decimal columns: SELECT FLOOR ( UnitPrice ) AS UnitPrice_Integer, UnitPrice - FLOOR ( UnitPrice ) AS UnitPrice_Decimal, ...
We can use the result of a column split as is in simple details reports or measures that restore the original value during the calculation. If available in the client tool, the Detail Rows feature allows us to control the drill-through operation, showing to the client the original column and hiding the presence of the two split columns.
Important The column split can optimize numbers aggregated in measures, using the separation between integer and decimal parts as in the previous example or similar techniques. However, consider that the aggregation operation will have to scan more than one column, and the total time of the operation is usually larger than with a single column. When optimizing for performance, saving memory might be not effective in this case, unless the dictionary is removed by enforcing value encoding instead of hash encoding for a currency or integer data type. A specific measurement is always required for a data model to validate if such optimization also works from a performance point of view.
Optimizing high-cardinality columns A column with a high cardinality has a high cost because of a large dictionary, a large hierarchy structure, and a lower compression in encoding. The attribute hierarchy structure can be expensive and may be disabled under certain conditions. We describe how to disable attribute hierarchies in the next section. If it is not possible to disable the hierarchy, or if this reduction is not enough for memory optimization, then consider the column split optimization for a high-cardinality column used in a measure. We can hide this optimization from the user by hiding the split columns and by adapting the calculation in measures. For example, if we optimize UnitPrice using the column split, we can create the Sum of Amount measure this way: Sum of Amount := SUMX ( Sales, Sales[Quantity] * ( Sales[UnitPrice_Integer] + Sales[UnitPrice_Decimal] ) )
Remember that the calculation will be more expensive, and only an accurate measurement of the performance of the two models (with and without column split optimization) can establish which one is better for a specific data model. CHAPTER 18
Optimizing VertiPaq
603
Disabling attribute hierarchies The attribute hierarchy structure is required by MDX queries that reference the column as an MDX attribute hierarchy. This structure contains a sorted list of all the values of the column, and its creation might require a large amount of time during a refresh operation, including incremental ones. The size of this structure is measured in the Columns Hierarchies Size column of VertiPaq Analyzer. If a column is only used by measures and in drill-through results, and it is not shown to the user as an attribute to filter or group data, then the attribute hierarchy structure is not necessary because it is never used. The Available In MDX property of a column disables the creation of the attribute hierarchy structure when set to False. By default, this property is True. The name of this property in TMSL and TOM is isAvailableInMdx. Depending on the development tool and on the compatibility level of the data model, this property might be not available. A tool that shows this property is Tabular Editor: https://github.com/otykier/TabularEditor/releases/latest. The attribute hierarchy structure is also used in DAX to optimize sorting and filter operations. It is safe to disable the isAvailableInMdx property when a column is only used in a measure expression, it is not visible, and it is never used to filter or sort data. This property is also documented at https:// docs.microsoft.com/en-us/dotnet/api/microsoft.analysisservices.tabular.column.isavailableinmdx.
Optimizing drill-through attributes If a column contains data used only for drill-through operations, there are two possible optimizations. The first is the column split optimization; the second is keeping the columns accessible only through DirectQuery in a composite data model. When the column is not being used in measures, there are no concerns about possible costs of the materialization of the original values. By leveraging the Detail Rows feature, it is possible to show the original column in the result of a drill-through operation, hiding the presence of the two split columns. However, it is not possible to use the original value as a filter or group-by column. In a composite data model, the entire table can be made accessible through a DirectQuery request, whereas the columns used by relationships and measures can be included in an in-memory aggregation managed by the VertiPaq engine. This way, it is possible to get the best performance when aggregating data, whereas the query execution time will be longer when the drill-through attributes are requested to the data source via DirectQuery. The next section, “Managing VertiPaq Aggregations,” provides more details about that feature.
Managing VertiPaq Aggregations The VertiPaq storage engine can be used for managing aggregations over DirectQuery data sources— and in the future, also over large VertiPaq tables. Aggregations were initially introduced in late 2018 as a Power BI feature. That same feature could later be adopted by other products. The purpose of Aggregations is to reduce the cost of a storage engine request, removing the need for an expensive DirectQuery request in case the data is available in a smaller table containing aggregated data. 604 CHAPTER 18 Optimizing VertiPaq
The Aggregations feature is not necessarily related to VertiPaq: it is possible to define aggregations in a DirectQuery model so that different tables are queried on the data source, depending on the granularity of a client request. However, the typical use case for Aggregations is defining them in a composite data model, where each table has three possible storage modes: ■ ■
■
Import: The table is stored in memory and managed by the VertiPaq storage engine. DirectQuery: The data is kept in the data source; at runtime, every DAX query might generate one or more requests to the data source, typically sending SQL queries. Dual: The table is stored in memory by VertiPaq and can also be used in DirectQuery, typically joining other tables stored in DirectQuery or Dual mode.
The principle of aggregations is to provide different options to solve a storage engine request. For example, a Sales table can store the details of each transaction, such as product, customer, and date. When one creates an aggregation by product and month, the aggregated table has a much smaller number of rows. The Sales table could also have more than one aggregation, each one with a precedence used in case of multiple aggregations compatible with the same request. Consider a case where the following aggregations are available in a model with Sales, Product, Date, and Store: ■
Product and Date—precedence 50
■
Store and Date—precedence 20
If a query required the total of sales by product brand and year, it would use the first aggregation. The same aggregation would be used when drilling down at the month or day level. Indeed, the aggregation that has the Sales data at the Product and Date granularity can solve any query that groups rows by using attributes included in these tables. With the same logic, a query aggregating data by store country and year will use the second aggregation created at the granularity of Store and Date. However, a query aggregating data by store country and product brand cannot use any existing aggregation. Such queries must use the Sales table that has all the details because none of the aggregations available have a granularity compatible with the request. If two or more aggregations are compatible with the request, the choice is made based on the precedence setting defined for each aggregation: The engine chooses the aggregation with the highest precedence. Table 18-6 recaps the aggregations used based on the query request in the examples described. TABLE 18-6 Examples of aggregation used, based on query request Query Request
Aggregation Used
Group by product brand and year
Product and Date
Group by product brand and month
Product and Date
Group by store country and year
Store and Date
Group by store country and month
Store and Date
Group by year
Product and Date (highest precedence)
Group by month
Product and Date (highest precedence)
Group by store country and product brand
No aggregation—query Sales table at detail level
CHAPTER 18
Optimizing VertiPaq
605
The engine chooses the aggregation to use only considering the precedence order, regardless of the aggregation storage mode. Indeed, every aggregation has an underlying table that can be stored either in VertiPaq or in DirectQuery. Common sense would suggest that a VertiPaq aggregation should be preferred over a DirectQuery aggregation. Nevertheless, the DAX engine only follows precedence rules. If a DirectQuery aggregation has a higher precedence over a VertiPaq aggregation, and both are candidates to speed up a request, the engine chooses the DirectQuery aggregation. It is up to the developer to define a good set of precedence rules. An aggregation can match a storage engine request depending on several conditions: ■
Granularity of the relationships involved in the storage engine request.
■
Matching of columns defined as GroupBy in the summarization type of the aggregation.
■
Summarization corresponding to a simple aggregation of a single column.
■
Presence of a Count summarization of the detail table.
These conditions might have an impact on the data model design. A model that imports all the tables in VertiPaq usually is designed to minimize the memory requirements. As described in the previous section, “Choosing the right columns to store,” storing the Quantity and Price columns allows the developer to compute the Amount at query time using a measure such as: Sales Amount := SUMX ( Sales, Sales[Quantity] * Sales[Price] )
This version of the Sales Amount measure might not use an aggregation with a Sum summarization type because the Sum summarization only references a single column. However, an aggregation could match the request if Sales[Quantity] and Sales[Price] have the GroupBy summarization and if there is a Count summarization of the Sales table. For complex expressions it could be hard to define an efficient aggregation, and this could impact the model and aggregation design. Consider the following code as an educational example. If there are two Sum aggregations for the Sales[Amount] and Sales[Cost] columns, then a Margin measure should be implemented using the difference between two aggregations (Margin1 and Margin2), instead of aggregating the difference computed row-by-row (Margin3). Sales Amount Total Cost Margin1 Margin2
:= := := :=
SUM ( Sales[Amount] ) SUM ( Sales[Cost] ) [Sales Amount] - [Total Cost] SUM ( Sales[Amount] ) - SUM ( Sales[Cost] )
Margin3 := SUMX ( Sales, Sales[Amount] - Sales[Cost] )
-----
Can Can Can Can
use use use use
Sum Sum Sum Sum
aggregations aggregations aggregations aggregations
-- CANNOT use Sum aggregations
However, the Margin3 measure could match an aggregation that defines the GroupBy summarization for the Sales[Amount] and Sales[Cost] columns and that also includes a Count summarization of the Sales table. Such aggregation would potentially also be useful for the previous definitions of the Sales Amount and Total Cost measures, even though it would be less efficient than a Sum aggregation on the specific column.
606 CHAPTER 18 Optimizing VertiPaq
As of April 2019, the Aggregations feature is available for DirectQuery tables. While it is not possible to define aggregations for a table imported in memory, that feature might be implemented in the near future. At that point, all these combinations will become possible: ■
DirectQuery aggregation over a DirectQuery table
■
VertiPaq aggregation over a DirectQuery table
■
VertiPaq aggregation over a VertiPaq table (not available as of April 2019)
The ability to create a VertiPaq aggregation over VertiPaq tables will provide a tool to optimize two scenarios for models imported in memory: very large tables (billions of rows) and relationships with a high cardinality (millions of unique values). These two scenarios can be managed by manually modifying the data model and the DAX code as described in the “Denormalization” section earlier in this chapter. The aggregations over VertiPaq tables will automate this process, resulting in better performance, reduced maintenance, and decreased development costs.
Conclusions In this chapter we focused on how to optimize a data model imported in memory using the VertiPaq storage engine. The goal is to reduce the memory required for a data model, obtaining as a side effect an improvement in query performance. VertiPaq can also be used to store aggregations in composite models, combining the use of the DirectQuery and VertiPaq storage engines in a single model. The main takeaways of this chapter are: ■
Only import in memory the columns required for the analysis.
■
Control columns cardinality, as a low cardinality column has better compression.
■
■
Manage date and time in separate tables and store them at the proper granularity level for the analysis. Storing a precision higher than required (e.g., milliseconds) consumes memory and lowers query performance. Consider using VertiPaq to store in-memory aggregations for DirectQuery data sources in composite models.
CHAPTER 18
Optimizing VertiPaq
607
CHAPTER 19
Analyzing DAX query plans DAX is a functional language with an advanced query engine that can use different storage engines. As is the case with many query languages, it is usually possible to get the same result using different DAX expressions, each one performing differently. Optimizing a measure or a query requires finding the most efficient way to obtain the desired result. In order to find a more efficient implementation for an expression, the first step is to identify the bottlenecks of the existing code. This chapter describes the components of the DAX query engine in more detail, explaining how to obtain information about query plans and performance counters related to a particular DAX expression using DAX Studio. This knowledge is fundamental to optimize any DAX formula.
Capturing DAX queries In order to analyze a query plan, it is necessary to execute a DAX query. A report in Power BI or Excel automatically generates queries that invoke measures included in the data model. Thus, optimizing a DAX measure requires analyzing and optimizing the DAX query that invokes that measure. Collecting the queries generated for a report is the first step in the DAX optimization journey. Indeed, a single slow report is likely to generate dozens of queries. The careful developer should find the slowest query out of them all, thus focusing on the biggest bottleneck first. DAX Studio (http://daxstudio.org/) is a free open-source tool that offers several useful features to capture and analyze DAX queries. In the following example, see how DAX Studio connects to a Power BI data model to capture the queries generated by a report page. The Power BI report shown in Figure 19-1 contains one visual that is slower to display. The table in the bottom-left corner with two columns (Product Name and Customers) requires a few seconds to be updated when the page is first opened and when the user changes the Continent slicer selection. We know this because we created the report on purpose. But how would one uncover the slowest visual in a report? DAX studio proves to be very helpful in this.
609
FIGURE 19-1 A Power BI report with many visuals, one of which is slower to display.
DAX Studio can connect to a Power BI model by selecting the name of a Power BI Desktop file already opened on the same computer. This is shown in Figure 19-2.
FIGURE 19-2 DAX Studio can connect to multiple types of Tabular models, including Power BI.
Once connected, DAX Studio can start capturing all the queries sent to the Tabular engine after the user activates the All Queries button in the Traces tab of the Home ribbon. This is visible in Figure 19-3.
FIGURE 19-3 The All Queries feature captures all the queries sent to the Tabular engine.
610
CHAPTER 19
Analyzing DAX query plans
At this point, every action in the client might produce one or more queries. For example, Power BI generates at least one DAX query for every visual in the page. Figure 19-4 shows the queries captured in the sample from Figure 19-1 when selecting the Asia continent in the Continent slicer.
FIGURE 19-4 The All Queries pane shows all the queries captured by DAX Studio.
Note DAX Studio listens to all the queries sent to the Tabular server. By connecting DAX Studio to Power BI Desktop, the queries are always executed by the same user on the same database. Different Power BI files require different connections and a different window in DAX Studio. However, a connection to Analysis Services (which requires administrative rights) will show queries executed by different users and on different databases. The query type will be MDX for any queries generated by a client like Excel. The Duration column shows the execution time in milliseconds, and the Query column contains the complete text of the query executed on the server. You can easily check that the first query has a duration of around three seconds. All the remaining queries are very fast, thus not worth any further attention. In a real-world report you likely will notice more than one slow query. DAX Studio lets you quickly discover the slowest queries, focusing the attention on those and avoiding any waste of time on measures and queries that are quick enough. When you double-click on a line in the All Queries list, the query is copied into the editor window. For example, Figure 19-5 shows the complete text of the first query in the previous list. When you press the highlighted Format Query button on the Home tab, the query is also formatted using the DAX Formatter web service. Once a slow query is identified following these steps, it can be executed in DAX Studio multiple times. One would analyze its query plan and other metrics to evaluate the bottlenecks and to try changes that could improve performance. The following sections analyze very simple queries created from scratch for educational reasons, although the end goal is to also analyze queries captured from a real workload.
CHAPTER 19
Analyzing DAX query plans
611
FIGURE 19-5 The Format Query button invokes DAX Formatter to format the DAX code in the editor.
Introducing DAX query plans The DAX engine provides several details about how it executes a query in the query plan. However, “query plan” is a generic definition for a set of information including two different types of query plans (logical and physical) and a list of storage engine queries used by the physical query plan. Unless otherwise specified, the generic term “query plan” references the whole set of details available. These are introduced in this section and explained in more detail in the following part of the chapter. In Chapter 17, “The DAX engines,” we explained that there are two layers in the DAX query engine: the formula engine (FE) and the storage engine (SE). Every query result is produced by executing the following steps:
612
1.
Building an Expression Tree. The engine transforms the query from a string to an expression tree, a data structure that is easier to manipulate for further optimization.
2.
Building a Logical Query Plan. The engine produces a list of the logical operations required to execute the query. This tree of logical operators resembles the original query syntax. It is easy to find a correspondence between a DAX function and a similar operation in the logical query plan.
3.
Building a Physical Query Plan. The engine transforms the logical query plan into a set of physical operations. A physical query plan is still a tree of operators, but the resulting tree can be different from the logical query plan.
CHAPTER 19
Analyzing DAX query plans
4.
Executing the Physical Query Plan. The engine finally executes the physical query plan, retrieving data from the SE and computing the query calculations.
The first step is not interesting to analyze performance. Steps 2 and 3 involve the formula engine, whereas step 4 also involves the storage engine (SE). Technically, step 3 is the most important for determining how the query works, even though the physical query plan is available only after the actual execution of a query (step 4). Therefore, it is necessary to wait for the execution of a query before being able to see its physical query plan. However, during the execution of step 4, there are other interesting pieces of information (SE requests) that are easier to read compared to the physical query plan. For this reason, we will see how the analysis of a query often starts from the analysis of the SE requests generated at step 4.
Note Tabular can be queried in both MDX and DAX, even though its natural language is DAX. Nevertheless, the engine does not translate MDX into DAX. MDX queries generate both a logical and a physical query plan just as DAX queries do. Keep in mind that the same query written in DAX or in MDX typically produces different query plans despite returning similar results. Here the focus is on the DAX language; however, the information provided in this chapter is useful to analyze how Tabular handles MDX queries as well.
Collecting query plans As explained in the previous section, a DAX query generates both a logical and a physical query plan. These plans describe the operations performed by the query engine in detail. Unfortunately, the query plan is only available in textual representation, not graphical visualization. Because of the complexity and length of a typical query plan, other tools and techniques should be used to optimize a DAX expression before starting to analyze the query plan in detail. However, it is important to understand the basics of a DAX query plan in order to both understand the behavior of the engine and quickly spot potential bottlenecks in longer and more complex query plans. We will now describe in greater detail the different parts of a query plan using a simple query. As you will see, even the simplest query produces rather complex plans. As an example, consider this query executed in DAX Studio: EVALUATE { SUM ( Sales[Quantity] ) }
The result of the table constructor is a table with one row and one column (Value), filled with the sum of the Quantity column for all the rows of the Sales table, as shown in Figure 19-6.
FIGURE 19-6 The result of a query with a simple table constructor with one row and one column.
CHAPTER 19
Analyzing DAX query plans
613
The next sections describe the query plans generated and executed by this DAX query. Later on we will see how to obtain this information for any query. At this stage, just focus your attention on the role of the query plans, how they are structured, and the information they provide.
Introducing logical query plans The logical query plan is a close representation of the DAX query expression tree. Figure 19-7 shows the logical query plan of the previous query.
FIGURE 19-7 The logical query plan of a simple query.
Each line is an operator, and the following lines, indented, are the parameters of the operator. By ignoring the parameters for each operator for a moment, it is possible to envision a simpler structure: AddColumns: Sum_Vertipaq: Scan_Vertipaq: 'Sales'[Quantity]:
The outermost operator is AddColumns. It creates the one-row table with the Value column containing the value returned by the DAX query. The Sum_VertiPaq operator scans the Sales table and sums the Sales[Quantity] column. The two operators included within Sum_Vertipaq are Scan_Vertipaq and a reference to the scanned column. This query plan in plain English would be: “Create a table with a column named Value, filled with the content of a SUM operation, performed by the storage engine by scanning the Quantity column in the Sales table.” The logical query plan shows what the DAX query engine plans to do in order to compute the results. Not surprisingly, it scans Sales summarizing Quantity using SUM. Clearly, more complex query plans will be harder to decode.
Introducing physical query plans The physical query plan has a similar format to the logical query plan. Each line is an operator and its parameters are in subsequent lines, indented with one tab. Apart from this aesthetic similarity, the two query plans use completely different operators. Figure 19-8 shows the physical query plan generated by the previous DAX query.
614
CHAPTER 19
Analyzing DAX query plans
FIGURE 19-8 The physical query plan of a simple query.
Again, a simplified version of the query plan is possible by removing the parameters of each operator: AddColumns: SingletonTable: SpoolLookup: LookupPhyOp ProjectionSpool: SpoolPhyOp Cache: IterPhyOp
The first operator, AddColumns, builds the result table. Its first parameter is a SingletonTable, which is an operator returning a single-row table generated by the table constructor. The second parameter, SpoolLookup, searches for a value in the datacache obtained by a query sent to the storage engine. This is the most intricate part of DAX query plans. The physical query plan shows that it uses some data that was previously spooled by other SE queries, but it does not show exactly from which one. In other words, the code of an SE query cannot be obtained by reading the DAX query plan. It is possible to retrieve the queries sent to the storage engine, but matching them with the exact point in the query plan is only possible in simple DAX queries. In more complex—yet realistic—DAX operations, this association might require a longer analysis. Before moving forward, it is important to highlight some important information included in the query plan: ProjectionSpool: SpoolPhyOp #Records=1 Cache: IterPhyOp #FieldCols=0 #ValueCols=1
Note In former versions of the Tabular engine that did not support composite models, the ProjectionSpool and Cache operators were called AggregationSpool and VertiPaqResult, respectively. Besides some differences in operator names, the structure of the physical query plan did not change much, and the same logic described in this chapter can be applied to older Tabular engines. The ProjectionSpool operator represents a query sent to the storage engine; the next section will describe storage engine requests. The ProjectionSpool operator iterates the result of the query, showing the total number of rows iterated in the #Records=1 parameter. The number of records also represents the number of rows returned by the nested Cache operator. CHAPTER 19
Analyzing DAX query plans
615
The number of records is important for two reasons: ■
■
It provides the size (in rows) of the datacache created by VertiPaq or DirectQuery. A large datacache consumes more memory at query time and takes more time to scan. The iteration performed by ProjectionSpool in the formula engine runs in a single thread. When a query is slow and this number is large, it could indicate a bottleneck in the query execution.
Because of the importance of the number of records, DAX Studio reports it in the Records column of the query plan. We sometimes refer to the number of records as the cardinality of the operator.
Introducing storage engine queries The previous physical query plan includes a ProjectionSpool operator that represents an internal query sent to the storage engine (SE). Because the model is in Import mode, DAX uses the VertiPaq SE, which receives queries in xmSQL. The following is the xmSQL query generated during the execution of the DAX query analyzed in the previous sections: SET DC_KIND="AUTO"; SELECT SUM ( 'DaxBook Sales'[Quantity] ) FROM 'DaxBook Sales'; 'Estimated size ( volume, marshalling bytes ) : 1, 16'
The preceding code is a simplified version shown in DAX Studio, which removes a few internal details that are not relevant in performance analysis. The original xmSQL visible in SQL Server Profiler is the following: SET DC_KIND="AUTO"; SELECT SUM([DaxBook Sales (905)].[Quantity (923)]) AS [$Measure0] FROM [DaxBook Sales (905)]; [Estimated size (volume, marshalling bytes): 1, 16]
This query aggregates all the rows of the Sales table, returning a single column with the sum of Quantity. The SE executes the entire aggregation operation, returning a small datacache (one row, one column) regardless of the size of the Sales table. The materialization required for this datacache is minimal. Moreover, the only data structures read by this query are those storing the Quantity column in the Sales table. A Sales table with hundreds of other columns would not affect the performance of this xmSQL query. The VertiPaq SE only scans columns included in the xmSQL query. If the model had been using DirectQuery, the query generated would have been a SQL query like the following one: SELECT SUM ( [Quantity] ) FROM Sales
616
CHAPTER 19
Analyzing DAX query plans
Note From here on out, we will not cover the details of query plans using DirectQuery. As discussed in Chapter 17, optimizing DirectQuery requires an optimization of the data source. However, changes to the DAX query can improve the SQL code sent to the DirectQuery data source, so the same techniques for analyzing a query plan described for VertiPaq can also be applied to DirectQuery, even though the assumptions on the speed of the storage engine are no longer valid for DirectQuery. Later in the chapter we will explain why measuring the execution time of each SE query is an important part of the optimization process. Keep in mind that VertiPaq performance is related to the size of the columns involved in a query, and not only to the number of rows of the table. Different columns can have different compression rates and different sizes in memory, resulting in different scan times.
Capturing profiling information The previous section introduced the DAX query plans. This section describes the tools to capture these events and how to measure their duration, which are the first steps in DAX optimization. The DAX engine has grown as part of Microsoft SQL Server Analysis Services. Analysis Services provides trace events that can be captured with the SQL Server Profiler tool or by intercepting extended events (xEvents). Other products such as Power Pivot and Power BI use the same engine, although these products do not have the same tools available as for Analysis Services to capture trace or extended events. For example, Power Pivot for Excel and Power BI Desktop have diagnostic options that save trace events on a file, which can be opened later with the same SQL Server Profiler tool. However, the events generated by the engine require some massage to be useful for performance analysis; the SQL Server Profiler is a general-purpose tool that is not designed specifically for this task. On the other hand, DAX Studio reads and interprets Analysis Services events, summarizing relevant information in an easier way. This is why we strongly suggest using DAX Studio as a primary tool to edit, test, and optimize DAX queries and expressions. A later section includes a description of SQL Server Profiler, providing more details to the readers interested in understanding the internal details. DAX Studio collects the same events as SQL Server Profiler, processing them and displaying summarized information in a very efficient way.
Using DAX Studio As explained at the beginning of this chapter, DAX Studio can also capture DAX queries sent to the Tabular engine. Indeed, DAX Studio can execute any valid DAX query, including those captured by DAX Studio itself. The DAX query syntax is explained in Chapter 13, “Authoring queries.” DAX Studio collects trace events generated by one or more queries executed from within DAX Studio and displays the relevant information about the query plans and storage engine. DAX Studio can connect to Power BI, Analysis Services, and Power Pivot for Excel. Before analyzing a query in DAX Studio, we must enable the Query Plan and Server Timings options in the Traces tab of the Home tab, as shown in Figure 19-9. CHAPTER 19
Analyzing DAX query plans
617
FIGURE 19-9 The Query Plan and Server Timings options enable the tracing features in DAX Studio.
When the user enables these options, DAX Studio shows the Query Plan and Server Timings panes next to the Output and Results pane, which is visible by default. DAX Studio connects to the DAX engine as if it were a profiler, and it captures the trace events described in the next section. It automatically only filters the events related to the executed query, so we do not have to worry if there are other concurrent users active on the same server. The Query Plan pane displays the two query plans generated by the query, as shown in Figure 19-10. The physical query plan is in the upper half of the pane, and the logical query plan is in the lower half. The physical query plan is usually the most important to analyze when looking for a performance bottleneck in the formula engine. For this reason, this list also provides a column containing the number of records iterated by a spool operation (which is an iteration performed by the formula engine, usually over a datacache). This way, we can easily recognize which operations iterate over a large number of records in a complex query plan. We will describe how to use this information later in Chapter 20, “Optimizing DAX.”
FIGURE 19-10 The Query Plan pane displays the Physical Query Plan and the Logical Query Plan.
The Server Timings pane in Figure 19-11 shows information related to SE queries and how the execution time splits between FE and SE.
FIGURE 19-11 The Server Timings pane displays a summary of timings information and the details of the storage
engine queries.
618
CHAPTER 19
Analyzing DAX query plans
Note The SE query displayed in Figure 19-11 is applied to a model with 4 billion rows to show high CPU consumption. The model used for this example is not included in the companion files for the book. The following metrics are found on the left side of the Server Timings pane: ■
■
Total: Elapsed time for the complete DAX query. It corresponds to the Duration of the Query End event. SE CPU: Sum of the CPU Time value for all the VertiPaq scan events. It also reports the degree of parallelism of VertiPaq operations (number of cores used in parallel).
■
FE: Time elapsed in the formula engine, in milliseconds and as a percentage of the Total time.
■
SE: Time elapsed in the storage engine, in milliseconds and as a percentage of the Total time.
■
SE Queries: Number of queries sent to the storage engine.
■
SE Cache: Number of storage engine queries resolved by the storage engine cache, displayed as an absolute number and as a percentage of the SE Queries value.
The list in the center shows the SE queries executed, and the panel on the right side displays the complete code of the SE query selected in the center list. By default, the list includes only one row for each query, hiding the VertiPaq Scan Internal and other cache events that are always visible in SQL Server Profiler. We can show/hide these more detailed events by enabling the Cache, Internal, and Batch buttons of the Server Timings group on the Home tab from Figure 19-9. However, these events are usually not necessary in the performance analysis and are thus hidden by default. A DAX performance analysis usually starts from the results displayed in the Server Timings pane. If the query spent more than 50% of the execution time in FE, then we might analyze the query plans first, looking for the most expensive operations in the FE. Otherwise, when most of the execution time is spent in SE, then we will look for the most expensive SE queries in the center list of the Server Timings pane. Information provided in the Duration and CPU columns is helpful to identify performance bottlenecks in a query. Both values are in milliseconds. The Duration is the time elapsed between the start and the end of the request made to the SE. The CPU column shows the total amount of time consumed by one core. If the CPU number is larger than Duration, it means that more cores have been used in parallel to complete the operation. The parallelism of an operation is obtained by dividing CPU by Duration. When this number is close to the total number of cores in the server, we cannot improve performance by increasing the parallelism. In this example, we used a system with eight cores. Thus, with a parallelism of 7.5, the query has reached the limits of the hardware. A concurrent user would not be able to get optimal performance executing a long-running query and would also slow down other users. In this condition, more cores would improve the speed of the query. In case the parallelism of a query is much smaller than the number of cores available, there would not be any benefit from providing more cores to the Tabular engine.
CHAPTER 19
Analyzing DAX query plans
619
The parallelism is computed only for SE operations because the FE runs in a single thread. Formula engine operations cannot benefit from parallel execution. The Rows and KB columns show the estimated number of rows and size of the result (datacache) provided by each SE query. Because every datacache must be consumed by the FE in a single thread, a datacache with a large cardinality might be responsible for a slow FE operation. Moreover, the size of a datacache represents the memory cost required by the materialization of a set of data in an uncompressed format; indeed, the FE only consumes uncompressed data. The SE cost to create a large datacache is usually caused by the need to allocate and write uncompressed data in memory. Therefore, reducing the need for the materialization of a datacache is important to lower the volume of data exchanged between SE and FE, reducing memory pressure and improving both query performance and scalability.
Note The Rows and KB columns show an estimated value that can sometimes be wrong. The exact number of rows returned by an SE query is available in the physical query plan. It is reported in the Records column of the ProjectionSpool event consuming a Cache element. The exact size of a datacache is not available, but it can be approximated proportionally to the ratio between Records in the query plan and the estimated Rows of the SE query. DAX Studio allows sorting of the queries by any column, making it easy to find the most expensive queries when they are sorted by CPU, Duration, Rows, or KB, depending on the ongoing investigation. DAX Studio makes finding the bottlenecks in a DAX query more productive. It does not optimize DAX by itself, but it simplifies the optimization task. In the remaining part of the book we will use DAX Studio as a reference. However, the same information could also be obtained by using SQL Server Profiler, which would be more expensive.
Using the SQL Server Profiler The SQL Server Profiler tool is installed as part of the SQL Server Management environment, which can be freely downloaded from https://docs.microsoft.com/en-us/sql/ssms/download-sql-servermanagement-studio-ssms. SQL Server Profiler can be connected to an Analysis Services instance and collects all the events related to a DAX query execution. SQL Server Profiler can also load a file containing a trace session produced by the same SQL Server Profiler, or by other services such as Power Pivot for Excel and Power BI Desktop. This section explains how to use SQL Server Profiler in case DAX Studio cannot be used for any reason. However, you can skip this section if DAX Studio is available. We provide it as a reference because it can be interesting to understand the underlying behavior of the events involved in performance analysis. In order to catch DAX query plans and storage engine queries, it is necessary to configure a new trace session selecting the interesting events for a DAX query. This is shown in Figure 19-12.
620 CHAPTER 19 Analyzing DAX query plans
FIGURE 19-12 SQL Server Profiler settings to capture DAX query plans and SE queries.
There are five classes of events required to collect the same information used by DAX Studio: ■
■
■
■
■
Query End: Event fired at the end of a query. One might include the Query Begin event too, but we suggest only catching Query End because it contains the execution time. DAX Query Plan: Event fired after the query engine has computed the query plan. It contains a textual representation of the query plan. This event class includes two different subclasses, Logical Plan and Physical Plan. For each query, the engine generates both classes: one logical query plan and one physical query plan. DirectQuery End: Event fired when the DirectQuery engine answers a query. As with the Query End event, to gather timing information we suggest including the end event of the queries executed by the DirectQuery engine. VertiPaq SE Query Cache Match: Event fired when a VertiPaq query is resolved by looking at the cache data. It is useful in order to see how much of your query performs real computations and how much of it just does cache lookups. VertiPaq SE Query End: Event fired when the VertiPaq engine answers a query. As with the Query End event, to gather timing information, we suggest including the end event of the queries executed by the VertiPaq storage engine.
CHAPTER 19
Analyzing DAX query plans
621
Tip Once you select the events needed, it is a good idea to organize columns (clicking the Organize Columns button you see in Figure 19-12), and to save a template of the selections made, so you do not have to repeat the same selection every time you start a new session. You can save a trace template by using the File / Templates / New Template menu in SQL Server Profiler.
Note In a production environment, one should filter the events of a single user session. Otherwise, all the events of different queries executed at the same time would be visible, which makes it harder to analyze events related to a single query. By running the Profiler in a development or test environment where there are no other active users, only the events related to the query executed for the performance tests would be visible without any background noise. DAX Studio automatically filters the events related to a single query analyzed, removing any background noise without requiring any further actions. In order to see the sequence of events fired, we analyze what has happened by running the query used to generate the SE query displayed in Figure 19-11 using DAX Studio over a large table (over 4 billion rows): EVALUATE ROW ( "Result", SUM ( Audience[Weight] ) )
The log window of the SQL Server Profiler shows the result, visible in Figure 19-13.
FIGURE 19-13 Trace events captured in a SQL Server Profiler session for a simple DAX query.
Even for such a simple query, the DAX engine fires five different events: 1.
A DAX VertiPaq Logical Plan event, which is the logical query plan.
2.
An Internal VertiPaq Scan event, which corresponds to an SE query. There could be more than one internal event (subclass 10) for each VertiPaq Scan event (subclass 0).
3.
A VertiPaq Scan event, which describes a single SE query received by the FE.
4.
A DAX VertiPaq Physical Plan event, which is the physical query plan.
5.
A final Query End event, which returns the query duration of the complete DAX query. The CPU time reported by this event should be ignored. It should be close to the time spent in the FE but is not as accurate as the calculation explained later.
622 CHAPTER 19 Analyzing DAX query plans
All the events show both CPU time and duration, expressed in milliseconds. CPU Time is the amount of CPU time consumed to answer the query, whereas Duration is the time the user has had to wait for their result. When Duration is lower than CPU Time, the operation has been executed in parallel on many cores. When Duration is greater than CPU Time, the operation had to wait for other operations (usually logged in different events) to be completed.
Note The accuracy of the CPU Time and Duration columns is not very reliable for values lower than 16 milliseconds, and CPU Time can be less accurate than that in conditions of high parallelism. Moreover, these timings might depend on other operations in progress on the same server. It is a common practice to run the same test multiple times in order to create an average of the execution time of single operations, especially when one needs accurate numbers. However, if only looking for an order of magnitude, one might just ignore differences under 100 milliseconds. Considering the sequence of events, the logical query plan precedes all the SE queries (VertiPaq scans), and only after their execution is the physical query plan raised. In other words, the physical query plan is an actual query plan and not an estimated one. Indeed, it contains the number of rows processed by any iteration in the FE, though it does not provide information about the CPU time and duration of each step in the query plan. Logical and physical query plans do not provide any timing information, which are only available in the other events gathered by the Profiler. Information provided in the CPU Time and Duration columns is the same shown in CPU and Duration by DAX Studio for SE queries. However, the calculation of the time spent in the FE displayed in DAX Studio requires some more work using SQL Server Profiler. The Query End event only provides the total elapsed time for a DAX query in the Duration column, summing both the FE and SE durations. The VertiPaq scan events provide the time spent in the SE. The elapsed time in FE is obtained by subtracting the duration of all the SE queries from the duration of the entire DAX query provided in the Query End event. As shown in Figure 19-13, the Query End event had a Duration of 844 milliseconds. The time spent in the SE was 838 milliseconds. There was only one SE query, which lasted 838 milliseconds; only consider the VertiPaq Scan event, ignoring internal ones. The difference is 6 milliseconds, which is the amount of time spent in the FE. In case of multiple SE queries, their execution time must be aggregated to calculate the total amount of time spent in the SE, which must be subtracted from the total duration to get the amount of time spent in the FE. Finally, the SQL Server Profiler can save and load a trace session. SQL Server Profiler cannot connect to Power Pivot for Excel, but it can open a trace file saved by Power Pivot for Excel or Power BI Desktop. However, Power Pivot for Excel has an Enable Power Pivot Tracing check box in the Settings dialog box that generates a TRC file; TRC is the extension for trace file. The events captured in the profiler session saved this way cannot be customized; they also usually include more event types than those required to analyze DAX query plans. DAX Studio cannot load a trace session but can connect directly to all the tools including Power Pivot for Excel without any limitation. CHAPTER 19
Analyzing DAX query plans
623
Reading VertiPaq storage engine queries In the previous sections, we described some details of the physical and logical query plans. Although these plans are useful in some scenarios, the most interesting part of a query plan is the set of VertiPaq SE queries. In this section we describe how to read the VertiPaq SE queries and understand what happens in VertiPaq to execute an xmSQL query. This information is useful to solve a bottleneck in the VertiPaq storage engine. However, reading these queries is useful to also understand what happens in the FE: If a calculation is not performed by the SE, it must be computed in the FE. Because the number of SE queries is usually smaller than the rows in the query plan, it is more productive to always start analyzing the SE queries regardless of the detected bottleneck type.
Introducing xmSQL syntax In the previous section, we introduced a simple SE query described in a simplified xmSQL syntax, which is the same as displayed by DAX Studio: SELECT SUM ( Sales[Quantity] ) FROM Sales;
This syntax would be quite similar in standard ANSI SQL: SELECT SUM ( Quantity ) FROM Sales;
Every xmSQL query involves a GROUP BY condition, even if this is not explicitly stated as part of its syntax. For example, the following DAX query returns the list of unique values of the Color column in the Product table: EVALUATE VALUES ( 'Product'[Color] )
It results in this xmSQL query; note that no GROUP BY appears in the query: SELECT Product[Color] FROM Product;
The corresponding query in ANSI SQL would have a GROUP BY condition: SELECT Color FROM Product GROUP BY Color
624 CHAPTER 19 Analyzing DAX query plans
The reason we compare the xmSQL to an ANSI SQL query with GROUP BY instead of DISTINCT— which would be possible for the previous example—is that most of the time xmSQL queries also include aggregated calculations. For example, consider the following DAX query: EVALUATE SUMMARIZECOLUMNS ( Sales[Order Date], "Revenues", CALCULATE ( SUM ( Sales[Quantity] ) ) )
This is the corresponding xmSQL query sent to the SE: SELECT Sales[Order Date], SUM ( Sales[Quantity] ) FROM Sales;
In ANSI SQL there would be a GROUP BY condition for the Order Date column: SELECT [Order Date], SUM ( Quantity ) FROM Sales GROUP BY [Order Date]
An xmSQL query never returns duplicated rows. When a DAX query runs over a table that does not have a unique key, the corresponding xmSQL query includes a special RowNumber column that keeps the rows unique. However, the RowNumber column is not accessible in DAX. For example, consider this DAX query: EVALUATE Sales
It generates the following xmSQL code: SELECT Sales[RowNumber], Sales[column1], Sales[column2], ... ,Sales[columnN] FROM Sales
Aggregation functions xmSQL includes the following aggregation operations: ■
SUM sums the values of a column.
■
MIN returns the minimum value of a column.
■
MAX returns the maximum value of a column.
■
COUNT counts the number of rows in the current GROUP BY.
■
DCOUNT counts the number of distinct values of a column.
CHAPTER 19
Analyzing DAX query plans
625
The behavior of SUM, MIN, MAX, and DCOUNT is similar. For example, the following DAX query returns the number of unique customers for each order date: EVALUATE SUMMARIZECOLUMNS ( Sales[Order Date], "Customers", DISTINCTCOUNT ( Sales[CustomerKey] ) )
It generates the following xmSQL code: SELECT Sales[Order Date], DCOUNT ( Sales[CustomerKey] ) FROM Sales;
Which corresponds to this ANSI SQL query: SELECT [Order Date], COUNT ( DISTINCT CustomerKey ) FROM Sales GROUP BY [Order Date]
The COUNT function does not have an argument. Indeed, it computes the number of rows for the current group. For example, consider the following DAX query that counts the number of products for each color: EVALUATE SUMMARIZECOLUMNS ( 'Product'[Color], "Products", COUNTROWS ( 'Product' ) )
This is the xmSQL code sent to the SE: SELECT Product[Color], COUNT ( ) FROM Product;
A corresponding ANSI SQL query could be the following: SELECT Color, COUNT ( * ) FROM Product GROUP BY Color
Other aggregation functions in DAX do not have a corresponding xmSQL aggregation function. For example, consider the following DAX query using AVERAGE: EVALUATE SUMMARIZECOLUMNS ( 'Product'[Color], "Average Unit Price", AVERAGE ( 'Product'[Unit Price] ) )
626 CHAPTER 19 Analyzing DAX query plans
The corresponding xmSQL code includes two aggregations: one for the numerator and one for the denominator of the division that will compute a simple average in the FE: SELECT Product[Color], SUM ( Product[Unit Price] ), COUNT ( ) FROM Product WHERE Product[Unit Price] IS NOT NULL;
Converting the xmSQL query in ANSI SQL, we would write: SELECT Color, SUM ( [Unit Price] ), COUNT ( * ) FROM Product WHERE Product[Unit Price] IS NOT NULL GROUP BY Color
Arithmetical operations xmSQL includes simple arithmetical operations: +, −, *, / (sum, subtraction, multiplication, division). These operations work on single rows, whereas the FE usually performs arithmetical operations between the results of aggregations. It is common to see arithmetical operations in the expression used by an aggregation function. For example, the following DAX query returns the sum of the product of Quantity by Unit Price calculated row-by-row for the Sales table: EVALUATE { SUMX ( Sales, Sales[Quantity] * Sales[Unit Price] ) }
It generates the following xmSQL code: WITH $Expr0 := ( Sales[Quantity] * Sales[Unit Price] ) SELECT SUM ( @$Expr0 ) FROM Sales;
The WITH statement introduces expressions associated with symbolic names (starting with the $Expr prefix) that are referenced later in the remaining part of the query. For example, in the previous code the $Expr0 expression corresponds to the multiplication between Quantity and Unit Price that is later evaluated for each row of the Sales table, summing the result in the aggregated value. The previous xmSQL code corresponds to this ANSI SQL query: SELECT SUM ( [Quantity] * [Unit Price] ) FROM Sales
xmSQL can also execute casts between data types to perform arithmetical operations. It is important to remember that these operations only happen within a row context, from the point of view of a DAX expression.
CHAPTER 19
Analyzing DAX query plans
627
Filter operations An xmSQL query can include filters in a WHERE condition. The performance of a filter depends on the cardinality of the conditions applied (this will be discussed in more detail later in the section “Understanding scan time”). For example, consider the following query that returns the sum of the Quantity column for all sales with a unit price equal to 42: EVALUATE CALCULATETABLE ( ROW ( "Result", SUM ( Sales[Quantity] ) ), Sales[Unit Price] = 42 )
The resulting xmSQL query is the following: SELECT SUM ( Sales[Quantity] ) FROM Sales WHERE Sales[Unit Price] = 420000;
Note The reason why the value in the WHERE condition is multiplied by 10,000 is because the Unit Price column is stored as a Currency data type (also known as Fixed Decimal Number in Power BI). That number is stored as an Integer in VertiPaq, so the FE performs the conversion to a decimal number by dividing the result by 10,000. Such division is not visible, neither in the query plan nor in the xmSQL code. The WHERE condition might include a test with more than one value. For example, consider a small variation of the previous query that sums either the quantity or the sales with a unit price equal to 16 or 42. You see this in the following DAX query: EVALUATE CALCULATETABLE ( ROW ( "Result", SUM ( Sales[Quantity] ) ), OR ( Sales[Unit Price] = 16, Sales[Unit Price] = 42 ) )
The xmSQL uses the IN operator to include a list of values: SELECT SUM ( Sales[Quantity] ) FROM Sales WHERE Sales[Unit Price] IN ( 16000, 42000 );
Any filter condition in xmSQL only includes existing values of the column. For example, if a DAX condition references a value that does not exist in the column, the resulting xmSQL code will include a
628 CHAPTER 19 Analyzing DAX query plans
condition that will filter out all the rows. For example, if neither 16 nor 42 existed in the Sales table, the previous xmSQL query could be not invoked at all from the FE or would become something like: SELECT SUM ( Sales[Quantity] ) FROM Sales WHERE Sales[Unit Price] IN ( );
The result of such an xmSQL query will always be empty. It is important to remember that xmSQL is a textual representation of an SE query. The actual structure is more optimized. For example, when the list of values allowed for a column is very long, the xmSQL reports a few values, highlighting the total number of values passed internally to the query. This happens quite often for time intelligence functions. For example, consider the following DAX query that returns the sum of the quantity for one year of sales: EVALUATE CALCULATETABLE ( ROW ( "Result", SUM ( Sales[Quantity] ) ), Sales[Order Date] >= DATE ( 2006, 1, 1 ) && Sales[Order Date] = 38718.000000 VAND Sales[Order Date] = 38718.000000 VAND Sales[Order Date] 1000 ) ORDER BY Product[Color]
The result visible in Figure 19-22 includes all the unique values of Color, including those without any unit sold. In order to do that, the approach of the DAX engine is different from the one we would expect in plain SQL language; this is because of the different technique used to join tables in the SE. We will highlight this difference later; pay attention to the process for now.
FIGURE 19-22 The result of ADDCOLUMNS includes rows with a blank value in the Units column.
CHAPTER 19
Analyzing DAX query plans
649
The logical query plan shown in Figure 19-23 includes three Scan_Vertipaq operations, two of which correspond to two datacaches provided by SE queries.
FIGURE 19-23 Logical query plan of a simple DAX query.
The two Scan_Vertipaq operations at lines 4 and 6 require different sets of columns. The third Scan_ Vertipaq operation at line 9 is used for a filter, and it does not generate a separate datacache. Its logic is included in one of the other two SE queries generated. The Scan_Vertipaq at line 4 only uses the product color, whereas the Scan_Vertipaq at line 6 includes product color and sales quantity, which are two columns in two different tables. When this happens, a join between two or more tables is required. After the logical query plan, the profiler receives the events from the SE. The corresponding xmSQL queries are the following: SELECT Product[Color], SUM ( Sales[Quantity] ) FROM Sales LEFT OUTER JOIN Product ON Sales[ProductKey] = Product[ProductKey] WHERE Sales[Net Price] > 1000; SELECT Product[Color] FROM Product;
The first SE query retrieves a table containing one row for each color that has at least one unit sold at a price greater than 1,000 in the Sales table. In order to do that, the query joins Sales and Product using the ProductKey column. The second xmSQL statement returns the list of all the product colors, independent of the Sales table. These two queries generate two different datacaches, one with two columns (product color and sum of quantity) and another with only one column (the product color). At this point, we might wonder why a second query is required. Why is the first xmSQL not enough? The reason is that the LEFT JOIN in xmSQL has Sales on the left side and Product on the right side. In plain SQL code, we would have written another query:
650 CHAPTER 19 Analyzing DAX query plans
SELECT Product.Color, SUM ( Sales.Quantity ) FROM Product LEFT OUTER JOIN Sales ON Sales.ProductKey = Product.ProductKey WHERE Sales.NetPrice > 1000 GROUP BY Product.Color ORDER BY Product.Color;
Having the Product table on the left side of a LEFT JOIN would produce a result that includes all the product colors. However, the SE can only generate queries between tables with a relationship in the data model, and the resulting join in xmSQL always puts the table that is on the many-side of the relationship on the left side of the join condition. This guarantees that even though there are missing product keys in the Product table, the result will also include sales for those missing products; these sales will be included in a row with a blank value for all the product attributes, in this case the product color. Now that we have seen why the DAX engine produces two SE queries for the initial DAX query, we can analyze the physical query plan shown in Figure 19-24, where we can find more information about the query execution.
FIGURE 19-24 Physical query plan of a simple DAX query.
The physical query plan uses the Cache operator (line 6 and 9) to indicate where it is consuming a datacache provided by the SE. Unfortunately, it is not possible to see the corresponding SE query for each operation. Nevertheless, at least in simple cases like the one considered, we can figure out this association by looking at other pieces of information. For example, one Cache only has one column obtained with a group operation, whereas the other Cache has two columns: one that is the result of a group operation and the other that is the result of an aggregation (the sum of the quantity). In the physical query plan, #ValueCols reports the number of columns that are the result of an aggregation, whereas #FieldCols reports the number of other columns used to group the result. By looking at the columns consumed by each Cache node, it is often possible to identify the corresponding xmSQL query even though it is a time-consuming process in complex query plans. In this example, the Cache node at line 6 returns a column with 16 product color names; on the other hand, the Cache node at line 9 only returns 10 rows and two columns, only with the product color names that have at least one transaction in Sales within the condition specified for Net Price (which must be greater than 1,000).
CHAPTER 19
Analyzing DAX query plans
651
The ProjectionSpool operation consumes the datacaches corresponding to Cache nodes in the physical query plan. Here we can find an important piece of information: the number of records iterated, which corresponds to the number of rows in the datacache used. This number follows the #Records attribute, which is also reported in the Records column in DAX Studio. We can find the same #Records attribute in parent nodes of the query plan—a place where the type of aggregation performed by the engine is also available if there is one. In this example, the Cache at line 9 has two columns: one is Product[Color] and the other is the result of a sum aggregation. This information is available in the LogOp argument of the Spool_Iterator and SpoolLookup nodes at lines 4 and 7, respectively. At this point, we can recap what we are reading in the query plans and the SE queries: 1.
The FE consumes two datacaches, corresponding to Cache nodes in the physical query plan.
2.
The FE iterates over the list of product colors, which is a table containing 16 rows and one column. This is the datacache obtained by the second SE query. Do not make assumptions about the order of the SE queries in the profiler.
3.
For each row of this datacache (a product color), the FE executes a lookup in the other datacache containing the product colors and the quantity sold for each color; this is a table with two columns and 10 rows.
The entire process executed by the FE is sequential and single-threaded. The FE sends one request at a time to the SE. The SE might parallelize the query, but the FE does not send multiple requests in parallel to the SE.
Note The FE and the SE are subject to optimizations and improvements made in new releases. The behavior described might be different in newer versions of the DAX engine. The FE can combine different results by using the lookup operation described in the previous query plan or other set operators. In any case, the FE executes this operation sequentially. For this reason, we might expect longer execution times by combining large datacaches or by performing a lookup for millions of rows in a large lookup datacache. A simple and effective way to identify these potential bottlenecks in the physical query plan is to look for the highest number of records in the operators of a logical query plan. For this reason, DAX Studio extracts that number from the query plan, making it easier to sort query plan operators by using the number of records iterated. It is possible to sort the rows by this number by clicking the Records column shown in Figure 19-24. We will show a more detailed example of this approach in Chapter 20. The presence of relationships in the data model is important in order to obtain better performance. We can examine the behavior of a join between two tables when a relationship is not available. For example, consider a query returning the same result as the previous example, but operating in a data model that does not have a relationship between the Product and Sales tables. We need a DAX query such as the following; it uses the virtual relationship pattern shown in Chapter 15, “Advanced relationships,” in the section “Transferring a filter using INTERSECT”: 652 CHAPTER 19 Analyzing DAX query plans
DEFINE MEASURE Sales[Units] = CALCULATE ( SUM ( Sales[Quantity] ), INTERSECT ( ALL ( Sales[ProductKey] ), VALUES ( 'Product'[ProductKey] ) ), -- Disable the existing relationship between Sales and Product CROSSFILTER ( Sales[ProductKey], 'Product'[ProductKey], NONE ) ) EVALUATE ADDCOLUMNS ( ALL ( 'Product'[Color] ), "Units", [Units] ) ORDER BY 'Product'[Color]
The function in the Units measure definition is equivalent to a relationship between Sales and Product. The resulting query plan is more complex than the previous one because there are many more operations in both the logical and the physical query plans. Without doing a dump of the complete query plan, which would be too long for a book, we can summarize the behavior of the query plan in these logical steps: 1.
Retrieves the list of ProductKey values for each product color.
2.
Sums the Quantity value for each ProductKey.
3.
For each color, aggregates the Quantity of the related ProductKey values.
The FE executes four SE queries, as shown in Figure 19-25.
FIGURE 19-25 SE queries executed for a DAX calculation using a virtual relationship with INTERSECT.
The following are the complete xmSQL statements of the four SE queries: SELECT Sales[ProductKey] FROM Sales; SELECT Product[Color] FROM Product;
CHAPTER 19
Analyzing DAX query plans
653
SELECT Product[ProductKey], Product[Color] FROM Product; SELECT Sales[ProductKey], SUM ( Sales[Quantity] ) FROM Sales WHERE Sales[ProductKey] IN ( 490, 479, 528, 379, 359, 332, 374, 597, 387, 484..[158 total values, not all displayed] );
The WHERE condition highlighted in the last SE query might seem useless because the DAX query does not apply a filter over products. However, usually in the real world there are other filters active on products or other tables. The query plan tries to only extract the quantities sold of products that are relevant to the query, lowering the size of the datacache returned to the FE. When there are similar WHERE conditions in the SE, the only concern is the size of the corresponding bitmap index moved back and forth between the FE and the SE. The FE has to group all the products belonging to each color. The performance of this join performed at the FE level mainly depends on the number of products and secondarily on the number of colors. Once again, the size of a datacache is the first and most important element to consider when we look for a performance bottleneck in the FE. We considered the virtual relationship using INTERSECT for educational purposes. We wanted to display the SE queries required for a join condition resolved mainly by the FE. However, whenever possible, if a physical relationship is not available, TREATAS should be considered as a more optimized alternative. Consider this alternative implementation of the previous DAX query: DEFINE MEASURE Sales[Units] = CALCULATE ( SUM ( Sales[Quantity] ), TREATAS ( VALUES ( 'Product'[ProductKey] ), Sales[ProductKey] ), -- Disable the existing relationship between Sales and Product CROSSFILTER ( Sales[ProductKey], 'Product'[ProductKey], NONE ) ) EVALUATE ADDCOLUMNS ( ALL ( 'Product'[Color] ), "Units", [Units] ) ORDER BY 'Product'[Color]
As shown in Figure 19-26, there are only three SE queries generated instead of four. Remember that Batch is just a recap of the previous Scan events. Moreover, the size of the datacaches is smaller because one result alone has 2,517 rows corresponding to the number of products in the Product table. In the previous implementation using INTERSECT, there were a larger number of queries returning thousands of rows. All of these datacaches must be consumed by the FE. 654 CHAPTER 19 Analyzing DAX query plans
FIGURE 19-26 SE queries executed for a DAX calculation using a virtual relationship with TREATAS.
The following is the content of the Batch event at line 5, which includes the first two Scan events (lines 2 and 4): DEFINE TABLE '$TTable3' := SELECT 'Product'[ProductKey], 'Product'[Color] FROM 'Product', CREATE SHALLOW RELATION '$TRelation1' MANYTOMANY FROM 'Sales'[ProductKey] TO '$TTable3'[Product$ProductKey], DEFINE TABLE '$TTable1' := SELECT '$TTable3'[Product$Color], SUM ( '$TTable2'[$Measure0] ) FROM '$TTable2' INNER JOIN '$TTable3' ON '$TTable2'[Sales$ProductKey]='$TTable3'[Product$ProductKey] REDUCED BY '$TTable2' := SELECT 'Sales'[ProductKey], SUM ( 'Sales'[Quantity] ) AS [$Measure0] FROM 'Sales';
The performance advantage of TREATAS is that it moves the execution of the operation to the SE, thanks to the CREATE SHALLOW RELATION statement highlighted in the previous code. This way, there is no need to materialize more data for the SE. Indeed, the join is executed within the FE, which reduces the number of lines of the physical query plan—from the 37 required by INTERSECT (not displayed in the book for brevity) to the 10 required by TREATAS. This results in a query plan very similar to the one shown in Figure 19-24. Analyzing complex and longer query plans would require another book, considering the length of the query plans involved. More details about the internals of the query plans are available in the white papers “Understanding DAX Query Plans” (http://www.sqlbi.com/articles/understanding-daxquery-plans/) and “Understanding Distinct Count in DAX Query Plans” (http://www.sqlbi.com/articles/ understanding-distinct-count-in-dax-query-plans/).
Conclusions As you have seen, diving into the complexity of query plans opens up a whole new world. In this chapter we barely scratched the surface of query plans, and a deeper analysis would require twice the size of this book. The good news is that in most—if not all—scenarios, going into more detail turns out to be useless. CHAPTER 19
Analyzing DAX query plans
655
An experienced DAX developer who aims to write optimal code should be able to focus their attention on the low-hanging fruit that can be discovered very quickly by looking at the most relevant parts of the query plan: ■
■
■
■
In the physical query plan, the presence of a large number of rows scanned indicates the materialization of large datasets. This suggests that the query is memory-hungry and potentially slow. Most of the time, the VertiPaq queries include enough information to figure out the overall algorithm of the calculation. Whatever is not computed in a VertiPaq query, it must be computed by the formula engine. Knowing this enables you to get a clear idea of the whole query process. CallbackDataID presence indicates iterations at the row level where your code requires calculations that are too complex for VertiPaq storage engine. CallbackDataIDs by themselves are not totally bad. Nevertheless, removing them almost always results in better performance. VertiPaq and DirectQuery models are different. When using DirectQuery, the performance of DAX is strongly connected to the performance of the data source. It makes sense to use DirectQuery if and only if the underlying data source is specifically optimized for the kind of queries generated by the DirectQuery storage engine.
In the next chapter, we are going to use the knowledge gained in this and previous chapters to provide a few guided optimization processes.
656 CHAPTER 19 Analyzing DAX query plans
CHAPTER 20
Optimizing DAX This is the last chapter of the book, and it is time to use all the knowledge you have gained so far to explore the most fascinating DAX topic: optimizing formulas. You have learned how the DAX engines work, how to read a query plan, and the internals of the formula engine and of the storage engine. Now all the pieces are in place and you are ready to learn how to use that information to write faster code. There is one very important warning before approaching this chapter. Do not expect to learn best practices or a simple way to write fast code. Simply stated: There is no way in DAX to write code that is always the fastest. The speed of a DAX formula depends on many factors, the most important of which unfortunately is not in the DAX code itself: It is data distribution. You have already learned that VertiPaq compression strongly depends on data distribution. The size of a column (hence, the speed to scan it) depends on its cardinality: the larger, the slower. Thus, the very same formula might behave differently when executed on one column or another. You will learn how to measure the speed of a formula, and we will provide you with several examples where rewriting the expression differently leads to a faster execution time. Learn all these examples for what they are—examples that might help you in finding new ideas for your code. Do not take them as golden rules, because they are not. We are not teaching you rules; we are trying to teach you how to find the best rules in the very specific scenario that is your data model. Be prepared to change them when the data model changes or when you approach a new scenario. Flexibility is key when optimizing DAX code: flexibility, a deep technical knowledge of the engine, and a good amount of creativity, to be prepared to test formulas and expressions that might be not so intuitive. Finally, all the information we provide in this book is valid at the time of printing. New versions of the engine come on the market every month, and the development team is always working on improving the DAX engine. So be prepared to measure different numbers for the examples of the book in the version of the engine you will be running and be prepared to use different optimization methods if necessary. If one day you measure your code and reach the educated conclusion that “Marco and Alberto are wrong; this code runs much faster than their suggested code,” that will be our brightest day, because we will have been able to teach you all that we know, and you are moving forward in writing better DAX code than ours.
657
Defining optimization strategies The optimization process for a DAX query, expression, or measure requires a strategy to reproduce a performance issue, identify the bottleneck, and remove it. Initially, you always observe a slowness in a complex query, but optimizing a complicated expression including several DAX measures is more involved than optimizing one measure at a time. For this reason, the approach we suggest is to isolate the slowest measure or expression first, and optimize it in a simpler query that reproduces the issue with a shorter query plan. This is a simple to-do list you should follow every time you want to optimize DAX: 1.
Identify a single DAX expression to optimize.
2.
Create a query that reproduces the issue.
3.
Analyze server timings and query plan information.
4.
Identify bottlenecks in the storage engine or formula engine.
5.
Implement changes and rerun the test query.
You can see a more complete description of each of these steps in the following sections.
Identifying a single DAX expression to optimize If you have already found the slowest measure in your model, you probably can skip this section and move to the following one. However, it is common to get a performance issue in a report that might generate several queries. Each of these queries might include several measures. The first step is to identify a single DAX expression to optimize. Doing this, you reduce the reproduction steps to a single query and possibly to a single measure returned in the result. A complete refresh of a report in Power BI or Reporting Services or of a Microsoft Excel workbook typically generates several queries in either DAX or MDX (PivotTables and charts in Excel always generate the latter). When a report generates several queries, you have to identify the slowest query first. In Chapter 19, “Analyzing DAX query plans,” you saw how DAX Studio can intercept all the queries sent to the DAX engine and identify the slowest query looking at the largest Duration amount. If you are using Excel, you can also use a different technique to isolate a query. You can extract the MDX query it generates by using OLAP PivotTable Extensions, a free Excel add-in available at https://olappivottableextensions.github.io/. Once you extract the slowest DAX or MDX query, you have to further restrict your focus and isolate the DAX expression that is causing the slowness. This way, you will concentrate your efforts on the right area. You can reduce the measures included in a query by modifying and executing the query interactively in DAX Studio. For example, consider the following table result in Power BI with four expressions (two distinct counts and two measures) grouped by product brand, as shown in Figure 20-1. 658 CHAPTER 20 Optimizing DAX
FIGURE 20-1 Simple visualization in Power BI generated by a DAX query with four expressions.
The report generates the following DAX query, captured by using DAX Studio: EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ), "DistinctCountProductKey", CALCULATE ( DISTINCTCOUNT ( 'Product'[ProductKey] ) ), "Sales_Amount", 'Sales'[Sales Amount], "Margin__", 'Sales'[Margin %], "DistinctCountOrder_Number", CALCULATE ( DISTINCTCOUNT ( 'Sales'[Order Number] ) ) ), [IsGrandTotalRowTotal], 0, 'Product'[Brand], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Product'[Brand]
You should reduce the query by trying one calculation at a time, to locate the slowest one. If you can manipulate the report, you might just include one calculation at a time. By accessing the DAX code, it is enough to comment or remove three of the four columns calculated in the SUMMARIZECOLUMNS function (DistinctCountProductKey, Sales_Amount, Margin__, and DistinctCountOrder_Number), finding the slowest one before proceeding. In this case, the most expensive calculation is the last one. The following query takes up 80% of the time required to compute the original query, meaning that the distinct count over Sales[Order Number] is the most expensive operation in the entire report: EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ),
CHAPTER 20
Optimizing DAX
659
// // // // //
"DistinctCountProductKey", CALCULATE ( DISTINCTCOUNT ( 'Product'[ProductKey] ) ), "Sales_Amount", 'Sales'[Sales Amount], "Margin__", 'Sales'[Margin %], "DistinctCountOrder_Number", CALCULATE ( DISTINCTCOUNT ( 'Sales'[Order Number] ) ) ), [IsGrandTotalRowTotal], 0, 'Product'[Brand], 1
) ORDER BY [IsGrandTotalRowTotal] DESC, 'Product'[Brand]
Another example is the following MDX query generated by the pivot table in Excel as seen in Figure 20-2: SELECT { [Measures].[Sales Amount], [Measures].[Total Cost], [Measures].[Margin], [Measures].[Margin %] } DIMENSION PROPERTIES PARENT_UNIQUE_NAME, HIERARCHY_UNIQUE_NAME ON COLUMNS, NON EMPTY HIERARCHIZE( DRILLDOWNMEMBER( { { DRILLDOWNMEMBER( { { DRILLDOWNLEVEL( { [Date].[Calendar].[All] },,, include_calc_members ) } }, { [Date].[Calendar].[Year].&[CY 2008] },,, include_calc_members ) } }, { [Date].[Calendar].[Quarter].&[Q4-2008] },,, include_calc_members ) ) DIMENSION PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON ROWS FROM [Model] CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, BACK_COLOR, FORE_COLOR, FONT_FLAGS
FIGURE 20-2 Simple pivot table in Excel that generates an MDX query with four measures.
660 CHAPTER 20 Optimizing DAX
You can reduce the measures either in the pivot table or directly in the MDX code. You can manipulate the MDX code by reducing the list of measures in braces. For example, you reduce the code to only the Sales Amount measure by modifying the list, as in the following initial part of the query: SELECT { [Measures].[Sales Amount] } DIMENSION PROPERTIES PARENT_UNIQUE_NAME, HIERARCHY_UNIQUE_NAME ON COLUMNS, ...
Regardless of the technique you use, once you identify the DAX expression (or measure) that is responsible for a performance issue, you need a reproduction query to use in DAX Studio.
Creating a reproduction query The optimization process requires a query that you can execute several times, possibly changing the definition of the measure in order to evaluate different levels of performance. If you captured a query in DAX or MDX, you already have a good starting point for the reproduction (repro) query. You should try to simplify the query as much as you can, so that it becomes easier to find the bottleneck. You should only keep a complex query structure when it is fundamental in order to observe the performance issue.
Creating a reproduction query in DAX When a measure is constantly slow, you should be able to create a repro query producing a single value as a result. Using CALCULATE or CALCULATETABLE, you can apply all the filters you need. For example, you can execute the Sales Amount measure for November 2008 using the following code, obtaining the same result ($96,777,975.30) you see in Figure 20-2 for that month: EVALUATE { CALCULATE ( [Sales Amount], 'Date'[Calendar Year] = "CY 2008", 'Date'[Calendar Year Quarter] = "Q4-2008", 'Date'[Calendar Year Month] = "November 2008" ) }
You can also write the previous query using CALCULATETABLE instead of CALCULATE: EVALUATE CALCULATETABLE ( { [Sales Amount] }, 'Date'[Calendar Year] = "CY 2008", 'Date'[Calendar Year Quarter] = "Q4-2008", 'Date'[Calendar Year Month] = "November 2008" )
The two approaches produce the same result. You should consider CALCULATETABLE when the query you use to test the measure is more complex than a simple table constructor. CHAPTER 20
Optimizing DAX
661
Once you have a repro query for a specific measure defined in the data model, you should consider writing the DAX expression of the measure as local in the query, using the MEASURE syntax. For example, you can transform the previous repro query into the following one: DEFINE MEASURE Sales[Sales Amount] = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) EVALUATE CALCULATETABLE ( { [Sales Amount] }, 'Date'[Calendar Year] = "CY 2008", 'Date'[Calendar Year Quarter] = "Q4-2008", 'Date'[Calendar Year Month] = "November 2008" )
At this point, you can apply changes to the DAX expression assigned to the measure directly into the query statement. This way, you do not have to deploy a change to the data model before executing the query again. You can change the query, clear the cache, and run the query in DAX Studio, immediately measuring the performance results of the modified expression.
Creating query measures with DAX Studio DAX Studio can generate the MEASURE syntax for a measure defined in the model by using the Define Measure context menu item. The latter is available by selecting a measure in the Metadata pane, as shown in Figure 20-3.
FIGURE 20-3 Screenshot of how a user would access the “Define Measure” menu item.
If a measure references other measures, all of them should be included as query measures in order to consider any possible change to the repro query. The Define Dependent Measures feature includes the definition of all the measures that are referenced by the selected measure, whereas Define and Expand Measure replaces any measure reference with the corresponding measure expression. For example, consider the following query that just evaluates the Margin % measure: EVALUATE { [Margin %] }
662 CHAPTER 20 Optimizing DAX
By clicking Define Measure on Margin %, you get the following code, where there are two other references to Sales Amount and Margin measures: DEFINE MEASURE Sales[Margin %] = DIVIDE ( [Margin], [Sales Amount] ) EVALUATE { [Margin %] }
Instead of repeating the Define Measure action on all the other measures, you can click on Define Dependent Measures on Margin %, obtaining the definition of all the other measures required; this includes Total Cost, which is used in the Margin definition: DEFINE MEASURE Sales[Margin] = [Sales Amount] - [Total Cost] MEASURE Sales[Sales Amount] = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) MEASURE Sales[Total Cost] = SUMX ( Sales, Sales[Quantity] * Sales[Unit Cost] ) MEASURE Sales[Margin %] = DIVIDE ( [Margin], [Sales Amount] ) EVALUATE { [Margin %] }
You can also obtain a single DAX expression without measure references by clicking Define and Expand Measure on Margin %: DEFINE MEASURE Sales[Margin %] = DIVIDE ( CALCULATE ( CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) ) - CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Unit Cost] ) ) ), CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) ) ) EVALUATE { [Margin %] }
This latter technique can be useful to quickly evaluate whether a measure includes nested iterators or not, though it could generate very verbose results.
Creating a reproduction query in MDX In certain conditions, you have to use an MDX query to reproduce a problem that only happens in MDX and not in DAX. The same DAX measure, executed in a DAX or in an MDX query, generates different query plans; it might display a different behavior depending on the language of the query. However in
CHAPTER 20
Optimizing DAX
663
this case too, you can define the DAX measure local to the query. That way, it is more efficient to edit and run again. For instance, you can define the Sales Amount measure local to the MDX query using the WITH MEASURE syntax: WITH MEASURE Sales[Sales Amount] = SUMX ( Sales, Sales[Quantity] * Sales[Unit Price] ) SELECT { [Measures].[Sales Amount], [Measures].[Total Cost], [Measures].[Margin], [Measures].[Margin %] } DIMENSION PROPERTIES PARENT_UNIQUE_NAME, HIERARCHY_UNIQUE_NAME ON COLUMNS, NON EMPTY HIERARCHIZE( DRILLDOWNMEMBER( { { DRILLDOWNMEMBER( { { DRILLDOWNLEVEL( { [Date].[Calendar].[All] },,, include_calc_members ) } }, { [Date].[Calendar].[Year].&[CY 2008] },,, include_calc_members ) } }, { [Date].[Calendar].[Quarter].&[Q4-2008] },,, include_calc_members ) ) DIMENSION PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON ROWS FROM [Model] CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, BACK_COLOR, FORE_COLOR, FONT_FLAGS
As you see, in MDX you must use WITH instead of DEFINE, which is how you can rename the syntax generated by DAX Studio if you optimize an MDX query. The syntax after MEASURE is always DAX code, so you will follow the same optimization process for an MDX query. Regardless of the repro query language (either DAX or MDX), you always have a DAX expression to optimize, which you can define within a local MEASURE definition.
Analyzing server timings and query plan information Once you have a repro query, you run it and collect information about execution time and query plan. You saw in Chapter 19 how to read the information provided by DAX Studio or SQL Server Profiler. In this section, we recap the steps required to analyze a simple query in DAX Studio. For example, consider the following DAX query: DEFINE MEASURE Sales[Sales Amount] = SUMX ( Sales, Sales[Quantity] * Sales[Unit Price] ) EVALUATE ADDCOLUMNS ( VALUES ( 'Date'[Calendar Year] ), "Result", [Sales Amount] )
664 CHAPTER 20 Optimizing DAX
If you execute this query in DAX Studio after clearing the cache and enabling Query Plan and Server Timings, you obtain a result with one row for each year in the Date table, and the total of Sales Amount for sales made in that year. The starting point for an analysis is always the Server Timings pane, which displays information about the entire query, as shown in Figure 20-4.
FIGURE 20-4 Server Timings pane after a simple query execution.
Our query returned the result in 25 ms (Total), and it spent 72 percent of this time in the storage engine (SE), whereas the formula engine (FE) only used up 7 ms of the total time. This pane does not provide much information about the formula engine internals, but it is rich in details on storage engine activity. For example, there were two storage engine queries (SE Queries) that consumed a total of 94 ms of processing time (SE CPU). The CPU time can be larger than Duration thanks to the parallelism of the storage engine. Indeed, the engine used 94 ms of logical processors working in parallel, so that the duration time is a fraction of that number. The hardware used in this test had 8 logical processors, and the parallelism degree of this query (ratio between SE CPU and SE) is 5.2. The parallelism cannot be higher than the number of logical processors you have. The storage engine queries are available in the list, and you can see that a single storage engine operation (the first one) consumes the entire duration and CPU time. By enabling the display of Internal and Cache subclass events, you can see in Figure 20-5 that the two storage engine queries were actually executed by the storage engine.
FIGURE 20-5 Server Timings pane with internal subclass events visible.
CHAPTER 20
Optimizing DAX
665
If you execute the same query again without clearing the cache, you see the results in Figure 20-6. Both storage engine queries retrieved the values from the cache (SE cache), and the storage engine queries resolved in the cache are visible in the Subclass column.
FIGURE 20-6 Server Timings pane with cache subclass events visible, after second execution of the same DAX query.
Usually, we will use the repro query with a cold cache (clearing the cache before the execution), but in some cases it is important to evaluate whether a given DAX expression can leverage the cache in an upcoming request or not. For this reason, the Cache visualization in DAX Studio is disabled by default, and you enable it on demand. At this point, you can start looking at the query plans. In Figure 20-7 you see the physical and logical query plans of the query used in the previous example. The physical query plan is the one you will use more often. In the query of the previous example, there are two datacaches—one for each storage engine query. Every Cache row in the physical query plan consumes one of the datacaches available. However, there is no simple way to match the correspondence between a query plan operation and a datacache. You can infer the datacache by looking at the columns used in the operations requiring a Cache result (the Spool_Iterator and SpoolLookup rows, in Figure 20-7).
FIGURE 20-7 Query Plan pane showing physical and logical query plans.
666 CHAPTER 20 Optimizing DAX
An important piece of information available in the physical query plan is the column showing the number of records processed. As you will see, when optimizing bottlenecks in the formula engine, it might be useful to identify the slowest operation in the formula engine by searching for the line with the largest number of records. You can sort the rows by clicking the Records column header, as you see in Figure 20-8. You restore the original sort order by clicking the Line column header.
FIGURE 20-8 Steps in physical query plan sorted by Records column.
Identifying bottlenecks in the storage engine or formula engine There are many possible optimizations usually available for any query. The first and most important step is to identify whether a query spends most of the time in the formula engine or in the storage engine. A first indication is available in the percentages provided by DAX Studio for FE and SE. Usually, this is a good starting point, but you also have to identify the distribution of the workload in both the formula engine and the storage engine. In complex queries, a large amount of time spent in the storage engine might correspond to a large number of small storage engine queries or to a small number of storage engine queries that concentrate the most of the workload. As you will see, these differences require different approaches in your optimization strategy. When you identify the execution bottleneck of a query, you should also prioritize the optimization areas. For example, there might be different inefficiencies in the query plan resulting in a large formula engine execution time. You should identify the most important inefficiency and concentrate on that first. If you do not follow this approach, you might end up spending time optimizing an expression that only marginally affects the execution time. Sometimes the more efficient optimizations are simple but hidden in counterintuitive context transitions or in other details of the DAX syntax. You should always measure the execution time before and after each optimization attempt, making sure that you obtain a real advantage and that you are not just applying some optimization pattern you found on the web or in this book without any real benefit.
CHAPTER 20
Optimizing DAX
667
Finally, remember that even if you have an issue in the formula engine, you should always start your analysis by looking at the storage engine queries. They provide valuable information about the content and size of the datacaches used by the formula engine. Reading the query plan that describes the operations made by the formula engine is a very complex process. It is easier to consider that the formula engine will use the content of datacaches and will have to do all the operations required to produce the result of a DAX query that has not already been produced by the storage engine. This approach is especially efficient for large and complex DAX queries. Indeed, these might generate thousands of lines in a query plan, but a relatively small number of datacaches produced by storage engine queries.
Implementing changes and rerunning the test query Once the bottlenecks have been identified, the next step is to change the DAX expressions and/or the data model, so that the query plan is more efficient. Running the test query again, it is possible to verify that the improvement is effective, starting the search for the next bottleneck and continuing the loop restarting at the step “Analyzing server timings and query plan information.” This process will continue until the performance is optimal or there are no further possible improvements that are worth the effort.
Optimizing bottlenecks in DAX expressions A longer execution time in the storage engine is usually the consequence of one or more of the following causes (explained in more detail in Chapter 19): ■
■
■
■
Longer scan time. Even for a simple aggregation, a DAX query must scan one or more columns. The cost for this scan depends on the size of the column, which depends on the number of unique values and on the data distribution. Different columns in the same table can have very different execution times. Large cardinality. A large number of unique values in a column affects the DISTINCTCOUNT calculation and the filter arguments of the CALCULATE and CALCULATETABLE functions. A large cardinality can also affect the scan time of a column, but it could be an issue by itself regardless of the column data size. High frequency of CallbackDataID. A large number of calls made by the storage engine to the formula engine can affect the overall performance of a query. Large materialization. If a storage engine query produces a large datacache, its generation requires time (allocating and writing RAM). Moreover, its consumption (by the formula engine) is also another potential bottleneck.
In the following sections, you will see several examples of optimization. Starting with the concepts you learned in previous chapters, you will see a typical problem reproduced in a simpler query and optimized.
Optimizing filter conditions Whenever possible, a filter argument of a CALCULATE/CALCULATETABLE function should always filter columns rather than tables. The DAX engine has improved over the years, and several simple table 668 CHAPTER 20 Optimizing DAX
filters are relatively well optimized in 2019 or newer engine versions. However, expressing a filter condition by columns rather than by tables is always a best practice. For example, consider the report in Figure 20-9 that compares the total of Sales Amount with the sum of the sales transactions larger than $1,000 (Big Sales Amount) for each product brand.
FIGURE 20-9 Sales Amount and Big Sales Amount reported by product brand.
Because the filter condition in the Big Sales Amount measure requires two columns, a trivial way to define the filter is by using a filter over the Sales table. The following query computes just the Big Sales Amount measure in the previous report, generating the server timings results visible in Figure 20-10: DEFINE MEASURE Sales[Big Sales Amount (slow)] = CALCULATE ( [Sales Amount], FILTER ( Sales, Sales[Quantity] * Sales[Net Price] > 1000 ) ) EVALUATE SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ), "Big_Sales_Amount", 'Sales'[Big Sales Amount (slow)] )
FIGURE 20-10 Server Timings running the query for the Big Sales Amount (slow) measure.
CHAPTER 20
Optimizing DAX
669
Because FILTER is iterating a table, this query is generating a larger datacache than necessary. The result in Figure 20-9 only displays 11 brands and one additional row for the grand total. Nevertheless, the query plan estimates that the first two datacaches return 3,937 rows, which is the same number as reported also in the Query Plan pane visible in Figure 20-11.
FIGURE 20-11 Query Plan pane running the query for Big Sales Amount (slow) measure.
The formula engine receives a much larger datacache than the one required for the query result because there are two additional columns. Indeed, the xmSQL query at line 2 is the following: WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Product'[Brand], 'DaxBook Sales'[Quantity], 'DaxBook Sales'[Net Price], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey] WHERE ( COALESCE ( ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) ) > COALESCE ( 1000.000000 ) );
The structure of the xmSQL query at line 4 in Figure 20-10 is similar to the previous one, just without the SUM aggregation. The presence of a table filter in CALCULATE results in this side effect in the query plan because the semantic of the filter includes all the columns of the Sales expanded table (expanded tables are described in Chapter 14, “Advanced DAX concepts”). The optimization of the measure only requires a column filter. Because the filter expression uses two columns, a row context requires a table with just those two columns to produce a corresponding and more efficient filter argument to CALCULATE. The following query implements the columns filter adding KEEPFILTERS to keep the same semantic as the previous version, generating the server timings results visible in Figure 20-12:
670 CHAPTER 20 Optimizing DAX
DEFINE MEASURE Sales[Big Sales Amount (fast)] = CALCULATE ( [Sales Amount], KEEPFILTERS ( FILTER ( ALL ( Sales[Quantity], Sales[Net Price] ), Sales[Quantity] * Sales[Net Price] > 1000 ) ) ) EVALUATE SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ), "Big_Sales_Amount", 'Sales'[Big Sales Amount (fast)] )
FIGURE 20-12 Server Timings when running the query for the Big Sales Amount (fast) measure.
The DAX query runs faster, but what is more important is that there is only one datacache for the rows of the result, excluding the grand total, which still has a separate xmSQL query. The materialization of the datacache at line 2 in Figure 20-12 only returns 14 estimated rows, when there are only 11 in the actual count visible in the Query Plan pane in Figure 20-13.
FIGURE 20-13 Query Plan pane running the query for Big Sales Amount (fast) measure.
The reason for this optimization is that the query plan can create a much more efficient calculation in the storage engine without returning additional data to the formula engine because of the semantic required by a table filter. The following is the xmSQL query at line 2 in Figure 20-12: CHAPTER 20
Optimizing DAX
671
WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Product'[Brand], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey] WHERE ( COALESCE ( ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) ) > COALESCE ( 1000.000000 ) );
The datacache no longer includes the Quantity and Net Price columns, and its cardinality corresponds to the cardinality of the DAX result. This is an ideal condition for minimal materialization. Keeping the filter conditions using columns rather than tables is an important effort to achieve this goal. The important takeaway of this section is that you should always pay attention to the rows returned by storage engine queries. When their number is much bigger than the rows included in the result of a DAX query, there might be some overhead caused by the additional work performed by the storage engine to materialize datacaches and by the formula engine to consume such datacaches. Table filters are one of the most common reasons for excessive materialization, though they are not always responsible for bad performance.
Note When you write a DAX filter, consider the cardinality of the resulting filter. If the cardinality using a table filter is identical to a column filter and the table filter does not expand to other tables, then the table filter can be used safely. For example, there is not usually much difference between filtering a Date table versus the Date[Date] column.
Optimizing context transitions The storage engine can only compute simple aggregations and simple grouping over columns of the model. Anything else must be computed by the formula engine. Every time there is an iteration and a corresponding context transition, the storage engine materializes a datacache at the granularity level of the iterated table. If the expression computed during the iteration is simple enough to be solved by the storage engine, the performance is typically good. Otherwise, if the expression is too complex, a large materialization and/or a CallbackDataID might occur as we demonstrate in the following example. In these scenarios, simplifying the code by reducing the number of context transitions and by reducing the granularity of the iterated table greatly helps in improving performance. For example, consider a Cashback measure that multiplies the Sales Amount by the Cashback % attribute assigned to each Customer based on an algorithm defined by the marketing department. The report in Figure 20-14 displays the Cashback amount for each country.
672 CHAPTER 20 Optimizing DAX
FIGURE 20-14 Cashback reported by customer country.
The easiest and most intuitive way to create the Cashback measure is also the slowest, which multiplies the Cashback % by the Sales Amount for each customer, summing the result. The following query computes the slowest Cashback measure in the previous report, generating the server timings results visible in Figure 20-15: DEFINE MEASURE Sales[Cashback (slow)] = SUMX ( Customer, [Sales Amount] * Customer[Cashback %] ) EVALUATE SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Customer'[Country], "IsGrandTotalRowTotal" ), "Cashback", 'Sales'[Cashback (slow)] )
FIGURE 20-15 Server Timings running the query for the Cashback (slow) measure reported by country.
The queries at lines 2 and 4 of Figure 20-15 compute the result at the Country level, whereas the queries at lines 6 and 8 run the same task for the grand total. We will focus exclusively on the first two storage engine queries. In order to check whether the estimation for the rows materialized is correct, you can look at the query plan in Figure 20-16. This could be surprising, because it seems that a few storage engine queries are not used at all.
CHAPTER 20
Optimizing DAX
673
FIGURE 20-16 Query Plan pane running the query for the Cashback (slow) measure reported by country.
The query plan in Figure 20-16 only reports two Cache nodes, which correspond to lines 4 and 8 of the Server Timings pane in Figure 20-15. This is another example of why looking at the query plan could be confusing. The formula engine is actually doing some other work, but the execution within a CallbackDataID is not always reported in the query plan, and this is one of those cases. This is the xmSQL query at line 4 of Figure 20-15, which returns 29 effective rows instead of the estimated 32: WITH $Expr0 := ( [CallbackDataID ( SUMX ( Sales, Sales[Quantity]] * Sales[Net Price]] ) ) ] ( PFDATAID ( 'DaxBook Customer'[CustomerKey] ) ) * PFCAST ( 'DaxBook Customer'[Cashback %] AS REAL ) ) SELECT 'DaxBook Customer'[Country], SUM ( @$Expr0 ) FROM 'DaxBook Customer';
The DAX code passed to CallbackDataID must be computed for each customer by the formula engine, which receives the CustomerKey as argument. You can see the additional storage engine queries, but the corresponding query plan is not visible in this case. Therefore, we can only imagine what the query plan does by looking at the other storage engine query at line 2 of Figure 20-15: WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Customer'[CustomerKey], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey];
The result of this xmSQL query only contains two columns: the CustomerKey and the result of the Sales Amount measure for that customer. Thus, the formula engine uses the result of this query to provide a result to the CallbackDataID request of the former query.
674 CHAPTER 20 Optimizing DAX
Once again, instead of trying to describe the exact sequence of operations performed by the engine, it is easier to analyze the result of the storage engine queries, checking whether the materialization is larger than what is required for the query result. In this case the answer is yes: the DAX query returns only 6 visible countries, whereas a total of 29 countries were computed by the formula engine. In any case, there is a huge difference with the materialization of 18,872 customers produced by the latter xmSQL query analyzed. Is it possible to push more workload to the storage engine, aggregating the data by country instead of by customer? The answer is yes, by reducing the number of context transitions. Consider the original Cashback measure: the expression executed in the row context depends on a single column of the Customer table (Cashback %): Sales[Cashback (slow)] := SUMX ( Customer, [Sales Amount] * Customer[Cashback %] )
Because the Sales Amount measure can be computed for a group of customers that have the same Cashback %, the optimal cardinality for the SUMX iterator is defined by the unique values of the Cashback % column. The following optimized version just replaces the first argument of SUMX using the unique values of Cashback % visible in the filter context: DEFINE MEASURE Sales[Cashback (fast)] = SUMX ( VALUES ( Customer[Cashback %] ), [Sales Amount] * Customer[Cashback %] ) EVALUATE SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Customer'[Country], "IsGrandTotalRowTotal" ), "Cashback", 'Sales'[Cashback (fast)] )
This way, the materialization is much smaller, as visible in Figure 20-17. However, even though the number of rows materialized is significantly smaller, the overall execution time is similar if not larger; remember that a difference of a few milliseconds should not be considered relevant.
FIGURE 20-17 Server Timings running the query for Cashback (fast) reported by country.
CHAPTER 20
Optimizing DAX
675
This time there is a single xmSQL query to compute the amount by country. This is the xmSQL query at line 2 of Figure 20-17: WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Customer'[Country], 'DaxBook Customer'[Cashback %], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey];
The result of this query contains three columns: Country, Cashback %, and the corresponding Sales Amount value. Thus, the formula engine multiplies Cashback % by Sales Amount for each row, aggregating the rows belonging to the same country. The result presents an estimated count of 288 rows, whereas there are only 65 rows consumed by the formula engine. This is visible in the query plan in Figure 20-18.
FIGURE 20-18 Query Plan pane running the query for Cashback (fast) reported by country.
Even though it is not evident, this measure is faster than the original measure. Having a smaller footprint in memory, it performs better in more complex reports. This is immediately visible by using a slightly different report like the one in Figure 20-19, grouping the Cashback measure by product brand instead of by customer country.
FIGURE 20-19 Cashback reported by product brand.
676 CHAPTER 20 Optimizing DAX
The following query computes the slowest Cashback measure in the report shown in Figure 20-19, generating the server timings results visible in Figure 20-20: DEFINE MEASURE Sales[Cashback (slow)] = SUMX ( Customer, [Sales Amount] * Customer[Cashback %] ) EVALUATE SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( Product[Brand], "IsGrandTotalRowTotal" ), "Cashback", 'Sales'[Cashback (slow)] )
FIGURE 20-20 Server Timings running the query for Cashback (slow) reported by brand.
There are a few differences in this query plan, but we focus on the materialization of 192,514 rows produced by the following xmSQL query at line 2 of Figure 20-20: WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Customer'[CustomerKey], 'DaxBook Product'[Brand], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey] LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey];
The reason for the larger materialization is that now, the inner calculation computes Sales Amount for each combination of CustomerKey and Brand. The estimated count of 192,514 rows is confirmed by the actual count visible in the query plan in Figure 20-21.
FIGURE 20-21 Query Plan pane running the query for the Cashback (slow) measure reported by country.
CHAPTER 20
Optimizing DAX
677
When the test query is using the faster measure, the materialization is much smaller and the query response time is also much faster. The execution of the following DAX query produces the server timings results visible in Figure 20-22: DEFINE MEASURE Sales[Cashback (fast)] = SUMX ( VALUES ( Customer[Cashback %] ), [Sales Amount] * Customer[Cashback %] ) EVALUATE SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( Product[Brand], "IsGrandTotalRowTotal" ), "Cashback", 'Sales'[Cashback (fast)] )
FIGURE 20-22 Server Timings running the query for Cashback (fast) reported by brand.
The materialization is three orders of magnitude smaller (126 rows instead of 192,000), and the total execution time is 9 times faster than the slow version (it was 415 milliseconds and it is 48 milliseconds with the fast version). Because these differences depend on the cardinality of the report, you should focus on the formula that minimizes the work in the formula engine by computing most of the aggregations in the storage engine. Reducing the number of context transitions is an important step to achieve this goal.
Note Excessive materialization generated by unnecessary context transitions is the most common performance issue in DAX measures. Using table filters instead of column filters is the second most common performance issue. Therefore, making sure that your DAX measures do not have these two problems should be your priority in an optimization effort. By inspecting the server timings, you should be able to quickly see the symptoms by looking at the materialization size.
Optimizing IF conditions An IF function is always executed by the formula engine. When there is an IF function within an iteration, there could be a CallbackDataID involved in the execution. Moreover, the engine might evaluate the arguments of the IF regardless of the result of the condition in the first argument. Even though 678 CHAPTER 20 Optimizing DAX
the result is correct, you might pay the full cost of processing all the possible solutions. As usual, there could be different behaviors depending on the version of the DAX engine used.
Optimizing IF in measures Conditional statements in a measure could trigger a dangerous side effect in the query plan, generating the calculation of every conditional branch regardless of whether it is needed or not. In general, it is a good idea to avoid or at least reduce the number of conditional statements in expressions evaluated for measures, applying filters through the filter context whenever possible. For example, the report in Figure 20-23 displays a Fam. Sales measure that only considers customers with at least one child at home. Because the goal is to display the value for individual customers, the first implementation (slow) does not work for aggregations of two or more customers (Total row is blank), whereas the alternative, faster implementation also works at aggregated levels.
FIGURE 20-23 Fam. Sales reported by product brand.
The following query computes the Fam. Sales (slow) measure in a report similar to the one in Figure 20-1. For each customer, an IF statement checks the number of children at home to filter customers classified as a family. The execution of the following DAX query produces the server timings results visible in Figure 20-22: DEFINE MEASURE Sales[Fam. Sales (slow)] = VAR ChildrenAtHome = SELECTEDVALUE ( Customer[Children At Home] ) VAR Result = IF ( ChildrenAtHome > 0, [Sales Amount] ) RETURN Result EVALUATE CALCULATETABLE ( SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( ROLLUPGROUP ( 'Customer'[CustomerKey], 'Customer'[Name] ), "IsGrandTotalRowTotal"
CHAPTER 20
Optimizing DAX
679
), "Fam__Sales__slow_", 'Sales'[Fam. Sales (slow)] ), 'Product Category'[Category] = "Home Appliances", 'Product'[Manufacturer] = "Northwind Traders", 'Product'[Class] = "Regular", DATESBETWEEN ( 'Date'[Date], DATE ( 2007, 5, 10 ), DATE ( 2007, 5, 10 ) ) ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Customer'[CustomerKey], 'Customer'[Name]
FIGURE 20-24 Server Timings running the query for Fam. Sales (slow) reported by customer.
The query is not that slow, but we wanted a query result with a small number or rows because the focus is mainly on the materialization required. We can avoid looking at the query plan, which is already 62 lines long, because the information provided in the Server Timings pane already highlights several facts: ■
■
■
■
Even though the DAX result only has 7 rows, the rows materialized in three xmSQL queries have more than 18,000 rows, a number close to the number of customers. The materialization produced by the storage engine query at line 4 in Figure 20-24 includes information about the number of children at home computed for each customer. The materialization produced by the storage engine query at line 9 in Figure 20-24 includes the Sales Amount measure computed for each customer. The grand total is not computed by any storage engine query, so it is the formula engine that aggregates the customers to obtain that number.
This is the storage engine query at line 4 in Figure 20-24. It provides the information required by the formula engine to filter customers based on the number of children at home: SELECT 'DaxBook Customer'[CustomerKey], SUM ( ( PFDATAID ( 'DaxBook Customer'[Children At Home] ) 2 ) MIN ( 'DaxBook Customer'[Children At Home] ),
680 CHAPTER 20 Optimizing DAX
),
MAX ( 'DaxBook Customer'[Children At Home] ), COUNT ( ) FROM 'DaxBook Customer';
This result is used as an argument to the following storage engine query at line 9 in Figure 20-24 in order to filter an estimate of 7,368 customers that have at least one child at home: WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Customer'[CustomerKey], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey] LEFT OUTER JOIN 'DaxBook Date' ON 'DaxBook Sales'[OrderDateKey]='DaxBook Date'[DateKey] LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey] LEFT OUTER JOIN 'DaxBook Product Subcategory' ON 'DaxBook Product'[ProductSubcategoryKey] ='DaxBook Product Subcategory'[ProductSubcategoryKey] LEFT OUTER JOIN 'DaxBook Product Category' ON 'DaxBook Product Subcategory'[ProductCategoryKey] ='DaxBook Product Category'[ProductCategoryKey] WHERE 'DaxBook Customer'[CustomerKey] IN ( 2241, 13407, 5544, 7787, 11090, 7368, 17055, 16636, 1329, 12914.. [7368 total values, not all displayed] ) VAND 'DaxBook Date'[Date] = 39212.000000 VAND 'DaxBook Product'[Manufacturer] = 'Northwind Traders' VAND 'DaxBook Product'[Class] = 'Regular' VAND 'DaxBook Product Category'[Category] = 'Home Appliances';
The estimated number of rows in this result is wrong, because there are only 7 rows received in the previous storage engine query. This is visible in the query plan; however, it might not be trivial to find the corresponding xmSQL query for each Cache node in the query plan shown in Figure 20-25.
FIGURE 20-25 Server Timings running the query for the Fam. Sales (slow) measure reported by customer.
The previous storage engine query receives a filter over the CustomerKey column. The formula engine requires a materialization of such a list of values in CustomerKey in order to provide the corresponding filter in a storage engine query. However, the materialization of a large number of customers in the formula engine is likely to be the bigger cost for this query. The size of this materialization depends on the number of customers. Therefore, a model with hundreds of thousands or millions of customers would make the performance issue evident. In this case you should look at the size of the CHAPTER 20
Optimizing DAX
681
materialization rather than just the execution time. The latter is still relatively quick. Understanding whether the materialization is efficient is important to create a formula that scales up well with a growing number of rows in the model. The IF statement in the measure can only be evaluated by the formula engine. This requires either materialization like in this example, or CallbackDataID calls, which we describe later. A better approach is to apply a filter to the filter context using CALCULATE. This removes the need to evaluate an IF condition for every cell of the query result. When the test query is using the faster measure, the materialization is much smaller and the query response time is also much shorter. The execution of the following DAX query produces the server timings results visible in Figure 20-26: DEFINE MEASURE Sales[Fam. Sales (fast)] = CALCULATE ( [Sales Amount], KEEPFILTERS ( Customer[Children At Home] > 0 ) ) EVALUATE CALCULATETABLE ( SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( ROLLUPGROUP ( 'Customer'[CustomerKey], 'Customer'[Name] ), "IsGrandTotalRowTotal" ), "Fam__Sales__fast_", 'Sales'[Fam. Sales (fast)] ), 'Product Category'[Category] = "Home Appliances", 'Product'[Manufacturer] = "Northwind Traders", 'Product'[Class] = "Regular", DATESBETWEEN ( 'Date'[Date], DATE ( 2007, 5, 10 ), DATE ( 2007, 5, 10 ) ) ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Customer'[CustomerKey], 'Customer'[Name]
FIGURE 20-26 Server Timings running the query for Fam. Sales (fast) reported by customer.
682 CHAPTER 20 Optimizing DAX
Even though there are still four storage engine queries, the query at line 4 in Figure 20-24 is no longer used. The query at line 4 in Figure 20-26 corresponds to the query at line 9 in Figure 20-24. It includes the filter over the number of children, highlighted in the last two lines of the following xmSQL query: WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Customer'[CustomerKey], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey] LEFT OUTER JOIN 'DaxBook Date' ON 'DaxBook Sales'[OrderDateKey]='DaxBook Date'[DateKey] LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey] LEFT OUTER JOIN 'DaxBook Product Subcategory' ON 'DaxBook Product'[ProductSubcategoryKey] ='DaxBook Product Subcategory'[ProductSubcategoryKey] LEFT OUTER JOIN 'DaxBook Product Category' ON 'DaxBook Product Subcategory'[ProductCategoryKey] ='DaxBook Product Category'[ProductCategoryKey] WHERE 'DaxBook Date'[Date] = 39212.000000 VAND 'DaxBook Product'[Manufacturer] = 'Northwind Traders' VAND 'DaxBook Product'[Class] = 'Regular' VAND 'DaxBook Product Category'[Category] = 'Home Appliances' VAND ( PFCASTCOALESCE ( 'DaxBook Customer'[Children At Home] AS INT ) > COALESCE ( 0 ) );
This different query plan has pros and cons. The advantage is that the formula engine bears a lower workload, not having to transfer the filter of customers back and forth between storage engine queries. The price to pay for this is that the execution of the filters is applied at the storage engine level, which results in an increased cost moving from a former 32 ms of SE CPU time to the current 94 ms of SE CPU time. Another side effect of the new query plan is the additional storage engine query at line 8 in Figure 20-26; this query computes the aggregation at the grand total without having to perform such aggregation in the formula engine, as was the case in the slower measure. The code is similar to the previous xmSQL query, without the aggregation by CustomerKey. As a rule of thumb, replacing a conditional statement with a filter argument in CALCULATE is usually a good idea, prioritizing a smaller materialization rather than looking at the execution time for small queries. This way, the expression is usually more scalable with larger data models. However, you should always evaluate the performance in specific conditions, analyzing the metrics provided by DAX Studio using different implementations; you might otherwise choose an implementation that, in a particular scenario, turns out to be slower and not faster.
CHAPTER 20
Optimizing DAX
683
Choosing between IF and DIVIDE A very common use of the IF statement is to make sure that an expression is only evaluated with valid arguments. For example, an IF function can validate the denominator of a division to avoid a division by zero. For this specific condition, the DIVIDE function provides a faster alternative. It is interesting to consider why the code is faster by analyzing the different executions with DAX Studio. The report in Figure 20-27 displays an Average Price measure by customer and brand.
FIGURE 20-27 Average Price reported by product brand and customer.
The following query computes the Average Price (slow) measure in the report shown in Figure 20-27. For each combination of product brand and customer, it divides the sales amount by the sum of quantity—only if the latter is not equal to zero. The execution of this DAX query produces the server timings results visible in Figure 20-28: DEFINE MEASURE VAR VAR VAR
Sales[Average Price (slow)] = Quantity = SUM ( Sales[Quantity] ) SalesAmount = [Sales Amount] Result = IF ( Quantity 0, SalesAmount / Quantity ) RETURN Result
EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( ROLLUPGROUP ( 'Customer'[CustomerKey], 'Product'[Brand] ), "IsGrandTotalRowTotal" ), "Average_Price__slow_", 'Sales'[Average Price (slow)] ), [IsGrandTotalRowTotal], 0,
684 CHAPTER 20 Optimizing DAX
'Customer'[CustomerKey], 1, 'Product'[Brand], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Customer'[CustomerKey], 'Product'[Brand]
FIGURE 20-28 Server Timings running the query for Average Price (slow) reported by product brand and customer.
Though the result of the query is limited to 500 rows, the materialization of the datacaches returned by the storage engine queries is much larger. The following xmSQL query is executed at line 2 in Figure 20-28, and returns one row for each combination of customer and brand: WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Customer'[CustomerKey], 'DaxBook Product'[Brand], SUM ( @$Expr0 ), SUM ( 'DaxBook Sales'[Quantity] ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey] LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey];
The query does not have any filter; therefore, the formula engine evaluates every row returned by this datacache, sorting the result and choosing the first 500 rows to return. This is certainly the most expensive part of the storage engine execution, which consumes 90% of the query duration time. The other three storage engine queries return the list of product brands (line 4), the list of customers (line 6), and the value of sales amount and quantity at the grand total level (line 8). However, these queries are less important in the optimization process. What matters is the formula engine cost required to execute the IF condition on more than 190,000 rows. The query plan resulting from the slow version of the measure has more than 80 lines (not reported here), and it consumes every datacache multiple times. This is a side effect of having different execution branches in an IF statement.
CHAPTER 20
Optimizing DAX
685
The optimization of the Average Price measure is based on replacing the IF function with DIVIDE. The execution of the following DAX query produces the server timings results visible in Figure 20-29: DEFINE MEASURE VAR VAR VAR
Sales[Average Price (fast)] = Quantity = SUM ( Sales[Quantity] ) SalesAmount = [Sales Amount] Result = DIVIDE ( SalesAmount, Quantity ) RETURN Result
EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( ROLLUPGROUP ( 'Customer'[CustomerKey], 'Product'[Brand] ), "IsGrandTotalRowTotal" ), "Average_Price__fast_", 'Sales'[Average Price (fast)] ), [IsGrandTotalRowTotal], 0, 'Customer'[CustomerKey], 1, 'Product'[Brand], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Customer'[CustomerKey], 'Product'[Brand]
FIGURE 20-29 Server Timings running the query for Average Price (fast) reported by product brand and customer.
The query now runs in 413 milliseconds, saving more than 80% of the execution time. At first sight, there being only two storage engine queries instead of four might seem like a good reason for the improved performance. However, this is not really the case. Overall, the SE CPU time did not change significantly, and the larger materialization is still there. The optimization is obtained by a shorter and more efficient query plan, which has only 36 lines instead of more than 80 generated by the slower
686 CHAPTER 20 Optimizing DAX
query. In other words, DIVIDE reduces the size and complexity of the query plan, saving time in the formula engine execution by almost one order of magnitude.
Optimizing IF in iterators Using the IF statement within a large iterator might create expensive callbacks to the formula engine. For example, consider a Discounted Sales measure that applies a 10% discount to every transaction that has a quantity greater than or equal to 3. The report in Figure 20-30 displays the Discounted Sales amount for each product brand.
FIGURE 20-30 Discounted Sales reported by product brand.
The following query computes the slower Discounted Sales measure in the previous report, generating the server timings results visible in Figure 20-31: DEFINE MEASURE Sales[Discounted Sales (slow)] = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] * IF ( Sales[Quantity] >= 3, .9, 1 ) ) EVALUATE SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ), "Sales_Amount", 'Sales'[Sales Amount], "Discounted_Sales__slow_", 'Sales'[Discounted Sales (slow)] ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Product'[Brand]
CHAPTER 20
Optimizing DAX
687
FIGURE 20-31 Server Timings running the query for Discounted Sales (slow) reported by product brand.
The IF statement executed in the SUMX iterator produces two storage engine queries with a CallbackDataID call. The following is the xmSQL query at line 2 of Figure 20-31: WITH $Expr0 := (
( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) * [CallbackDataID ( IF ( Sales[Quantity]] >= 3, .9, 1 ) ) ] ( PFDATAID ( 'DaxBook Sales'[Quantity] ) ) ) , $Expr1 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) SELECT 'DaxBook Product'[Brand], SUM ( @$Expr0 ), SUM ( @$Expr1 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey];
The presence of a CallbackDataID comes with two consequences: a slower execution time compared to the storage engine performance and the unavailability of the storage engine cache. The datacache must be computed every time and cannot be retrieved from the cache in subsequent requests. The second issue could be more important than the first one, as is the case for this example. The CallbackDataID can be removed by rewriting the measure in a different way, summing the value of two CALCULATE statements with different filters. For example, the Discounted Sales measure can be rewritten using two CALCULATE functions, one for each percentage, filtering the transactions that share the same multiplicator. The following DAX query implements a version of Discounted Sales that does not rely on any CallbackDataID. The code is longer and requires KEEPFILTERS to provide the same semantic as in the original measure, producing the server timings results visible in Figure 20-32: DEFINE MEASURE Sales[Discounted Sales (scalable)] = CALCULATE ( SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) * .9, KEEPFILTERS ( Sales[Quantity] >= 3 ) ) + CALCULATE (
688 CHAPTER 20 Optimizing DAX
SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ), KEEPFILTERS ( NOT ( Sales[Quantity] >= 3 ) ) ) EVALUATE SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ), "Sales_Amount", 'Sales'[Sales Amount], "Discounted_Sales__slow_", 'Sales'[Discounted Sales (scalable)] )
FIGURE 20-32 Server Timings running the query for Discounted Sales (scalable) by product brand for the first time.
Actually, in this simple query the result is not faster at all. The query required 159 milliseconds instead of the 142 milliseconds of the “slow” version. However, we called this measure “scalable.” Indeed, the important advantage is that a second execution of the last query with a warm cache produces the results visible in Figure 20-33, whereas multiple executions of the query for the “slow” version always produce a result similar to the one shown in Figure 20-31.
FIGURE 20-33 Server Timings running the query for Discounted Sales (scalable) by product brand a second time.
The Server Timings in Figure 20-33 show that there is no SE CPU cost after the first execution of the query. This is important when a model is published on a server and many users open the same reports: Users experience a faster response time, and the memory and CPU workload on the server side is reduced. This optimization is particularly relevant in environments with a fixed reserved capacity, such as Power BI Premium and Power BI Report Server. The rule of thumb is to carefully consider the IF function in the expression of an iterator with a large cardinality because of the possible presence of CallbackDataID in the storage engine queries. The next section includes a deeper discussion on the impact of CallbackDataID, which might be required by many other DAX functions used in iterators. CHAPTER 20
Optimizing DAX
689
Note The SWITCH function in DAX is similar to a series of nested IF functions and can be optimized in a similar way.
Reducing the impact of CallbackDataID In Chapter 19, you saw that the CallbackDataID function in a storage engine query can have a huge performance impact. This is because it slows down the storage engine execution, and it disables the use of the storage engine cache for the datacache produced. Identifying the CallbackDataID is important because this is often the reason behind a bottleneck in the storage engine, especially for models that only have a few million rows in their largest table (scan time should typically be in the order of magnitude of 10–100 milliseconds). For example, consider the following query where the Rounded Sales measure computes its result rounding Unit Price to the nearest integer. The report in Figure 20-34 displays the Rounded Sales amount for each product brand.
FIGURE 20-34 Rounded Sales reported by product brand.
The simpler implementation of Rounded Sales applies the ROUND function to every row of the Sales table. This results in a CallbackDataID call, which slows down the execution, thus lowering performance. The following query computes the slowest Rounded Sales measure in the previous report, generating the server timings results visible in Figure 20-35: DEFINE MEASURE Sales[Rounded Sales (slow)] = SUMX ( Sales, Sales[Quantity] * ROUND ( Sales[Net Price], 0 ) ) EVALUATE TOPN ( 502,
690 CHAPTER 20 Optimizing DAX
SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ), "Rounded_Sales", 'Sales'[Rounded Sales (slow)] ), [IsGrandTotalRowTotal], 0, 'Product'[Brand], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Product'[Brand]
FIGURE 20-35 Server Timings running the query for Rounded Sales (slow).
The two storage engine queries at lines 2 and 4 compute the value for each brand and for the grand total, respectively. This is the xmSQL query at line 2 of Figure 20-35: WITH $Expr0 := ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * [CallbackDataID ( ROUND ( Sales[Net Price]], 0 ) ) ] ( PFDATAID ( 'DaxBook Sales'[Net Price] ) ) ) SELECT 'DaxBook Product'[Brand], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey];
The Sales table contains more than 12 million rows, and each storage engine query computes an equivalent amount of CallbackDataID calls to execute the ROUND function. Indeed, the formula engine executes the ROUND operation to remove the decimal part of the Unit Price value. Based on the Server Timings report, we can estimate that the formula engine executes around 7,000 ROUND functions per millisecond. It is important to keep these numbers in mind, so that you can evaluate whether or not the cardinality of an iterator generating CallbackDataID calls would benefit from some amount of optimization. If the table contained 12,000 rows instead of 12 million rows, the priority would be to optimize something else. However, optimizing the measure in the current model requires reducing the number of CallbackDataID calls. We aim to reduce the number of CallbackDataID calls by refactoring the measure. By looking at the information provided by VertiPaq Analyzer, we know that the Sales table has more than 12 million rows, whereas the Net Price column in the Sales table has less than 2,500 unique values. Accordingly, the formula can compute the same result by multiplying the rounded value of each unique Unit Price value by the sum of Quantity for all the Sales transaction with the same Unit Price. CHAPTER 20
Optimizing DAX
691
Note You should always use the statistics of your data model during DAX optimization. A quick way to obtain these numbers for a data model is by using VertiPaq Analyzer (http://www.sqlbi.com/tools/vertipaq-analyzer/). The following optimized version of Rounded Sales materializes up to 2,500 rows computing the sum of Quantity iterating the unique values of Unit Price: DEFINE MEASURE Sales[Rounded Sales (fast)] = SUMX ( VALUES ( Sales[Net Price] ), CALCULATE ( SUM ( Sales[Quantity] ) ) * ROUND ( Sales[Net Price], 0 ) ) EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Brand], "IsGrandTotalRowTotal" ), "Rounded_Sales", 'Sales'[Rounded Sales (fast)] ), [IsGrandTotalRowTotal], 0, 'Product'[Brand], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Product'[Brand]
This way, the formula engine executes the ROUND function using the result of the datacache returning the sum of Quantity for each Net Price. Despite a larger materialization compared to the slow version, the time required to obtain the solution is reduced by almost one order of magnitude. Moreover, the results provided by the storage engine queries can be reused in following executions because the storage engine cache will store the result of xmSQL queries that do not have any CallbackDataID calls.
FIGURE 20-36 Server Timings running the query for Rounded Sales (fast).
The following is the xmSQL query at line 2 of Figure 20-36. This query returns the Net Price and the sum of the Quantity for each brand and does not have any CallbackDataID calls:
692 CHAPTER 20 Optimizing DAX
SELECT 'DaxBook Product'[Brand], 'DaxBook Sales'[Net Price], SUM ( 'DaxBook Sales'[Quantity] ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey];
In this latter version, the rounding is executed by the formula engine and not by the storage engine through the CallbackDataID. Be mindful that a very large number of unique values in Net Price would require a bigger materialization, up to the point where the previous version could be faster with a different data distribution. If Net Price had millions of unique values, a benchmark comparison between the two solutions would be required in order to determine the optimal solution. Moreover, the result could be different depending on the hardware. Rather than assuming that one technique is better than another, you should always evaluate the performance using a real database and not just a sample before making a decision. Finally, remember that most of the scalar DAX functions that do not aggregate data require a CallbackDataID if executed in an iterator. For example, DATE, VALUE, most of the type conversions, IFERROR, DIVIDE, and all the rounding, mathematical, and date/time functions are only implemented in the formula engine. Most of the time, their presence in an iterator generates a CallbackDataID call. However, you always have to check the xmSQL query to verify whether a CallbackDataID is present or not.
Optimizing nested iterators Nested iterators in DAX cannot be merged into a single storage engine query. Only the innermost iterator can be executed using a storage engine query, whereas the outer iterators typically require either a larger materialization or additional storage engine queries. For example, consider another Cashback measure named “Cashback Sim.” that simulates a cashback for each customer using the current price of each product multiplied by the historical quantity and the cashback percentage of each customer. The report in Figure 20-37 displays the Cashback Sim. amount for each country.
FIGURE 20-37 Cashback Sim. reported by customer country.
The first and slowest implementation iterates the Customer and Product tables in order to retrieve the cashback percentage of the customer and the current price of the product, respectively. The innermost iterators retrieve the quantity sold for each combination of customer and product, multiplying it CHAPTER 20
Optimizing DAX
693
by Unit Price and Cashback %. The following query computes the slowest Cashback Sim. measure in the previous report, generating the server timings results visible in Figure 20-38: DEFINE MEASURE Sales[Cashback Sim. (slow)] = SUMX ( Customer, SUMX ( 'Product', SUMX ( RELATEDTABLE ( Sales ), Sales[Quantity] * 'Product'[Unit Price] * Customer[Cashback %] ) ) ) EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Customer'[Country], "IsGrandTotalRowTotal" ), "Cashback Sim. (slow)", 'Sales'[Cashback Sim. (slow)] ), [IsGrandTotalRowTotal], 0, 'Customer'[Country], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Customer'[Country]
FIGURE 20-38 Server Timings running the query for the Cashback Sim. (slow) measure reported by country.
The execution cost is split between the storage engine and the formula engine. The former pays a big price to produce a large materialization, whereas the latter spends time consuming that large set of materialized data. The storage engine queries at lines 2 and 10 of Figure 20-38 are identical and materialize the following columns for the entire Sales table: CustomerKey, ProductKey, Quantity, and RowNumber: SELECT 'DaxBook 'DaxBook 'DaxBook 'DaxBook
Customer'[CustomerKey], Product'[ProductKey], Sales'[RowNumber], Sales'[Quantity]
694 CHAPTER 20 Optimizing DAX
FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey] LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey];
The RowNumber is a special column inaccessible to DAX that is used to uniquely identify a row in a table. These four columns are used in the formula engine to compute the formula in the innermost iterator, which considers the sales for each combination of Customer and Product. The query at line 2 creates the datacache that is also returned at line 10, hitting the cache. The presence of this second storage engine query is caused by the need to compute the grand total in SUMMARIZECOLUMNS. Without the two levels of granularity in the result, half the query plan and half the storage engine queries would not be necessary. The DAX measure iterates two tables (Customer and Product) producing all the possible combinations. For each combination of customer and product, the innermost SUMX function iterates only the corresponding rows in Sales. The formula also considers the combinations of Customer and Product that do not have any rows in the Sales table, potentially wasting precious CPU time. The query plan shows that there are 2,517 products and 18,869 customers; these are the same numbers estimated for the storage engine queries at lines 4 and 6 in Figure 20-38, respectively. Therefore, the formula engine performs 1,326,280 aggregations of the rows materialized by the Sales table, as shown in the excerpt of the query plan in Figure 20-39. The Records column shows the number of rows iterated by consumed datacaches returned by storage engine queries (see the Cache nodes at lines 28, 33, and 36) or computed by other formula engine operations (see the CrossApply node at line 23).
FIGURE 20-39 Query Plan pane running the query for the Cashback Sim. (slow) measure reported by country.
Although the DAX code iterates the tables, the xmSQL code only retrieves the columns of the tables uniquely representing one row of each table. This reduces the number of columns materialized, even though the cardinality of the tables iterated is larger than necessary. At this point, there are two important considerations: ■
The cardinality of the iterators is larger than required. Thanks to the context transition, it is possible to reduce the cardinality of the outer iterators; that way, the query context considers CHAPTER 20
Optimizing DAX
695
all the rows in Sales for a given combination of Unit Price and Cashback %, instead of each combination of product and customer. ■
Removing nested iterators would produce a better query plan, also removing expensive materialization.
The first consideration should suggest applying the technique previously described to optimize the context transitions. Indeed, the RELATEDTABLE function is like a CALCULATETABLE without filter arguments that only performs a context transition. The first variation to the DAX measure is a “medium” version that iterates the Cashback % and Unit Price columns, instead of iterating by Customer and Product. The semantic of the query is still the same because the innermost expression only depends on these columns: DEFINE MEASURE Sales[Cashback Sim. (medium)] = SUMX ( VALUES ( Customer[Cashback %] ), SUMX ( VALUES ( 'Product'[Unit Price] ), SUMX ( RELATEDTABLE ( Sales ), Sales[Quantity] * 'Product'[Unit Price] * Customer[Cashback %] ) ) ) EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Customer'[Country], "IsGrandTotalRowTotal" ), "Cashback Sim. (medium)", 'Sales'[Cashback Sim. (medium)] ), [IsGrandTotalRowTotal], 0, 'Customer'[Country], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Customer'[Country]
Figure 20-40 shows that the execution of the “medium” version is orders of magnitude faster than the “slow” version, thanks to a smaller granularity and a simpler dependency between tables iterated and columns referenced.
FIGURE 20-40 Server Timings running the query for the Cashback Sim. (medium) measure reported by country.
696 CHAPTER 20 Optimizing DAX
The two storage engine queries provide a result for each of the cardinalities of the result. The following is the storage query at line 2, whereas the similar query at line 4 does not include the Country column and is used for the grand total: WITH $Expr0 := (
( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Product'[Unit Price] AS REAL ) ) * PFCAST ( 'DaxBook Customer'[Cashback] AS REAL ) )
SELECT 'DaxBook Customer'[Country], 'DaxBook Customer'[Cashback], 'DaxBook Product'[Unit Price], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey] LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey];
The “medium” version of the Cashback Sim. measure still contains the same number of nested iterators, potentially considering all the possible combinations between the values of the Unit Price and Cashback % columns. In this simple measure, the query plan is able to establish the dependencies on the Sales table, reducing the calculation to the existing combinations. However, there is an alternative DAX syntax to explicitly instruct the engine to only consider the existing combinations. Instead of using nested iterators, a single iterator over the result of a SUMMARIZE enforces a query plan that does not compute calculations over non-existing combinations. The following version named “improved” could produce a more efficient query plan in complex scenarios, even though in this example it generates the same result and query plan: MEASURE Sales[Cashback Sim. (improved)] = SUMX ( SUMMARIZE ( Sales, 'Product'[Unit Price], Customer[Cashback %] ), CALCULATE ( SUM ( Sales[Quantity] ) ) * 'Product'[Unit Price] * Customer[Cashback %] )
The “medium” and “improved” versions of the Cashback Sim. measure can easily be adapted to use existing measures in the innermost calculations. Indeed, the “improved” version uses a CALCULATE function to compute the sum of Sales[Quantity] for a given combination of Unit Price and Cashback %, just like a measure reference would. You should consider this approach to write efficient code that is easier to maintain. However, a more efficient version is possible by removing any nested iterators.
CHAPTER 20
Optimizing DAX
697
Note A measure definition often includes aggregation functions such as SUM. With the exception of DISTINCTCOUNT, simple aggregation functions are just a shorter syntax for an iterator. For example, SUM internally invokes SUMX. Hence, a measure reference in an iterator often implies the execution of another nested iterator with a context transition in the middle. When this is required by the nature of the calculation, this is a necessary computational cost. When the nested iterators are additive like the two nested SUMX/SUM of the Cashback Sim. (improved) measure, then a consolidation of the calculation may be considered to optimize the performance; however, this could affect the readability and reusability of the measure. The following “fast” version of the Cashback Sim. measure optimizes the performance, at the cost of reducing the ability to reuse the business logic of existing measures: DEFINE MEASURE Sales[Cashback Sim. (fast)] = SUMX ( Sales, Sales[Quantity] * RELATED ( 'Product'[Unit Price] ) * RELATED ( Customer[Cashback %] ) ) EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Customer'[Country], "IsGrandTotalRowTotal" ), "Cashback Sim. (fast)", 'Sales'[Cashback Sim. (fast)] ), [IsGrandTotalRowTotal], 0, 'Customer'[Country], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Customer'[Country]
Figure 20-41 shows the server timings information of the “fast” version, which saves more than 50% of the execution time compared to the “medium” and “improved” versions.
FIGURE 20-41 Server Timings running the query for the Cashback Sim. (fast) measure reported by country.
698 CHAPTER 20 Optimizing DAX
The measure with a single iterator without context transitions generates the following simple storage engine query, reported at line 2 of Figure 20-41: WITH $Expr0 := (
( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Product'[Unit Price] AS REAL ) ) * PFCAST ( 'DaxBook Customer'[Cashback] AS REAL ) )
SELECT 'DaxBook Customer'[Country], SUM ( @$Expr0 ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Customer' ON 'DaxBook Sales'[CustomerKey]='DaxBook Customer'[CustomerKey] LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey];
Using the RELATED function does not require any CallbackDataID. Indeed, the only consequence of RELATED is that it enforces a join in the storage engine to enable the access to the related column, which typically has a smaller performance impact compared to a CallbackDataID. However, the “fast” version of the measure is not suggested unless it is critical to obtain the last additional performance improvement and to keep the materialization at a minimal level.
Avoiding table filters for DISTINCTCOUNT We already mentioned that filter arguments in CALCULATE/CALCULATETABLE functions should be applied to columns instead of tables. The goal of this example on the same topic is to show you an additional query plan pattern that you might find in server timings. A side effect of a table filter is that it requires a large materialization to the storage engine, to enable the formula engine to compute the result. However, for non-additive expressions, the query plan might generate one storage engine query for each element included in the granularity of the result. The DISTINCTCOUNT aggregation is a simple and common example of a non-additive expression. For example, consider the report in Figure 20-42 that shows the number of customers that made purchases over $1,000 (Customers 1k) for each product name.
FIGURE 20-42 Customers with purchase amounts over $1,000 for each product.
CHAPTER 20
Optimizing DAX
699
The filter condition in the Customers 1k measure requires two columns. The less efficient way to implement such a condition is by using a filter over the Sales table. The following query computes the Customers 1k measure in the previous report, generating the server timings results visible in Figure 20-43: DEFINE MEASURE Sales[Customers 1k (slow)] = CALCULATE ( DISTINCTCOUNT ( Sales[CustomerKey] ), FILTER ( Sales, Sales[Quantity] * Sales[Net Price] > 1000 ) ) EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Product Name], "IsGrandTotalRowTotal" ), "Customers_1k__slow_", 'Sales'[Customers 1k (slow)] ), [IsGrandTotalRowTotal], 0, 'Product'[Product Name], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Product'[Product Name]
FIGURE 20-43 Server Timings running the query for the Customers 1k (slow) measure.
This query generates a large number of storage engine queries—one query for each product included in the result. Because each storage engine query requires 100 to 200 milliseconds, there are a total of several minutes of CPU cost, and the latency is below one minute just because of the parallelism of the storage engine. The first xmSQL query at line 2 of Figure 20-43 returns the list of product names, including Quantity and Net Price for the sales transactions of that product. Indeed, even though there are only 1,091 products used at least once in the Sales table in transactions with an amount greater than $1,000, the
700 CHAPTER 20 Optimizing DAX
granularity of the datacache is larger because it also includes additional details other than the product name, returning more rows for the same product: SELECT 'DaxBook Product'[Product Name], 'DaxBook Sales'[Quantity], 'DaxBook Sales'[Net Price] FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey] WHERE ( COALESCE ( ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) ) > COALESCE ( 1000.000000 ) );
There are 1,091 xmSQL queries that are very similar to the one at line 6 of Figure 20-43 and return a single value obtained with a distinct count aggregation. In this case, the filter condition has all the combinations of Quantity and Net Price that return a value greater than 1,000 for the Adventure Works 52″ LCD HDTV X790W Silver product: SELECT DCOUNT ( 'DaxBook Sales'[CustomerKey] ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey] WHERE ( COALESCE ( ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT * PFCAST ( 'DaxBook Sales'[Net Price] AS > COALESCE ( 1000.000000 ) ) VAND ( 'DaxBook Product'[Product Name], 'DaxBook Sales'[Quantity], 'DaxBook Sales'[Net Price] ) IN { ( 'Adventure Works 52" LCD HDTV X790W Silver', 2, 1592.200000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 4, 1432.980000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 1, 1273.760000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 3, 1480.746000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 4, 1512.590000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 3, 1592.200000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 3, 1353.370000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 4, 1273.760000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 1, 1480.746000 ( 'Adventure Works 52" LCD HDTV X790W Silver', 1, 1592.200000 ..[24 total tuples, not all displayed]};
CHAPTER 20
) AS REAL ) REAL ) ) )
) ) ) ) ) ) ) ) ) )
, , , , , , , , ,
Optimizing DAX
701
Indeed, the following xmSQL query at line 10 of Figure 20-43 only differs from the latter in the final filter condition, which includes valid combinations of Quantity and Net Price for the Contoso Washer & Dryer 21in E210 Blue product: SELECT DCOUNT ( 'DaxBook Sales'[CustomerKey] ) FROM 'DaxBook Sales' LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey] WHERE ( COALESCE ( ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) ) > COALESCE ( 1000.000000 ) ) VAND ( 'DaxBook Product'[Product Name], 'DaxBook Sales'[Quantity], 'DaxBook Sales'[Net Price] ) IN { ( 'Contoso Washer & Dryer 21in E210 Blue', 2, 1519.050000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 2, 1279.200000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 2, 1359.150000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 4, 1487.070000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 3, 1439.100000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 3, 1519.050000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 3, 1359.150000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 2, 1599.000000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 1, 1439.100000 ) , ( 'Contoso Washer & Dryer 21in E210 Blue', 3, 1279.200000 ) ..[24 total tuples, not all displayed]};
The presence of multiple similar storage engine queries is also visible in the Query Plan pane shown in Figure 20-44. Each row starting at line 15 corresponds to a single datacache with just one column produced by one of the storage engine queries described before.
FIGURE 20-44 Query Plan pane running the query for Customers 1k (slow).
The presence of the table filter applied to the filter context forces a query plan that is not efficient. In this case, a table filter produces multiple storage engine queries instead of a single large materialization. However, the optimization required is always the same: Column filters are better than table filters 702 CHAPTER 20 Optimizing DAX
in CALCULATE and CALCULATETABLE. The optimized version of the Customer 1k measure applies a filter over the two columns Quantity and Net Price, using KEEPFILTERS in order to use the filter semantic of the original measure. The following query produces the Server Timings results visible in Figure 20-45: DEFINE MEASURE Sales[Customers 1k (fast)] = CALCULATE ( DISTINCTCOUNT ( Sales[CustomerKey] ), KEEPFILTERS ( FILTER ( ALL ( Sales[Quantity], Sales[Net Price] ), Sales[Quantity] * Sales[Net Price] > 1000 ) ) ) EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( 'Product'[Product Name], "IsGrandTotalRowTotal" ), "Customers_1k__fast_", 'Sales'[Customers 1k (fast)] ), [IsGrandTotalRowTotal], 0, 'Product'[Product Name], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Product'[Product Name]
FIGURE 20-45 Server Timings running the query for Customers 1k (fast).
The column filter in CALCULATE simplifies the query plan, which now only requires two storage engine queries—one for each granularity level of the result (one product versus total of all products). The following is the xmSQL query at line 4 in Figure 20-45: SELECT 'DaxBook Product'[Product Name], DCOUNT ( 'DaxBook Sales'[CustomerKey] ) FROM 'DaxBook Sales'
CHAPTER 20
Optimizing DAX
703
LEFT OUTER JOIN 'DaxBook Product' ON 'DaxBook Sales'[ProductKey]='DaxBook Product'[ProductKey] WHERE ( COALESCE ( ( CAST ( PFCAST ( 'DaxBook Sales'[Quantity] AS INT ) AS REAL ) * PFCAST ( 'DaxBook Sales'[Net Price] AS REAL ) ) ) > COALESCE ( 1000.000000 ) );
The datacache obtained corresponds to the result of the DAX query. The formula engine does not have to do any further processing. This is an optimal condition for the performance of this query. The lesson here is that the number of storage engine queries can also matter. A large number of storage engine queries might be the result of a bad query plan. Non-additive measures combined with table filters or bidirectional filters could be one of the reasons for this behavior, impacting performance in a negative way.
Avoiding multiple evaluations by using variables When a DAX expression evaluates the same subexpression multiple times, it is usually a good idea to store the result of the subexpression in a variable, referencing the variable name in following parts of the original DAX expression. The use of variables is a best practice which improves code readability and can provide a better and more efficient query plan—with just some exceptions described later in this section. For example, the report in Figure 20-46 shows a Sales YOY % measure computing the percentage difference between the value of Sales Amount displayed in the row of the report and the corresponding value in the previous year.
FIGURE 20-46 Difference in sales year over year reported by year and month.
The Sales YOY % measure uses other measures internally. In order to be able to modify each part of the calculation, it is useful to include all the underlying measures using the Define Dependent Measure feature in DAX Studio. The following query computes the original Sales YOY % (slow) measure in the previous report, generating the server timings results visible in Figure 20-47: DEFINE MEASURE Sales[Sales PY] = CALCULATE (
704 CHAPTER 20 Optimizing DAX
[Sales Amount], SAMEPERIODLASTYEAR ( 'Date'[Date] ) ) MEASURE Sales[Sales YOY (slow)] = IF ( NOT ISBLANK ( [Sales Amount] ) && NOT ISBLANK ( [Sales PY] ), [Sales Amount] - [Sales PY] ) MEASURE Sales[Sales Amount] = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) MEASURE Sales[Sales YOY % (slow)] = DIVIDE ( [Sales YOY (slow)], [Sales PY] ) EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( ROLLUPGROUP ( 'Date'[Calendar Year Month], 'Date'[Calendar Year Month Number] ), "IsGrandTotalRowTotal" ), "Sales_YOY____slow_", 'Sales'[Sales YOY % (slow)] ), [IsGrandTotalRowTotal], 0, 'Date'[Calendar Year Month Number], 1, 'Date'[Calendar Year Month], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Date'[Calendar Year Month Number], 'Date'[Calendar Year Month]
FIGURE 20-47 Server Timings running the query for the Sales YOY % (slow) measure.
The description of the query plan includes 1,819 rows, not reported here. Moreover, there are four storage engine queries retrieved by the storage engine cache (SE Cache), even though we executed a clear cache command before running the query. This indicates that different parts of the query plan CHAPTER 20
Optimizing DAX
705
generate different requests for the same storage engine query. Although the cache improves the performance of the storage engine request, the presence of such redundancy in the query plan is an indicator that there is room for further improvements. When a query plan is so complex and there are many storage engine queries, it is a good idea to review the DAX code and reduce redundant evaluations by using variables. Indeed, redundant evaluations could be responsible for these duplicated requests. In general, the DAX engine should be able to locate similar subexpressions executed within the same filter context, and reuse their results without multiple evaluations. However, the presence of logical conditions such as IF and SWITCH creating different branches of execution can easily stop this internal optimization. For example, consider the Sales YOY (slow) measure implementation: the Sales Amount and Sales PY measures are executed in different branches of the evaluation. The first argument of the IF function must always be evaluated, whereas the second argument should only be evaluated whenever the first argument evaluates to TRUE. A DAX expression that is present in both the first and the second argument might be evaluated twice in the query plan, which might not consider the result obtained for the first argument as something that can be reused when evaluating the second argument. The technical reasons why this happens and when it turns out to be preferable are outside the scope of this book. The following excerpt of the previous query highlights the measure references that might be evaluated twice because they are in both the first and the second argument: MEASURE Sales[Sales YOY (slow)] = IF ( NOT ISBLANK ( [Sales Amount] ) && NOT ISBLANK ( [Sales PY] ), [Sales Amount] - [Sales PY] )
By storing the values returned by the two measures Sales Amount and Sales PY in two variables, it is possible to instruct the DAX engine to enforce a single evaluation of the two measures before the IF condition, reusing the result in both the first and the second argument. The following excerpt of the Sales YOY (fast) measure shows how to implement this technique in the DAX code: MEASURE Sales[Sales YOY (fast)] = VAR SalesPY = [Sales PY] VAR SalesAmount = [Sales Amount] RETURN IF ( NOT ISBLANK ( SalesAmount ) && NOT ISBLANK ( SalesPY ), SalesAmount - SalesPY )
The following query includes a full implementation of the Sales YOY (fast) % measure, which internally relies on Sales YOY (fast) instead of Sales YOY (slow). The execution of the query produces the server timings results visible in Figure 20-48: DEFINE MEASURE Sales[Sales PY] = CALCULATE (
706 CHAPTER 20 Optimizing DAX
[Sales Amount], SAMEPERIODLASTYEAR ( 'Date'[Date] ) ) MEASURE Sales[Sales YOY (fast)] = VAR SalesPY = [Sales PY] VAR SalesAmount = [Sales Amount] RETURN IF ( NOT ISBLANK ( SalesAmount ) && NOT ISBLANK ( SalesPY ), SalesAmount - SalesPY ) MEASURE Sales[Sales Amount] = SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) MEASURE Sales[Sales YOY % (fast)] = DIVIDE ( [Sales YOY (fast)], [Sales PY] ) EVALUATE TOPN ( 502, SUMMARIZECOLUMNS ( ROLLUPADDISSUBTOTAL ( ROLLUPGROUP ( 'Date'[Calendar Year Month], 'Date'[Calendar Year Month Number] ), "IsGrandTotalRowTotal" ), "Sales_YOY____fast_", 'Sales'[Sales YOY % (fast)] ), [IsGrandTotalRowTotal], 0, 'Date'[Calendar Year Month Number], 1, 'Date'[Calendar Year Month], 1 ) ORDER BY [IsGrandTotalRowTotal] DESC, 'Date'[Calendar Year Month Number], 'Date'[Calendar Year Month]
FIGURE 20-48 Server Timings running the query for Sales YOY % (fast).
CHAPTER 20
Optimizing DAX
707
The description of the query plan includes 488 rows (not reported here), reducing the complexity of the query plan by 73%; the previous query plan was 1,819 rows long. The new query plan reduces the cost for the storage engine in terms of both execution time and number of queries, and it also reduces the execution time in the formula engine. Overall, the optimized measure reduces the execution time by about 50%, but the optimization could be even bigger in more complex models and expressions. If the same optimization were applied to nested measures, the improvement might be exponential. However, pay attention to possible side effects of assigning variables before conditional statements. Only the subexpressions used in the first argument can be assigned to variables defined before an IF or SWITCH statement; otherwise, the effect could be the opposite, enforcing the evaluation of expressions that would otherwise be ignored. You should follow these guidelines: ■
■
■
■
When the same DAX expression is evaluated multiple times within the same filter context, assign it to a variable and reference the variable instead of the DAX expression. When a DAX expression is evaluated within the branches of an IF or SWITCH, whenever necessary assign the expression to a variable within the conditional branch. Do not assign a variable outside an IF or SWITCH statement if the variable is only used within the conditional branch. The first argument of IF and SWITCH can use variables defined before IF and SWITCH without it affecting performance.
More examples about these guidelines are included in this article: https://www.sqlbi.com/articles/ optimizing-if-and-switch-expressions-using-variables/
Implementing alternative conditional statements In the last example we used a simple IF statement to show a possible optimization using variables. While using variables is a best practice, it is worth mentioning that there are alternative ways to express the same conditional logic in DAX. For example, whenever an IF function returns a numeric value and the expression of the second argument does not raise an execution error when the condition of the first argument is TRUE, it is possible to convert this code: IF ( , )
Into: *
For example, the Sales YOY (fast) measure can be implemented using this expression: MEASURE Sales[Sales YOY (fast)] = ( [Sales Amount] - [Sales PY] ) * ( NOT ISBLANK ( [Sales Amount] ) && NOT ISBLANK ( [Sales PY] ) )
708 CHAPTER 20 Optimizing DAX
The result produces only 208 rows in the query plan, despite a very similar query duration. Nevertheless, in more complex models the reduction of the query plan might have more visible benefits. However, different versions of the engine will tend to produce different results. Consider this alternative coding style one of the options available in case you need to further optimize your code. Do not apply such techniques without checking the effects on performance and query plans, verifying whether they improve performance and whether they are worth reducing the readability of your code.
Conclusions The lesson in this last chapter (to be honest, in the entire book) is that you must consider all the factors that affect a query plan in order to find the real bottleneck. Looking at the percentages of FE and SE shown in server timings is a good starting point, but you should always investigate the reason behind the numbers. Tools like DAX Studio and VertiPaq Analyzer provide you with the ability to measure the effects of a bad query plan, but these are only clues and pieces of evidence pointing to the reasons for a slow query. Welcome to the DAX world!
CHAPTER 20
Optimizing DAX
709
Index Numbers 1:1 relationships (data models), 2
A active relationships ambiguity, 514–515 CALCULATETABLE function, 451–453 expanded tables and, 450–453 USERELATIONSHIP function, 450–451 ADDCOLUMNS function, 223–224, 366–369, 371–372 ADDCOLUMNS iterators, 196–199 ADDMISSINGITEMS function authoring queries, 419–420, 432–433 auto-exists feature (queries), 432–433 aggregation functions, xmSQL queries, 625–627 aggregations, 568–571 in data models, 587–588, 647–648 SE, 548 VertiPaq aggregations, managing, 604–607 aggregators, 42, 43, 44, 45–46 AVERAGE function, 43–44 AVERAGEX function, 44 COUNT function, 46 COUNTA function, 46 COUNTBLANK function, 46 COUNTROWS function, 46 DISTINCTCOUNT function, 46 DISTINCTCOUNTNOBLANK function, 46 MAX function, 43 MIN function, 43 SUM function, 42–43, 44–45 SUMX function, 45 ALL function, 464–465 ALLEXCEPT function versus, 326–328 CALCULATE function and, 125–132, 164, 169–172
calculated physical relationships, circular dependencies, 478 columns and, 64–65 computing percentages, 125–132 context transitions, avoiding, 328–330 evaluation contexts, 100–101 filter contexts, 324–326, 327–330 measures and, 63–64 nonworking days between two dates, computing, 523–525 percentages, computing, 63–64 syntax of, 63 top categories/subcategories example, 66–67 VALUES function and, 67, 327–328 ALL* functions, 462–464 ALLCROSSFILTERED function, 464, 465 ALLEXCEPT function, 65–66, 464, 465 ALL function versus, 326–328 computing percentages, 135 filter contexts, 326–328 VALUES function versus, 326–328 ALLNOBLANKROW function, 464, 465, 478 ALLSELECTED function, 74–75, 76, 455–457, 464, 465 CALCULATE function and, 171–172 computing percentages, 75–76 iterated rows, returning, 460–462 shadow filter contexts, 459–462 alternate/primary keys column (tables), 599, 600 ambiguity in relationships, 512–513 active relationships, 514–515 non-active relationships, 515–517 Analysis Services 2012/2014 and CallbackDataID function, 644 annual totals (moving), computing, 243–244 arbitrarily shaped filters, 336 best practices, 343 building, 338–343
711
arbitrarily shaped filters column filters versus, 336 defined, 337–338 simple filters versus, 337 uses of, 343 arithmetic operators, 23 error-handling division by zero, 32–33 empty/missing values, 33–35 xmSQL queries, 627 arrows (cross filter direction), 3 attributes, data model optimization disabling attribute hierarchies, 604 optimizing drill-through attributes, 604 authoring queries, 395 ADDMISSINGITEMS function, 419–420, 432–433 auto-exists feature, 428–434 DAX Studio, 395 DEFINE sections MEASURE keyword in, 399 VAR keyword in, 397–399 EVALUATE statements ADDMISSINGITEMS function, 419–420, 432–433 example of, 396 expression variables and, 398 GENERATE function, 414–417 GENERATEALL function, 417 GROUPBY function, 420–423 ISONORAFTER function, 417–419 NATURALINNERJOIN function, 423–425 NATURALLEFTOUTERJOIN function, 423–425 query variables and, 398 ROW function, 400–401 SAMPLE function, 427–428 SUBSTITUTEWITHINDEX function, 425–427 SUMMARIZE function, 401–403, 433–434 SUMMARIZECOLUMNS function, 403–409, 429–434 syntax of, 396–399 TOPN function, 409–414 TOPNSKIP function, 420 expression variables, 397–399 GENERATE function, 414–417 GENERATEALL function, 417 GROUPBY function, 420–423 ISONORAFTER function, 417–419 MEASURE in DEFINE sections, 399
712
measures query measures, 399 testing, 399–401 NATURALINNERJOIN function, 423–425 NATURALLEFTOUTERJOIN function, 423–425 query variables, 397–399 ROW function, testing measures, 400–401 SAMPLE function, 427–428 shadow filter contexts, 457–462 SUBSTITUTEWITHINDEX function, 425–427 SUMMARIZE function, 401–403, 433–434 SUMMARIZECOLUMNS function, 403–409, 429–434 TOPN function, 409–414 TOPNSKIP function, 420 VAR in DEFINE sections, 397–399 Auto Date/Time (Power BI), 218–219 auto-exists feature (queries), 428–434 automatic date columns (Power Pivot for Excel), 219 AVERAGE function, 43–44, 199 AVERAGEA function, returning averages, 199 averages (means) computing averages, AVERAGEX function, 199–201 moving averages, 201–202 returning averages AVERAGE function, 199 AVERAGEA function, 199 AVERAGEX function, 44 computing averages, 199–201 filter contexts, 111–112 AVERAGEX iterators, 188
B batch events (xmSQL queries), 630–632 bidirectional cross-filter direction (physical relationships), 490, 491–493, 507 bidirectional filtering (relationships), 3–4 bidirectional relationships, 106, 109 Binary data type, 23 BLANK function, 36 blank rows, invalid relationships, 68–71 Boolean calculated columns, data model optimization, 597–598 Boolean conditions, CALCULATE function, 119–120, 123–124 Boolean data type, 22
calculation items Boolean logic, 23 bottlenecks, DAX optimization, 667–668 identifying SE/FE bottlenecks, 667–668 optimizing bottlenecks, 668 bridge tables, MMR (Many-Many Relationships), 494–499 budget/sales information (calculations), showing together, 527–530
C CALCULATE function, 115 ALL function, 125–132, 164, 169–172 ALLSELECTED function, 171–172 Boolean conditions, 119–120, 123–124 calculated physical relationships, circular dependencies, 478–480 calculation items, applying to expressions, 291–299 circular dependencies, 161–164 computing percentages, 124, 135 ALL function, 125–132 ALLEXCEPT function, 135 VALUES function, 133–134 context transitions, 148, 151–154 calculated columns, 154–157 measures, 157–160 CROSSFILTER function, 168 evaluation contexts, 79 evaluation order, 144–148 filter arguments, 118–119, 122, 123, 445–447 filter contexts, 148–151 filtering multiple columns, 140–143 a single column, 138–140 KEEPFILTERS function, 135–138, 139–143, 164, 168–169 evaluation order, 146–148 filtering multiple columns, 142–143 moving averages, 201–202 numbering sequences of events (calculations), 537–538 overwriting filters, 120–122, 136 Precedence calculation group, 299–304 range-based relationships (calculated physical relationships), 474–476 RELATED function and, 443–444 row contexts, 148–151 rules for, 172–173
semantics of, 122–123 syntax of, 118, 119–120 table filters, 382–384, 445–447 time intelligence calculations, 228–232 transferring filters, 482–483, 484–485 UNION function and, 376–378 USERELATIONSHIP function, 164–168 calculated columns, 25–26 Boolean calculated columns, data model optimization, 597–598 context transitions, 154–157 data model optimization, 595–599 DISTINCT function, 68 expressions, 29 measures, 42 choosing between calculated columns and measures, 29–30 differences between calculated columns and measures, 29 using measures in calculated columns, 30 processing, 599 RELATED function, 443–444 SUM function, evaluation contexts, 88–89 table functions, 59 VALUES function, 68 calculated physical relationships, 471 circular dependencies, 476–480 multiple-column relationships, 471–473 range-based relationships, 474–476 calculated tables, 59 creating, 390–391 DISTINCT function, 68 SELECTCOLUMNS function, 390–391 VALUES function, 68 CALCULATETABLE function, 115, 363 active relationships, 451–453 FILTER function versus, 363–365 time intelligence functions, 259, 260–261 calculation granularity and iterators, 211–214 calculation groups, 279–281 calculation items and, 288 creating, 281–288 defined, 288 Name calculation group, 288 Precedence calculation group, 288, 299–304 calculation items applying to expressions, 291 CALCULATE function, 291–299
713
calculation items DATESYTD function, 293–296 YTD calculations, 294 best practices, 311 calculation groups and, 288 Expression calculation item, 289 format strings, 289–291 including/excluding measures from calculation items, 304–306 Name calculation item, 288 Ordinal values, 289 properties of, 288–289 sideways recursion, 306–311 YOY calculation item, 289–290 YOY% calculation item, 289–290 calculations budget/sales information (calculations), showing together, 527–530 nonworking days between two dates, computing, 523–525 precomputing values (calculations), computing work days between two dates, 525–527 sales computing previous year sales up to last day sales (calculations), 539–544 computing same-store sales, 530–536 showing budget/sales information together, 527–530 syntax of, 17–18 work days between two dates, computing, 519–523 nonworking days, 523–525 precomputing values (calculations), 525–527 CALENDAR function, building date tables, 222 CALENDARAUTO function, building date tables, 222–224 calendars (custom), time intelligence calculations, 272 DATESYTD function, 276–277 weeks, 272–275 CallbackDataID function Analysis Services 2012/2014 and, 644 DAX optimization, 690–693 parallelism and, 641 VertiPaq and, 640–644 capturing DAX queries, 609–611 cardinality columns (tables) data model optimization, 591–592 optimizing high-cardinality columns, 603
714
iterators, 188–190 relationships (data models), 489–490, 586–587, 590–591 Cardinality column (VertiPaq Analyzer), 581, 583 categories/subcategories example, ALL function and, 66–67 cells (Excel), 5 chains (relationships), 3 circular dependencies CALCULATE function and, 161–164 calculated physical relationships, 476–480 code documentation, variables, 183–184 code maintenance/readability, FILTER function, 62–63 column filters arbitrarily shaped filters versus, 336 defined, 336 columnar databases, 550–553 columns (tables), 5–7 ADDCOLUMNS function, 223–224, 366–369, 371–372 ADDCOLUMNS iterators, 196–199 ALL function and, 64–65 ALLEXCEPT function and, 65–66 automatic date columns (Power Pivot for Excel), 219 Boolean calculated columns, data model optimization, 597–598 calculated columns, 25–26, 42, 443–444 Boolean calculated columns, 597–598 choosing between calculated columns and measures, 29–30 context transitions, 154–157 data model optimization, 595–599 differences between calculated columns and measures, 29 DISTINCT function, 68 expressions, 29 processing, 599 SUM function, 88–89 table functions, 59 using measures in calculated columns, 30 VALUES function, 68 cardinality data model optimization, 591–592 optimizing high-cardinality columns, 603 Date column, data model optimization, 592–595 defined, 2 descriptive attributes column (tables), 600, 601–602 filtering
CROSSFILTER function CALCULATE function, 138–140 multiple columns, 140–143 a single column, 138–140 table filters versus, 444–447 measures, evaluation contexts, 89–90 multiple columns DISTINCT function and, 71 VALUES function and, 71 primary/alternate keys column (tables), 599, 600 qualitative attributes column (tables), 599, 600 quantitative attributes column (tables), 599, 600–601 referencing, 17–18 relationships, 3 row contexts, 87 SELECTCOLUMNS function, 390–391, 393–394 SELECTCOLUMNS iterators, 196, 197–199 split optimization, 602–603 storage optimization, 602 column split optimization, 602–603 high-cardinality columns, 603 storing, 601–602 SUBSTITUTEWITHINDEX function, 425–427 SUMMARIZE function and, 401 SUMMARIZECOLUMNS function, 403–409, 429–434 technical attributes column (tables), 600, 602 Time column, data model optimization, 592–595 VertiPaq Analyzer, 580–583 Columns # column (VertiPaq Analyzer), 582 Columns Hierarchies Size column (VertiPaq Analyzer), 582 Columns Total Size column (VertiPaq Analyzer), 581 COMBINEVALUES function, multiple-column relationships (calculated physical relationships), 472–473 comments at the end of expressions, 18 expressions, comment placement in expressions, 18 multi-line comments, 18 single-line comments, 18 comparison operators, 23 composite data models, 646–647 DirectQuery mode, 488 VertiPaq mode, 488 compression (VertiPaq), 553–554 hash encoding, 555–556 re-encoding, 559
RLE, 556–559 value encoding, 554–555 CONCATENATEX function iterators and, 194–196 tables as scalar values, 74 conditional statements, 24–25, 708–709 conditions DAX, 11 SQL, 11 CONTAINS function tables and, 387–388 transferring filters, 481–482, 484–485 CONTAINSROW function and tables, 387–388 context transitions, 148 ALL function and, 328–330 CALCULATE function and, 151–154 calculated columns, 154–157 DAX optimization, 672–678 expanded tables, 454–455 iterators, leveraging context transitions, 190–194 measures, 157–160 time intelligence functions, 260 conversion functions, 51 CURRENCY function, 51 DATE function, 51, 52 DATEVALUE function, 51 FORMAT function, 51 INT function, 51 TIME function, 51, 52 VALUE function, 51 conversions, error-handling, 31–32 cores (number of), VertiPaq hardware selection, 574, 576 COUNT function, 46 COUNTA function, 46 COUNTBLANK function, 46 COUNTROWS function, 46 filter contexts and relationships, 109 nested row contexts on the same table, 92–95 tables as scalar values, 73 CPU model, VertiPaq hardware selection, 574–575 cross-filter directions (physical relationships), 3, 490 bidirectional cross-filter direction, 490, 491–493, 507 single cross-filter direction, 490 cross-filtering, data model optimization, 590 cross-island relationships, 489 CROSSFILTER function bidirectional relationships, 109 CALCULATE function and, 168
715
CROSSJOIN function and tables CROSSJOIN function and tables, 372–374, 383–384 Currency data type, 21 CURRENCY function, 51 custom calendars, time intelligence calculations, 272 DATESYTD function, 276–277 weeks, 272–275 customers (new), computing (tables), 380–381, 386–387
D Daily AVG calculation group precedence, 299–303 calculation items, including/excluding measures, 304–306 data lineage, 332–336, 465–468 data models aggregations, 647–648 composite data models, 646–647 DirectQuery mode, 488 VertiPaq mode, 488 defined, 1–2 optimizing with VertiPaq, 579 aggregations, 587–588, 604–607 calculated columns, 595–599 choosing columns for storage, 599–602 column cardinality, 591–592 cross-filtering, 590 Date column, 592–595 denormalizing data, 584–591 disabling attribute hierarchies, 604 gathering data model information, 579–584 optimizing column storage, 602–603 optimizing drill-through attributes, 604 relationship cardinality, 586–587, 590–591 Time column, 592–595 relationships, 2 1:1 relationships, 2 active relationships, 450–453 bidirectional filtering, 3–4 cardinality, 586–587, 590–591 chains, 3 columns, 3 cross filter direction, 3 DAX and SQL, 9 directions of, 3–4 many-sided relationships, 2, 3
716
one-sided relationships, 2, 3 Relationship reports (VertiPaq Analyzer), 584 unidirectional filtering, 4 weak relationships, 2 single data models DirectQuery mode, 488 VertiPaq mode, 488 tables, defined, 2 weak relationships, 439 data refreshes, SSAS (SQL Server Analysis Services), 549–550 Data Size column (VertiPaq Analyzer), 581 data types, 19 Binary data type, 23 Boolean data type, 22 Currency data type, 21 DateTime data type, 21–22 Decimal data type, 21 Integer data type, 21 operators, 23 arithmetic operators, 23 comparison operators, 23 logical operators, 23 overloading, 19–20 parenthesis operators, 23 text concatenation operators, 23 string/number conversions, 19–21 strings, 22 Variant data type, 22 Database Size % column (VertiPaq Analyzer), 582 databases (columnar), 550–553 datacaches FE, 547 SE, 547 VertiPaq, 549, 635–637 DATATABLE function, creating static tables, 392–393 Date column, data model optimization, 592–595 DATE function, 51, 52 date table templates (Power Pivot for Excel), 220 date tables building, 220–221 ADDCOLUMNS function, 223–224 CALENDAR function, 222 CALENDARAUTO function, 222–224 date templates, 224 duplicating, 227 loading from other data sources, 221
DAX (Data Analysis eXpressions) Mark as Date Table, 232–233 multiple dates, managing, 224 multiple date tables, 226–228 multiple relationships to date tables, 224–226 naming, 221 date templates, 224 date/time-related calculations, 217 Auto Date/Time (Power BI), 218–219 automatic date columns (Power Pivot for Excel), 219 basic calculations, 228–232 basic functions, 233–235 CALCULATE function, 228–232 CALCULATETABLE function, 259, 260–261 context transitions, 260 custom calendars, 272 DATESYTD function, 276–277 weeks, 272–275 date tables ADDCOLUMNS function, 223–224 building, 220–224 CALENDAR function, 222 CALENDARAUTO function, 222–224 date table templates (Power Pivot for Excel), 220 date templates, 224 duplicating, 227 loading from other data sources, 221 managing multiple dates, 224–228 Mark as Date Table, 232–233 multiple date tables, 226–228 multiple relationships to date tables, 224–226 naming, 221 DATEADD function, 237–238, 262–269 DATESINPERIOD function, 243–244 DATESMTD function, 259, 276–277 DATESQTD function, 259, 276–277 DATESYTD function, 259, 260, 261–262, 276–277 differences over previous periods, computing, 241–243 drillthrough operations, 271 FILTER function, 228–232 FIRSTDATE function, 269, 270 FIRSTNONBLANK function, 256–257, 270–271 LASTDATE function, 248–249, 254, 255, 269–270 LASTNONBLANK function, 250–254, 255, 270–271 mixing functions, 239–241
moving annual totals, computing, 243–244 MTD calculations, 235–236, 259–262, 276–277 nested functions, call order of, 245–246 NEXTDAY function, 245–246 nonworking days between two dates, computing, 523–525 opening/closing balances, 254–258 PARALLELPERIOD function, 238–239 periods to date, 259–262 PREVIOUSMONTH function, 239 QTD calculations, 235–236, 259–262, 276–277 SAMEPERIODLASTYEAR function, 237, 245–246 semi-additive calculations, 246–248 STARTOFQUARTER function, 256–257 time periods, computing from prior periods, 237–239 work days between two dates, computing, 519–523 nonworking days, 523–525 precomputing values (calculations), 525–527 YTD calculations, 235–236, 259–262, 276–277 DATEADD function, time intelligence calculations, 237–238, 262–269 DATESINPERIOD function, computing moving annual totals, 243–244 DATESMTD function, time intelligence calculations, 259, 276–277 DATESQTD function, time intelligence calculations, 259, 276–277 DATESYTD function calculation items, applying to expressions, 293–296 time intelligence calculations, 259, 260, 261–262, 276–277 DateTime data type, 21–22 DATEVALUE function, 51 DAX (Data Analysis eXpressions), 1 conditions, 11 data models defined, 1–2 relationships, 2–4 tables, 2 date templates, 224 DAX and, cells and tables, 5–7 Excel and functional languages, 7 theories, 8–9 expressions
717
DAX (Data Analysis eXpressions) identifying a single DAX expression for optimization, 658–661 optimizing bottlenecks, 668 as functional language, 10 functions, 6–7 iterators, 8 MDX, 12 hierarchies, 13–14 leaf-level calculations, 14 multidimensional versus tabular space, 12 as programming language, 12–13 as querying language, 12–13 queries, 613 optimizing, 657 bottlenecks, 668 CallbackDataID function, 690–693 change implementation, 668 conditional statements, 708–709 context transitions, 672–678 creating reproduction queries, 661–664 DISTINCTCOUNT function, 699–704 to-do list, 658 filter conditions, 668–672 identifying a single DAX expression for optimization, 658–661 identifying SE/FE bottlenecks, 667–668 IF conditions, 678–690 multiple evaluations, avoiding with variables, 704–708 nested iterators, 693–699 query plans, 664–667 rerunning test queries, 668 server timings, 664–667 variables, 704–708 Power BI and, 14–15 as programming language, 10–11 queries capturing, 609–611 creating reproduction queries, 661–662 DISTINCTCOUNT function, 634–635 executing, 546 query plans, 612–613 collecting, 613–614 DAX Studio, 617–620 logical query plans, 612, 614 physical query plans, 612–613, 614–616 SQL Server Profiler, 620–623 as querying language, 10–11
718
SQL and, 9 subqueries, 11 DAX engines DirectQuery, 546, 548, 549 FE, 546, 547 datacaches, 547 operators of, 547 single-threaded implementation, 547 SE, 546 aggregations, 548 datacaches, 547 DirectQuery, 548, 549 operators of, 547 parallel implementations, 548 VertiPaq, 547–549, 550–577 Tabular model and, 545–546 VertiPaq, 546, 547–548, 550. See also data models, optimizing with VertiPaq aggregations, 571–573 columnar databases, 550–553 compression, 553–562 datacaches, 549 DMV, 563–565 hardware selection, 573–577 hash encoding, 555–556 hierarchies, 561–562 materialization, 568–571 multithreaded implementations, 548 partitioning, 562–563 processing tables, 550 re-encoding, 559 relationships (data models), 561–562, 565–568 RLE, 556–559 scan operations, 549 segmentation, 562–563 sort orders, 560–561 value encoding, 554–555 DAX Studio, 395 capturing DAX queries, 609–611 Power BI and, 609–611 query measures, creating, 662–663 query plans, capturing profiling information, 617–620 VertiPaq caches, 639–640 DAXFormatter.com, 41 Decimal data type, 21 DEFINE MEASURE clauses in EVALUATE statements, 59
evaluation contexts DEFINE sections (authoring queries) MEASURE keyword in, 399 VAR keyword in, 397–399 denormalizing data and data model optimization, 584–591 descriptive attributes column (tables), 600, 601–602 DETAILROWS function, reusing table expressions, 388–389 dictionary encoding. See hash encoding Dictionary Size column (VertiPaq Analyzer), 581 DirectQuery, 488–489, 546, 548, 549, 617 calculated columns, 25–26 composite data models, 488 End events (SQL Server Profiler), 621 SE, 549 composite data models, 646–647 reading, 645–646 single data models, 488 Disk I/O performance, VertiPaq hardware selection, 574, 576–577 DISTINCT function, 71 blank rows and invalid relataionships, 68, 70–71 calculated columns, 68 calculated physical relationships circular dependencies, 477–478 range-based relationships, 476 filter contexts, 111–112 multiple columns, 71 UNION function and, 375–378 VALUES function versus, 68 DISTINCTCOUNT function, 46 DAX optimization, 699–704 same-store sales (calculations), computing, 535–536 table filters, avoiding, 699–704 VertiPaq SE queries, 634–635 DISTINCTCOUNTNOBLANK function, 46 DIVIDE function, DAX optimization, 684–687 division by zero, arithmetic operators, 32–33 DMV (Dynamic Management Views) and SSAS, 563–565 documenting code, variables, 183–184 drill-through attributes, optimizing, 604 drillthrough operations, time intelligence calculations, 271 duplicating, date tables, 227 duration of an order example, 26 dynamic segmentation, virtual relationships and, 485–488
E EARLIER function, evaluation contexts, 97–98 editing text, formatting DAX code, 42 empty/missing values, error-handling, 33–35 Encoding column (VertiPaq Analyzer), 582, 583 error-handling BLANK function, 36 Excel, empty/missing values, 35 expressions, 31 arithmetic operator errors, 32–35 conversion errors, 31–32 generating errors, 38–39 IF function, 36, 37 IFERROR function, 35–36, 37–38 ISBLANK function, 36 ISERROR function, 36, 38 SQRT function, 36 variables, 37 EVALUATE statements ADDMISSINGITEMS function, 419–420, 432–433 DEFINE MEASURE clauses, 59 example of, 396 expression variables and, 398 GENERATE function, 414–417 GENERATEALL function, 417 GROUPBY function, 420–423 ISONORAFTER function, 417–419 NATURALINNERJOIN function, 423–425 NATURALLEFTOUTERJOIN function, 423–425 ORDER BY clauses, 60 query variables and, 398 ROW function, 400–401 SAMPLE function, 427–428 SUBSTITUTEWITHINDEX function, 425–427 SUMMARIZE function, 401–403, 433–434 SUMMARIZECOLUMNS function, 403–409, 429–434 syntax of, 59–60, 396–399 TOPN function, 409–414 TOPNSKIP function, 420 evaluation contexts, 79 ALL function, 100–101 AVERAGEX function, filter contexts, 111–112 CALCULATE function, 79 columns in measures, 89–90 COUNTROWS function, filter contexts and relationships, 107–108 defined, 80
719
evaluation contexts DISTINCT function, filter contexts, 111–112 EARLIER function, 97–98 filter contexts, 80, 109–110 AVERAGEX function, 111–112 CALCULATE function, 118–119 CALCULATE function and, 148–151 creating, 115–119 DISTINCT function, 111–112 examples of, 80–85 filter arguments, 118–119 relationships and, 106–109 row contexts versus, 85 SUMMARIZE function, 112 FILTER function, 92–93, 94–95, 98–101 multiple tables, working with, 101–102 filter contexts and relationships, 106–109 row contexts and relationships, 102–105 RELATED function filter contexts and relationships, 109 nested row contexts on different tables, 92 row contexts and relationships, 103–105 RELATEDTABLE function filter contexts and relationships, 109 nested row contexts on different tables, 91–92 row contexts and relationships, 103–105 relationships and, 101–102 filter contexts, 106–109 row contexts, 102–105 row contexts, 80 CALCULATE function and, 148–151 column references, 87 examples of, 86–87 filter contexts versus, 85 iterators and, 90–91 nested row contexts on different tables, 91–92 nested row contexts on the same table, 92–97 relationships and, 102–105 SUM function, in calculated columns, 88–89 SUMMARIZE function, filter contexts, 112 evaluations (multiple), avoiding with variables, 704–708 events (calculations), numbering sequences of, 536–539 Excel calculations, 8 cells, 5 columns, 5–7
720
DAX and cells and tables, 5–7 functional languages, 7 theories, 8–9 error-handling, empty/missing values, 35 formulas, 6 functions, 6–7 Power Pivot for Excel automatic date columns, 219 date table templates, 220 EXCEPT function, tables and, 379–381 expanded tables active relationships, 450–453 column filters versus table filters, 444–447 context transitions, 454–455 filter contexts, 439–441 filtering, 444–447 active relationships and, 450–453 differences between table filters and expanded tables, 453–454 RELATED function, 441–444 relationships, 437–441 table filters column filters versus, 444–447 in measures, 447–450 Expression calculation item, 289 Expression Trees, 612 expressions calculated columns, 29 calculation items, applying to expressions, 291 CALCULATE function, 291–299 DATESYTD function, 293–296 YTD calculations, 294 comments, placement in expressions, 18 DAX optimization, 658–661, 668 error-handling, 31 arithmetic operator errors, 32–35 conversion errors, 31–32 formatting, 39–40, 42 MDX DAX and, 12–13, 14 queries, 546, 604, 613, 663–664 query measures, 399 scalar expressions, 57–58 table expressions EVALUATE statements, 59–60 reusing, 388–389 variables, 30–31, 397–399
filtering
F FE (Formula Engines), 546, 547 bottlenecks, identifying, 667–668 datacaches, 547 operators of, 547 query plans, reading, 652–653, 654–655 single-threaded implementation, 547, 642 filter arguments CALCULATE function, 118–119, 122, 123, 445–447 defined, 120 multiple column references, 140 SUMMARIZECOLUMNS function, 406–409 filter contexts, 80, 109–110, 313, 343–344 ALL function, 324–326, 327–330 ALLEXCEPT function, 326–328 arbitrarily shaped filters, 336 best practices, 343 building, 338–343 column filters versus, 336 defined, 337–338 simple filters versus, 337 uses of, 343 AVERAGEX function, 111–112 CALCULATE function, 148–151 filter arguments, 118–119 overwriting filters, 120–122 column filters arbitrarily shaped filters versus, 336 defined, 336 creating, 115–119 data lineage, 332–336 DISTINCT function, 111–112 examples of, 80–85 expanded tables, 439–441 FILTERS function, 322–324 HASONVALUE function, 314–318 ISCROSSFILTERED function, 319–322 ISEMPTY function, 330–332 ISFILTERED function, 319, 320–322 nesting in variables, 184–185 relationships and, 106–109 row contexts versus, 85 SELECTEDVALUE function, 318–319 simple filters arbitrarily shaped filters versus, 337
defined, 337 SUMMARIZE function, 112 TREATAS function, 334–336 VALUES function, 322–324, 327–328 FILTER function, 57–58 CALCULATETABLE function versus, 363–365 code maintenance/readability, 62–63 evaluation contexts, 98–101 as iterator, 60–61 nested row contexts on the same table, 92–93, 94–95 nesting, 61–62 range-based relationships (calculated physical relationships), 474–476 syntax of, 60 time intelligence calculations, 228–232 transferring filters, 481–482, 484–485 filter operations, xmSQL queries, 628–630 filtering ALLCROSSFILTERED function, 464, 465 columns (tables) versus table filters, 444–447 DAX optimization, filter conditions, 668–672 expanded tables differences between table filters and expanded tables, 453–454 table filters and active relationships, 450–453 FILTER function range-based relationships (calculated physical relationships), 474–476 transferring filters, 484–485 KEEPFILTERS function, 461–462, 482–483, 484 relationships bidirectional filtering, 3–4 unidirectional filtering, 4 shadow filter contexts, 457–462 tables, 381 CALCULATE function and, 445–447 column filters versus, 444–447 differences between table filters and expanded tables, 453–454 DISTINCTCOUNT function, 699–704 in measures, 447–450 OR conditions, 381–384 table filters and active relationships, 450–453 transferring filters, 480–481 CALCULATE function, 482
721
filtering CONTAINS function, 481–482 FILTER function, 481–482, 484–485 INTERSECT function, 483–484 TREATAS function, 482–483, 484 FILTERS function filter contexts, 322–324 VALUES function versus, 322–324 FIRSTDATE function, time intelligence calculations, 269, 270 FIRSTNONBLANK function, time intelligence calculations, 256–257, 270–271 FORMAT function, 51 format strings calculation items and, 289–291 defined, 291 SELECTEDMEASUREFORMATSTRING function, 291 formatting DAX code, 39, 41–42 DAXFormatter.com, 41 editing text, 42 expressions, 39–40, 42 formulas, 42 help, 42 variables, 40–41 formulas Excel, 6 formatting, 42 IN function, tables and, 387–388 functions ADDCOLUMNS function, 223–224, 366–369, 371–372 ADDMISSINGITEMS function authoring queries, 419–420, 432–433 auto-exists feature (queries), 432–433 aggregation functions, xmSQL queries, 625–627 aggregators, 42, 44, 45–46 AVERAGE function, 43–44 AVERAGEX function, 44 COUNT function, 46 COUNTA function, 46 COUNTBLANK function, 46 COUNTROWS function, 46 DISTINCTCOUNT function, 46 DISTINCTCOUNTNOBLANK function, 46 MAX function, 43 MIN function, 43
722
SUM function, 42–43, 44–45 SUMX function, 45 ALL function, 464–465 ALLEXCEPT function versus, 326–328 CALCULATE function and, 164, 169–172 calculated physical relationships and circular dependencies, 478 computing nonworking days between two dates, 523–525 computing percentages, 125–132 context transitions, 328–330 evaluation contexts, 100–101 filter contexts, 324–326, 327–330 VALUES function and, 327–328 ALL* functions, 462–464 ALLCROSSFILTERED function, 464, 465 ALLEXCEPT function, 464, 465 ALL function versus, 326–328 computing percentages, 135 filter contexts, 326–328 VALUES function versus, 326–328 ALLNOBLANKROW function, 464, 465, 478 ALLSELECTED function, 455–457, 464, 465 CALCULATE function and, 171–172 returning iterated rows, 460–462 shadow filter contexts, 459–462 AVERAGE function, returning averages, 199 AVERAGEA function, returning averages, 199 AVERAGEX function computing averages, 199–201 filter contexts, 111–112 Boolean conditions, 123–124 CALCULATE function, 115 ALL function, 125–132, 164, 169–172 ALLSELECTED function, 171–172 Boolean conditions, 119–120 calculated physical relationships and circular dependencies, 478–480 calculation items, applying to expressions, 291–299 circular dependencies, 161–164 computing percentages, 124–135 context transitions, 148, 151–160 CROSSFILTER function, 168 evaluation contexts, 79 evaluation order, 144–148
functions filter arguments, 118–119, 122, 123, 445–447 filter contexts, 148–151 filtering a single column, 138–140 filtering multiple columns, 140–143 KEEPFILTERS function, 135–138, 139–143, 164, 168–169 KEEPFILTERS function and, 146–148 moving averages, 201–202 numbering sequences of events (calculations), 537–538 overwriting filters, 120–122 Precedence calculation group, 299–304 range-based relationships (calculated physical relationships), 474–476 RELATED function and, 443–444 row contexts, 148–151 rules for, 172–173 semantics of, 122–123 syntax of, 118, 119–120 table filters, 445–447 tables as filters, 382–384 time intelligence calculations, 228–232 transferring filters, 482–483, 484–485 UNION function and, 376–378 USERELATIONSHIP function, 164–168 CALCULATETABLE function, 115, 363 active relationships, 451–453 FILTER function versus, 363–365 time intelligence functions, 259, 260–261 CALENDAR function, date tables, 222 CALENDARAUTO function, date tables, 222–224 CallbackDataID function Analysis Services 2012/2014 and, 644 DAX optimization, 690–693 parallelism and, 641 VertiPaq and, 640–644 COMBINEVALUES function, multiple-column relationships (calculated physical relationships), 472–473 CONCATENATEX function iterators and, 194–196 tables as scalar values, 74 CONTAINS function tables and, 387–388 transferring filters, 481–482, 484–485 CONTAINSROW function, tables and, 387–388 conversion functions, 51
COUNTROWS function filter contexts and relationships, 107–108 nested row contexts on the same table, 92–95 tables as scalar values, 73 CROSSFILTER function bidirectional relationships, 109 CALCULATE function and, 168 CROSSJOIN function, tables and, 372–374, 383–384 CURRENCY function, 51 DATATABLE function, creating static tables, 392–393 DATE function, 51, 52 DATEADD function, time intelligence calculations, 237–238, 262–269 DATESINPERIOD function, moving annual totals, 243–244 DATESMTD function, time intelligence calculations, 259, 276–277 DATESQTD function, time intelligence calculations, 259, 276–277 DATESYTD function calculation items, applying to expressions, 293–296 time intelligence calculations, 259, 260, 261–262, 276–277 DATEVALUE function, 51 DETAILROWS function, reusing table expressions, 388–389 DISTINCT function calculated physical relationships and circular dependencies, 477–478 filter contexts, 111–112 range-based relationships (calculated physical relationships), 476 UNION function and, 375–378 DISTINCTCOUNT function avoiding table filters, 699–704 computing same-store sales, 535–536 DAX optimization, 699–704 DIVIDE function, DAX optimization, 684–687 EARLIER function, evaluation contexts, 97–98 Excel, 6–7 EXCEPT function, tables and, 379–381 FILTER function CALCULATETABLE function versus, 363–365 evaluation contexts, 98–101
723
functions nested row contexts on the same table, 92–93, 94–95 range-based relationships (calculated physical relationships), 474–476 time intelligence calculations, 228–232 transferring filters, 481–482, 484–485 FILTERS function filter contexts, 322–324 VALUES function versus, 322–324 FIRSTDATE function, time intelligence calculations, 269, 270 FIRSTNONBLANK function, time intelligence calculations, 256–257, 270–271 FORMAT function, 51 IN function, tables and, 387–388 GENERATE function, authoring queries, 414–417 GENERATEALL function, authoring queries, 417 GENERATESERIES function, tables and, 393–394 GROUPBY function authoring queries, 420–423 SUMMARIZE function and, 420–423 HASONEVALUE function filter contexts, 314–318 tables as scalar values, 73 information functions, 48–49 INT function, 51 INTERSECT function tables and, 378–379 transferring filters, 483–484 ISCROSSFILTERED function, filter contexts, 319–322 ISEMPTY function, filter contexts, 330–332 ISFILTERED function filter contexts, 319, 320–322 time intelligence calculations, 268–269 ISNUMBER function, 48–49 ISONORAFTER function authoring queries, 417–419 TOPN function and, 417–419 ISSELECTEDMEASURE function, including/excluding measures from calculation items, 304–306 ISSUBTOTAL function and SUMMARIZE function, 402–403 KEEPFILTERS function, 461–462 CALCULATE function and, 135–138, 142–143, 146–148, 164, 168–169 evaluation order, 146–148 transferring filters, 482–483, 484
724
LASTDATE function, time intelligence calculations, 248–249, 254, 255, 269–270 LASTNONBLANK function, 250–254, 255, 270–271 logical functions IF function, 46–47 IFERROR function, 47 SWITCH function, 47–48 LOOKUPVALUE function, 444, 473 mathematical functions, 49 NATURALINNERJOIN function, authoring queries, 423–425 NATURALLEFTOUTERJOIN function, authoring queries, 423–425 nested functions, call order of time intelligence functions, 245–246 NEXTDAY function, call order of nested time intelligence functions, 245–246 PARALLELPERIOD function, time intelligence calculations, 238–239 PREVIOUSMONTH function, time intelligence calculations, 239 RANK.EQ function, 210 RANKX function, numbering sequences of events (calculations), 538–539 RELATED function CALCULATE function and, 443–444 calculated columns, 443–444 context transitions in expanded tables, 455 expanded tables, 441–444 filter contexts and relationships, 109 nested row contexts on different tables, 92 row contexts and relationships, 103–105 table filters and expanded tables, 454 RELATEDTABLE function filter contexts and relationships, 109 nested row contexts on different tables, 91–92 row contexts and relationships, 103–105 relational functions, 53–54 ROLLUP function, 401–402, 403 ROW function creating static tables, 391–392 testing measures, 400–401 SAMEPERIODLASTYEAR function call order of nested time intelligence functions, 245–246 computing previous year sales up to last day sales (calculations), 540–544 time intelligence calculations, 237
granularity SAMPLE function, authoring queries, 427–428 SELECTCOLUMNS function, 390–391, 393–394 SELECTEDMEASURE function, including/excluding measures from calculation items, 304–306 SELECTEDMEASUREFORMATSTRING function, 291 SELECTEDVALUE function calculated physical relationships and circular dependencies, 479–480 computing same-store sales, 533–534 context transitions in expanded tables, 454–455 filter contexts, 318–319 tables as scalar values, 73–74 STARTOFQUARTER function, time intelligence calculations, 256–257 SUBSTITUTEWITHINDEX function, authoring queries, 425–427 SUM function in calculated columns, 88–89 SUMMARIZE function authoring queries, 401–403, 433–434 auto-exists feature (queries), 433–434 columns (tables) and, 401 filter contexts, 112 GROUPBY function and, 420–423 ISSUBTOTAL function and, 402–403 ROLLUP function and, 401–402, 403 table filters and expanded tables, 453–454 tables and, 369–372, 373–374, 383–384 transferring filters, 484–485 SUMMARIZECOLUMNS function authoring queries, 403–409, 429–434 auto-exists feature (queries), 429–434 filter arguments, 406–409 IGNORE modifier, 403–404 ROLLUPADDISSUBTOTAL modifier, 404–406 ROLLUPGROUP modifier, 406 TREATAS function and, 407–408 table functions, 57 ALL function, 63–65, 66–67 ALLEXCEPT function, 65–66 ALLSELECTED function, 74–76 calculated columns and, 59 calculated tables, 59 DISTINCT function, 68, 70–71 FILTER function, 57–58, 60–63 measures and, 59 nesting, 58–59
RELATEDTABLE function, 58–59 VALUES function, 67–74 text functions, 50–51 TIME function, 51, 52 time intelligence functions (nested), call order of, 245–246 TOPN function authoring queries, 409–414 ISONORAFTER function and, 417–419 sort order, 410 TOPNSKIP function, authoring queries, 420 TREATAS function, 378 data lineage, 467–468 filter contexts and data lineage, 334–336 SUMMARIZECOLUMNS function and, 407–408 transferring filters, 482–483, 484 UNION function and, 377–378 trigonometric functions, 50 UNION function CALCULATE function and, 376–378 DISTINCT function and, 375–378 tables and, 374–378 TREATAS function and, 377–378 USERELATIONSHIP function active relationships, 450–451 CALCULATE function and, 164–168 non-active relationships and ambiguity, 516–517 VALUE function, 51 VALUES function ALL function and, 327–328 ALLEXCEPT function versus, 326–328 calculated physical relationships and circular dependencies, 477–480 computing percentages, 133–134 filter contexts, 322–324, 327–328 FILTERS function versus, 322–324 range-based relationships (calculated physical relationships), 474–476
G GENERATE function, authoring queries, 414–417 GENERATEALL function, authoring queries, 417 GENERATESERIES function, tables and, 393–394 generating errors (error-handling), 38–39 granularity calculations and iterators, 211–214 relationships (data models), 507–512
725
GROUPBY function GROUPBY function authoring queries, 420–423 SUMMARIZE function and, 420–423
H hash encoding (VertiPaq compression), 555–556 HASONEVALUE function filter contexts, 314–318 tables as scalar values, 73 help, formatting DAX code, 42 hierarchies, 345, 362 attribute hierarchies (data model optimization), disabling, 604 Columns Hierarchies Size column (VertiPaq Analyzer), 582 DAX, 13–14 MDX, 13–14 P/C (Parent/Child) hierarchies, 350–361, 362 percentages, computing, 345 IF conditions, 349 PercOnCategory measures, 348 PercOnParent measures, 346–349 ratio to parent calculations, 345 SSAS and, 561–562 Use Hierarchies Size column (VertiPaq Analyzer), 582
I IF conditions computing percentages over hierarchies, 349 DAX optimization, 678–679 DIVIDE function and, 684–687 iterators, 687–690 in measures, 679–683 IF function, 36, 37, 46–47 IFERROR function, 35–36, 37–38, 47 IGNORE modifier, SUMMARIZECOLUMNS function, 403–404 information functions, 48–49 INT function, 51 Integer data type, 21 INTERSECT function tables and, 378–379 transferring filters, 483–484 intra-island relationships, 489 invalid relationships, blank rows and, 68–71
726
ISBLANK function, 36 ISCROSSFILTERED function, filter contexts, 319–322 ISEMPTY function, filter contexts, 330–332 ISERROR function, 36, 38 ISFILTERED function filter contexts, 319, 320–322 time intelligence calculations, 268–269 ISNUMBER function, 48–49 ISONORAFTER function authoring queries, 417–419 TOPN function and, 417–419 ISSELECTEDMEASURE function, including/excluding measures from calculation items, 304–306 ISSUBTOTAL function, 402–403 iterators, 8, 43, 44, 209–215 ADDCOLUMNS iterators, 196–199 averages (means) computing with AVERAGEX function, 199–201 moving averages, 201–202 returning with AVERAGE function, 199 returning with AVERAGEA function, 199 AVERAGEX iterators, 188 behavior of, 91 calculation granularity, 211–214 cardinality, 188–190 CONCATENATEX function and, 194–196 context transitions, leveraging, 190–194 DAX optimization IF conditions, 687–690 nested iterators, 693–699 FILTER function as, 60–61 nested iterators DAX optimization, 693–699 leveraging context transitions, 190–194 parameters of, 187–188 RANK.EQ function, 210 RANKX iterators, 188, 202–210 ROW CONTEXT iterators, 187–188 row contexts and, 90–91 SELECTCOLUMNS iterators, 196, 197–199 SUMX iterators, 187–188 tables, returning, 196–199
J join operators, xmSQL queries, 628–630
MIN function
K KEEPFILTERS function, 461–462 CALCULATE function and, 135–138, 139–143, 164, 168–169 evaluation order, 146–148 filtering multiple columns, 142–143 transferring filters, 482–483, 484
L last day sales (calculations), computing previous year sales up to, 539–544 LASTDATE function, time intelligence calculations, 248–249, 254, 255, 269–270 LASTNONBLANK function, time intelligence calculations, 250–254, 255, 270–271 lazy evaluations, variables, 181–183 leaf-level calculations DAX, 14 MDX, 14 leap year bug, 22 list of values. See filter arguments logical functions IF function, 46–47 IFERROR function, 47 SWITCH function, 47–48 logical operators, 23 logical query plans, 612, 614, 650–651 LOOKUPVALUE function, 444, 473
M maintenance (code), FILTER function, 62–63 many-sided relationships (data models), 2, 3 many-to-many relationships. See MMR Mark as Date Table, 232–233 materialization (queries), 568–571 mathematical functions, 49 MAX function, 43 MDX (Multidimensional Expressions) DAX and, 12 hierarchies, 13–14 leaf-level calculations, 14 multidimensional versus tabular space, 12 as programming language, 12–13 as querying language, 12–13 queries, 546
attribute hierarchies (data model optimization), disabling, 604 DAX and, 613 executing, 546 reproduction queries, creating, 663–664 means (averages) computing averages, AVERAGEX function, 199–201 moving averages, 201–202 returning averages AVERAGE function, 199 AVERAGEA function, 199 MEASURE keyword, DEFINE sections (authoring queries), 399 measures, 26–28 ALL function and, 63–64 calculated columns, 42 choosing between calculated columns and measures, 29–30 differences between calculated columns and measures, 29 using measures in calculated columns, 30 calculation items, including/excluding measures from, 304–306 columns in, evaluation contexts, 89–90 context transitions, 157–160 DEFINE MEASURE clauses in EVALUATE statements, 59 defining in tables, 29 expressions, 29 IF conditions, DAX optimization, 679–683 ISSELECTEDMEASURE function, including/excluding measures from calculation items, 304–306 PercOnCategory measures, computing percentages over hierarchies, 348 PercOnParent measures, computing percentages over hierarchies, 346–349 query measures, 399, 662–663 SELECTEDMEASURE function, including/excluding measures from calculation items, 304–306 table filters in, 447–450 table functions, 59 testing, 399–401 VALUES function and, 67–68 memory size, VertiPaq hardware selection, 574, 576 memory speed, VertiPaq hardware selection, 574, 575–576 MIN function, 43
727
MMR (Many-Many Relationships) MMR (Many-Many Relationships), 489, 490, 494, 507 bridge tables, 494–499 common dimensionality, 500–504 weak relationships, 504–506 moving annual totals, computing, 243–244 moving averages, CALCULATE function, 201–202 MTD (Month-to-Date) calculations, time intelligence calculations, 235–236, 259–262, 276–277 multi-line comments, 18 multiple columns DISTINCT function and, 71 multiple-column relationships (calculated physical relationships), 471–473 VALUES function and, 71 MultipleItemSales variable, 58
N Name calculation group, 288 Name calculation item, 288 naming variables, 182 narrowing table computations, 384–386 NATURALINNERJOIN function, authoring queries, 423–425 NATURALLEFTOUTERJOIN function, authoring queries, 424–425 nested functions, call order of time intelligence functions, 245–246 nested iterators DAX optimization, 693–699 leveraging context transitions, 190–194 nesting filter contexts, in variables, 184–185 FILTER functions, 61–62 multiple rows, in variables, 184 row contexts on different tables, 91–92 on the same table, 92–97 table functions, 58–59 VAR/RETURN statements, 179–180 new customers, computing (tables), 380–381, 386–387 NEXTDAY function, call order of nested time intelligence functions, 245–246 non-active relationships, ambiguity, 515–517 nonworking days between two dates, computing, 523–525 numbering sequences of events (calculations), 536–539 numbers, conversions, 19–21
728
O one-sided relationships (data models), 2, 3 one-to-many relationships. See SMR one-to-one relationships. See SSR opening/closing balances (time intelligence calculations), 254–258 operators, 23 arithmetic operators, 23 division by zero, 32–33 empty/missing values, 33–35 error-handling, 32–35 comparison operators, 23 logical operators, 23 overloading, 19–20 parenthesis operators, 23 text concatenation operators, 23 optimizing columns high-cardinality columns, 603 split optimization, 602–603 storage optimization, 602–603 data models with VertiPac, 579 aggregations, 587–588 cross-filtering, 590 denormalizing data, 584–591 gathering data model information, 579–584 relationship cardinality, 586–587 DAX, 657 bottlenecks, 668 CallbackDataID function, 690–693 change implementation, 668 conditional statements, 708–709 context transitions, 672–678 DISTINCTCOUNT function, 699–704 expressions, identifying a single DAX expression for optimization, 658–661 filter conditions, 668–672 IF conditions, 678–683, 684–690 multiple evaluations, avoiding with variables, 704–708 nested iterators, 693–699 query plans, 664–667 reproduction queries, creating, 661–664 SE/FE bottlenecks, identifying, 667–668 server timings, 664–667
queries test queries, rerunning, 668 to-do list, 658 variables, 704–708 OR conditions, tables as filters, 381–384 ORDER BY clauses in EVALUATE statements, 60 orders (example), computing duration of, 26 Ordinal values, calculated items, 289 overwriting filters, CALCULATE function, 120–122, 136
P P/C (Parent/Child) hierarchies, 350–361, 362 paging, VertiPaq hardware selection, 576–577 parallelism CallbackDataID function, 641 VertiPaq SE queries, 641 PARALLELPERIOD function, time intelligence calculations, 238–239 parenthesis operators, 23 partitioning and SSAS, 562–563 Partitions # column (VertiPaq Analyzer), 582 percentages, computing, 135 ALL function, 63–64 ALLSELECTED function, 75–76 CALCULATE function, 124 ALL function, 125–132 ALLEXCEPT function, 135 VALUES function, 133–134 hierarchies, 345 IF conditions, 349 PercOnCategory measures, 348 PercOnParent measures, 346–349 ratio to parent calculations, 345 PercOnCategory measures, computing percentages over hierarchies, 348 PercOnParent measures, computing percentages over hierarchies, 346, 348–349 PercOnSubcategory measures, computing percentages over hierarchies, 346–348 physical query plans, 612–613, 614–616, 651–652 physical relationships calculated physical relationships, 471–473 circular dependencies, 476–480 range-based relationships, 474–476 cardinality, 489–490 choosing, 506–507 cross-filter directions, 490
bidirectional cross-filter direction, 490, 491–493, 507 single cross-filter direction, 490 cross-island relationships, 489 intra-island relationships, 489 MMR, 489, 490, 494, 507 bridge tables, 494–499 common dimensionality, 500–504 weak relationships, 504–506 SMR, 489, 490, 493, 507 SSR, 489, 490, 493–494 strong relationships, 488 virtual relationships versus, 506–507 weak relationships, 488, 489, 504–506 Power BI Auto Date/Time, 218–219 DAX and, 14–15 DAX Studio and, 609–611 filter contexts, 84–85 Power BI reports and DAX queries, 609–610 Power Pivot for Excel automatic date columns, 219 date table templates, 220 Precedence calculation group, 288, 299–304 precomputing values (calculations), computing work days between two dates, 525–527 previous year sales up to last day sales (calculations), computing, 539–544 PREVIOUSMONTH function, time intelligence calculations, 239 Primary/Alternate Keys column (tables), 599 primary/alternate keys column (tables), 600 processing tables, 550 PYTD (Previous Year-To-Date) calculations, calculation items and sideways recursion, 307–308
Q QTD (Quarter-to-Date) calculations, time intelligence calculations, 235–236, 259–262, 276–277 qualitative attributes column (tables), 599, 600 quantitative attributes column (tables), 599, 600–601 queries DAX queries capturing, 609–611 DISTINCTCOUNT function, 634–635 executing, 546 DAX query plans, 612–613
729
queries DirectQuery, 546, 548, 549, 617 DirectQuery SE queries composite data models, 646–647 reading, 645–646 Expression Trees, 612 FE, 546, 547 datacaches, 547 operators of, 547 single-threaded implementation, 547 materialization, 568–571 MDX queries, 546 DAX and, 613 disabling attribute hierarchies (data model optimization), 604 executing, 546 query measures, creating with DAX Studio, 662–663 reproduction queries, creating creating query measures with DAX Studio, 662–663 in DAX, 661–662 in MDX, 663–664 SE, 546, 616–617 aggregations, 548 datacaches, 547 DirectQuery, 548 operators of, 547 parallel implementations, 548 VertiPaq, 547–549, 550–577 test queries, rerunning (DAX optimization), 668 VertiPaq, 546, 547–548, 550. See also data models, optimizing with VertiPaq aggregations, 571–573 columnar databases, 550–553 compression, 553–562 datacaches, 549 DMV, 563–565 hardware selection, 573–577 hash encoding, 555–556 hierarchies, 561–562 materialization, 568–571 multithreaded implementations, 548 partitioning, 562–563 processing tables, 550 re-encoding, 559 relationships (data models), 561–562, 565–568 RLE, 556–559
730
scan operations, 549 segmentation, 562–563 sort orders, 560–561 value encoding, 554–555 VertiPaq SE queries, 624 composite data models, 646–647 datacaches and parallelism, 635–637 DISTINCTCOUNT function, 634–635 scan time, 632–634 xmSQL queries and, 624–632 xmSQL queries, 624 aggregation functions, 625–627 arithmetical operations, 627 batch events, 630–632 filter operations, 628–630 join operators, 630 queries, authoring, 395 ADDMISSINGITEMS function, 419–420, 432–433 auto-exists feature, 428–434 DAX Studio, 395 DEFINE sections MEASURE keyword in, 399 VAR keyword in, 397–399 EVALUATE statements ADDMISSINGITEMS function, 419–420, 432–433 example of, 396 expression variables and, 398 GENERATE function, 414–417 GENERATEALL function, 417 GROUPBY function, 420–423 ISONORAFTER function, 417–419 NATURALINNERJOIN function, 423–425 NATURALLEFTOUTERJOIN function, 423–425 query variables and, 398 ROW function, 400–401 SAMPLE function, 427–428 SUBSTITUTEWITHINDEX function, 425–427 SUMMARIZE function, 401–403, 433–434 SUMMARIZECOLUMNS function, 403–409, 429–434 syntax of, 396–399 TOPN function, 409–414 TOPNSKIP function, 420 expression variables, 397–399 GENERATE function, 414–417
relationships (data models) GENERATEALL function, 417 GROUPBY function, 420–423 ISONORAFTER function, 417–419 MEASURE in DEFINE sections, 399 measures query measures, 399 testing, 399–401 NATURALINNERJOIN function, 423–425 NATURALLEFTOUTERJOIN function, 423–425 query variables, 397–399 ROW function, testing measures, 400–401 SAMPLE function, 427–428 shadow filter contexts, 457–462 SUBSTITUTEWITHINDEX function, 425–427 SUMMARIZE function, 401–403, 433–434 SUMMARIZECOLUMNS function, 403–409, 429–434 TOPN function, 409–414 TOPNSKIP function, 420 VAR in DEFINE sections, 397–399 Query End events (SQL Server Profiler), 621 query plans capturing queries DAX Studio, 617–620 SQL Server Profiler, 620–623 collecting, 613–614 DAX optimization, 664–667 logical query plans, 612, -614, 650–651 physical query plans, 612–613, 614–616, 651–652 reading, 649–655 query variables, 397–399
R range-based relationships (calculated physical relationships), 474–476 RANK.EQ function, 210 RANKX function, numbering sequences of events (calculations), 538–539 RANKX iterators, 188, 202–210 ratio to parent calculations, computing percentages over hierarchies, 345 readability (code), FILTER function, 62–63 recursion (sideways), calculation items, 306–311 re-encoding SSAS and, 559 VertiPaq, 559
referencing columns in tables, 17–18 refreshing data, SSAS (SQL Server Analysis Services), 549–550 RELATED function CALCULATE function and, 443–444 calculated columns, 443–444 context transitions in expanded tables, 455 expanded tables, 441–444 filter contexts, relationships and, 109 nested row contexts on different tables, 92 row contexts and relationships, 103–105 table filters and expanded tables, 454 RELATEDTABLE function, 58–59 filter contexts, relationships and, 109 nested row contexts on different tables, 91–92 row contexts and relationships, 103–105 relational functions, 53–54 relationships (data models), 2 1:1 relationships, 2 active relationships ambiguity, 514–515 CALCULATETABLE function, 451–453 expanded tables and, 450–453 USERELATIONSHIP function, 450–451 ambiguity, 512–513 active relationships, 514–515 non-active relationships, 515–517 bidirectional filtering, 3–4 bidirectional relationships, 106, 109 calculated physical relationships, 471 circular dependencies, 476–480 multiple-column relationships, 471–473 range-based relationships, 474–476 cardinality, 489–490, 586–587, 590–591 chains, 3 columns, 3 cross-filter directions, 3, 490 bidirectional cross-filter direction, 490, 491–493, 507 single cross-filter direction, 490 cross-island relationships, 489 DAX and SQL, 9 directions of, 3–4 evaluation contexts and, 101–102 filter contexts, 106–109 row contexts, 102–105 expanded tables, 437–441
731
relationships (data models) granularity, 507–512 intra-island relationships, 489 invalid relationships and blank rows, 68–71 many-sided relationships, 2, 3 MMR, 489, 490, 494, 507 bridge tables, 494–499 common dimensionality, 500–504 weak relationships, 504–506 non-active relationships, ambiguity, 515–517 one-sided relationships, 2, 3 performance, 507 physical relationships calculated physical relationships, 471–480 cardinality, 489–490 choosing, 506–507 cross-filter directions, 490–493 cross-island relationships, 489 intra-island relationships, 489 MMR, 489, 490, 494–506, 507 SMR, 489, 490, 493, 507 SSR, 489, 490, 493–494 strong relationships, 488 virtual relationships versus, 506–507 weak relationships, 488, 489, 504–506 Relationship reports (VertiPaq Analyzer), 584 Relationship Size column (VertiPaq Analyzer), 582 relationships, expanded tables, 437–441 shallow relationships in batch events (xmSQL queries), 630–632 SMR, 489, 490, 493, 507 SSAS and, 561–562 SSR, 489, 490, 493–494 strong relationships, 488 transferring filters, 480–481 CALCULATE function, 482 CONTAINS function, 481–482 FILTER function, 481–482, 484–485 INTERSECT function, 483–484 TREATAS function, 482–483, 484 unidirectional filtering, 4 USERELATIONSHIP function, non-active relationships and ambiguity, 516–517 VertiPaq and, 565–568 virtual relationships, 480, 507 dynamic segmentation, 485–488 physical relationships versus, 506–507
732
transferring filters, 480–485 weak relationships, 2, 439, 488, 489, 504–506 reproduction queries, creating in DAX, 661–662 in MDX, 663–664 query measures, creating with DAX Studio, 662–663 reusing table expressions, 388–389 RLE (Run Length Encoding), VertiPaq, 556–559 ROLLUP function, 401–402, 403 ROLLUPADDISSUBTOTAL modifier, SUMMARIZECOLUMNS function, 404–406 ROLLUPGROUP modifier, SUMMARIZECOLUMNS function, 406 ROW CONTEXT iterators, 187–188 row contexts, 80 CALCULATE function and, 148–151 column references, 87 examples of, 86–87 filter contexts versus, 85 iterators and, 90–91 nested row contexts on different tables, 91–92 on the same table, 92–97 relationships and, 102–105 ROW function static tables, creating, 391–392 testing measures, 400–401 rows (tables) ALLNOBLANKROW function, 464, 465 blank rows, invalid relationships, 68–71 CONTAINSROW function, 387–388 DETAILROWS function, 388–389 nesting in variables, 184 SAMPLE function, 427–428 TOPN function, 409–414 Rows column (VertiPaq Analyzer), 581, 583
S sales budget/sales information (calculations), showing together, 527–530 previous year sales up to last day sales (calculations), computing, 539–544 same-store sales (calculations), computing, 530–536 same-store sales (calculations), computing, 530–536 SAMEPERIODLASTYEAR function
SQL Server Profiler computing previous year sales up to last day sales (calculations), 540–544 nested time intelligence functions, call order of, 245–246 time intelligence calculations, 237 SAMPLE function, authoring queries, 427–428 scalar expressions, 57–58 scalar values storing in variables, 176, 181 tables as, 71–74 SE (Storage Engines), 546 aggregations, 548 bottlenecks, identifying, 667–668 datacaches, 547 DirectQuery, 548, 549 operators of, 547 parallel implementations, 548 queries, 616–617 SE queries, copy VertiPaq SE queries entries VertiPaq, 547–548, 550. See also data models, optimizing with VertiPaq aggregations, 571–573 columnar databases, 550–553 compression, 553–562 datacaches, 549 DMV, 563–565 hardware selection, 573–577 hash encoding, 555–556 hierarchies, 561–562 materialization, 568–571 multithreaded implementations, 548 partitioning, 562–563 processing tables, 550 re-encoding, 559 relationships (data models), 561–562, 565–568 RLE, 556–559 scan operations, 549 segmentation, 562–563 sort orders, 560–561 value encoding, 554–555 VertiPaq SE queries, 624–632 segmentation dynamic segmentation and virtual relationships, 485–488 SSAS and, 562–563 Segments # column (VertiPaq Analyzer), 582 SELECTCOLUMNS function, 390–391, 393–394
SELECTCOLUMNS iterators, 196, 197–199 SELECTEDMEASURE function, including/excluding measures from calculation items, 304–306 SELECTEDMEASUREFORMATSTRING function, 291 SELECTEDVALUE function calculated physical relationships, circular dependencies, 479–480 context transitions in expanded tables, 454–455 filter contexts, 318–319 same-store sales (calculations), computing, 533–534 tables as scalar values, 73–74 semi-additive calculations, time intelligence calculations, 246–248 sequences of events (calculations), numbering, 536–539 server timings, DAX optimization, 664–667 shadow filter contexts, 457–462 shallow relationships in batch events (xmSQL queries), 630–632 sideways recursion, calculation items, 306–311 simple filters arbitrarily shaped filters versus, 337 defined, 337 single cross-filter direction (physical relationships), 490 single data models DirectQuery mode, 488 VertiPaq mode, 488 single-line comments, 18 SMR (Single-Many Relationships), 489, 490, 493, 507 sort order, determining, ORDER BY clauses, 60 sort orders SSAS and, 560–561 VertiPaq, 560–561 SQL (Structured Query Language) conditions, 11 DAX and, 9 as declarative language, 10 error-handling, empty/missing values, 35 subqueries, 11 SQL Server Profiler DirectQuery End events, 621 Query End events, 621 query plans, capturing profiling information, 620–623 VertiPaq SE Query Cache Match events, 621 VertiPaq SE Query End events, 621
733
SQRT function SQRT function, 36 SSAS (SQL Server Analysis Services) data refreshes, 549–550 DMV, 563–565 hierarchies, 561–562 partitioning, 562–563 processing tables, 550 re-encoding, 559 relationships (data models), 561–562 segmentation, 562–563 sort orders, 560–561 SSR (Single-Single Relationships), 489, 490, 493–494 star schemas, denormalizing data and data model optimization, 586 STARTOFQUARTER function, time intelligence calculations, 256–257 static tables, creating DATATABLE function, 392–393 ROW function, 391–392 storing blockz, in variables, 176, 181 columns (tables), 601–602 partial results of calculations, in variables, 176–177 scalar values, in variables, 176, 181 tables, in variables, 58 string conversions, 19–21 strong relationships, 488 subcategories/categories example, ALL function and, 66–67 subqueries DAX, 11 SQL, 11 SUBSTITUTEWITHINDEX function, authoring queries, 425–427 SUM function, 42–43, 44–45, 88–89 SUMMARIZE function authoring queries, 401–403, 433–434 auto-exists feature (queries), 433–434 columns (tables) and, 401 filter contexts, 112 GROUPBY function and, 420–423 ISSUBTOTAL function and, 402–403 ROLLUP function and, 401–402, 403 table filters and expanded tables, 453–454 tables and, 369–372, 373–374, 383–384 transferring filters, 484–485 SUMMARIZECOLUMNS function
734
authoring queries, 403–409, 429–434 auto-exists feature (queries), 429–434 filter arguments, 406–409 IGNORE modifier, 403–404 ROLLUPADDISSUBTOTAL modifier, 404–406 ROLLUPGROUP modifier, 406 TREATAS function and, 407–408 SUMX function, 45 SUMX iterators, 187–188 SWITCH function, 47–48
T table constructors, 24 table expressions, EVALUATE statements, 59–60 table filters, DISTINCTCOUNT function, 699–704 table functions, 57 ALL function columns and, 64–65 computing percentages, 63–64 measures and, 63–64 syntax of, 63 top categories/subcategories example, 66–67 VALUES function versus, 67 ALLEXCEPT function, 65–66 ALLSELECTED function, 74–76 calculated columns and, 59 calculated tables, 59 DISTINCT function, 71 blank rows and invalid relationships, 68, 70–71 calculated columns, 68 multiple columns, 71 VALUES function versus, 68 FILTER function, 57–58 code maintenance/readability, 62–63 as iterator, 60–61 nesting, 61–62 syntax of, 60 measures and, 59 nesting, 58–59 RELATEDTABLE function, 58–59 VALUES function, 71 ALL function versus, 67 blank rows and invalid relationships, 68–71
tables calculated columns, 68 calculated tables, 68 DISTINCT function versus, 68 measures and, 67–68 multiple columns, 71 tables as scalar values, 71–74 Table Size % column (VertiPaq Analyzer), 582 Table Size column (VertiPaq Analyzer), 581 table variables, 181–182 tables, 363 ADDCOLUMNS function, 366–369, 371–372 blank rows, invalid relationships, 68–71 bridge tables, MMR, 494–499 CALCULATE function, tables as filters, 382–384 calculated columns, 25–26, 42 choosing between calculated columns and measures, 29–30 differences between calculated columns and measures, 29 expressions, 29 using measures in calculated columns, 30 calculated tables, 59 creating, 390–391 DISTINCT function, 68 SELECTCOLUMNS function, 390–391 VALUES function, 68 CALCULATETABLE function, 363–365 columns ADDCOLUMNS function, 366–369, 371–372 Boolean calculated columns, 597–598 calculated columns and data model optimization, 595–599 calculated columns, RELATED function, 443–444 cardinality, 603 cardinality and data model optimization, 591–592 Date column, 592–595 defined, 2 descriptive attributes column (tables), 600, 601–602 filtering, 444–447 optimizing high-cardinality columns, 603 Primary/Alternate Keys column (tables), 599 primary/alternate keys column (tables), 600 qualitative attributes column (tables), 599, 600 quantitative attributes column (tables), 599, 600–601 referencing, 17–18
relationships, 3 SELECTCOLUMNS function, 390–391, 393–394 storage optimization, 602–603 storing, 601–602 SUBSTITUTEWITHINDEX function, 425–427 SUMMARIZE function and, 401 SUMMARIZECOLUMNS function, 403–409, 429–434 technical attributes column (tables), 600, 602 Time column, 592–595 VertiPaq Analyzer, 580–583 computing new customers, 380–381, 386–387 CONTAINS function, 387–388 CONTAINSROW function, 387–388 CROSSJOIN function, 372–374, 383–384 date tables ADDCOLUMNS function, 223–224 building, 220–224 CALENDAR function, 222 CALENDARAUTO function, 222–224 date table templates (Power Pivot for Excel), 220 date templates, 224 duplicating, 227 loading from other data sources, 221 managing multiple dates, 224–228 Mark as Date Table, 232–233 multiple date tables, 226–228 multiple relationships to date tables, 224–226 naming, 221 defined, 2 DETAILROWS function, 388–389 EXCEPT function, 379–381 expanded tables active relationships, 450–453 column filters versus table filters, 444–447 context transitions, 454–455 differences between table filters and expanded tables, 453–454 filter contexts, 439–441 filtering, 444–447, 450–453 RELATED function, 441–444 relationships, 437–441 table filters in measures, 447–450 table filters versus column filters, 444–447
735
tables expressions, reusing, 388–389 FILTER function versus CALCULATETABLE function, 363–365 filtering CALCULATE function and, 445–447 column filters versus, 444–447 in measures, 447–450 as filters, 381–384 GENERATESERIES function, 393–394 IN function, 387–388 INTERSECT function, 378–379 iterators, returning tables with, 196–199 measures, defining in tables, 29 narrowing computations, 384–386 NATURALINNERJOIN function, 423–425 NATURALLEFTOUTERJOIN function, 423–425 processing, 550 records, 2 reusing expressions, 388–389 rows ALLNOBLANKROW function, 464, 465 CONTAINSROW function, 387–388 DETAILROWS function, 388–389 SAMPLE function, 427–428 TOPN function, 409–414 as scalar values, 71–74 SELECTCOLUMNS function, 390–391, 393–394 static tables creating with DATATABLE function, 392–393 creating with ROW function, 391–392 storing in variables, 176, 181 SUMMARIZE function, 369–372, 373–374, 383–384 temporary tables in batch events (xmSQL queries), 630–632 TOPN function, 409–414 UNION function, 374–378 variables, storing tables in, 58 Tabular model calculation groups, creating, 281–288 DAX engines and, 545–546 DAX queries, executing, 546 DirectQuery, 546 MDX queries, executing, 546 VertiPaq, 546 technical attributes column (tables), 600, 602 templates date table templates (Power Pivot for Excel), 220 date templates, 224
736
temporary tables in batch events (xmSQL queries), 630–632 test queries, rerunning (DAX optimization), 668 text concatenation operators, 23 editing, formatting DAX code, 42 text functions, 50–51 Time column, data model optimization, 592–595 TIME function, 51, 52 time intelligence calculations, 217 Auto Date/Time (Power BI), 218–219 automatic date columns (Power Pivot for Excel), 219 basic calculations, 228–232 basic functions, 233–235 CALCULATE function, 228–232 CALCULATETABLE function, 259, 260–261 context transitions, 260 custom calendars, 272 DATESYTD function, 276–277 weeks, 272–275 date tables ADDCOLUMNS function, 223–224 building, 220–224 CALENDAR function, 222 CALENDARAUTO function, 222–224 date table templates (Power Pivot for Excel), 220 date templates, 224 duplicating, 227 loading from other data sources, 221 managing multiple dates, 224–228 Mark as Date Table, 232–233 multiple date tables, 226–228 multiple relationships to date tables, 224–226 naming, 221 DATEADD function, 237–238, 262–269 DATESINPERIOD function, 243–244 DATESMTD function, 259, 276–277 DATESQTD function, 259, 276–277 DATESYTD function, 259, 260, 261–262, 276–277 differences over previous periods, computing, 241–243 drillthrough operations, 271 FILTER function, 228–232 FIRSTDATE function, 269, 270 FIRSTNONBLANK function, 256–257, 270–271
variables LASTDATE function, 248–249, 254, 255, 269–270 LASTNONBLANK function, 250–254, 255, 270–271 mixing functions, 239–241 moving annual totals, computing, 243–244 MTD calculations, 235–236, 259–262, 276–277 nested functions, call order of, 245–246 NEXTDAY function, 245–246 opening/closing balances, 254–258 PARALLELPERIOD function, 238–239 periods to date, 259–262 PREVIOUSMONTH function, 239 QTD calculations, 235–236, 259–262, 276–277 SAMEPERIODLASTYEAR function, 237, 245–246 semi-additive calculations, 246–248 STARTOFQUARTER function, 256–257 time periods, computing from prior periods, 237–239 YTD calculations, 235–236, 259–262, 276–277 time periods, computing from prior periods, 237–239 top categories/subcategories example, ALL function and, 66–67 TOPN function authoring queries, 409–414 ISONORAFTER function and, 417–419 sort order, 410 TOPNSKIP function, authoring queries, 420 transferring filters, 480–481 CALCULATE function, 482 CONTAINS function, 481–482 FILTER function, 481–482, 484–485 INTERSECT function, 483–484 TREATAS function, 482–483, 484 TREATAS function, 378 data lineage, 467–468 filter contexts and data lineage, 334–336 SUMMARIZECOLUMNS function and, 407–408 transferring filters, 482–483, 484 UNION function and, 377–378 trigonometric functions, 50
U unary operators, P/C (Parent/Child) hierarchies, 362 unidirectional filtering (relationships), 4
UNION function CALCULATE function and, 376–378 DISTINCT function and, 375–378 tables and, 374–378 TREATAS function and, 377–378 Use Hierarchies Size column (VertiPaq Analyzer), 582 USERELATIONSHIP function active relationships, 450–451 CALCULATE function and, 164–168 non-active relationships and ambiguity, 516–517
V value encoding (VertiPaq compression), 554–555 VALUE function, 51 values, list of. See filter arguments VALUES function, 71 ALL function and, 327–328 ALL function versus, 67 ALLEXCEPT function versus, 326–328 blank rows and invalid relataionships, 68–71 calculated columns, 68 calculated physical relationships circular dependencies, 477–480 range-based relationships, 474–476 calculated tables, 68 computing percentages, 133–134 DISTINCT function versus, 68 filter contexts, 322–324, 327–328 FILTERS function versus, 322–324 measures and, 67–68 multiple columns, 71 tables as scalar values, 71–74 VAR keyword, DEFINE sections (authoring queries), 397–399 variables, 30–31, 175 as a constant, 177–178 defining, 176, 178–180 documenting code, 183–184 error-handling, 37 expression variables, 397–399 formatting, 40–41 lazy evaluations, 181–183 multiple evaluations, avoiding with variables, 704–708
737
variables MultipleItemSales variable, 58 names, 182 nesting filter contexts, 184–185 multiple rows, 184 query variables, 397–399 scalar values, 58 scope of, 178–180 storing partial results of calculations, 176–177 scalar values, 176, 181 tables, 176, 181 table variables, 181–182 tables, storing, 58 VAR syntax, 175–177 VAR/RETURN blocks, 175–177, 180 VAR/RETURN statements, nesting, 179–180 Variant data type, 22 VertiPaq, 546, 547–548, 550 aggregations, 571–573, 604–607 caches, 637–640 CallbackDataID function, 640–644 columnar databases, 550–553 compression, 553–554 hash encoding, 555–556 re-encoding, 559 RLE, 556–559 value encoding, 554–555 data model optimization, 579 aggregations, 587–588, 604–607 calculated columns, 595–599 choosing columns for storage, 599–602 column cardinality, 591–592 cross-filtering, 590 Date column, 592–595 denormalizing data, 584–591 disabling attribute hierarchies, 604 gathering data model information, 579–584 optimizing column storage, 602–603 optimizing drill-through attributes, 604 relationship cardinality, 586–587, 590–591 Time column, 592–595 datacaches, 549 DMV, 563–565 hardware selection, 573 best practices, 577 CPU model, 574–575
738
Disk I/O performance, 574, 576–577 memory size, 574, 576 memory speed, 574, 575–576 number of cores, 574, 576 as an option, 573–574 paging, 576–577 setting priorities, 574–576 hierarchies, 561–562 materialization, 568–571 multithreaded implementations, 548 partitioning, 562–563 processing tables, 550 relationships (data models), 561–562, 565–568 row-level security, 639 scan operations, 549 segmentation, 562–563 sort orders, 560–561 VertiPaq Analyzer columns (tables), 580–583 gathering data model information, 579–584 VertiPaq Analyzer, Relationship reports, 584 VertiPaq mode, 488–489 composite data models, 488 single data models, 488 VertiPaq SE queries, 624 composite data models, 646–647 datacaches, parallelism and, 635–637 DISTINCTCOUNT function, 634–635 scan time, 632–634 xmSQL queries and, 624 aggregation functions, 625–627 arithmetical operations, 627 batch events, 630–632 filter operations, 628–630 join operators, 630 VertiPaq SE Query Cache Match events (SQL Server Profiler), 621 VertiPaq SE Query End events (SQL Server Profiler), 621 virtual relationships, 480, 507 dynamic segmentation, 485–488 physical relationships versus, 506–507 transferring filters, 480–481 CALCULATE function, 482 CONTAINS function, 481–482 FILTER function, 481–482, 484–485 INTERSECT function, 483–484 TREATAS function, 482–483, 484
YTD (Year-to-Date) calculations
W weak relationships, 2, 439, 488, 489, 504–506 weeks (custom calendars), time intelligence calculations, 272–275 work days between two dates, computing, 519–523 nonworking days, 523–525 precomputing values (calculations), 525–527
X xmSQL CallbackDataID function parallelism and, 641 VertiPaq and, 640–644 VertiPaq queries, 548 xmSQL queries, 624 aggregation functions, 625–627
arithmetic operations, 627 batch events, 630–632 filter operations, 628–630 join operators, 630
Y YOY (Year-Over-Year) calculation item, 289–290 YOY% (Year-Over-Year Percentage) calculation item, 289–290 YTD (Year-to-Date) calculations calculation group precedence, 299–303 calculation items applying to expressions, 294 sideways recursion, 307 time intelligence calculations, 235–236, 259–262, 276–277
739
Marco Russo and Alberto Ferrari are the founders of sqlbi.com, where they regularly publish articles about Microsoft Power BI, Power Pivot, DAX, and SQL Server Analysis Services. They have worked with DAX since the first beta version of Power Pivot in 2009 and, during these years, sqlbi.com became one of the major sources for DAX articles and tutorials. Their courses, both in-person and online, are the major source of learning for many DAX enthusiasts. They both provide consultancy and mentoring on business intelligence (BI) using Microsoft technologies. They have written several books and papers about Power BI, DAX, and Analysis Services. They constantly help the community of DAX users providing content for the websites daxpatterns.com, daxformatter.com, and dax.guide. Marco and Alberto are also regular speakers at major international conferences, including Microsoft Ignite, PASS Summit, and SQLBits. Contact Marco at [emailprotected], and contact Alberto at [emailprotected]
Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree Cookies