Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
title<...>

Sections 1 & 2 model answers:

Question 1:

Paste code macro
languagesql
Select
Customer_Name
,Count(1) As Order_Count

From dbo.Orders

Group By Customer_Name

Having Count(1) > 1


Question 2:

Rank() or Row_Number() will help you here

Paste code macro
languagesql
Select
a.Customer_Name
,a.Order_ID
,a.Order_Date
,a.Order_Value

From (
   Select
   Customer_Name
   ,Order_ID
   ,Order_Date
   ,Order_Value
   ,Row_Number() Over(Partition By Customer_Name Order By Order_Date Asc) As first_sq
   ,Row_Number() Over(Partition By Customer_Name Order By Order_Date Desc) As last_sq

   From dbo.Orders
)a
Where a.first_sq = 1 Or a.last_sq = 1


Question 3:

The better method will be:

Paste code macro
languagesql
Select
a.Customer_Name
,a.Order_Value As Latest_Value
,a.First_Order As Earliest_Value
,((Nullif(a.Order_Value,0) - NullIf(a.First_Order,0)) / NullIf(a.First_Order,0)) * 100 As Perc_Diff

From (
	Select
	Customer_Name
	,Order_Date
	,Order_Value
	,Lead(Order_Value,1,0) Over(Partition By Customer_Name Order By Order_Date Desc) As First_Order
	,Row_Number() Over(Partition By Customer_Name Order By Order_Date Desc) As Customer_Sq

	From ResultSet
)a
Where a.Customer_sq = 1

Order By a.Customer_Name

The above method will allow you to identify the first and last orders rather than simply the highest and lowest, this helps with understanding whether performance has increased or decreased between the two orders, but, for a simple max vs min, you could also use:

Paste code macro
languagesql
Select
Customer_Name
,Max(Order_Value) As Max_Value
,Min(Order_Value) As Min_Value
,((Max(Order_Value) - Min(Order_Value)) / Min(Order_Value))*100 As perc_diff

From Resultset
Group By
Customer_Name


Question 4:

Query 2 will be far more performant than query 1 as the index on the Order_Date field will be able to be used, whereas Query 1 will prevent any index from being used.

Performing the function on any stored field (table field) at any point outside the projection, will cause the db engine to need to ‘touch’ every record in the table to update the data in that field. As such, no index can be used, so if your table has 1 trillion records, but the date field would return just 500,000 records, all 1 trillion records will need to be processed.

It is far better, to load the records you need to a temporary source or derived table, adding the filters that can be applied without altering the stored data, and performing the function into the projection.
In so doing, the index for the date field can be used to build the temporary table, and you perform the upper-case filter on the outer table.


Question 5a:

Execution Order

From

Join

Where

Select

Group By

Having

Order By

Question 5b:

Unions stack sets of data one atop the other to for a single unified output. The field names and data-types of the first result-set define the table (whether real or virtual - in temp memory).

UnionUnion All
A union filters the sets as they are being stacked ensuring each row in the output is unique.Union All will not filter the stacks for uniqueness, so when unioning the output of the same table, (assume it has 1000 rows), where a Union will only consist of 1000 rows, the Union All will consist of 2000 rows


JoinExplantion
Inner

Returns all data from both tables where a related field exists in both tables

Left

Returns all data from the primary table, and only data from the secondary table where the related field exists. All other fields are nulled

Right

Functionally the same as a left join, just reversed

Full

Is identical to a left and right join, all data from both tables is returned. Where a related field exists in both tables, the row shall be complete

Cross

All rows of both tables are joined to one another with no join clause.

Often cited as great for creating test data and little else.

However, with a little creativity, this can generate null entries for when data does not exist rather than simply being filtered away, such as if no sales data exists for a customer


Question 6:

This is more abstract and has no real definitive answer given the lack of information. That said, I would probably look at the numbers of statuses each transaction generates, then perform a count of transactions for the past few months, identified a possible multiplier for percentage growth or shrinkage, then apply this to get an idea as to how many transactions we could forecast this month. Then we simply multiply the forecast by the numbers of statuses to get an idea as to the estimated number of rows we would expect the log to hold.


Question 7:

Problem

       Possible Resolution

Using calculations to adjust attribute names

Use Aliases

Including a fixed title in a calculation

Use textboxes

Using calculations for grouping

Use the group field

Using nested calculations

Create calculations and use the results in forward calcs

Quickfilters

Try to limit to less than 6 quickfilters and/or make use of parameters


Despite the above, arguably the best action a developer can take will be to identify the source of the bottleneck, which can only really be achieved through analysis of the SQL that Tableau has generated and sent to the data-server for processing.

There are two ways of getting to this, either use the Performance Recorder or, access the Tableau logs, although I tend to use both: the performance recorder to get the initial analysis underway followed by the logs, following this pathway:

(users do not need to follow this, although this method will make their lives easier)

  1. Open the workbook
  2. Create a new blank worksheet
  3. Unhide ALL worksheets, ensuring you remain on or navigate back to the blank worksheet
  4. Save, and close.


We have done this so that when we re-open the workbook, we can get straight on with the analysis rather than waiting for the workbook to render.

I would recommend waiting around 30 mins at this point if possible, to allow the data server time to clear its query cache.

  1. Navigate to your Tableau logs: (default) Documents\My Tableau Repository\Logs and delete or move them.
  2. Now re-open the workbook, activate the performance recorder, open the newly created log into a text editor such as Notepad++, Sublime, Visual Studio etc, be ready to keep refreshing
  3. Move through each sheet, allowing time to render, so the queries can be captured both to the recorder (which is currently generating its own temp log), and to the main log

At the completion of this exercise, the user can use both the log and the performance recorder to try to identify the causes of the bottle-necks.

It’s a good idea to test the generated SQL with the show/explain plan so we can get an understanding as to how the workbook is impacting the server; this is more often than not, the usual cause of bottlenecks, from here, users can:

  • Identify possible table and field optimisations to ensure the data can be retrieved quicker
  • Additional additive and semi-additive fields that can be added to the table such as logic expressions, and numerators and denominators
  • Identify very complex queries resulting from equally complex calculations, ready to optimise and/or rebuild in the front-end


Question 8:

The Table Calculation is the most performant. But the use case needs to be considered too:

Table Calculations are ultimately more performant than Level-of-Detail Expressions, although LODx are generally easier to use, as the scoping information is contained within the expression, and, LODx can offer much more flexibility over Table Calcs.

The TC will perform a basic summation, with the scoping handled by Tableau, whereas the LODx must pull through the row-level detail and, create one-or-more derived tables that handle the grouping before then joining the aggregated results back to the main row-data.

As Tableau converts your expressions to SQL queries, knowing the expected queries can help us greatly reduce our performance bottlenecks.

These are the two queries generated by the calculations:

Note

Tableau is pre-programmed to write SQL to pre-defined templates, such that the SQL produced may not be as performant as: when you have defined it, and may perform additional unneccessary actions , that we cannot directly adjust, that ; and can greatly affect server performance


Table CalculationLevel-of-detail Expression

Tableau:

Code Block
Window_Sum(Sum(Sales))

SQL:

Code Block
Select
Segment
,Category
,Region
,Sum(Sales) As Sales

From Superstore
Where Segment = 'Consumer'
Group By
Segment, Category, Region 


Tableau:

Code Block
Sum({Fixed Segment, Category : Sum(Sales)})

SQL:

Code Block
Select
t1.Segment
,t1.Category
,Sum(t2.Sales) As Sales

From (
	Select
	Segment
	,Category
	,Region

	From Superstore
	Where Segment = 'Consumer'
	Group By Segment, Category, Region
)t1

Join (
	Select
	Segment
	,Category
	,Sum(Sales) As Sales

	From Superstore
	Group By Segment, Category
)t2

On t1.Segment = t2.Segment And t1.Category = t2.Category

Group By 
t1.Segment, t1.Category



Question 9:

The better method here will be CustomSQL (cSQL).

Tableau has a really fancy method called ‘join-culling’, so when a large model like this is loaded, Tableau analyses the model to determine where the data is, and generate a query that only uses the necessary tables needed to get the data, effectively ignoring or culling other unnecessary tables.

This can produce really high-performant queries and is great for when users want to analyse different parts of the model in the same workbook, however:

Note

Relational Integrity needs to be defined between each of the tables for Tableau to utilise join-culling

If RI is not defined – as is more often the case in highly denormalised reporting databases and warehouses (as explained abovein the question), then Tableau must process the entirety of the model: every table and every join, and return every field from every table in the model.

Furthermore, users are unable to optimise the outgoing model, whereas the cSQL is functionally the same as a view, so whilst Tableau needs to run the same query for every calculation that is used, as the SQL analyst, you can optimise the query before you add it to the Tableau data-model.

Tip

In an ideal world, the best option would be to create a view, but as views are permanent objects, users usually require DDL (Data Definition Language) permissions to create them.

DDL permissions often extend to db-admin functions too as such, ddl permissions are more often denied thus preventing developers from being able to create views.


...