New Hire Process: Technical Assessment

Congratulations, you have made it to the technical testing stage of the interview process, your suitability for the role is confirmed.

We appreciate that tests are never easy to encounter, and can often be thought of as meaningless, we want to assure you that this test is not an assessment of your suitability for the role - you have already demonstrated your suitability by getting this far.
The purpose of this test will enable us to understand your skill-level, and to provide further training (where necessary) should you join us.

The following test is divided into 3 sections:

The SQL test: Often undervalued or misunderstood. For us, strong SQL skill is vital for all visualisation, from modelling, to calculations, to performance tuning.
Tableau: A deep assessment of your Tableau product knowledge
A collection of Tableau charts for you to reproduce; you will need access to a full version of Tableau (either desktop or Public) to undertake this. Follow this link to get the latest version of Tableau Public

Contents

Section 1: SQL

Question 1:

dbo.orders

Customer_Name	Category
Claire Gute	Furniture
Claire Gute	Furniture
Darrin Van Huff	Office Supplies
Sean O'Donnell	Furniture
Sean O'Donnell	Office Supplies
Brosina Hoffman	Furniture
Brosina Hoffman	Office Supplies
Brosina Hoffman	Technology
Brosina Hoffman	Office Supplies
Brosina Hoffman	Office Supplies
Brosina Hoffman	Furniture
Brosina Hoffman	Technology
Andrew Allen	Office Supplies
Irene Maddox	Office Supplies
Harold Pawlan	Office Supplies
Harold Pawlan	Office Supplies

Each row in the above table represents an order. Please can you write the code to return the customer name, and the count of orders for all those who have more than 1 order.

Question 2:

dbo.Orders

Customer_Name	Order_ID	Order_Date	Order_Value
Aaron Bergman	CA-2017-140935	2017-11-10	563.94
Aaron Bergman	CA-2015-156587	2015-03-07	309.592
Aaron Bergman	CA-2015-152905	2015-02-18	12.624
Aaron Hawkins	CA-2018-164000	2018-12-18	18.704
Aaron Hawkins	CA-2017-162747	2017-03-20	86.45
Aaron Hawkins	CA-2016-130113	2016-12-27	991.26
Aaron Hawkins	CA-2015-157644	2015-12-31	53.67
Aaron Smayling	CA-2018-101749	2018-10-03	171.288
Aaron Smayling	US-2018-147655	2018-09-04	88.074
Aaron Smayling	CA-2018-162691	2018-08-01	1476.27
Aaron Smayling	CA-2017-162901	2017-03-28	31.4
Aaron Smayling	US-2015-150126	2015-07-27	65.78
Adam Bellavance	CA-2018-107174	2018-11-06	2595.388
Adam Bellavance	CA-2018-118213	2018-11-05	240.15
Adam Bellavance	CA-2018-134173	2018-09-16	20.736

From the above table, please can you write the code to return:

Customer Name
First and last Order_ID
First and last Order_Date
First and last Order_Value

Please ensure you are using ANSI SQL for this task

Expected result [tbl_output]:

Customer_Name	Order_ID	Order_Date	Order_Value
Aaron Bergman	CA-2017-140935	2017-11-10	563.94
Aaron Bergman	CA-2015-152905	2015-02-18	12.624
Aaron Hawkins	CA-2018-164000	2018-12-18	18.704
Aaron Hawkins	CA-2015-122070	2015-04-22	257.752
Aaron Smayling	CA-2018-101749	2018-10-03	171.288
Aaron Smayling	US-2015-150126	2015-07-27	65.78
Adam Bellavance	CA-2018-107174	2018-11-06	2595.388
Adam Bellavance	CA-2016-150511	2016-09-18	18.54

Question 3:

Using the result from question 2 [tbl_output], please write a query that will return on the same row the:

Customer Name
First Order Value
Last Order Value
Percentage difference between the first and last orders

Expectation:

Customer_Name	Latest_Value	Earliest_Value	Perc_Diff
Aaron Bergman	563.94	12.624	4367.20532319392
Aaron Hawkins	18.704	257.752	-92.7434122722617
Aaron Smayling	171.288	65.78	160.395256916996
Adam Bellavance	2595.388	18.54	13898.8565264293

Question 4:

dbo.Orders

First_Name	Last_Name	Order_Date	Order_ID
Chuck	Clark	2018-12-30	CA-2018-126221
Erica	SMITH	2018-12-30	CA-2018-115427
Jill	Matthias	2018-12-30	CA-2018-156720
Patrick	O'Donnell	2018-12-30	CA-2018-143259
Michael	Chen	2018-12-29	US-2018-102638
James	Galang	2018-12-29	CA-2018-118885
Katherine	hughes	2018-12-29	US-2018-158526
Ken	Brennan	2018-12-29	CA-2018-158673
Bruce	Stewart	2018-12-29	CA-2018-130631
Ben	Peterman	2018-12-29	CA-2018-146626
Christopher	Schild	2018-12-28	CA-2018-135111
Erica	Hernandez	2018-12-28	CA-2018-127516
Greg	Hansen	2018-12-28	CA-2018-136539
Harry	Marie	2018-12-28	CA-2018-129805
Kristen	Smith	2018-12-30	CA-2018-163979
Lori	Olson	2018-12-28	CA-2018-163860
Jason	Gross	2018-12-28	CA-2018-101322
Jennifer	smith	2018-12-30	CA-2018-164826
Pamela	Coakley	2018-12-28	US-2018-162068
Susan	SMITH	2018-12-28	CA-2018-122798

This table contains 1 trillion rows of data, and has indexes on the Last_Name and Order_Date columns.
Last_Name is case insensitive meaning entries can be either UPPERCASE, lowercase, or MiXed-cASe.

You are looking for all customers whose Last_Name is 'smith' (all cases). Text searches are case-sensitive meaning that searching in just one case, may not return all customers, and you are also only interested in orders made on 2018-12-30

An initial analysis of the data shows that of the 1 trillion rows, 2018-12-30 contains just 500,000 (five-hundred thousand) rows of data.

Which of these queries is the more efficient to fetch the required data, considering the performance enhancements. Explain why.

Query 1	Query 2

Question 5a:

Please can you put these query operations into the correct execution order

Query Order

Execution Order

Select

From

Join

Where

Group By

Having

Order By

Question 5b:

What is a Union, and Union All? How do they differ? And can you provide an example of their use.

Please can you name the five join types, and provide an example of their use.

Question 6:

This table is a transaction stream log:

Row_ID	Txn_ID	Account_ID	Txn_DateTime	Status
1	123abc	112233	2000-01-01 08:08:45	Initiated
2	895jkl	112233	2000-01-01 08:09:03	Initiated
3	747ghd	456789	2000-01-01 08:12:21	Initiated
4	123abc	112233	2000-01-01 08:13:15	In Process
5	696pol	456892	2000-01-01 08:14:37	Initiated
6	123abc	112233	2000-01-01 08:17:43	Successful
7	895ghd	784521	2000-01-01 08:22:51	Initiated
8	696pol	456892	2000-01-01 08:23:04	In Process
9	696pol	456892	2000-01-01 08:23:08	Failed

Can you estimate how many entries we would expect to capture in a month. Explain.

Section 2: Advanced Tableau Questions

Question 7:

Thinking about Tableau, you have a workbook connected to a data-source that is more than 3 billion rows and is around 180 columns wide.
This data needs to be refactored each day so the entirety of the table must be dropped and rebuilt in a job taking more than 12 hours each night.
Data this large is too much to be loaded into an extract, and too much for Tableau Server to unpack into temp memory.

Due to these size constraints, and that the number of concurrent users average around 120 each day, the workbook can only connect real-time.

The workbook visuals aren’t overly complex, neither do they plot a lot of marks; however, they are performing a lot of look-up calculations as they are measuring multi-variate tests.

The dashboard currently takes around 5 minutes to render after loading and after each filter change, this is unacceptable and needs to get to below 20 seconds.

Please can you describe some of the methods you would use to identify the performance bottlenecks, and methods on resolving them

Question 8:

Looking at this table:

Which of these two calculations is the most performant in achieving the highlighted row? Explain

Table Calculation	Level-of-detail Expression
Window_Sum(Sum(Sales))	Sum({Fixed Segment, Category : Sum(Sales)})

Question 9:

The version 2020.2 changes to the Tableau data model have superseded this, but it is still pertinent for a Tableau developer to understand this

Consider this ER diagram:

You are connecting Tableau to this reporting database.
Whilst there are fields called ‘foreign key’ within each table, there is no relational integrity between the tables, these fields being provided for lookups and joins.

You are building a report that analyses the relationship between the numbers of products ordered and their destination city.

How would you model the data?

Option 1: Tableau's Data Model	Option 2: CustomSQL
Use Tableau’s data-model window, adding each field and joining based on the ER diagram	Write your own query using SQL, that simply processes the necessary joins, and add the data to a CustomSQL object

Section 3: Tableau Task

Looking at this dashboard (collection of charts). We would like you to try to replicate using Superstore Sales (download below). We have provided you the fields used on each chart, and many of the charts have clues as to what their values represent.
You need to build us 6 useable charts that tell us something about the performance of the Superstore, we would like for you to attempt to replicate all the charts.

We also appreciate that you may be unfamiliar with charts 3, 4 & 5, if you are unable to build charts 3, 4 & 5, please build replacements that you feel will compliment the rest of the charts in the collection.

It will be awesome if you could add some interactivity into this too.

As with all parts of this assessment, if you get stuck, please reach out to us; remember, this role is collaborative, and we want to help you as much as we can.

Please ensure you read and all parts of the dashboard before setting-off.

Should the interview over-run, you can complete this task in your own time if you like. Should you choose to do so, please ensure you send as a version no later than the previous first sub-version (xxxx.01); so if the latest version is 2020.02, then we want no later than 2019.01

Workbook not loading, open the image here:	Superstore Sales

<...>

Sections 1 & 2 model answers:

Question 1:

Question 2:

Rank() or Row_Number() will help you here

Question 3:

The better method will be:

The above method will allow you to identify the first and last orders rather than simply the highest and lowest, this helps with understanding whether performance has increased or decreased between the two orders, but, for a simple max vs min, you could also use:

Question 4:

Query 2 will be far more performant than query 1 as the index on the Order_Date field will be able to be used, whereas Query 1 will prevent any index from being used.

Performing the function on any stored field (table field) at any point outside the projection, will cause the db engine to need to ‘touch’ every record in the table to update the data in that field. As such, no index can be used, so if your table has 1 trillion records, but the date field would return just 500,000 records, all 1 trillion records will need to be processed.

It is far better, to load the records you need to a temporary source or derived table, adding the filters that can be applied without altering the stored data, and performing the function into the projection.
In so doing, the index for the date field can be used to build the temporary table, and you perform the upper-case filter on the outer table.

Question 5a:

Execution Order

From

Join

Where

Select

Group By

Having

Order By

Question 5b:

Unions stack sets of data one atop the other to for a single unified output. The field names and data-types of the first result-set define the table (whether real or virtual - in temp memory).

Union	Union All
A union filters the sets as they are being stacked ensuring each row in the output is unique.	Union All will not filter the stacks for uniqueness, so when unioning the output of the same table, (assume it has 1000 rows), where a Union will only consist of 1000 rows, the Union All will consist of 2000 rows

Join	Explantion
Inner	Returns all data from both tables where a related field exists in both tables
Left	Returns all data from the primary table, and only data from the secondary table where the related field exists. All other fields are nulled
Right	Functionally the same as a left join, just reversed
Full	Is identical to a left and right join, all data from both tables is returned. Where a related field exists in both tables, the row shall be complete
Cross	All rows of both tables are joined to one another with no join clause. Often cited as great for creating test data and little else. However, with a little creativity, this can generate null entries for when data does not exist rather than simply being filtered away, such as if no sales data exists for a customer

Question 6:

This is more abstract and has no real definitive answer given the lack of information. That said, I would probably look at the numbers of statuses each transaction generates, then perform a count of transactions for the past few months, identified a possible multiplier for percentage growth or shrinkage, then apply this to get an idea as to how many transactions we could forecast this month. Then we simply multiply the forecast by the numbers of statuses to get an idea as to the estimated number of rows we would expect the log to hold.

Question 7:

Problem	Possible Resolution
Using calculations to adjust attribute names	Use Aliases
Including a fixed title in a calculation	Use textboxes
Using calculations for grouping	Use the group field
Using nested calculations	Create calculations and use the results in forward calcs
Quickfilters	Try to limit to less than 6 quickfilters and/or make use of parameters

Despite the above, arguably the best action a developer can take will be to identify the source of the bottleneck, which can only really be achieved through analysis of the SQL that Tableau has generated and sent to the data-server for processing.

There are two ways of getting to this, either use the Performance Recorder or, access the Tableau logs, although I tend to use both: the performance recorder to get the initial analysis underway followed by the logs, following this pathway:

(users do not need to follow this, although this method will make their lives easier)

Open the workbook
Create a new blank worksheet
Unhide ALL worksheets, ensuring you remain on or navigate back to the blank worksheet
Save, and close.

We have done this so that when we re-open the workbook, we can get straight on with the analysis rather than waiting for the workbook to render.

I would recommend waiting around 30 mins at this point if possible, to allow the data server time to clear its query cache.

Navigate to your Tableau logs: (default) Documents\My Tableau Repository\Logs and delete or move them.
Now re-open the workbook, activate the performance recorder, open the newly created log into a text editor such as Notepad++, Sublime, Visual Studio etc, be ready to keep refreshing
Move through each sheet, allowing time to render, so the queries can be captured both to the recorder (which is currently generating its own temp log), and to the main log

At the completion of this exercise, the user can use both the log and the performance recorder to try to identify the causes of the bottle-necks.

It’s a good idea to test the generated SQL with the show/explain plan so we can get an understanding as to how the workbook is impacting the server; this is more often than not, the usual cause of bottlenecks, from here, users can:

Identify possible table and field optimisations to ensure the data can be retrieved quicker
Additional additive and semi-additive fields that can be added to the table such as logic expressions, and numerators and denominators
Identify very complex queries resulting from equally complex calculations, ready to optimise and/or rebuild in the front-end

Question 8:

The Table Calculation is the most performant. But the use case needs to be considered too:

Table Calculations are ultimately more performant than Level-of-Detail Expressions, although LODx are generally easier to use, as the scoping information is contained within the expression, and, LODx can offer much more flexibility over Table Calcs.

The TC will perform a basic summation, with the scoping handled by Tableau, whereas the LODx must pull through the row-level detail and, create one-or-more derived tables that handle the grouping before then joining the aggregated results back to the main row-data.

As Tableau converts your expressions to SQL queries, knowing the expected queries can help us greatly reduce our performance bottlenecks.

These are the two queries generated by the calculations:

Tableau is pre-programmed to write SQL to pre-defined templates, such that the SQL produced may not be as performant as: when you have defined it, may perform additional unneccessary actions that we cannot directly adjust; and can greatly affect server performance

Table Calculation

Level-of-detail Expression

Tableau:

Window_Sum(Sum(Sales))

SQL:

Select
Segment
,Category
,Region
,Sum(Sales) As Sales

From Superstore
Where Segment = 'Consumer'
Group By
Segment, Category, Region

Tableau:

Sum({Fixed Segment, Category : Sum(Sales)})

SQL:

Select
t1.Segment
,t1.Category
,Sum(t2.Sales) As Sales

From (
	Select
	Segment
	,Category
	,Region

	From Superstore
	Where Segment = 'Consumer'
	Group By Segment, Category, Region
)t1

Join (
	Select
	Segment
	,Category
	,Sum(Sales) As Sales

	From Superstore
	Group By Segment, Category
)t2

On t1.Segment = t2.Segment And t1.Category = t2.Category

Group By 
t1.Segment, t1.Category

Question 9:

The better method here will be CustomSQL (cSQL).

Tableau has a really fancy method called ‘join-culling’, so when a large model like this is loaded, Tableau analyses the model to determine where the data is, and generate a query that only uses the necessary tables needed to get the data, effectively ignoring or culling other unnecessary tables.

This can produce really high-performant queries and is great for when users want to analyse different parts of the model in the same workbook, however:

Relational Integrity needs to be defined between each of the tables for Tableau to utilise join-culling

If RI is not defined – as is more often the case in highly denormalised reporting databases and warehouses (as explained in the question), then Tableau must process the entirety of the model: every table and every join, and return every field from every table in the model.

Furthermore, users are unable to optimise the outgoing model, whereas the cSQL is functionally the same as a view, so whilst Tableau needs to run the same query for every calculation that is used, as the SQL analyst, you can optimise the query before you add it to the Tableau data-model.

In an ideal world, the best option would be to create a view, but as views are permanent objects, users usually require DDL (Data Definition Language) permissions to create them.

DDL permissions often extend to db-admin functions too as such, ddl permissions are more often denied thus preventing developers from being able to create views.