Congratulations, you have made it to the technical testing stage of the interview process, your suitability for the role is confirmed.
We appreciate that tests are never easy to encounter, and can often be thought of as meaningless, we want to assure you that this test is not an assessment of your suitability for the role - you have already demonstrated your suitability by getting this far.
The purpose of this test will enable us to understand your skill-level, and to provide further training (where necessary) should you join us.
The following test is divided into 3 sections:
Contents
Question 1: |
dbo.orders
Customer_Name | Category |
---|---|
Claire Gute | Furniture |
Claire Gute | Furniture |
Darrin Van Huff | Office Supplies |
Sean O'Donnell | Furniture |
Sean O'Donnell | Office Supplies |
Brosina Hoffman | Furniture |
Brosina Hoffman | Office Supplies |
Brosina Hoffman | Technology |
Brosina Hoffman | Office Supplies |
Brosina Hoffman | Office Supplies |
Brosina Hoffman | Furniture |
Brosina Hoffman | Technology |
Andrew Allen | Office Supplies |
Irene Maddox | Office Supplies |
Harold Pawlan | Office Supplies |
Harold Pawlan | Office Supplies |
Each row in the above table represents an order. Please can you write the code to return the customer name, and the count of orders for all those who have more than 1 order.
Question 2: |
dbo.Orders
Customer_Name | Order_ID | Order_Date | Order_Value |
---|---|---|---|
Aaron Bergman | CA-2017-140935 | 2017-11-10 | 563.94 |
Aaron Bergman | CA-2015-156587 | 2015-03-07 | 309.592 |
Aaron Bergman | CA-2015-152905 | 2015-02-18 | 12.624 |
Aaron Hawkins | CA-2018-164000 | 2018-12-18 | 18.704 |
Aaron Hawkins | CA-2017-162747 | 2017-03-20 | 86.45 |
Aaron Hawkins | CA-2016-130113 | 2016-12-27 | 991.26 |
Aaron Hawkins | CA-2015-157644 | 2015-12-31 | 53.67 |
Aaron Smayling | CA-2018-101749 | 2018-10-03 | 171.288 |
Aaron Smayling | US-2018-147655 | 2018-09-04 | 88.074 |
Aaron Smayling | CA-2018-162691 | 2018-08-01 | 1476.27 |
Aaron Smayling | CA-2017-162901 | 2017-03-28 | 31.4 |
Aaron Smayling | US-2015-150126 | 2015-07-27 | 65.78 |
Adam Bellavance | CA-2018-107174 | 2018-11-06 | 2595.388 |
Adam Bellavance | CA-2018-118213 | 2018-11-05 | 240.15 |
Adam Bellavance | CA-2018-134173 | 2018-09-16 | 20.736 |
From the above table, please can you write the code to return:
Please ensure you are using ANSI SQL for this task |
Expected result [tbl_output]:
Customer_Name | Order_ID | Order_Date | Order_Value |
---|---|---|---|
Aaron Bergman | CA-2017-140935 | 2017-11-10 | 563.94 |
Aaron Bergman | CA-2015-152905 | 2015-02-18 | 12.624 |
Aaron Hawkins | CA-2018-164000 | 2018-12-18 | 18.704 |
Aaron Hawkins | CA-2015-122070 | 2015-04-22 | 257.752 |
Aaron Smayling | CA-2018-101749 | 2018-10-03 | 171.288 |
Aaron Smayling | US-2015-150126 | 2015-07-27 | 65.78 |
Adam Bellavance | CA-2018-107174 | 2018-11-06 | 2595.388 |
Adam Bellavance | CA-2016-150511 | 2016-09-18 | 18.54 |
Question 3: |
Using the result from question 2 [tbl_output], please write a query that will return on the same row the:
Expectation:
Customer_Name | Latest_Value | Earliest_Value | Perc_Diff |
---|---|---|---|
Aaron Bergman | 563.94 | 12.624 | 4367.20532319392 |
Aaron Hawkins | 18.704 | 257.752 | -92.7434122722617 |
Aaron Smayling | 171.288 | 65.78 | 160.395256916996 |
Adam Bellavance | 2595.388 | 18.54 | 13898.8565264293 |
Question 4: |
dbo.Orders
First_Name | Last_Name | Order_Date | Order_ID |
---|---|---|---|
Chuck | Clark | 2018-12-30 | CA-2018-126221 |
Erica | SMITH | 2018-12-30 | CA-2018-115427 |
Jill | Matthias | 2018-12-30 | CA-2018-156720 |
Patrick | O'Donnell | 2018-12-30 | CA-2018-143259 |
Michael | Chen | 2018-12-29 | US-2018-102638 |
James | Galang | 2018-12-29 | CA-2018-118885 |
Katherine | hughes | 2018-12-29 | US-2018-158526 |
Ken | Brennan | 2018-12-29 | CA-2018-158673 |
Bruce | Stewart | 2018-12-29 | CA-2018-130631 |
Ben | Peterman | 2018-12-29 | CA-2018-146626 |
Christopher | Schild | 2018-12-28 | CA-2018-135111 |
Erica | Hernandez | 2018-12-28 | CA-2018-127516 |
Greg | Hansen | 2018-12-28 | CA-2018-136539 |
Harry | Marie | 2018-12-28 | CA-2018-129805 |
Kristen | Smith | 2018-12-30 | CA-2018-163979 |
Lori | Olson | 2018-12-28 | CA-2018-163860 |
Jason | Gross | 2018-12-28 | CA-2018-101322 |
Jennifer | smith | 2018-12-30 | CA-2018-164826 |
Pamela | Coakley | 2018-12-28 | US-2018-162068 |
Susan | SMITH | 2018-12-28 | CA-2018-122798 |
This table contains 1 trillion rows of data, and has indexes on the Last_Name and Order_Date columns.
Last_Name is case insensitive meaning entries can be either UPPERCASE, lowercase, or MiXed-cASe.
You are looking for all customers whose Last_Name is 'smith' (all cases). Text searches are case-sensitive meaning that searching in just one case, may not return all customers, and you are also only interested in orders made on 2018-12-30
An initial analysis of the data shows that of the 1 trillion rows, 2018-12-30 contains just 500,000 (five-hundred thousand) rows of data.
Which of these queries is the more efficient to fetch the required data, considering the performance enhancements. Explain why.
Query 1 | Query 2 | ||
---|---|---|---|
|
|
Question 5a: |
Please can you put these query operations into the correct execution order
Query Order | Execution Order |
---|---|
Select From Join Where Group By Having Order By |
What is a Union, and Union All? How do they differ? And can you provide an example of their use.
Please can you name the five join types, and provide an example of their use.
Question 6: |
This table is a transaction stream log:
Row_ID | Txn_ID | Account_ID | Txn_DateTime | Status |
---|---|---|---|---|
1 | 123abc | 112233 | 2000-01-01 08:08:45 | Initiated |
2 | 895jkl | 112233 | 2000-01-01 08:09:03 | Initiated |
3 | 747ghd | 456789 | 2000-01-01 08:12:21 | Initiated |
4 | 123abc | 112233 | 2000-01-01 08:13:15 | In Process |
5 | 696pol | 456892 | 2000-01-01 08:14:37 | Initiated |
6 | 123abc | 112233 | 2000-01-01 08:17:43 | Successful |
7 | 895ghd | 784521 | 2000-01-01 08:22:51 | Initiated |
8 | 696pol | 456892 | 2000-01-01 08:23:04 | In Process |
9 | 696pol | 456892 | 2000-01-01 08:23:08 | Failed |
Can you estimate how many entries we would expect to capture in a month. Explain.
Question 7: |
Thinking about Tableau, you have a workbook connected to a data-source that is more than 3 billion rows and is around 180 columns wide.
This data needs to be refactored each day so the entirety of the table must be dropped and rebuilt in a job taking more than 12 hours each night.
Data this large is too much to be loaded into an extract, and too much for Tableau Server to unpack into temp memory.
Due to these size constraints, and that the number of concurrent users average around 120 each day, the workbook can only connect real-time.
The workbook visuals aren’t overly complex, neither do they plot a lot of marks; however, they are performing a lot of look-up calculations as they are measuring multi-variate tests.
The dashboard currently takes around 5 minutes to render after loading and after each filter change, this is unacceptable and needs to get to below 20 seconds.
Please can you describe some of the methods you would use to identify the performance bottlenecks, and methods on resolving them
Question 8: |
Looking at this table:
Which of these two calculations is the most performant in achieving the highlighted row? Explain
Table Calculation | Level-of-detail Expression | ||
---|---|---|---|
|
|
Question 9: |
The version 2020.2 changes to the Tableau data model have superseded this, but it is still pertinent for a Tableau developer to understand this |
Consider this ER diagram:
You are connecting Tableau to this reporting database.
Whilst there are fields called ‘foreign key’ within each table, there is no relational integrity between the tables, these fields being provided for lookups and joins.
You are building a report that analyses the relationship between the numbers of products ordered and their destination city.
How would you model the data?
Option 1: Tableau's Data Model | Option 2: CustomSQL |
---|---|
Use Tableau’s data-model window, adding each field and joining based on the ER diagram | Write your own query using SQL, that simply processes the necessary joins, and add the data to a CustomSQL object |
Looking at this dashboard (collection of charts). We would like you to try to replicate using Superstore Sales (download below). We have provided you the fields used on each chart, and many of the charts have clues as to what their values represent.
You need to build us 6 useable charts that tell us something about the performance of the Superstore, we would like for you to attempt to replicate all the charts.
We also appreciate that you may be unfamiliar with charts 3, 4 & 5, if you are unable to build charts 3, 4 & 5, please build replacements that you feel will compliment the rest of the charts in the collection. |
It will be awesome if you could add some interactivity into this too.
As with all parts of this assessment, if you get stuck, please reach out to us; remember, this role is collaborative, and we want to help you as much as we can.
Please ensure you read and all parts of the dashboard before setting-off.
Should the interview over-run, you can complete this task in your own time if you like. Should you choose to do so, please ensure you send as a version no later than the previous first sub-version (xxxx.01); so if the latest version is 2020.02, then we want no later than 2019.01 |
Workbook not loading, open the image here: | Superstore Sales |
---|---|
![]() |
Sections 1 & 2 model answers: Question 1:
Question 2: Rank() or Row_Number() will help you here
Question 3: The better method will be:
The above method will allow you to identify the first and last orders rather than simply the highest and lowest, this helps with understanding whether performance has increased or decreased between the two orders, but, for a simple max vs min, you could also use:
Question 4: Query 2 will be far more performant than query 1 as the index on the Order_Date field will be able to be used, whereas Query 1 will prevent any index from being used. Performing the function on any stored field (table field) at any point outside the projection, will cause the db engine to need to ‘touch’ every record in the table to update the data in that field. As such, no index can be used, so if your table has 1 trillion records, but the date field would return just 500,000 records, all 1 trillion records will need to be processed. It is far better, to load the records you need to a temporary source or derived table, adding the filters that can be applied without altering the stored data, and performing the function into the projection. Question 5a:
Question 5b: Unions stack sets of data one atop the other to for a single unified output. The field names and data-types of the first result-set define the table (whether real or virtual - in temp memory).
Question 6: This is more abstract and has no real definitive answer given the lack of information. That said, I would probably look at the numbers of statuses each transaction generates, then perform a count of transactions for the past few months, identified a possible multiplier for percentage growth or shrinkage, then apply this to get an idea as to how many transactions we could forecast this month. Then we simply multiply the forecast by the numbers of statuses to get an idea as to the estimated number of rows we would expect the log to hold. Question 7:
Despite the above, arguably the best action a developer can take will be to identify the source of the bottleneck, which can only really be achieved through analysis of the SQL that Tableau has generated and sent to the data-server for processing. There are two ways of getting to this, either use the Performance Recorder or, access the Tableau logs, although I tend to use both: the performance recorder to get the initial analysis underway followed by the logs, following this pathway: (users do not need to follow this, although this method will make their lives easier)
We have done this so that when we re-open the workbook, we can get straight on with the analysis rather than waiting for the workbook to render. I would recommend waiting around 30 mins at this point if possible, to allow the data server time to clear its query cache.
At the completion of this exercise, the user can use both the log and the performance recorder to try to identify the causes of the bottle-necks. It’s a good idea to test the generated SQL with the show/explain plan so we can get an understanding as to how the workbook is impacting the server; this is more often than not, the usual cause of bottlenecks, from here, users can:
Question 8: The Table Calculation is the most performant. But the use case needs to be considered too: Table Calculations are ultimately more performant than Level-of-Detail Expressions, although LODx are generally easier to use, as the scoping information is contained within the expression, and, LODx can offer much more flexibility over Table Calcs. The TC will perform a basic summation, with the scoping handled by Tableau, whereas the LODx must pull through the row-level detail and, create one-or-more derived tables that handle the grouping before then joining the aggregated results back to the main row-data. As Tableau converts your expressions to SQL queries, knowing the expected queries can help us greatly reduce our performance bottlenecks. These are the two queries generated by the calculations:
Question 9: The better method here will be CustomSQL (cSQL). Tableau has a really fancy method called ‘join-culling’, so when a large model like this is loaded, Tableau analyses the model to determine where the data is, and generate a query that only uses the necessary tables needed to get the data, effectively ignoring or culling other unnecessary tables. This can produce really high-performant queries and is great for when users want to analyse different parts of the model in the same workbook, however:
If RI is not defined – as is more often the case in highly denormalised reporting databases and warehouses (as explained in the question), then Tableau must process the entirety of the model: every table and every join, and return every field from every table in the model. Furthermore, users are unable to optimise the outgoing model, whereas the cSQL is functionally the same as a view, so whilst Tableau needs to run the same query for every calculation that is used, as the SQL analyst, you can optimise the query before you add it to the Tableau data-model.
|