Congratulations, you have made it to the technical testing stage of the interview process, your suitability for the role is confirmed.
We appreciate that tests are never easy to encounter, and can often be thought of as meaningless, we want to assure you that this test is not an assessment of your suitability for the role - you have already demonstrated your suitability by getting this far.
The purpose of this test will enable us to understand your skill-level, and to provide further training (where necessary) should you join us.
The following test is divided into 3 sections:
Contents
Question 1: |
dbo.orders
Customer_Name | Category |
---|---|
Calire Gute | Furniture |
Claire Gute | Furniture |
Darrin Van Huff | Office Supplies |
Sean O'Donnell | Furniture |
Sean O'Donnell | Office Supplies |
Brosina Hoffman | Furniture |
Brosina Hoffman | Office Supplies |
Brosina Hoffman | Technology |
Brosina Hoffman | Office Supplies |
Brosina Hoffman | Office Supplies |
Brosina Hoffman | Furniture |
Brosina Hoffman | Technology |
Andrew Allen | Office Supplies |
Irene Maddox | Office Supplies |
Harold Pawlan | Office Supplies |
Harold Pawlan | Office Supplies |
Each row in the above table represents an order. Please can you write the code to return the customer name, and the count of orders for all those who have more than 1 order.
Question 2: |
dbo.Orders
Customer_Name | Order_ID | Order_Date | Order_Value |
---|---|---|---|
Aaron Bergman | CA-2017-140935 | 2017-11-10 | 563.94 |
Aaron Bergman | CA-2015-156587 | 2015-03-07 | 309.592 |
Aaron Bergman | CA-2015-152905 | 2015-02-18 | 12.624 |
Aaron Hawkins | CA-2018-164000 | 2018-12-18 | 18.704 |
Aaron Hawkins | CA-2017-162747 | 2017-03-20 | 86.45 |
Aaron Hawkins | CA-2016-130113 | 2016-12-27 | 991.26 |
Aaron Hawkins | CA-2015-157644 | 2015-12-31 | 53.67 |
Aaron Smayling | CA-2018-101749 | 2018-10-03 | 171.288 |
Aaron Smayling | US-2018-147655 | 2018-09-04 | 88.074 |
Aaron Smayling | CA-2018-162691 | 2018-08-01 | 1476.27 |
Aaron Smayling | CA-2017-162901 | 2017-03-28 | 31.4 |
Aaron Smayling | US-2015-150126 | 2015-07-27 | 65.78 |
Adam Bellavance | CA-2018-107174 | 2018-11-06 | 2595.388 |
Adam Bellavance | CA-2018-118213 | 2018-11-05 | 240.15 |
Adam Bellavance | CA-2018-134173 | 2018-09-16 | 20.736 |
From the above table, please can you write the code to return:
Please ensure you are using ANSI SQL for this task |
Result:
Customer_Name | Order_ID | Order_Date | Order_Value |
---|---|---|---|
Aaron Bergman | CA-2017-140935 | 2017-11-10 | 563.94 |
Aaron Bergman | CA-2015-152905 | 2015-02-18 | 12.624 |
Aaron Hawkins | CA-2018-164000 | 2018-12-18 | 18.704 |
Aaron Hawkins | CA-2015-122070 | 2015-04-22 | 257.752 |
Aaron Smayling | CA-2018-101749 | 2018-10-03 | 171.288 |
Aaron Smayling | US-2015-150126 | 2015-07-27 | 65.78 |
Adam Bellavance | CA-2018-107174 | 2018-11-06 | 2595.388 |
Adam Bellavance | CA-2016-150511 | 2016-09-18 | 18.54 |
Question 3: |
Using the result from question 2, please write a query that will return on the same row the:
Result:
Customer_Name | Latest_Value | Earliest_Value | Perc_Diff |
---|---|---|---|
Aaron Bergman | 563.94 | 12.624 | 4367.20532319392 |
Aaron Hawkins | 18.704 | 257.752 | -92.7434122722617 |
Aaron Smayling | 171.288 | 65.78 | 160.395256916996 |
Adam Bellavance | 2595.388 | 18.54 | 13898.8565264293 |
Question 4: |
dbo.Orders
First_Name | Last_Name | Order_Date | Order_ID |
---|---|---|---|
Chuck | Clark | 2018-12-30 | CA-2018-126221 |
Erica | SMITH | 2018-12-30 | CA-2018-115427 |
Jill | Matthias | 2018-12-30 | CA-2018-156720 |
Patrick | O'Donnell | 2018-12-30 | CA-2018-143259 |
Michael | Chen | 2018-12-29 | US-2018-102638 |
James | Galang | 2018-12-29 | CA-2018-118885 |
Katherine | hughes | 2018-12-29 | US-2018-158526 |
Ken | Brennan | 2018-12-29 | CA-2018-158673 |
Bruce | Stewart | 2018-12-29 | CA-2018-130631 |
Ben | Peterman | 2018-12-29 | CA-2018-146626 |
Christopher | Schild | 2018-12-28 | CA-2018-135111 |
Erica | Hernandez | 2018-12-28 | CA-2018-127516 |
Greg | Hansen | 2018-12-28 | CA-2018-136539 |
Harry | Marie | 2018-12-28 | CA-2018-129805 |
Kristen | Smith | 2018-12-30 | CA-2018-163979 |
Lori | Olson | 2018-12-28 | CA-2018-163860 |
Jason | Gross | 2018-12-28 | CA-2018-101322 |
Jennifer | smith | 2018-12-30 | CA-2018-164826 |
Pamela | Coakley | 2018-12-28 | US-2018-162068 |
Susan | SMITH | 2018-12-28 | CA-2018-122798 |
The above table contains 1 trillion rows of data, and has indexes on the Last_Name and Order_Date columns.
The Last_Name field is case insensitive meaning some entries are in UPPERCASE, some lowercase, and some MiXed-mODe.
You are looking for all customers with the Last_Name of 'smith', text searches are case-sensitive meaning that searching in just one case, may not return the complete result. You are also only interested in orders made on 2018-12-30
An initial analysis of the data shows that of the 1 trillion rows, 2018-12-30 contains just 500,000 (five-hundred thousand) rows of data.
In order to fetch the required data, please can you identify which of these queries are the most efficient considering the performance enhancements. Explain why.
Query 1 | Query 2 | ||
---|---|---|---|
|
|
Question 5a: |
Please can you put these query operations into the correct execution order
Query Order | Execution Order |
---|---|
Select From Join Where Group By Having Order By |
Please explain what a Union is, and a Union All. Can you give an example of their use.
There are five types of join, can you name them and provide an example of their use.
Question 6: |
The table below is a transaction stream log:
dbo.txn
Row_ID | Txn_ID | Account_ID | Txn_DateTime | Status |
---|---|---|---|---|
1 | 123abc | 112233 | 2000-01-01 08:08:45 | Initiated |
2 | 895jkl | 112233 | 2000-01-01 08:09:03 | Initiated |
3 | 747ghd | 456789 | 2000-01-01 08:12:21 | Initiated |
4 | 123abc | 112233 | 2000-01-01 08:13:15 | In Process |
5 | 696pol | 456892 | 2000-01-01 08:14:37 | Initiated |
6 | 123abc | 112233 | 2000-01-01 08:17:43 | Successful |
7 | 895ghd | 784521 | 2000-01-01 08:22:51 | Initiated |
8 | 696pol | 456892 | 2000-01-01 08:23:04 | In Process |
9 | 696pol | 456892 | 2000-01-01 08:23:08 | Failed |
Can you estimate how many entries we would expect to capture in a month. Explain.
Question 7: |
Thinking about Tableau, you have a workbook connected to a data-source that is more than 3.2 billion rows and 120 columns wide.
This data needs to be refactored each day so the entirety of the table must be dropped and rebuilt in a job taking more than 12 hours each night.
As a result, this is too much data to be loaded into an extract, and too much for Tableau Server to unpack into temp memory.
Due to these size constraints, and that the number of concurrent users average around 120 each day, the workbook can only connect real-time.
The workbook visuals aren’t overly complex, neither do they plot a lot of marks; however, they are performing a lot of look-up calculations as they are measuring multi-variate tests.
The dashboard currently takes around 5 minutes to render after loading and after each filter change, this is unacceptable and needs to get to below 20 seconds.
Please can you describe some of the methods you would use to identify the performance bottlenecks, and methods on resolving them
Question 8: |
Looking at this table:
Which of the following exprssions will be the most performant in achieving the highlighted row?
Table Calculation | Level-of-detail Expression | ||
---|---|---|---|
|
|
Question 9: |
Consider this ER diagram:
You are connecting Tableau to this reporting database.
Whilst there are fields called ‘foreign key’ within each table, there is no relational integrity between the tables, these fields being provided for lookups and joins.
You are building a report that analyses the relationship between the numbers of products ordered and their destination city.
How would you model the data?
Option 1: Tableau's Data Model | Option 2: CustomSQL |
---|---|
Use Tableau’s data-model window, adding each field and joining based on the ER diagram | Write your own query using SQL, that simply processes the necessary joins, and add the data to a CustomSQL object |
Looking at this dashboard (collection of charts). We would like you to try to replicate using Superstore Sales (download below). We have provided you the fields used on each chart, and many of the charts have clues as to what their values represent.
You need to build us 6 useable charts that tell us something about the performance of the Superstore, we would like for you to attempt to replicate all the charts.
We also appreciate that you may be unfamiliar with charts 3, 4 & 5, if you are unable to build charts 3, 4 & 5, please build replacement's that you feel will compliment the rest of the charts in the collection. |
It will be awsome if you could add some interactivity into this too.
As with all parts of this assessment, if you get stuck, please reach out to us; remember, this role is collaborative, and we want to help you as much as we can.
Please ensure you read and all parts of the dashboard before setting-off.
Should the interview over-run, you can complete this task in your own time if you like. Should you choose to do so, please ensure you send as a version no later than the previous first sub-version (xxxx.01); so if the latest version is 2020.02, then we want no later than 2019.01 |
Workbook not loading, open the image here: | Superstore Sales |
---|---|
![]() |
Sections 1 & 2 model answers: Question 1:
Question 2: Rank() or Row_Number() will help you here
Question 3: The better method will be:
The above method will allow you to identify the first and last orders rather than simply the highest and lowest, this helps with understanding whether performance has increased or decreased between the two orders, but, for a simple max vs min, you could also use:
Question 4: Query 2 will be far more performant than query 1 as the index on the Order_Date field will be able to be used, whereas Query 1 will prevent any index from being used. Performing the function on any stored field (table field) at any point outside the projection, will cause the db engine to need to ‘touch’ every record in the table to update the data in that field. As such, no index can be used, so if your table has 1 trillion records, but the date field would return just 500,000 records, all 1 trillion records will need to be processed. It is far better, to load the records you need to a temporary source or derived table, adding the filters that can be applied without altering the stored data, and performing the function into the projection. Question 5a:
Question 5b: Unions stack sets of data one atop the other to for a single unified output. The field names and data-types of the first result-set define the table (whether real or virtual - in temp memory).
Question 6: This is more abstract and has no real definitive answer given the lack of information. That said, I would probably look at the numbers of statuses each transaction generates, then perform a count of transactions for the past few months, identified a possible multiplier for percentage growth or shrinkage, then apply this to get an idea as to how many transactions we could forecast this month. Then we simply multiply the forecast by the numbers of statuses to get an idea as to the estimated number of rows we would expect the log to hold. Question 7:
Despite the above, arguably the best action a developer can take will be to identify the source of the bottleneck, which can only really be achieved through analysis of the SQL that Tableau has generated and sent to the data-server for processing. There are two ways of getting to this, either use the Performance Recorder or, access the Tableau logs, although I tend to use both: the performance recorder to get the initial analysis underway followed by the logs, following this pathway: (users do not need to follow this, although this method will make their lives easier)
We have done this so that when we re-open the workbook, we can get straight on with the analysis rather than waiting for the workbook to render. I would recommend waiting around 30 mins at this point if possible, to allow the data server time to clear its query cache.
At the completion of this exercise, the user can use both the log and the performance recorder to try to identify the causes of the bottle-necks. It’s a good idea to test the generated SQL with the show/explain plan so we can get an understanding as to how the workbook is impacting the server; this is more often than not, the usual cause of bottlenecks, from here, users can:
Question 8: The Table Calculation is the most performant. But the use case needs to be considered too: Table Calculations are ultimately more performant than Level-of-Detail Expressions, although LODx are generally easier to use, as the scoping information is contained within the expression, and, LODx can offer much more flexibility over Table Calcs. The TC will perform a basic summation, with the scoping handled by Tableau, whereas the LODx must pull through the row-level detail and, create one-or-more derived tables that handle the grouping before then joining the aggregated results back to the main row-data. As Tableau converts your expressions to SQL queries, knowing the expected queries can help us greatly reduce our performance bottlenecks. These are the two queries generated by the calculations:
Question 9: The better method here will be CustomSQL (cSQL). Tableau has a really fancy method called ‘join-culling’, so when a large model like this is loaded, Tableau analyses the model to determine where the data is, and generate a query that only uses the necessary tables needed to get the data, effectively ignoring or culling other unnecessary tables. This can produce really high-performant queries and is great for when users want to analyse different parts of the model in the same workbook, however:
If RI is not defined – as is more often the case in highly denormalised reporting databases and warehouses (as explained in the question), then Tableau must process the entirety of the model: every table and every join, and return every field from every table in the model. Furthermore, users are unable to optimise the outgoing model, whereas the cSQL is functionally the same as a view, so whilst Tableau needs to run the same query for every calculation that is used, as the SQL analyst, you can optimise the query before you add it to the Tableau data-model.
|