/
New Hire Process: Technical Assessment

New Hire Process: Technical Assessment

Congratulations, you have made it to the technical testing stage of the interview process, your suitability for the role is confirmed.

We appreciate that tests are never easy to encounter, and can often be thought of as meaningless, we want to assure you that this test is not an assessment of your suitability for the role - you have already demonstrated your suitability by getting this far.
The purpose of this test will enable us to understand your skill-level, and to provide further training (where necessary) should you join us.

The following test is divided into 3 sections:

  1. The SQL test: Often undervalued or misunderstood. For us, strong SQL skill is vital for all visualisation, from modelling, to calculations, to performance tuning.
  2. Tableau: A deep assessment of your Tableau product knowledge
  3. A collection of Tableau charts for you to reproduce; you will need access to a full version of Tableau (either desktop or Public) to undertake this. Follow this link to get the latest version of Tableau Public

Contents

Section 1: SQL

Question 1: 

dbo.orders
Customer_NameCategory
Claire GuteFurniture
Claire GuteFurniture
Darrin Van HuffOffice Supplies

Sean O'Donnell

Furniture

Sean O'Donnell

Office Supplies

Brosina Hoffman

Furniture

Brosina Hoffman

Office Supplies

Brosina Hoffman

Technology

Brosina Hoffman

Office Supplies

Brosina Hoffman

Office Supplies

Brosina Hoffman

Furniture

Brosina Hoffman

Technology

Andrew Allen

Office Supplies

Irene Maddox

Office Supplies

Harold Pawlan

Office Supplies

Harold Pawlan

Office Supplies

Each row in the above table represents an order. Please can you write the code to return the customer name, and the count of orders for all those who have more than 1 order.


Question 2:

dbo.Orders

Customer_Name

Order_ID

Order_Date

Order_Value

Aaron Bergman

CA-2017-140935

2017-11-10

563.94

Aaron Bergman

CA-2015-156587

2015-03-07

309.592

Aaron Bergman

CA-2015-152905

2015-02-18

12.624

Aaron Hawkins

CA-2018-164000

2018-12-18

18.704

Aaron Hawkins

CA-2017-162747

2017-03-20

86.45

Aaron Hawkins

CA-2016-130113

2016-12-27

991.26

Aaron Hawkins

CA-2015-157644

2015-12-31

53.67

Aaron Smayling

CA-2018-101749

2018-10-03

171.288

Aaron Smayling

US-2018-147655

2018-09-04

88.074

Aaron Smayling

CA-2018-162691

2018-08-01

1476.27

Aaron Smayling

CA-2017-162901

2017-03-28

31.4

Aaron Smayling

US-2015-150126

2015-07-27

65.78

Adam Bellavance

CA-2018-107174

2018-11-06

2595.388

Adam Bellavance

CA-2018-118213

2018-11-05

240.15

Adam Bellavance

CA-2018-134173

2018-09-16

20.736

From the above table, please can you write the code to return:

  • Customer Name
  • First and last Order_ID
  • First and last Order_Date
  • First and last Order_Value

Please ensure you are using ANSI SQL for this task

Expected result [tbl_output]:

Customer_Name

Order_ID

Order_Date

Order_Value

Aaron Bergman

CA-2017-140935

2017-11-10

563.94

Aaron Bergman

CA-2015-152905

2015-02-18

12.624

Aaron Hawkins

CA-2018-164000

2018-12-18

18.704

Aaron Hawkins

CA-2015-122070

2015-04-22

257.752

Aaron Smayling

CA-2018-101749

2018-10-03

171.288

Aaron Smayling

US-2015-150126

2015-07-27

65.78

Adam Bellavance

CA-2018-107174

2018-11-06

2595.388

Adam Bellavance

CA-2016-150511

2016-09-18

18.54

Question 3:

Using the result from question 2 [tbl_output], please write a query that will return on the same row the:

  • Customer Name
  • First Order Value
  • Last Order Value
  • Percentage difference between the first and last orders

Expectation:

Customer_Name

Latest_Value

Earliest_Value

Perc_Diff

Aaron Bergman

563.94

12.624

4367.20532319392

Aaron Hawkins

18.704

257.752

-92.7434122722617

Aaron Smayling

171.288

65.78

160.395256916996

Adam Bellavance

2595.388

18.54

13898.8565264293

Question 4:

dbo.Orders

First_Name

Last_Name

Order_Date

Order_ID

Chuck

Clark

2018-12-30

CA-2018-126221

Erica

SMITH

2018-12-30

CA-2018-115427

Jill

Matthias

2018-12-30

CA-2018-156720

Patrick

O'Donnell

2018-12-30

CA-2018-143259

Michael

Chen

2018-12-29

US-2018-102638

James

Galang

2018-12-29

CA-2018-118885

Katherine

hughes

2018-12-29

US-2018-158526

Ken

Brennan

2018-12-29

CA-2018-158673

Bruce

Stewart

2018-12-29

CA-2018-130631

Ben

Peterman

2018-12-29

CA-2018-146626

Christopher

Schild

2018-12-28

CA-2018-135111

Erica

Hernandez

2018-12-28

CA-2018-127516

Greg

Hansen

2018-12-28

CA-2018-136539

Harry

Marie

2018-12-28

CA-2018-129805

Kristen

Smith

2018-12-30

CA-2018-163979

Lori

Olson

2018-12-28

CA-2018-163860

Jason

Gross

2018-12-28

CA-2018-101322

Jennifer

smith

2018-12-30

CA-2018-164826

Pamela

Coakley

2018-12-28

US-2018-162068

Susan

SMITH

2018-12-28

CA-2018-122798

This table contains 1 trillion rows of data, and has indexes on the Last_Name and Order_Date columns.
Last_Name is case insensitive meaning entries can be either UPPERCASE, lowercase, or MiXed-cASe.

You are looking for all customers whose Last_Name is 'smith' (all cases). Text searches are case-sensitive meaning that searching in just one case, may not return all customers, and you are also only interested in orders made on 2018-12-30

An initial analysis of the data shows that of the 1 trillion rows, 2018-12-30 contains just 500,000 (five-hundred thousand) rows of data.

Which of these queries is the more efficient to fetch the required data, considering the performance enhancements. Explain why.

Query 1

Query 2

Question 5a:

Please can you put these query operations into the correct execution order

Query OrderExecution Order

Select

From

Join

Where

Group By

Having

Order By


Question 5b:

What is a Union, and  Union All? How do they differ? And can you provide an example of their use.

Please can you name the five join types, and provide an example of their use.


Question 6:

This table is a transaction stream log:

Row_ID

Txn_ID

Account_ID

Txn_DateTime

Status

1

123abc

112233

2000-01-01 08:08:45

Initiated

2

895jkl

112233

2000-01-01 08:09:03

Initiated

3

747ghd

456789

2000-01-01 08:12:21

Initiated

4

123abc

112233

2000-01-01 08:13:15

In Process

5

696pol

456892

2000-01-01 08:14:37

Initiated

6

123abc

112233

2000-01-01 08:17:43

Successful

7

895ghd

784521

2000-01-01 08:22:51

Initiated

8

696pol

456892

2000-01-01 08:23:04

In Process

9

696pol

456892

2000-01-01 08:23:08

Failed

Can you estimate how many entries we would expect to capture in a month. Explain. 

Section 2: Advanced Tableau Questions

Question 7:

Thinking about Tableau, you have a workbook connected to a data-source that is more than 3 billion rows and is around 180 columns wide.
This data needs to be refactored each day so the entirety of the table must be dropped and rebuilt in a job taking more than 12 hours each night.
Data this large is too much to be loaded into an extract, and too much for Tableau Server to unpack into temp memory.

Due to these size constraints, and that the number of concurrent users average around 120 each day, the workbook can only connect real-time.

The workbook visuals aren’t overly complex, neither do they plot a lot of marks; however, they are performing a lot of look-up calculations as they are measuring multi-variate tests.

The dashboard currently takes around 5 minutes to render after loading and after each filter change, this is unacceptable and needs to get to below 20 seconds.

Please can you describe some of the methods you would use to identify the performance bottlenecks, and methods on resolving them


Question 8:

Looking at this table:

Which of these two calculations is the most performant in achieving the highlighted row? Explain

Table CalculationLevel-of-detail Expression
Window_Sum(Sum(Sales))
Sum({Fixed Segment, Category : Sum(Sales)})

Question 9:

The version 2020.2 changes to the Tableau data model have superseded this, but it is still pertinent for a Tableau developer to understand this


Consider this ER diagram:

You are connecting Tableau to this reporting database.
Whilst there are fields called ‘foreign key’ within each table, there is no relational integrity between the tables, these fields being provided for lookups and joins.

You are building a report that analyses the relationship between the numbers of products ordered and their destination city.

How would you model the data?

Option 1: Tableau's Data ModelOption 2: CustomSQL
Use Tableau’s data-model window, adding each field and joining based on the ER diagram

Write your own query using SQL, that simply processes the necessary joins, and add the data to a CustomSQL object

Section 3: Tableau Task

Looking at this dashboard (collection of charts). We would like you to try to replicate using Superstore Sales (download below). We have provided you the fields used on each chart, and many of the charts have clues as to what their values represent.
You need to build us 6 useable charts that tell us something about the performance of the Superstore, we would like for you to attempt to replicate all the charts. 

We also appreciate that you may be unfamiliar with charts 3, 4 & 5, if you are unable to build charts 3, 4 & 5, please build replacements that you feel will compliment the rest of the charts in the collection.


It will be awesome if you could add some interactivity into this too.

As with all parts of this assessment, if you get stuck, please reach out to us; remember, this role is collaborative, and we want to help you as much as we can.

Please ensure you read and all parts of the dashboard before setting-off.

Should the interview over-run, you can complete this task in your own time if you like. Should you choose to do so, please ensure you send as a version no later than the previous first sub-version (xxxx.01); so if the latest version is 2020.02, then we want no later than 2019.01


Workbook not loading, open the image here: Superstore Sales



 <...>

Sections 1 & 2 model answers:

Question 1:

Question 2:

Rank() or Row_Number() will help you here

Question 3:

The better method will be:

The above method will allow you to identify the first and last orders rather than simply the highest and lowest, this helps with understanding whether performance has increased or decreased between the two orders, but, for a simple max vs min, you could also use:

Question 4:

Query 2 will be far more performant than query 1 as the index on the Order_Date field will be able to be used, whereas Query 1 will prevent any index from being used.

Performing the function on any stored field (table field) at any point outside the projection, will cause the db engine to need to ‘touch’ every record in the table to update the data in that field. As such, no index can be used, so if your table has 1 trillion records, but the date field would return just 500,000 records, all 1 trillion records will need to be processed.

It is far better, to load the records you need to a temporary source or derived table, adding the filters that can be applied without altering the stored data, and performing the function into the projection.
In so doing, the index for the date field can be used to build the temporary table, and you perform the upper-case filter on the outer table.


Question 5a:

Execution Order

From

Join

Where

Select

Group By

Having

Order By

Question 5b:

Unions stack sets of data one atop the other to for a single unified output. The field names and data-types of the first result-set define the table (whether real or virtual - in temp memory).

UnionUnion All
A union filters the sets as they are being stacked ensuring each row in the output is unique.Union All will not filter the stacks for uniqueness, so when unioning the output of the same table, (assume it has 1000 rows), where a Union will only consist of 1000 rows, the Union All will consist of 2000 rows
JoinExplantion
Inner

Returns all data from both tables where a related field exists in both tables

Left

Returns all data from the primary table, and only data from the secondary table where the related field exists. All other fields are nulled

Right

Functionally the same as a left join, just reversed

Full

Is identical to a left and right join, all data from both tables is returned. Where a related field exists in both tables, the row shall be complete

Cross

All rows of both tables are joined to one another with no join clause.

Often cited as great for creating test data and little else.

However, with a little creativity, this can generate null entries for when data does not exist rather than simply being filtered away, such as if no sales data exists for a customer

Question 6:

This is more abstract and has no real definitive answer given the lack of information. That said, I would probably look at the numbers of statuses each transaction generates, then perform a count of transactions for the past few months, identified a possible multiplier for percentage growth or shrinkage, then apply this to get an idea as to how many transactions we could forecast this month. Then we simply multiply the forecast by the numbers of statuses to get an idea as to the estimated number of rows we would expect the log to hold.

Question 7:

Problem

       Possible Resolution

Using calculations to adjust attribute names

Use Aliases

Including a fixed title in a calculation

Use textboxes

Using calculations for grouping

Use the group field

Using nested calculations

Create calculations and use the results in forward calcs

Quickfilters

Try to limit to less than 6 quickfilters and/or make use of parameters

Despite the above, arguably the best action a developer can take will be to identify the source of the bottleneck, which can only really be achieved through analysis of the SQL that Tableau has generated and sent to the data-server for processing.

There are two ways of getting to this, either use the Performance Recorder or, access the Tableau logs, although I tend to use both: the performance recorder to get the initial analysis underway followed by the logs, following this pathway:

(users do not need to follow this, although this method will make their lives easier)

  1. Open the workbook
  2. Create a new blank worksheet
  3. Unhide ALL worksheets, ensuring you remain on or navigate back to the blank worksheet
  4. Save, and close.


We have done this so that when we re-open the workbook, we can get straight on with the analysis rather than waiting for the workbook to render.

I would recommend waiting around 30 mins at this point if possible, to allow the data server time to clear its query cache.

  1. Navigate to your Tableau logs: (default) Documents\My Tableau Repository\Logs and delete or move them.
  2. Now re-open the workbook, activate the performance recorder, open the newly created log into a text editor such as Notepad++, Sublime, Visual Studio etc, be ready to keep refreshing
  3. Move through each sheet, allowing time to render, so the queries can be captured both to the recorder (which is currently generating its own temp log), and to the main log

At the completion of this exercise, the user can use both the log and the performance recorder to try to identify the causes of the bottle-necks.

It’s a good idea to test the generated SQL with the show/explain plan so we can get an understanding as to how the workbook is impacting the server; this is more often than not, the usual cause of bottlenecks, from here, users can:

  • Identify possible table and field optimisations to ensure the data can be retrieved quicker
  • Additional additive and semi-additive fields that can be added to the table such as logic expressions, and numerators and denominators
  • Identify very complex queries resulting from equally complex calculations, ready to optimise and/or rebuild in the front-end


Question 8:

The Table Calculation is the most performant. But the use case needs to be considered too:

Table Calculations are ultimately more performant than Level-of-Detail Expressions, although LODx are generally easier to use, as the scoping information is contained within the expression, and, LODx can offer much more flexibility over Table Calcs.

The TC will perform a basic summation, with the scoping handled by Tableau, whereas the LODx must pull through the row-level detail and, create one-or-more derived tables that handle the grouping before then joining the aggregated results back to the main row-data.

As Tableau converts your expressions to SQL queries, knowing the expected queries can help us greatly reduce our performance bottlenecks.

These are the two queries generated by the calculations:

Tableau is pre-programmed to write SQL to pre-defined templates, such that the SQL produced may not be as performant as: when you have defined it, may perform additional unneccessary actions that we cannot directly adjust; and can greatly affect server performance


Table CalculationLevel-of-detail Expression

Tableau:

Window_Sum(Sum(Sales))

SQL:

Select
Segment
,Category
,Region
,Sum(Sales) As Sales

From Superstore
Where Segment = 'Consumer'
Group By
Segment, Category, Region 

Tableau:

Sum({Fixed Segment, Category : Sum(Sales)})

SQL:

Select
t1.Segment
,t1.Category
,Sum(t2.Sales) As Sales

From (
	Select
	Segment
	,Category
	,Region

	From Superstore
	Where Segment = 'Consumer'
	Group By Segment, Category, Region
)t1

Join (
	Select
	Segment
	,Category
	,Sum(Sales) As Sales

	From Superstore
	Group By Segment, Category
)t2

On t1.Segment = t2.Segment And t1.Category = t2.Category

Group By 
t1.Segment, t1.Category


Question 9:

The better method here will be CustomSQL (cSQL).

Tableau has a really fancy method called ‘join-culling’, so when a large model like this is loaded, Tableau analyses the model to determine where the data is, and generate a query that only uses the necessary tables needed to get the data, effectively ignoring or culling other unnecessary tables.

This can produce really high-performant queries and is great for when users want to analyse different parts of the model in the same workbook, however:

Relational Integrity needs to be defined between each of the tables for Tableau to utilise join-culling

If RI is not defined – as is more often the case in highly denormalised reporting databases and warehouses (as explained in the question), then Tableau must process the entirety of the model: every table and every join, and return every field from every table in the model.

Furthermore, users are unable to optimise the outgoing model, whereas the cSQL is functionally the same as a view, so whilst Tableau needs to run the same query for every calculation that is used, as the SQL analyst, you can optimise the query before you add it to the Tableau data-model.

In an ideal world, the best option would be to create a view, but as views are permanent objects, users usually require DDL (Data Definition Language) permissions to create them.

DDL permissions often extend to db-admin functions too as such, ddl permissions are more often denied thus preventing developers from being able to create views.



Related content