Its a sad fact of life but Tableau probably won't be around forever
Certainly as your company seeks out newer tools and data continues to grow exponentially
The problem is, your company knows this and has built, and continues to build systems in order to progress
It is very rare, and in most cases, improbable that your Tableau reports and analyse's will consume data from a dedicated for-Tableau optimised data-warehouse, at best, you can hope that a warehouse for visualisation be made available, although most of the time, you will be consuming the same data as everyone else, such that unless you are a member of the BI team, or have ddl (Data Definition Language) access to the databases (required to create and optimise tables, views, etl etc), the chances are, you may not have a chance to optimise the back-end.
Best practice for performance is key to everything we do as Tableau developers; if our reports slow-down or, our reports create bottle-necks further down the chain, you can be guaranteed of the following (in this order):
- Users will tire of waiting, ultimately leading them to find alternatives for their answers
- Our reports may find themselves permanently disabled
- Who needs Tableau when we can write our own SQL and present in Excel? Sure, Tableau is prettier and makes it easier but those reports that slow everything down cannot be used
I actually experienced this myself several years ago: I was hired to fix the reports and improve their performance down from more than 5 minutes to do anything. Things had become so bad that the team had taken to Excel and had ditched the dedicated Tableau reports.
It took 2 months of investigation and re-building from scratch to fix the underlying problems in performance to achieve rendering times of under 10 seconds on a billion-row, 200-column mixed nvarchar / bigint table. Paid-off too as the report is still in use now (4 years later), which is great really given that this is the main R&D platform
I remember my Family Law lecturer when studying for my degree making the point that really, the subject should be called anti-family law as we spend more time trying to find the best approach to amicably break the family apart rather than creating one and/or keeping it together.
Using Tableau is no different: In order to ensure we as developers are getting the best from Tableau, we really need to be asking "is this method the best way to achieve this" or "is Tableau the best tool for the job" or "do we actually need this" and finally "I'm sure we can demonstrate this more simply than that"
Data drives everything.
Including your reports, but data is not always optimised for the job in-hand, and is often unable to be, so really you should see data and data performance as the main drivers behind your reports, don't try and over-wrangle Tableau.
So this where the following guide steps in, and a little more information can be found on the following pages:
Finally, this is my list of items to consider for best practice for all of your analyses and dashboards:
(1.) Only use calculations when you need them:
But calculations are Tableau's bread-and-butter aren't they? What can I possibly do if I cannot write calculations?
... you will need to create calculations in order for you to provide your analyses however, there are right ways and wrong ways to create calculations with the wrong ways having a detrimental effect to the back-end; however, only through the of the performance recorder will you know for sure the true impacts that you calculations are having.
Classically, consider this calculation:
This is the query and execution plan from SQL Server:
And now in Tableau:
|And now in Tableau||How Tableau has decided to interpret the calculation||And the plan of this new expression|
Notice the additional operation in the plan?
In this instance, this has resulted in a minimal change on a 9994 row set, but can you imagine the impact this could have on a million / billion or even trillion row set? Moreover, this example uses a table with no performance enhancements which has resulted in a full table scan and hash-match in both queries but crucially, if the table was indexed, the first query would likely be able to use the index to reduce the number of rows to be processed whereas the resultant query Tableau has produced, would mean the index would discarded and a full table scan would be required.
Moral of the story, yes, calculations are needed, you will be lost without them, but, just ensure the calculations you write perform to their best with minimal impact eg:
- Try to avoid huge nested case statements by getting to understand the relationship with the logic, this will allow the data-engine to quickly evaluate the statement regardless as to how Tableau re-writes your seemingly beautiful statement
- Utilise the RAW_ calculations wherever possible, these are designed to prevent Tableau from re-writing your query
- Create and test you calculations against the source to ensure they are optimally performant before moving them into Tableau
- Where possible, utilise external Python and R for complex calculations or when using a database, try to have complicated expressions built to the server a user-defined calculations to move some of the load from the main engine
- Only create the calculations you need
- And Never, be afraid to create multiple calculations that make slight alterations - Tableau will only use materialised calculations in its final query so try to avoid "one-size fit's all calculations"
(2.) Wherever possible ALWAYS use CustomSQL
Hold on a moment, the majority of the articles available online and other Tableau users generally advise to avoid CustomSQL and utilise the Tableau data-model, but you are telling me the exact reverse - why?
... that you will benefit from the best experience.
See my article Lets Talk: Custom SQL an in-depth explanation but CustomSQL offers the best performance: As an SQL analyst, you have the tools to be able to write much more performant code, and to test this code and where necessary, gain advice from others about how best to tune the query whereas, simply dragging and dropping the table objects and relying on Tableau to process the fields and connections, means that no-one will be able to optimise the code.
Also, most BI teams make data available through the use of optimised view's rather than direct access to the table so when Tableau executes its connection such as:
The query that Tableau is sending to the database engine will be a variation of:
this view for example:
will generate a query similar to:
But, where view's are used as is the case here, the actual query that needs to be processed by the database engine will actually look more like:
Which you can see is functionally identical to oft quoted reasons for NOT using CustomSQL.
However, two major points not covered which are generally the most important reasons for using CustomSQL:
- Notice that although Tableau is selective as to what data it is requesting, the view unfortunately will need to return all the data and perform in-memory filtering to get to the required data as a view, is simply the result-set of a query
- You on the other hand will be able to alter the selection in your query to get only the data you need to satisfy your requirements, even if this means creating multiple bespoke connections; keep in mind that a connection is nothing more than a semantic layer, and will only be made when the data is required from that connection.
- Parameterised queries: This neat little function allows for user selection driven data or rather, being able to write dynamic queries that change based on the selections of parameters, means you can create dynamic case statements, have a much selective where clause, filtering the source data before other filters are applied on the final result-set etc.
In short, CustomSQL allows you to harness the best from your database with the least amount of impact and to be able better control you data
(3.) Avoid nesting and materialise wherever possible
Ok, I think I understand this one, instead of writing nested queries, you suggest I should create separate calculations for what would be the nested calculations? Why, surely this will lead to more processing
... a marginal rise, although, there are two more important reasons for this:
1.Reading point 1 above, Tableau has a tendency to convert even the most simplest of queries into considerably more complicated queries which can, and generally speaking do impact data-server resource, and more frustratingly, cause the data server to disregard indexes and other performance optimisations that may have nee applied to the table, so writing individual calculations, will cause these to be processed in a separate transaction.
2.Secondly, nesting calculations rather than pre-materialising the outputs can lead to calculation errors, as Tableau is only able to consider the data in the view as being the partition eg:
Assume you are wanting the average sum across the complete set of 200M rows of data, but your sample contains only 20,000, this average could therefore be incorrect. Now, this could be easily resolved with a correctly-scoped level-of-detail calculation, however, this would mean your calculation is unnecessarily unique, making your documentation unnecessarily complicated too
(4.) Use Tableau built-in functions (aliases, groups, sets, bins etc) wherever possible instead of calculations or make the changes in CustomSQL
But calculations are easier and aliases / groups / bins etc may be single-use
... consider this calculation to alter the alias of Tenure Type : (running over a 30M row optimised set:
|Calc||Tableau Interpreted||Results in this plan|
Where aliases, being a Tableau object result in no execution time in order to generate the outputs, result is, far faster performance and zero impact to the underlying source.
is the generally the cause of over 9 in 10 bottle-necks, slowdowns and server impacts yet is also one of the most easiest you can control.
(5.) Always test the performance using the performance recorder
About the Performance Recorder
The recorder simply records all Tableau events to help you identify performance issues.
The recorder is NOT a screen-capture device: it will not record screen activity, webcam or microphone and all recorded activity is saved locally.
... will tell you everything you need to know about what Tableau and thereby, the data-server is needing to do to build your visual. It can be accessed from Help > Settings and Performance > Start Performance Recording to start recording actions, and then returning to the same location and selecting Stop Performance Recording when you are ready to review.
I would recommend you also use the Tableau Server performance recorder when publishing extracts after you have tuned using the desktop version, to ensure your server and web pipeline are also optimal. This is reached by first activating the toolbar with this url command:
Open the workbook you want to test onto the first chart you want eg:
and replace the ":iid=1" with ":record_performance=yes"
Once active, you will be able to start and stop recording at will and recording is continuous unlike the desktop recorder where only actions that take place between explicit start and stop points are recorded, allowing for better traceback facility.
(6.) Understand your use-case carefully and consider the best methods for connecting to your data
- Hyper Extract (Tableau Data Extract (tde) for pre v10.5)
- Published Extract
Understanding usage frequency, data-size and complexity are the three considerations you need to make when choosing the best connection method.
Generally speaking the Direct connection is often the best as you can ensure:
- Data security and governance - Tableau will only read the data rather than extract it
- You can leverage the power of your data-server / processing farm to reach the best performance
- Not limited by Tableau Server memory
- Tableau does not need to unpack the data into memory before use
- As the data changes, so does your report
Although, this too has other considerations:
- The report cannot leave company premises unless the data is cloud-based as it contains no physical data
- The report needs to be fully-optimised to reduce server impacts, which might not suit where tactical unblocking reports have been deployed ahead of a completed optimised strategic solutions
- High user activity could cause data-server throttling of a single pipeline
- In the event of a catastrophic data failure, the report will remain blank until data is restored (this should be protected through back-ups and correctly defined etl processes)
Hyper extracts by comparison offer a safe medium however, by offering mostly the reverse of the direct connect method.
However, keep in mind that extracts:
- Take considerably longer to load and are affected by pipe size, lag and network use
- Can suffer performance drawbacks at even small sizes.
- On a personal note, I have noticed on many occasions that the extracts begins to lag at anything more than 5 millions rows and 20 columns wide of mixed alpha-numeric data; you will need to test this
- Are formed of in-memory column-store data-sets:
- This method is considerably faster than traditional row-based storage as data is not stored together on single pages however, in order to achieve optimal performance, one or more indexes still need to be used. As Tableau has no indexing facility, it must still touch ever row in the column in order to generate the result-set.
- Can only be queried by the workbook to which it is attached, users wanting to use the same data must create their own extracts in their own workbook(s) resulting in unnecessary reloads of the same data
- The extract must be refreshed to remain current
- Data alterations and/or additions or deletions of fields require a complete new extract to be generated before these changes can be used, which can prove very time consuming even when empty extracts (Create an Empty Extract) are used
Finally, Published extracts.
Published extracts can exist as either extracted data-sets like Hyper Extracts or can simply be a published data-source holding the connectivity information to a Direct-connect data-source. They function near identical to their workbook-base counterparts only being server based, they provide the addition of accessibility allowing all users to connect and use the same source as you are using.
But, published extracts and sources are not also without their issues and should still only be used following careful consideration.
Seemingly offering the best method of connectivity for on-prem workbooks, published sources appear to be the most appealing method however, their usage will generally double, triple and sometimes quadruple and more query load times.
Tableau Software are consistently improving the connection method however, as recently as November/December 2018 I personally worked with two Tableau Software 2nd-line engineers on a performance problem that was attributed entirely to the use of the published extract:
When using a hyper extract of 300 million rows of data at 1.7GB in size, the dashboard turn-around speeds were around 3.2 seconds however, once loaded to Tableau Server this slowed down to around 53 seconds. It was during this deep investigation we found that Tableau Server was converting the 200-line sql code into a 2000-line xml document, passing this to the data-source over a secure-http pipe (despite the source being on the same server) and waiting for the results to be returned (Tableau server decoding the xml into usable code, executing and returning the results back as xml for the workbook side to decode into usable data).
Switching back to a published extract did decrease performance to around 12 seconds rather than the desktop 3.2 but still, this was substantially quicker than 53 seconds.
This investigation was reported back to Tableau Software for further investigation and remedy.
(7.) Additive Calculations belong in the source
... should exist in the source.
Remembering the maxim: Less is More
And expanding on point 1 above:
Regardless as to the source, be it Excel or a flat-file, or a database table, calculations that operate on a row-by-row basis and these even include numerators and denominators of ratio calculations, can and therefore should be materialised in the source.
If you are unable to modify the source table, this is the perfect time to use CustomSQL.
By pre-materialising your base numbers at the source level, you are significantly increasing performance on the presentation layer (Tableau).
Examples of such calculations include:
- Sums: Addition, Subtraction, Division (with zero/null protection), Multiplication
- Aliasing / renaming, case adjustment, concatenation
- Date/Time adjustment and logic
- Year-over-year logic
- Some basic groups
(8.) Understand Tableau's role
.. a presentation layer.
All too often, Tableau is given Excel-centric nicknames "Excel on steroids" or "a more powerful Excel tool", and many of Tableau's user-base began there analytical lives with Excel or a similar spreadsheet tool, as such, most people therefore try to gain the same results from Tableau that they see with Excel, which often leads to terrible results further down the line.
Tableau is NOT Excel and it does not behave like Excel either.
The problem comes from the ease of use that Tableau provides, and users failing to understand the architecture of both tools.
Only when explained that Tableau is nothing more than a presentation-layer with the bulk of the heavy-lifting taking place outside of Tableau, compared to Excel which does process every action, do users begin to see the bigger picture.
This is discussed in greater detail in Tableau Overview: for the Excel User
(9.) Limit your use of filters
Are you kidding? If I don't have filters, my charts and dashboards will become unusable! It was only a matter of time before these ended-up on your list - killjoy!
The ideal tool to slice and dice your data.
They can be formed from dimensions, measures and calculations. They can use top n, bottom n, wildcard matching, type in and multi-select. Far more powerful (at least from a basic sense), than parameters, these filters really bring your visualisations to life.
It's true, but these are amongst the biggest culprits for terrible data performance. Yet another one that fails to convey the impact to performance, you see, in order to build a quick-filter, Tableau needs to execute a Select Distinct on the entirety of the column, it matters not whether the table contains a thousand rows, or one-hundred trillion, it still needs to perform this. Net result, as many users routinely plug Tableau into tens and one-hundreds of millions of rows of data, do you have any idea how much processing just one quick-filter takes to build.
And then, the source-engine must start again for the next quick-filter. So if your table contains 10M rows and you have just five quick-filters, the data engine must run this query once for each filter, touching every record in the set:
So essentially, the data engine will have touched 50 million records from a 10 million record set. Now do you understand?
There are however, two type of filter that don't touch any records:
- type-in allowing users to type their filter:
- Although you must be very careful with this as this results in a full-text search to filter the records which can be very slow - imagine string matching a partial index to 10 million records.
- The trusty parameter:
- A static list that the developer must maintain, having this as a multi-select is just about the only thing that would truly make this but alas, being only single-select is a major turn-off for most users and for good reason too.
I can offer two further solutions that will enable yo to go crazy with your filters: (1.) Either use a parameter to filter your initial set or (2.) Have your filters on just a single page
(1.) The quote I put at the top of this page regarding needing to re-build the r&d report, well, part of the problem was that this was plugged into a 2 billion row set with around 8 quick filters that needed to be generated causing huge slow-downs on what was already a 650ms network lag.
However, the r&d entries were all Jira tickets, and frustratingly, users were only interested in a single ticket id each time they were using the report, with the average data size per ticket being around 3 million rows. Ok, so part of the back-end rebuild I mentioned was to partition the data by ticket id which hugely improved performance but, from the front-end, simply converting the connection to CustomSQL with a type-in parameter for the where-clause (this was eventually populated over a url-call) resolved the problem.
.. Tableau will not attempt to render any item including quick-filters without data, trouble is, an empty parameter meant that no data was being served which means, the report had a chance to open and render (as nothing but headers/footers titles and parameters would still be drawn), and then, once the user enters their ticket id which filters the data from the outset down from 2 billion rows to 3 million rows, the quick-filters are then built using the same Select distinct only against a considerably smaller set.
Therefore, by using CustomSQL and a parameter to reduce your set to that which is of interest from the outset, your quick-filters can be built against a much smaller set.
(2.) The second alternative I can offer cam about from a pure accident:
A second major drawback of the quick-filter is that if the same filter is placed on more than one dashboard or chart page but, enough time passes between initially building the filters on page/dashboard 1, where other actions have taken place enough to clear the cache for the filters, when moving to the alternate pages, the quick-filters must be re-generated with the same Select distinct query.
I provided some consultancy to a company who just because they had 20 filters which took-up huge amounts of screen-space so, to save on space, we created a page of filters; it was awkward for the end-user to have to go back to the filters page when they wanted to adjust something, and I had to create an indicator to sit on every page to flag when and which filters were applied, but, the users were happy with it.
Moreover, by moving the quick-filters to the one page, this meant that despite their number, they would only have to query the data just once per session, which allowed for an improved experience.
(10.) Keep your visuals clean and your dashboards focused
Unfortunately, development requests from another user can often cause this very important point to fall by the wayside.
Dashboards need to be as clean as possible - the maxim Less is More is absolutely necessary here.
Tools such as Excel and to a lessor degree, PowerPoint have enabled users to attempt to cram as much information as possible into the tiniest of spaces. Users have been brought-up on a diet of "All on one page". This is terrible, for a number of reasons:
- First of all, this is far too much information for users to consume and keep track of
- Dashboards quickly lose focus so end-up a mess of disorganised charts that hide or worse, lose them message
- Imagine you have an odd data point that needs investigation, trying to trace this through multiple charts can often lead to losing the point or, by the time the user reaches chart 9, they are more likely to have forgotten the reason for the investigation
- Finally and most importantly, much charts on a screen quickly leads to performance degradation:
- At best, a chart may be made of two components: a dimension and a measure. These need to be executed as a query, and plotted taking at least 5 operations:
- Tableau generates the initial query and send to the data-server
- The data engine executes the query
- Data is returned back to Tableau
- Tableau reads the data and identifies which objects are to be plotted where
- Tableau plots the data
- Tableau must follow the above steps regardless as to the number of marks to plot and furthermore, need to do this for every chart so if you have 15 charts on a page, then this needs to happen 15 times, which is a huge amount of work
- Worse (what could be worse than this?), as you build interactivity into your dashboard such as filters, parameters and action filters, each of these items generate a new query each time a change needs to be computed
- At best, a chart may be made of two components: a dimension and a measure. These need to be executed as a query, and plotted taking at least 5 operations:
Visuals (charts and tables) also need to be as clean as possible. The less ink on the page, the easier a chart is to read:
- Use the most simplest design to get the message across: there's no denying a beautifully crafted info-chart can look like a work of art in its own right, however, such charts are designed to beautify data with a single message in mind such that they often lose out to more simplistic bars and line charts
- Colour can allow you to add another dimension to your data
- But keep your design clean and simple; the more data you cram onto your chart, the easier your audience will lose track of the message
- Most importantly, the more data that has been crammed onto a chart, the more processing that needs to be done in order to build the chart, so only include that which is necessary for you to present the message.
(11.) Keep your work-space tidy
- Keep calculation and parameter names short, descriptive and where necessary, replace spaces with underscores:
- Default database behaviour is to surround all field names in square brackets, just in case there's a space in there. Sure, this isn't a problem, but when none of your fields contain spaces, you can do away withe the brackets in calculations, which improves readability (and helps with documentation too).
Remember to add comments in your calculations: Single-line comments follow two forward-strokes eg:
- Annotations (object comments) are really useful in understanding the purpose of a calculation or object (like a field, dimension or parameter etc) without needing to open it. Annotations use rich-text so you are free to format as you want. To add an annotation simply:
- Add your comments and press OK to save them
- Add your comments and press OK to save them
- Folders. Folders allow you to organise your work-space, however, you will need to alter the view model in order to access them which in so doing, will remove data-model groupings, so fields that are grouped by table from a joined data-model, will now lose their groupings to be listed alphabetically. No matter, if you want to add this arrangement back, simply create folders to represent the tables instead.
To switch to folders:
- Select the menu option and alter to Group by Folder
- You will now be able to:
- Create new folders
- Delete existing folders
- Select a group of dimensions (or measures or parameters etc) and choose either to create a new folder (and insert into the folder) or to add to an existing folder
- Select the menu option and alter to Group by Folder