Part 4: A Data and workbook refresh: complete build
In this article, I shall demonstrate including an automated Tableau workbook refresh as part of a data transformation refresh process
Whilst Tableau Server does have a built-in scheduler for users to schedule refreshes of data whether they be workbook extracts, or published extracts, sadly, the scheduler is quite limited in function.
As long as the server is able to reach the data source, then the server and workbook are able to report that the refresh attempt was successful. Attempt - this is the key word, so even if no data had been transferred, or even if the table had been truncated before hand, Tableau Server is still correct in marking the refresh as successful.
Reports and dashboards require data in order to function, but also require the latest data in order to be useful. A dashboard that does not contain the latest available data is about as useful as a waterproof teabag. So a scheduler that is not capable of profiling the data source ahead of a refresh, is also not very useful as it could end-up just refreshing the same data yet reporting that the latest data has been loaded, or worse, in the event that the source table has been truncated, Tableau will actually import No data.
Import no data
Import no data is not the same as no import or importing nothing, if the data source is empty as it has been truncated or rebuilt following corruption, then this is what Tableau shall import: scrubbing the existing data in the extract.
In essence, the extract will still exist, but shall be empty!
One potential method of avoiding such a problem would be to use an incremental refresh, though this requires a field to be targeted such as a date or row number; but what happens if you have a less-than-straightforward setup? Maybe this is a rolling year-and-a-day dashboard: as data is reprocessed, you can perform a daily refresh just to bring through the latest data, but, on the first of each month, the whole data-set needs to be dropped and the entire last 12 months needs to be brought in.
Tools like Alteryx and Tableau Prep can perform similar tasks like this, though they too are still limited to only bringing through data when it is ready: there is no failsafe if the data simply isn't ready yet as its still in transit; the only alternative in this situation is "wait until tomorrow" - not great for mission-critical data.
This is exactly the problem encountered by Madhu Latha Doddala in their real-world Tableau Forum post: schedule a incremental refresh daily and full refresh once a month to the same data extract (tableau.com) perfect timing for this article, so I used this article to help him tackle his more complex requirements; with a couple of added extras of my own:
So including the refresh as part of the etl build is the safest method to ensuring that the data is refreshed just-in-time. What's more is, that by building the refresh process directly into the etl mechanism, we can go beyond the basics of the Tableau Server refresh and even the notifications.
This full build will demonstrate:
using SQL Server Integration Services to control a data transformation refresh
logging refresh stages and using them for further control mechanisms
initiating the workbook refresh
exporting a post-refresh image and emailing it to a distribution-list
All in a complete package that can be easily deployed and later modified.
This complete guide has been broken-down to the following sections:
Part 1: The Code
We will be creating several types of code as we go, though the pieces we want to focus on here are:
the sql for the stored procedure(s) that will move and/or transform the data from the source(s) to the destination
the code to execute the workbook refresh and the code to export the screenshot ready for forwarding
Part 1a: The Log Table
It all begins with a log. The log controls everything: it determines whether the process can initiate, and in some bigger etl's, I use a log for stage processing; this way, if a stage fails to complete, or needs to be reprocessed, I can simply kick-off the whole etl process knowing only the failed stages will be processed. Furthermore, as emails are also included in the process, the tech team are only notified if part or all of a process has failed, and using this log, I can include the tables that failed.
So as a minimum, our log will need:
An autoincrementing id - this is so we can always check the last entry
Process Start Date
Process Start DateTime
Process End Date
Process Complete Flag
Process End DateTime
Processing Time
Record Count
Data Process Start DateTime
Data Process End DateTime
Tableau Process Start DateTime
Tableau Process End DateTime
RecordSet Max Date
Well, it is a log after all. Sure, there are a number of fields that may not seem too relevant for controlling an etl process, and you'd be correct; only the fields coloured orange are all that are required, but for the sake of a few extra lines of code, we can plug the log into another dashboard (tool of choice but if Tableau is in your arsenal, it may as well be this), and this way, you can keep a proactive analysis of the etl performance: sudden spikes in record counts or processing times etc allow you the time to look for the cause before they become a real problem.
The necessary fields - the orange ones:
an ID allow us to identify the latest log entry easier than running max dates especially in multi-process solutions
The next step - the start and end dates: this can become harder for processing that takes place at the end of day 0 and usually ends in the small hours of day 1, this is something you will need to factor in, though for the most part, the question is: is the End Date the same as today (or yesterday depending as to when the processing is set for) if yes then the refresh has already completed for today and if no, then the refresh needs to go ahead.
This seems a bit daft doesn't it, if the process has already completed, why is it running again? Simple, it is usually good practice to set a refresh window and frequency eg every 30 minutes between 4am and 7am this failsafes against time-outs such as an excessively long running query, terminated by the server.Next up, what if the Process End Date is Null, but the Process Start Date is today? We then check the Data Process End DateTime, if null, the data is still being refreshed and we can terminate our etl, if the same Date as the start date, then the data has finished processing and it is Tableau
Finally, the Recordset Max Date: we use this to determine which data should be truncated out of our processed table, and to decided which data needs to be pulled across from the source
And the record count, this can be used a number of ways though the most basic is greater than zero. Its one thing to check that the data and or refresh have completed today, but if the record count is zero eg, no records were processed, then there is no point initiating a Tableau refresh if the extract data will be the same
Finally, unless you have a multi-refresh set-up in which case the date_time fields will become so much more important, it is the case that there should be just one completed entry per refresh, as a new record is only created and populated when the etl begins.
All the fields need to be defined as null, as they shall be updated as each process stage completes.
And, although I have written this for SQL Server, I have kept it as close to SQL ANSI-11 as possible, the field "Process_Complete_Flag" should be a boolean field (bit in SQL Server with a data size of 1 byte), I have defined this to a TinyInt which in SQL Server can hold a value of between 0 to 255 but is also 1 byte in size, so has the same data size as bit which can only hold 0 or 1 (False or True)
Part 1b: The Table
Lets create the table now. I have created this as a sproc, this is so I can include it as a failsafe in the main etl code: If the table is not present, then recreate it so as to prevent the main etl bugging-out when it comes to load the data to a table that does not exist. Also, you will notice that this script creates two identical tables - the actual table, and a staging table. This is so we can load the data to the staging table first, check to ensure the data is present before then loading into the main table. This process allows for a failsafe, to retain the existing data in the event that no newer data is available at the time of the refresh.
We won't create this straight away, we'll let the etl script handle this later; this way, we know the failsafe is working:
Part 1c: The ETL Script
So now we have the log table, lets now get on with the code.
This code doesn't do anything special, simply selects some data from one table and moves it to another, the only difference is how much data is being moved: On the 1st of the month, the table is truncated and the last 365 days are transferred, whereas any other day, we are just transferring the number of days that have lapsed and become available since the last run, which we would normally expect to be one day.
So now we have the code to run the etl process, and the log file that controls it, we now have two final scripts to write: the Tableau ones.
Part 1d: The tabcmd scripts
The first, the refresh script is simply a login to Tableau Server, and initiate the workbook refresh, which will refresh all the extracts contained therein:
(I'm assuming that you have installed tabcmd, and moved it to the root directory of the pc/server that shall be executing the etl, and that you have created a scripts folder for your tabcmd scripts as detailed here: Tableau Server Automation - On a Tableau Quest... - Confluence (atlassian.net))
Always remember the logoff to close the session when you have completed your commands
Save this to the scripts directory (C:\tabcmd\scripts) as company_kpi_refresh.bat
Now the print-out, I would recommend another directory within tabcmd called "prints", and whilst its possible to print the entire workbook to pdf - useful for leaders who need a complete overview before they reach the office network, and a high-level summary just won't do, in this instance, I shall only export the Summary dashboard page, and I shall assume you are using only a single site, ,otherwise you would need the -t switch immediately before the workbook location
Save this as company_kpi_daily_print.bat