Part 4: A Data and workbook refresh: complete build


In this article, I shall demonstrate including an automated Tableau workbook refresh as part of a data transformation refresh process

Whilst Tableau Server does have a built-in scheduler for users to schedule refreshes of data whether they be workbook extracts, or published extracts, sadly, the scheduler is quite limited in function.

As long as the server is able to reach the data source, then the server and workbook are able to report that the refresh attempt was successful. Attempt - this is the key word, so even if no data had been transferred, or even if the table had been truncated before hand, Tableau Server is still correct in marking the refresh as successful.

Reports and dashboards require data in order to function, but also require the latest data in order to be useful. A dashboard that does not contain the latest available data is about as useful as a waterproof teabag. So a scheduler that is not capable of profiling the data source ahead of a refresh, is also not very useful as it could end-up just refreshing the same data yet reporting that the latest data has been loaded, or worse, in the event that the source table has been truncated, Tableau will actually import No data.

Import no data

Import no data is not the same as no import or importing nothing, if the data source is empty as it has been truncated or rebuilt following corruption, then this is what Tableau shall import: scrubbing the existing data in the extract.
In essence, the extract will still exist, but shall be empty!

One potential method of avoiding such a problem would be to use an incremental refresh, though this requires a field to be targeted such as a date or row number; but what happens if you have a less-than-straightforward setup? Maybe this is a rolling year-and-a-day dashboard: as data is reprocessed, you can perform a daily refresh just to bring through the latest data, but, on the first of each month, the whole data-set needs to be dropped and the entire last 12 months needs to be brought in.
Tools like Alteryx and Tableau Prep can perform similar tasks like this, though they too are still limited to only bringing through data when it is ready: there is no failsafe if the data simply isn't ready yet as its still in transit; the only alternative in this situation is "wait until tomorrow" - not great for mission-critical data.

This is exactly the problem encountered by Madhu Latha Doddala in their real-world Tableau Forum post: schedule a incremental refresh daily and full refresh once a month to the same data extract (tableau.com) perfect timing for this article, so I used this article to help him tackle his more complex requirements; with a couple of added extras of my own:

So including the refresh as part of the etl build is the safest method to ensuring that the data is refreshed just-in-time. What's more is, that by building the refresh process directly into the etl mechanism, we can go beyond the basics of the Tableau Server refresh and even the notifications.

This full build will demonstrate:

  • using SQL Server Integration Services to control a data transformation refresh
  • logging refresh stages and using them for further control mechanisms
  • initiating the workbook refresh
  • exporting a post-refresh image and emailing it to a distribution-list

All in a complete package that can be easily deployed and later modified.

This complete guide has been broken-down to the following sections:

 

Part 1: The Code

We will be creating several types of code as we go, though the pieces we want to focus on here are:

  • the sql for the stored procedure(s) that will move and/or transform the data from the source(s) to the destination
  • the code to execute the workbook refresh and the code to export the screenshot ready for forwarding

Part 1a: The Log Table

It all begins with a log. The log controls everything: it determines whether the process can initiate, and in some bigger etl's, I use a log for stage processing; this way, if a stage fails to complete, or needs to be reprocessed, I can simply kick-off the whole etl process knowing only the failed stages will be processed. Furthermore, as emails are also included in the process, the tech team are only notified if part or all of a process has failed, and using this log, I can include the tables that failed.

So as a minimum, our log will need:

  • An autoincrementing id - this is so we can always check the last entry
  • Process Start Date
  • Process Start DateTime
  • Process End Date
  • Process Complete Flag
  • Process End DateTime
  • Processing Time
  • Record Count
  • Data Process Start DateTime
  • Data Process End DateTime
  • Tableau Process Start DateTime
  • Tableau Process End DateTime
  • RecordSet Max Date

Well, it is a log after all. Sure, there are a number of fields that may not seem too relevant for controlling an etl process, and you'd be correct; only the fields coloured orange are all that are required, but for the sake of a few extra lines of code, we can plug the log into another dashboard (tool of choice but if Tableau is in your arsenal, it may as well be this), and this way, you can keep a proactive analysis of the etl performance: sudden spikes in record counts or processing times etc allow you the time to look for the cause before they become a real problem.

The necessary fields - the orange ones:

  • an ID allow us to identify the latest log entry easier than running max dates especially in multi-process solutions
  • The next step - the start and end dates: this can become harder for processing that takes place at the end of day 0 and usually ends in the small hours of day 1, this is something you will need to factor in, though for the most part, the question is: is the End Date the same as today (or yesterday depending as to when the processing is set for) if yes then the refresh has already completed for today and if no, then the refresh needs to go ahead.
    This seems a bit daft doesn't it, if the process has already completed, why is it running again? Simple, it is usually good practice to set a refresh window and frequency eg every 30 minutes between 4am and 7am this failsafes against time-outs such as an excessively long running query, terminated by the server.
  • Next up, what if the Process End Date is Null, but the Process Start Date is today? We then check the Data Process End DateTime, if null, the data is still being refreshed and we can terminate our etl, if the same Date as the start date, then the data has finished processing and it is Tableau
  • Finally, the Recordset Max Date: we use this to determine which data should be truncated out of our processed table, and to decided which data needs to be pulled across from the source
  • And the record count, this can be used a number of ways though the most basic is greater than zero. Its one thing to check that the data and or refresh have completed today, but if the record count is zero eg, no records were processed, then there is no point initiating a Tableau refresh if the extract data will be the same

Finally, unless you have a multi-refresh set-up in which case the date_time fields will become so much more important, it is the case that there should be just one completed entry per refresh, as a new record is only created and populated when the etl begins.

All the fields need to be defined as null, as they shall be updated as each process stage completes.

And, although I have written this for SQL Server, I have kept it as close to SQL ANSI-11 as possible, the field "Process_Complete_Flag" should be a boolean field (bit in SQL Server with a data size of 1 byte), I have defined this to a TinyInt which in SQL Server can hold a value of between 0 to 255 but is also 1 byte in size, so has the same data size as bit which can only hold 0 or 1 (False or True)

Part 1b: The Table

Lets create the table now. I have created this as a sproc, this is so I can include it as a failsafe in the main etl code: If the table is not present, then recreate it so as to prevent the main etl bugging-out when it comes to load the data to a table that does not exist. Also, you will notice that this script creates two identical tables - the actual table, and a staging table. This is so we can load the data to the staging table first, check to ensure the data is present before then loading into the main table. This process allows for a failsafe, to retain the existing data in the event that no newer data is available at the time of the refresh.
We won't create this straight away, we'll let the etl script handle this later; this way, we know the failsafe is working:

Part 1c: The ETL Script

So now we have the log table, lets now get on with the code.

This code doesn't do anything special, simply selects some data from one table and moves it to another, the only difference is how much data is being moved: On the 1st of the month, the table is truncated and the last 365 days are transferred, whereas any other day, we are just transferring the number of days that have lapsed and become available since the last run, which we would normally expect to be one day.

So now we have the code to run the etl process, and the log file that controls it, we now have two final scripts to write: the Tableau ones.

Part 1d: The tabcmd scripts

The first, the refresh script is simply a login to Tableau Server, and initiate the workbook refresh, which will refresh all the extracts contained therein:
(I'm assuming that you have installed tabcmd, and moved it to the root directory of the pc/server that shall be executing the etl, and that you have created a scripts folder for your tabcmd scripts as detailed here: Tableau Server Automation - On a Tableau Quest... - Confluence (atlassian.net))

Always remember the logoff to close the session when you have completed your commands

Save this to the scripts directory (C:\tabcmd\scripts) as company_kpi_refresh.bat

Now the print-out, I would recommend another directory within tabcmd called "prints", and whilst its possible to print the entire workbook to pdf - useful for leaders who need a complete overview before they reach the office network, and a high-level summary just won't do, in this instance, I shall only export the Summary dashboard page, and I shall assume you are using only a single site, ,otherwise you would need the -t switch immediately before the workbook location

Save this as company_kpi_daily_print.bat

Now we have finished the major scripting we can get on with the rest of the automation.

Part 2: Package Build

Open SQL Server Integration Services in MS Visual Studio, create a new project, give it a name and press OK to begin the build, this is covered here: Programmatic Workbook Refresh (10 Minute Read) - On a Tableau Quest... - Confluence (atlassian.net) in Step 2

  1. Drag an Execute SQL Task object to the canvas - this first step determines whether the package has already completed for today
  2. Double-Click the object to edit it
  3. Now drop onto the Parameter Mapping page to add a new parameter, this is so we can use the parameter later. Nb, you will need to define the Connection Type and Connection info on the General page before proceeding
    1. Click Add
    2. Change VariableName to RefreshState
    3. Direction to Output
    4. Data Type to Int16
    5. Parameter Name to @RefreshState
  4. Then open the General page
    1. Rename to "Package Start Check"
    2. Connection Type to ADO.Net
    3. Set the Connection to <your server>
    4. SQLSourceType to Direct input
    5. SQLStatement press the ellipses to expand the window and use this code:

  5. Then press OK to save
  6. Drag another Execute SQL Task onto the canvas and set:
    1. Name to "New Log Entry"
    2. Connection Type to ADO.Net
    3. Connection to <your server>
    4. SQLSourceType to Direct input
    5. SQLStatement to

    6. IsQueryStoredProcedure to False
    7. Press OK
  7. Now join the Package Start Check object to the New Log Entry object with the green connector
  8. Double-click the green connector to enter the Precedence Constrain Editor and set
    1. Evaluation operation to Expression
    2. Expression - you can type in your variable here, or search for it and build the expression (preferred) by clicking the ellipses, the output should resemble

      What this does is only progresses the package to the data refresh if the log check determined that the refresh hadn't taken place today: returns 1

    3. Press OK

  9. Now drag another Execute SQL Task onto the canvas - this will execute the data refresh:
    1. Name to Refresh Source
    2. Connection Type to ADO.Net
    3. Connection: <your server>
    4. SQLSourceType to Direct input
    5. SQLStatement to sp_Company_KPI_Build
    6. IsQueryStoredProcedure to True
    7. Press OK
    8. Connect New Log Entry to Refresh Source
  10. Another Execute SQL Task to update the log:
    1. NamePost Data Refresh Log Update
    2. Connection TypeADO.Net
    3. Connection: <your server>
    4. SQLSourceType to Direct input
    5. SQLStatement

    6. Press OK
    7. Connect Refresh Source to Post Data Refresh Log Update
  11. Drag another Execute SQL Task onto the canvas - I should have said at the beginning, that this is mostly Execute SQL Task objects: This is for testing whether the Workbook is to be refreshed
    We'll be defining another output parameter here, although we shall be using the same User Variable as before. And just like earlier, you'll need to define your connection settings first, before heading to the parameter page:
    1. Go to the Parameter Mapping page
    2. Add a new variable, but this time, scroll the list to find the variable you previously defined
    3. DirectionOutput
    4. Data TypeInt16
    5. Parameter Name@RefreshState
  12. Onto the General page:
    1. NameTableau Refresh Start Check
    2. Connection TypeADO.Net
    3. Connection: <your server>
    4. SQLSourceType to Direct input
    5. SQLStatement:

    6. Press OK
    7. Connect the Post Data Refresh Log Update to the Tableau Refresh Start Check
  13. Drag another Execute SQL Task onto the canvas: this will update the log with the workbook refresh start dateTime
    1. NameUpdate the Tableau Refresh Start
    2. Connection TypeADO.Net
    3. Connection: <your server>
    4. SQLSourceType to Direct input
    5. SQLStatement:

    6. Press OK
    7. Now connect the Tableau Refresh Start Check to the Update the Tableau Refresh Start
    8. Double-click the connection and repeat Step 8 here: Set the Evaluation Operation to Expression and set the Expression to:


  14. Now, drag an Execute Process Task onto the canvas: this will execute the workbook refresh tabcmd script we created earlier:
    1. Name: Refresh the workbook
    2. On the Process page
    3. ExecutableC:\tabcmd\scripts\company_kpi_refresh.bat
    4. WindowStyleMinimized
    5. Press OK
    6. Then connect the Update the Tableau Refresh Start to the Refresh the workbook
  15. Drag an Execute SQL Task onto the canvas: this will update the log with the workbook refresh end time
    1. NameUpdate the Tableau Refresh End
    2. Connection TypeADO.Net
    3. Connection: <your server>
    4. SQLSourceType to Direct input
    5. SQLStatement:

    6. And then connect the Refresh the workbook to the Update the Tableau Refresh End
  16. Drag another Execute Process Task onto the canvas: this shall run the second batch file to export the summary to the prints C:\tabcmd\prints folder. It is usually best to use SSIS to create the batch file here so you can encode the date onto the export filename, but these purposes, we shall simply execute the executable we created earlier.
    1. NamePrint workbook summary page
    2. On the Process page
    3. ExecutableC:\tabcmd\scripts\company_kpi_daily_print.bat
    4. WindowStyleMinimized
    5. Press OK
  17. We now need to make two connections to the workbook print object:
    1. Connect the Update the Tableau Refresh End to the Print workbook summary page
    2. And then connect the Tableau Refresh Start Check to the Print workbook summary page:
      1. Next, double click the connection made in 17b above
      2. Evaluation OperationExpression
      3. Expression:

        Therefore, if there was no data transferred during step 9, this shall be detected during step 12 above, redirecting the task straight to the generate the workbook image avoiding refreshing the workbook. 

      4. Press OK
  18. Now drag a Send MailTask object onto the canvas: this will send the email to your recipients. You can use SSIS to create detailed messages containing any error codes you have recorded based on your logging, but for this example, we'll stick with hard-coding the fields
    1. NameSend the summary page to the DL - you don't need a distribution list here, though it is more usual within a business to maintain them
    2. Onto the Mail page
    3. SmtpConnection: You will need to speak to your IT team as regards these details
    4. From: You
    5. To: Recipients and/or distribution list
    6. CcBCcSubject: Populate with your requirements
    7. MessageSource: <Message Body>
    8. Attachments: Navigate to the print produced by Tableau Server
    9. Press OK
    10. Connect the Print workbook summary page to the Send the summary page to the DL
  19. Drag one final Execute SQL Task onto the canvas: this updates the log with the process end date, datetime and the process complete flag:
    1. NameComplete Process Log 
    2. Connection TypeADO.Net
    3. Connection: <your server>
    4. SQLSourceType to Direct input
    5. SQLStatement:

    6. Press OK

This completes the build side, the final step is to deploy the package to SQL Server, and configure the Agent scheduler which is covered from Step 3 in the shorter guide here: Programmatic Workbook Refresh (10 Minute Read) - On a Tableau Quest... - Confluence (atlassian.net)