Where did my arrows go?

Quick hitter on this one.  I pushed some reports to the Power BI Service lately and noticed that suddenly I didn’t have the ability to drill down into my matrix visualizations.  I could right click a row in the visualization and get the option to drill down but I am used to being able to hover over and see the following arrows pop up:

It’s a quick easy fix to get these back.  Once you have the March 2018 release of Power BI Desktop, you have the ability to turn off the visual header.  This means that while the report is in reading mode, the headers won’t appear when you mouse over the visualization.

Sure enough, when I went into the Reports Section of the workspace and checked the settings, I saw this:

Not sure how that got toggled on for my reports but since I had built a bunch of them from the same template, it’s easy to see how they all that setting applied.  Toggle that switch and its back to normal.

Simple fix.

Using Azure and Power BI to mine weather station data

I started to really get obsessed with data back when I was supporting SharePoint.   SharePoint used to (maybe still does?) have this huge diagnostics database that logged amazing amounts of data by default.  I am sure the product team had imagined writing all sorts of cool stuff around it but that feature must have been cut because the database appeared in SharePoint almost as an afterthought.  I spent many hours trying to use all sorts of tools to pry knowledge from that database but I never really got too far.

My other data obsession that started around the same time was digging through historical weather data.  Ever since seeing this episode of the Simpsons, I wondered what data was being collected by these mysterious weather stations.  Years later I finally found my answer when I stumbled across the National Centers for Environmental Information website, found here:  https://www.ncdc.noaa.gov/

After some poking around, I found an FTP site where the mother lode of weather information was stored and I have poked and prodded around with it ever since.  In part one of this series, I will cover how I automated pulling these files into an Azure Data Lake using Azure Data Factory.  As the series progresses, I will cover ETL/ELT and building out of a model for Power BI as well as building some visualizations.

Fits and Starts

In the past, I had used an SSIS package with a FTP task that downloaded all of the files to my desktop.  I used a For-Each container to spawn an instance of 7-zip to unzip each of the files and then used an Azure Blob task to upload the files into Azure.  This was an effective approach but it hardly seemed enterprise worthy so I started looking into using Azure Data Factory (ADF) to do the same thing.

The obvious approach with ADF would be to create a FTP data set and use a Copy Data activity to pull the files from the FTP server and move them to my desired location.  I hit an immediate brick wall however.  The NOAA FTP site wasn’t anonymous, but to log in, you use the user name ‘anonymous’ and pass your email address as the password.  The FTP data set in ADF will only allow you to choose anonymous or Basic authentication and the FTP server did not like the fact that the ADF FTP data set automatically encrypted the password.  Honestly, I got stuck here for a while.

What finally worked

Eventually I found that NOAA had the same data stored on a website that allowed anonymous access.  It was off to the races at this point, here is what I put together in ADF.

The website, https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/, looks like this:

Each file consists of a years worth of data in csv format that has been GZipped.  The URL of each file was just the URL listed above + the name of the file.

I wanted to be able to run a copy data action against each file so I brute force created a list of the filenames in Excel.  I then created my pipeline in ADF and stored the file names as an array in a pipeline parameter.  I first entered a few into the dialogue as shown below:

I then used the code button to enter them directly into the JSON file rather than trusting pasting in all of my names into that little box.

I then created my HTTP linked service that had the URL as my base URL as follows:

Next, I created my HTTP data set.

Note that I have the compression type set to GZip and that the relative URL is using a dataset parameter.  I will pass a value down to the copy activity during each iteration of the ‘For-each’ action that will contain the file name that will complete the URL needed to get the file.  I store that file name in the data set as shown in the Parameters section of the HTTP data set properties:

I created the target data set to point to Azure Data Lake, nothing out of the ordinary there.

Ok, finally I created the ‘For-each’ copy data activity in my pipeline.  I specified that each time it loops, it will iterate through the list of file names included as pipeline parameters by specifying the following in the settings of the ‘For-Each’ activity:

I then created my activity that I wanted my ‘For-each’ activity to run.  This would be the ‘Copy Data’ activity that would move the data from the web site to Azure.  This is pretty straightforward except when specifying the source.  If you remember, I created a parameter on my data set called ‘PipelineFileList’.  When you specify that source in the copy activity, because of that parameter, you get a chance to tell it what to store in that parameter as shown:

This is what finally ties it all together.  By specifying @item() as the value to pass down to the data set, each time the ‘For-each’ loop iterates, it takes the latest file name and passed it to the source data set, which in turn pulls it from the parameter and uses it as the file name when it makes the request to its HTTP source.

From here, it was just a matter of saving, publishing and letting it rip.

The speed involved was breathtaking.  Remember where I specified GZip as the compression method?  The pipeline ran for roughly 12 minutes, downloaded, decompressed and copied to Azure Data Lake over 94GB of data.

The next post in the series will be looking at how I start chewing through the data and getting it ready to be pulled into my data model.

 

The anatomy of a DAX mistake – Analyzing DAX Performance

I have been working with a retail customer that uses a 445 date table.  Matt Allington over at PowerPivotPro wrote a great blog post on what a 445 date table is and how to create one but creating one wasn’t my issue because the customer provided me with the one that they use (Whew).

Instead, I was doing some custom time intelligence with DAX since I couldn’t use the built in functions due to the whacky date table.  Specifically, the customer wanted a lot of their metrics expressed in terms of last week, the last four weeks, the last 13 wks and total.

As a side note here, the interesting thing about DAX is that there are usually all sorts of ways to do the same thing.  I am not giving a template for how to solve this problem with the following approach, I am just using what I did to show you how I found my first (hah, hah) performance problem and solved it.

I started by defining a simple measure that would give me the latest date for the data that I had in my data set.

CURRENT WEEK ID = MAX ( ‘Date'[DateID] )

Since my Date table only has data for weeks that I have data in my fact table, this gives me the max Date ID for my dataset.  With this information in hand, I created some custom helper tables in my model.  The example below is the one created for the previous thirteen weeks helper table.

Rolling_13_Date =
VAR maxID = [CURRENT WEEK ID]
RETURN
FILTER (
‘Date’,
‘Date'[DateID]
>= ( maxID – 12 )
&& ‘Date'[DateID] <= ( maxID )
)

I made a helper table for each of the custom time periods, 13wk, 4wk and 1wk.  These tables are created dynamically when the data model is refreshed and they are joined back to the main data table as shown:

So the thinking was this, to make life easy for myself, any time that I had to present a measure over one of these time periods, I would just do the following:

My4WkMeasure = Calculate([MyMeasure], Rolling_4_Date)

Initial testing went great, it worked as expected and it was super easy to remember this approach.

Of course, I did mention that I when I am doing development, it is against a week of data right?  Yeh, probably should have mentioned that.  Once I pulled in all of the data, here is a sample of the blazing fast speed that my measure was cranking out:

 

 

That is the ‘Server Timings’ readout from Dax Studio.  My measure, just one single measure using this approach, took 4.5 seconds to load.  For such a simple measure (basically total sales), it actually would have taken almost 15 seconds if it didn’t spread the love out over multiple cores.  Even better, its not a lot of data being materialized, the time is all CPU time so I am making the Query Engine thing WAY too hard for such a simple calculation.

When I started digging into the query, I quickly saw my mistake.  I don’t quite understand the full details of it because I am not Marco Russo or Alberto Ferrari, but I could see enough from the queries shown in Dax Studio to understand what I did wrong.  By specifying an entire table as a filter instead of a column, the Query Engine performed a massive join that pulled in my entire rollup table, the main date table and the other connected rollup tables following the 1:1 relationships.  All of the columns in the new mashed up table were then used in such a massive WHERE clause that it actually overran the buffer in Dax Studio before it captured all of the text.  Usually DAX studio will cut off the text and tell you how many more lines were remaining but I think I overwhelmed it with my nonsense.  The resulting WHERE clause looked like it was filtering on every available value in all of the date columns in all of the joined date tables.

That is only a short piece of the never ending WHERE clause (yes it goes on and on my friend….).  Wow, I am not doing something right.

To try and limit the JOIN, I changed my measure to the following, this time specifying a column:

My4WkMeasure = Calculate([MyMeasure], Rolling_4_Date[DateID])

That’s more like it!  Total time 48ms.  Again, not exactly sure why its so much faster but if you look at the queries, the where clause is much more simpler.

Ok, wrapping up, what did I learn here?  Always specify a column in your filter expression when using calculate?  Always test your measures against production data?  Always keep Dax Studio handy?

I guess so, these are all good things to know.  More importantly, I got to immediately put to work the minding bending stuff that I am learning in the SQLBI.com course Optimizing DAX.  I am only half way through the course and already am using way more of Dax Studio than I knew even existed before I started.  This is not a paid endorsement, btw, I ponied up out of my own pocket to get a bunch of the videos that these two mad scientists cooked up and they have been worth every penny.

Using DAX Studio, I was able to performance test a measure by itself, quantify how bad a job I did with it, see where the slowness was coming from (yes its the WHERE Clause that doesn’t end…), make a change and immediate quantify how fast the measure was once I fixed it.

I will try to blog more perf scenarios as I come across them but for now, if you don’t know DAX Studio, and you are responsible for DAX/Power BI/SSAS Tabular performance, you need to learn it now.

And in case I didn’t firmly stick Lambchop into your ear with this post, enjoy:

Let it go, let it go, can’t import any more!

Ran into this today.  For once I was being patient and didn’t kill Power BI in its tracks and start over but maybe I should have?

I was doing a full import of my customer’s data set (24+  million rows) and knowing it would take a while, I was doing my normal shiny-object hopping around the Power BI world on the internet.  Somewhere along the line, my import seemed to freeze and the rows imported stopped stopped increasing.

Since I live out in the East Texas Outback, you can bet that it took a while to get those 21 million rows that it got before it hung so I started the stages of grief about this import.

  • Shock
    • Whaaa?
  • Denial
    • This can’t be happening, it’s not happening!
  • Anger
    • Quick route here too btw, dang it Microsoft!
  • Bargaining
    • Ok, let me try killing off all these other processes.  Here computer, please, take more memory and CPU and make it work again, please!
  • Depression
    • This sucks man, I have lost so much time.  Why does this ALWAYS (I have a teenager, sorry) happen to me!  Pout, whine, complain.
  • Testing
    • Task manager says that its still pulling down data, it’s still churning away with the CPU.  Oh well, time to give up and hit cancel, wait, I can’t, it just made a noise like its got a window open somewhere (cue sinking feeling….).  Let me try ALT-TAB and see if there is a hidden notification window or something…….oh rats
    • .
  • Acceptance
    • Time to kill it and start over.

I am not exactly sure how I managed to fill up a drive but it was suspicious that Carbonite was working away in the background during all of this.  Wouldn’t surprise me if Carbonite locked the file that Power BI was trying to update, not sure.

My point is, always check for those hidden windows.  That could have saved me a big chunk of time.

 

 

Implementing slicer driven currency calculations

My customer recently wanted to add international currency support to their Power BI reports.  He gave me an excel file that had conversion rates for the current year and asked that the users be allowed to choose what currency format they wanted to see their data displayed in.

There is no easy ‘out of the box’ way to implement this.  The slicer needs to slice the result of a measure which of course needs to be written generically.  There was a great video on the Curbal YouTube channel that walked through a method of using slicers to change the format of currency values.  I originally used this approach but ran into problems displaying the data in a matrix visualization.  Since we were formatting the data as strings, the data would not sort properly in a matrix because the values were being string sorted instead of being treated as numbers.

Therefore, I went with the following approach that combined her very clever way of using slicers and helper tables with keeping the values as numbers so that we could sort the data properly.

I imported the Excel file as a stand alone table with no relations to any of the existing tables already in the model.  It will exist as a simple lookup table.

Notice that since this is a US company, the raw data is stored in US Dollars so no conversion is needed for USD.

First step is to make a slicer based on the ‘Exchange Rates'[Currency Type].

Any measure that would give a result in currency has the following DAX at the beginning:

RTL SALES TY =
VAR cur =
    IF (
        HASONEVALUE ( ‘Exchange Rates'[Currency Type] ),
        VALUES ( ‘Exchange Rates'[Currency Type] ),
        “USD”
    )
VAR res = –Implement Measure Here
RETURN
    IF ( cur = “USD”, res, res * VALUES ( ‘Exchange Rates'[Exchange Rate] ) )

What happens is that when the measure is being calculated, we first check to see if the Exchange Rates table is being filtered down to one row.  If so,  we store the currency type in that row in the variable cur, otherwise we store the default value of “USD”.  We implement our measure and store the value in a variable called res.  The final line of the measure checks the value of cur and if we have selected a different currency type than “USD”, we multiply the result of the measure by the exchange rate.

The drawback of this approach is that we lost the ability to have a currency symbol associated with our measure results.  Since we don’t know what currency the user is going to select, we can’t type the measure as a specific currency type.  The customer wanted to ensure that users would be able to easily see what type of currency was being displayed so I created a simple measure that would show what currency type was selected.

Selected Currency =
SELECTEDVALUE ( ‘Exchange Rates'[Currency Type], “Multiple Selected” )

SELECTEDVALUE is very similar to HASONEVALUE.  If a slicer is used in a visual to limit the ExchangeRates table to just one row, it will display the Currency Type of that row.  If not, it would display ‘Multiple Selected’.

The following animated GIF is a quick demo of the approach with Worldwide Importers data.