I started to really get obsessed with data back when I was supporting SharePoint. SharePoint used to (maybe still does?) have this huge diagnostics database that logged amazing amounts of data by default. I am sure the product team had imagined writing all sorts of cool stuff around it but that feature must have been cut because the database appeared in SharePoint almost as an afterthought. I spent many hours trying to use all sorts of tools to pry knowledge from that database but I never really got too far.
My other data obsession that started around the same time was digging through historical weather data. Ever since seeing this episode of the Simpsons, I wondered what data was being collected by these mysterious weather stations. Years later I finally found my answer when I stumbled across the National Centers for Environmental Information website, found here: https://www.ncdc.noaa.gov/
After some poking around, I found an FTP site where the mother lode of weather information was stored and I have poked and prodded around with it ever since. In part one of this series, I will cover how I automated pulling these files into an Azure Data Lake using Azure Data Factory. As the series progresses, I will cover ETL/ELT and building out of a model for Power BI as well as building some visualizations.
Fits and Starts
In the past, I had used an SSIS package with a FTP task that downloaded all of the files to my desktop. I used a For-Each container to spawn an instance of 7-zip to unzip each of the files and then used an Azure Blob task to upload the files into Azure. This was an effective approach but it hardly seemed enterprise worthy so I started looking into using Azure Data Factory (ADF) to do the same thing.
The obvious approach with ADF would be to create a FTP data set and use a Copy Data activity to pull the files from the FTP server and move them to my desired location. I hit an immediate brick wall however. The NOAA FTP site wasn’t anonymous, but to log in, you use the user name ‘anonymous’ and pass your email address as the password. The FTP data set in ADF will only allow you to choose anonymous or Basic authentication and the FTP server did not like the fact that the ADF FTP data set automatically encrypted the password. Honestly, I got stuck here for a while.
What finally worked
Eventually I found that NOAA had the same data stored on a website that allowed anonymous access. It was off to the races at this point, here is what I put together in ADF.
The website, https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/, looks like this:
Each file consists of a years worth of data in csv format that has been GZipped. The URL of each file was just the URL listed above + the name of the file.
I wanted to be able to run a copy data action against each file so I brute force created a list of the filenames in Excel. I then created my pipeline in ADF and stored the file names as an array in a pipeline parameter. I first entered a few into the dialogue as shown below:
I then used the code button to enter them directly into the JSON file rather than trusting pasting in all of my names into that little box.
I then created my HTTP linked service that had the URL as my base URL as follows:
Next, I created my HTTP data set.
Note that I have the compression type set to GZip and that the relative URL is using a dataset parameter. I will pass a value down to the copy activity during each iteration of the ‘For-each’ action that will contain the file name that will complete the URL needed to get the file. I store that file name in the data set as shown in the Parameters section of the HTTP data set properties:
I created the target data set to point to Azure Data Lake, nothing out of the ordinary there.
Ok, finally I created the ‘For-each’ copy data activity in my pipeline. I specified that each time it loops, it will iterate through the list of file names included as pipeline parameters by specifying the following in the settings of the ‘For-Each’ activity:
I then created my activity that I wanted my ‘For-each’ activity to run. This would be the ‘Copy Data’ activity that would move the data from the web site to Azure. This is pretty straightforward except when specifying the source. If you remember, I created a parameter on my data set called ‘PipelineFileList’. When you specify that source in the copy activity, because of that parameter, you get a chance to tell it what to store in that parameter as shown:
This is what finally ties it all together. By specifying @item() as the value to pass down to the data set, each time the ‘For-each’ loop iterates, it takes the latest file name and passed it to the source data set, which in turn pulls it from the parameter and uses it as the file name when it makes the request to its HTTP source.
From here, it was just a matter of saving, publishing and letting it rip.
The speed involved was breathtaking. Remember where I specified GZip as the compression method? The pipeline ran for roughly 12 minutes, downloaded, decompressed and copied to Azure Data Lake over 94GB of data.
The next post in the series will be looking at how I start chewing through the data and getting it ready to be pulled into my data model.