Global Warming: self-learning journey to build the story with data science

When I was working at MathWorks, I had the opportunity to create a demo in MATLAB that provides a simple walkthrough of how to perform a data science workflow on Domino Data Lab‘s MLOps platform. The demo showed how to use climate data from NOAA, to build a simple prediction tool that uses machine learning (ML) regression.

With the goal of telling you whether you should consider buying an air conditioner, the model predicts how many hot days upcoming years hold in store for us. A hot day is defined as one with a temperature over 29° Celsius (or 84° Fahrenheit). You name a location, and the model predicts the number of hot days.

More than two years later and I am now a member of the Domino team. I am also embarking on a new project to demonstrate our deep integration with a major data store partner, and creating the demo in Python this time. This post and subsequent ones will act as my travelogue for the project.

Data: NOAA is pretty amazing

The US National Oceanic and Atmospheric Administration provides interested parties with climate data collected across the globe. Not being a climatologist, I look to glom onto big data and data that I can understand (and I do believe that it is far easier for a subject matter expert to learn data science techniques than the other way around…). And NOAA has plenty of data for climate neophytes. The tool I found most useful was the Global Historical Climatology Network, or GHCN to its friends. GHCN provides you with location and date-based climate records for thousands of weather stations around the world. Data includes:

Minimum temperature
Maximum temperature
Precipitation
Snowfall
Wind

and much more as listed here in section III.
It is truly a treasure trove of data for you to reach out and look at. Thanks NOAA!

In the two years since working on the original demo, NOAA spiffed up the way it shares its data. Also, working with a data store and looking to demonstrate how it can crush big data and demonstrate how we work together, the workflow I will look to illustrate is:

Preliminaries – Data Engineering
1. Find data
2. Understand the data
3. Get the data
4. Wrangle data into a format usable for analysis
Analysis
1. Look at the data – normally using a subset of the complete dataset
2. Clean up the data – deal with missing and errant data
3. Identify the arguments that you believe matter for your prediction to work
Model development
1. Try out several algorithms to determine which one produces the best results
2. Save the training function
Model Training
1. Run the model training function on the complete dataset
2. Collect the model
3. Test again
Model Monitoring
1. Ensure the model remains valid given new data and ground truths

NOAA made it interesting

My original demo used historical data from a single station. Weather from every day of the last century or so was recorded in the source data. Trying to simplify things, I used data from the last 20 years and focused on Berlin, where the friend who prompted the idea for the demo lives. In those two years, the weather station I relied on, shut down – together with the airport that housed it, Tegel. Which teaches you something about weather stations: They are not immortal. And being in Berlin, there are also gaps (world wars will do that to you) – so you need to keep a watchful eye on the data.

The station data file consists of one data row per day, each containing the data points I listed above. That makes it easy to review the data and understand what’s going on. Moving to the larger, more complete data set, NOAA offers a carrot and a stick of sorts: You can now download the entire historical, global data set (12GB tar gzipped, 103GB~) in one file. But that file consists of a different structure: each row of the file has one data point and consists of station id, date, data point, value – e.g. the station in Potsdam Germany, January 9, 1960, maximum temperature, 15c, followed by metadata about the observation.

While this format is more confusing, it offers a foundation to collecting data in increments from this point on, as NOAA provides a daily diff file. So what you need if you want to maintain the data on your own server, is download the daily updates and changes. Which is pretty nifty and that I look to use for demonstrate Domino’s model monitoring capability.

In any case, I am still trying to make sense of the data. Future posts will discuss things in more detail as I figure them out.

Next chapter: Data Loading into Snowflake

Data: NOAA is pretty amazing

NOAA made it interesting

One reply on “Global Warming: self-learning journey to build the story with data science”

Leave a Reply