Selecting lakes – and the story of creating “a poor woman’s supercomputer”

A true story of an impossible mission

Imagine that it’s your first day at a new job. After slogging away as a graduate student for years, what a relief to land your first real job!

 

An email suddenly arrives from your new boss:

 

Fleeing this situation would be a natural reaction. (For appropriate destinations, we suggest you read our previous blog post from March 10th, 2017.)

We’d now like to introduce you to the “poor woman” who was assigned this task on her first day of work: Geneviève Potvin, our new GIS specialist at Lake Pulse (the supercomputer part comes later, so keep reading).

 

A second email quickly follows from the boss:

 

If you know of a better way to set someone up for failure when they start a new job, please send us a message. We’re always looking for ways to improve.

At this point, if you’re Geneviève, you either take up the challenge or, let’s see, McMurdo Station might sound like a viable alternative. Although Geneviève seems pretty normal, actually she isn’t: she’s in the geomatics program at the Université de Sherbrooke (I expect a pay raise for this shout-out), and she’s extremely determined!

So, what did we want Geneviève to do and why?

Lake Pulse aims to provide an assessment of the “health status” of Canadian lakes. This refers to how different a lake is compared to how it used to function before human impacts. For a medical analogy, think of human impacts as a disease and the lake as a human body. Our lake measurements, like tests that a doctor would perform (such as taking your temperature to detect a fever), allow us to perform a diagnosis. However, because we’re interested in the health of “Canadian lakes” – and not just one lake – we must sample many lakes across Canada. Furthermore, because we want to extrapolate our results to most Canadian lakes, we need to sample randomly (a basic tenet of statistics).

This all comes with a few caveats.

For example, if we were to call 500 people across Canada to ask whether they’re feeling ill today, we could come up with a fraction of Canadians who are feeling sick (say 1/10), and then easily extrapolate to the whole of the Canadian population (1/10 * 35 million = 3.5 million people feeling sick). This would be a fairly good estimate and probably be correct within a few percent.

But, if we’re interested in knowing how many people are feeling sick in the Quebec countryside, this approach could be very inaccurate. Even if we knew the exact population size for the Quebec countryside, what if there is a much lower fraction of people feeling sick there compared to cities? Indeed, because we randomly chose the sample, most of the people we called would live in cities (over 80% of the Canadian population lives in urban areas). Therefore, we would have collected very little data on people living in the Quebec countryside.

The Lake Pulse approach

For lakes in Canada, the vast majority are located on the Canadian Shield. So, by randomly choosing lakes, we would mostly sample lakes in this region, and we would be hard pressed to say anything about the rest of the country.

Similarly, there are many more small lakes than large lakes. By randomly choosing lakes, we would mostly end up with small lakes. It’s like if Canadian families all had on average 20 kids, and we interviewed whoever answered the phone… We would mostly interview kids.

The Lake Pulse Network also aims to examine lakes that are strongly impacted by humans. We thus need to sample a range of altered lakes from nearly pristine conditions to destroyed lakes that are essentially “dead”.

  1. We solved the first problem regarding the uneven distribution of lakes by deciding to have a set number of lakes in different regions of Canada called ecozones.
  2. To solve the problems related to the size of lakes, we decided to choose an equal number of lakes in three size classes.
  3. The necessity of sampling different levels of lake alteration was addressed by choosing lakes within three classes of human impacts in the watersheds.
  4. We also limited the smallest lakes to 0.1 km2 (there are too many smaller lakes) and the largest lakes to 100 km2 (the largest lakes require different sampling strategies).

To understand our lake selection strategy, imagine that for each ecozone we placed an equal number of lakes in each of the orange or red squares in the figure above and to the right.

Simplifying the problem further

Let’s get back to our story… Of course, we never really wanted Geneviève to fail (we quite like Geneviève in our neck of the woods in Sherbrooke, Quebec). How did we simplify the problem?

  • Reducing the size range of lakes actually left her with just 274 173 lakes to work with.
  • We further decided to limit ourselves, just this summer, to ecozones in Eastern Canada (well, that was mostly for logistical reasons and not so much to help Geneviève). Instead of having millions of lakes to sift through, she now had only about 180 690 lakes.
  • Next, we decided that at least for sampling this year we should sample only by roads. Choosing lakes that are accessible by road further reduced the number to about 50 000 lakes (a bit of geomatics magic can tell you this).

You might think to yourself – well, that is completely manageable! Uh, wait a minute… There’s a saying that goes: “Everything multiplied by 50 000 is still a big number”. This saying (that we made up) may not always be true… unfortunately, in this case it is…

Indeed, delineating a watershed once you have all the data in place takes about 50 seconds on a decent computer. A bit of math will show that it would take you about 29 days just to do that part of the computation. Clearly, having taken about a month and a half to prepare all the data, there was no time left to run the calculations. Unless, of course, you have access to a supercomputer!

Super solutions

It turns out that access to supercomputers is actually pretty easy in Canadian academia. After contacting Calcul Québec (the group providing supercomputing facilities in Quebec), Geneviève had everything figured out: if we run the computations on 100 processors, we get the results in about 7 hours, and there’s enough time to spare to finish the lake selection and perhaps even sleep a few hours before the 2 months are over. As we like to say around here, “easy peazy lemon squeezy.” That was until she received an email a week later saying that it would take a few months to install the software required to run the watershed analysis on the supercomputer. To avoid panicking, we tried to rationally assimilate this news… Weren’t we in fact quite lucky? Apparently, software installation time, just like processing power, increases with more processor cores. If this weren’t the case, conversations like this would be the norm:

“Wassup, bro?”

“Yo, I just installed Pac Man on my new laptop.”

« How’d it go? »

“This new laptop is blazing fast, it only took 51 years!”

As such, laptops would only be useful for bristlecone pines or Greenland sharks.

We’re being facetious here! Of course we understand that software installation and licensing for supercomputers is another beast altogether.

(Calcul Québec: we’d still love to work with you on this, honestly!)

Plan C, anyone?

Back to square one, with one week less to go! Time for “plan C” and it better be a good one! While struggling to come up with “plan C”, we wondered: “what is a supercomputer nowadays anyway?” Isn’t it simply a bunch of normal computers strung together with appropriate software to efficiently dispatch the work to multiple cores running in parallel? University computer labs contain a bunch of normal computers… What if Geneviève could act as the “software to efficiently dispatch the work …”? Perhaps the “Disciple” character from the “Léonard” comic books comes to mind (image at right).

At least we now had a plan: take over a computer lab for a weekend and run small fractions of the lakes on 17 computers. Geneviève said, “I’ll sleep there if I have to!” (Note to the Université de Sherbrooke: Of course she didn’t! Otherwise, we would have completed all the necessary paperwork.)

Et voilà… three days later the computations were done! Geneviève only had to “put the lake maps back together again” (and she didn’t even need all the King’s horses and all the King’s men).

Here is “the poor woman’s supercomputer”:

(pssst…what did Geneviève do after accomplishing this impossible mission, you ask?
Pump up the volume before you hit the link!)

The rest is history, as they say…

Here is the first map showing the initial set of lakes selected. This selection is now being refined, but it allows us to make much better plans for sampling this summer!


P.S. While Geneviève was the superhero of this blog post, she did have a sidekick in this adventure… if you will, the Watson to her Sherlock, the Robin to her Batman or an Obelix (without the extra weight) to her Asterix. We’re talking about the newly hired Jelena Juric, our database and informatics guru, who was steadfastly fighting the good fight.
(We’re told that while Geneviève was fighting Evilsupercomputer, Jelena was squashing bugs that were attacking the software).