Solving a “gaps and islands” problem using LAG [SQL]

At my work, we have a lot with sensors installed all over the city, which take measurements at certain intervals, e.g. once per minute. For example, a water sensor, which every minute measures the current level of rain water in an underground basin.

Most of the time the sensor will measure zero amount of water. So our dataset can look something like this, during a day with a couple of showers:

The water sensor works as a tipping bucket; it reports the amount of water in the bucket, and after a while it will empty the bucket. This means that the mm of water represents the total, accumulated amount of water in the bucket. As such, you will never see the amount decreasing; it will go from the max value to zero.

I am interested in understanding two things:
1. How many rain events occur per day
2. How much water is registered in each rain event

My data is stored in a PostgreSQL database (version 13). First, let’s define the data I have available:

Column nameData type
sensor_idvarchar
timestamptimestamp
no_measurementint
valuenumeric

Here’s an example of what the data can look like:

sensor_idtimestampno_measurementvalue
SPILD_SIFIX012023-07-03 07:35:00.00010
SPILD_SIFIX012023-07-03 07:36:00.00020
SPILD_SIFIX012023-07-03 07:37:00.00030.1
SPILD_SIFIX012023-07-03 07:38:00.00040
SPILD_SIFIX012023-07-03 07:39:00.00050.1
SPILD_SIFIX012023-07-03 07:40:00.00060.2
SPILD_SIFIX012023-07-03 07:41:00.00070
SPILD_SIFIX012023-07-03 07:42:00.00080
SPILD_SIFIX012023-07-03 07:43:00.00090.2
SPILD_SIFIX012023-07-03 07:44:00.000100.4
SPILD_SIFIX012023-07-03 07:45:00.000110.4

The first problem I need to solve is how to isolate rain events and figure out if a rain event has occurred over multiple measurements (aka through time).

By removing the zeros, I get the rain events. But how do I figure out whether a rain event continues over multiple measurements? My solution is to use the LAG function. This function looks backwards 1 row (1 is the default, it can be set to any amount of rows) and gives you the value of that row for a specific column. I can use this function, because I am given a measurement count column (“no_measurement”). This means that any concurrent measurement will have a no_measurement value the same as the previous + 1.

For example, a rain event is recorded with no_measurement = 8. If this measurement is part of a continuing rain event, the previous measurement should have no_measurement = 7 if it is the same rain event.

SELECT *,
  LAG(no_measurement) OVER(ORDER BY no_measurement) AS prev_meas
  FROM MyTable 
WHERE value!= 0::numeric
ORDER BY no_measurement

Now I have the expected previous no_measurement. Next step is to group the rain events (with a unique ID per group), by checking if the previous no_measurement fits with the current:

SELECT 
  sensor_id, 
  timestamp_meas, 
  value, 
  no_measurement,
  prev_meas,
  CASE WHEN m.prev_meas = no_measurement - 1 THEN currval('seq') ELSE nextval('seq') END AS group_id
FROM 
(SELECT *,  
  LAG(no_measurement) OVER(ORDER BY no_measurement) AS prev_meas
  FROM MyTable 
WHERE value != 0::numeric
ORDER BY no_measurement) m

Note that I am using a function called CURRVAL and NEXTVAL to give the groups a unique ID. I create a temporary sequence (called “seq”) from which I can get a unique ID using the CURRVAL/NEXTVAL. You create the sequence like this:

CREATE TEMPORARY SEQUENCE seq START 1;

Now I have given the 3 rain events unique IDs:

The final step is to use GROUP BY on the date, group_id, take the maximum measured value (because it is an accumulated value), add a start and end timestamp and calculate the duration of each event.

SELECT 
  DATE(timestamp_meas) AS date,
  sensor_id, 
  group_id, 
  ROUND(MAX(value),2) AS mm_in_total, 
  MIN(timestamp_meas) rain_start, 
  MAX(timestamp_meas) rain_end, 
  (EXTRACT(HOUR FROM MAX(timestamp_meas))-EXTRACT(HOUR FROM MIN(timestamp_meas))) AS duration_hour,
  (EXTRACT(MINUTE FROM MAX(timestamp_meas))-EXTRACT(MINUTE FROM MIN(timestamp_meas))) AS duration_minute
FROM (
  SELECT 
  sensor_id, 
  timestamp_meas, 
  value, 
  no_measurement, 
  prev_meas,
  CASE WHEN m.prev_meas = no_measurement - 1 THEN CURRVAL('seq') ELSE NEXTVAL('seq') END AS group_id
FROM 
(SELECT *, 
  LAG(no_measurement) OVER(ORDER BY no_measurement) AS prev_meas
  FROM MyTable 
WHERE value != 0::numeric
ORDER BY no_measurement) m) t
GROUP BY DATE(timestamp_meas), sensor_id, group_id
ORDER BY group_id

Reflecting on this solution, this query could also have been written as a CTE (common table expression), which I think would have made the query more readable. Here’s a nice example of such a solution.

Time series for SSM, ET and precipitation for Myanmar [Google Earth Engine]

Back in December 2018, I was in Yangon (Myanmar) to help facilitate a workshop on satellite-based remote sensing to support water professionals in the country. For this workshop, I created a script to calculate a time series for surface soil moisture, evapotranspiration and precipitation.

The data for this script are:

The first thing to do is to load the shapefile of Myanmar, using the LSIB dataset. We need to deal with the auto generated system:index, which otherwise causes some visualisation issues later on. The auto generated ID for the Feature could be something like 000000000000000092.

My solution is a bit of a ‘hack’, because in this specific case, I only have a single Feature in my FeatureCollection, i.e. the shapefile of Myanmar. That means that I can use .first() and .set() to change the system:index of the single Feature. In this instance, I change the system:index to Myanmar.

// Select country from list
var countries = ee.FeatureCollection("USDOS/LSIB/2017");
var fc = countries.filter(ee.Filter.inList('COUNTRY_NA',['Burma']));
print('Old system:index', fc);

// Change system:index from the auto generated name
var feature = ee.Feature(fc.first().set('system:index', 'Myanmar')); // .first() is used as there is just one Feature
var Myanmar = ee.FeatureCollection(feature);
print('New system:index', Myanmar);

// Add Myanmar and center map
Map.centerObject(Myanmar);
Map.addLayer(Myanmar, {}, 'Myanmar shapefile');

This gives the following output:

Now I load the imageCollections of the three datasets.

// Filter datasets for bands and period
var SSM = ee.ImageCollection('NASA_USDA/HSL/SMAP_soil_moisture')
      .select('ssm')
      .filterBounds(Myanmar)
      .filterDate(startdate, enddate);

var ET = ee.ImageCollection('MODIS/006/MOD16A2')
      .select('ET')
      .filterBounds(Myanmar)
      .filterDate(startdate, enddate);
      
var PRCP = ee.ImageCollection('UCSB-CHG/CHIRPS/DAILY')
      .select('precipitation')
      .filterBounds(Myanmar)
      .filterDate(startdate, enddate);

Next, I create a single image consisting of the mean of each imageCollection for the given time period. This is purely for visualisation purposes as you will see next.

// Calculate means for visualisation
var SSM_mean = SSM.mean()
  .clip(Myanmar);

var ET_mean = ET.mean()
  .clip(Myanmar);

var PRCP_mean = PRCP.mean()
  .clip(Myanmar);

I set the visualisation parameters, and add the data to my map.

// Set visualisation parameters    
var soilMoistureVis = {
  min: 10.0,
  max: 28.0,
  palette: ['0300ff', '418504', 'efff07', 'efff07', 'ff0303'],
};

var evapotranspirationVis = {
  min: 0.0,
  max: 300.0,
  palette: [
    'ffffff', 'fcd163', '99b718', '66a000', '3e8601', '207401', '056201',
    '004c00', '011301'
  ],
};

var precipitationVis = {
  min: 1.0,
  max: 30.0,
  palette: ['#ffffcc','#a1dab4','#41b6c4','#2c7fb8','#253494'],
};

// Visualize maps
Map.addLayer(SSM_mean, soilMoistureVis, 'SSM');
Map.addLayer(ET_mean, evapotranspirationVis, 'ET');
Map.addLayer(PRCP_mean, precipitationVis, 'PRCP');

That gives us the following output:


Now what we really want is a graph showing the development of the data through time. Due to the limited memory capacity of Google Earth Engine, I can’t get more than 1.5 month of data, but that is enough to have a look at the onset of the monsoon season in May/June.

I use the ui.Chart.image.seriesByRegion where the arguments are: imageCollection, regions, reducer, band, scale, xProperty, seriesProperty (last four are optional). In addition, I use .setOptions() to add a main title and axis titles.

// Define graphs
var SSMGraph = ui.Chart.image.seriesByRegion({
  imageCollection: SSM, 
  regions: Myanmar, 
  band: 'ssm',
  reducer: ee.Reducer.mean(),
  scale: 25000, // 25km resolution
}).setOptions({
        title: 'Average soil moisture',
        hAxis: {title: 'Date'},
        vAxis: {title: 'Soil moisture [mm]'}});
        
var ETGraph = ui.Chart.image.seriesByRegion({
  imageCollection: ET, 
  regions: Myanmar, 
  band: 'ET',
  reducer: ee.Reducer.mean(),
  scale: 500, // 500m resolution
}).setOptions({
        title: 'Average evapotranspiration',
        hAxis: {title: 'Date'},
        vAxis: {title: 'Evapotranspiration [kg/m^2]'}});

var PRCPGraph = ui.Chart.image.seriesByRegion({
  imageCollection: PRCP, 
  regions: Myanmar, 
  band: 'precipitation',
  reducer: ee.Reducer.mean(),
  scale: 5000, // ~5km resolution
}).setOptions({
        title: 'Average precipitation',
        hAxis: {title: 'Date'},
        vAxis: {title: 'Precipitation [mm/day]'}});

// Display graphs
print(SSMGraph);
print(ETGraph);
print(PRCPGraph);

Which gives us the following output:

Note that “Myanmar” appears in the legend of the graph. If we had not changed the system:index in the beginning, you would instead see something like “000000000000000092“.

Link to the entire script: https://code.earthengine.google.com/e9042e4ac5fc902d1ff58da10856965f