SentinelBot, the twittter account I created for putting together and sharing tiles from the ESA Sentinel2 satellite received another upgrade lately, which I didn’t really get around to advertising – mainly because I planned on adding some more features into it, but ran out of time in the end.
The new functionality allows users to request an image around a given lat,lng coordinate. It’s actually very simple, with the steps being:
Listen for tweets with @sentinel_bot mentions – this is simple using the twitter developer platform. It essentially just listens for all activity on the timeline and parses the text. If a given activity mentions the bot it’ll go to step 2.
Unpack the coordinates – I could have used regex here, but instead I just cut off the bot mention and split the lat lng coordinates on the comma separator – any issue with formatting here will cause an error. If we get the coordinates, move to step 3
Figure out which available image data intersect with the given point – this is done using earth-search on AWS. Within this query, we look for images with less than 5 percent cloud cover.
Reproject our lat,lng coordinate into the coordinate reference system of the target image
Create a geojson polygon around this point, extending a preset amount in each direction – for this step I take 250 pixels in each direction from the point (forming a 500 x 500 pixel image) so I can run an instance with a small amount of memory
After I finished my PhD that most of this blog is dedicated too, I began my first industry job at KisanHub in 2018. Since then, I have spent over 3 great years learning about web services, building modern applications, and applying the science I worked with during my Remote Sensing PhD. It has been great to be able to work with such a forward-looking company, and I have had great mentors (more from the industry, rather than academia space) along the way. It’s an altogether different skillset from academic research. To mark the occasion, I thought I would write up some of the bigger learnings from working at a product company and reflect on some of the things I have learned about product development and give an idea of what to expect if you are moving from academia into a product company. Obviously, this is the tip of the iceberg, and many blog posts like this exist, but it is worth reiterating nonetheless!
1. Practice defining the scope of standalone ‘features’
Discussing implementation of seemingly small features can seem like time poorly spent, but it is worth its weight in gold
A “feature” I mean here is really anything within an application that does something – for example subscribes to a newsfeed, stores user settings, or creates a satellite image of a specific AOI. Thinking about these small things on websites we use everyday will get you in the right mentality for building better software.
From time to time something would make its way onto our backlog which was a seemingly small feature. Think, for example (not a real-world example here), we want to build a feature which will recommend an action to a user, based off a single previous action. What can we do?
Maybe we could investigate the user’s session for the latest “action” (loosely defined here) and have some sort of lookup table to select the next recommended action? Seems simple enough, we would need some logic somewhere to include the lookup and then a prompt for the user. Something seemingly so benign can spiral out of control quite rapidly however, as unless you are willing to track the feature’s usage, ensure that it is useful (as opposed to gimmicky) and subsequently ensure that feature creep does not set in where it’s not useful (e.g., Improved recommendations based off 2 previous actions) the feature will sit there in the application, just existing in perpetuity.
In the earlier days, I would be inclined to just do it quickly and forget about it (“fail fast”), but you quickly realise that really is not really a good choice. A better choice is to engage users and take your time on the design phase, to make sure that zombie features do not end up littering the application. Resist seemingly small features, as they will hang over you (and become technical debt) if they are not required. It is an art I am yet to master.
2. Product management is a key ingredient to any good software product
Having a good product manager makes work easy
During my PhD I worked with just a handful of people – my 2 supervisors and a few others in my PhD cohort. It was all project-based, and while I did develop and share small pieces of software, none of it was ever productised.
Many companies, on the other hand, follow some flavour of agile product/project management. This is quite a different approach to work, and in the teams I was part of an integral part was that of the product manager – in short, the person who empathises with the users, and brings potential enhancements to the company for discussion. It was particularly important in my earlier days, where I had loads of ideas I thought would be useful, but just had no place in the context of the application we were building. During my PhD, I’d often spend time just doing those things anyway, so this degree of friction was new to me.
Good product people, however, ensure users are getting value from the software, and so don’t just let engineers build whatever they want, as much as they might want to! They can take the majority of product research work away from developers however, who can then focus solely on nailing the implementation. I love being in that position, where the work is laid out in front of you and you can just focus on getting it right.
3. Be prepared for the long haul with software – technical debt is part of software lifecycles
Prevention is better than cure – build relevant features, and build them to last
The time spent in my PhD was mostly spent on tightly defined questions and developing proof-of-concept pieces of software to test hypothesis. Moving into industry, the applications we develop need to be more accessible than this in many ways, but the principal difference is that the software needs to be maintained.
I have learned so much about software lifecycles over the course of my time at KisanHub. One topic seemingly as old as software itself – technical debt – or how to handle existing/defunct software, or old features that maybe only a handful of users engage with. How do you decide when to kill the feature off? How do you decide when to fix a bug associated with an old feature? How do you ensure that bugs aren’t introduced in other places as a result of this feature (“regressions”)?
These are tricky questions at the best of times, but in a small company with limited resources, the decision cycle time is amplified, and it is best to be consistent with your decision making. The best approach is of course to not be in the situation in the first place, by doing one’s best to ensure regression issues (where future changes can render previously developed software buggy) do not arise. Experience has taught me that this is impossible.
4. Start writing tests, but don’t go overboard
It’s never too early to start writing tests – learn what to test and when, and just do that
Tests are something I did not really do during my PhD. In that simpler time, I would write a script, check that the output seemed in some way sensible, and carry on. When you are working on an app however, we need to ensure that bugs are not introduced because of changing other features which may interact with other parts of the app (“regressions” mentioned in technical debt). Tests ensure this by checking that independent, standalone features show consistent behaviours in test conditions.
However, some might be tempted to write a test for every single scenario which exists or could possibly ever exist. As well as this, it could be tempting to independently test each method call in each library you are using with a service. There is a strange sort of paranoia that can set in around testing, which I have gotten used to over the years. However, my advice would be to just think about the main use cases and write a reasonable number of cases for conditions you know are intended to be used by the user, as well as several which can be abstracted out and shared across methods (e.g., passing invalid strings).
5. Don’t technobabble
Practice using clear language with non-experts in mind
Moving from a PhD, where attending conferences and having extremely focused meetings on tiny aspects of work is the norm, it really pays to tone it down coming out of the academic environment. I think, at least towards the beginning of work, I would describe a given piece of work I was doing to be met with “I have no idea what that is, but it sounds really interesting”. I’ve learned that using inaccessible language does nobody any favours, and while it may temporarily massage your ego, it’s completely unsustainable long term.
Let’s take an example from the Remote Sensing world:
The NDVI response from annual arable crops generally fit a double logistic curve over a single growing season
Instead, we could try something as simple as:
A crop’s growth curve, as represented by NDVI (a chlorophyll metric) should have a single hump
Moving from a PhD to KisanHub was a huge step change in the pace and cycles of work. When I first entered the company, I had no idea about what it takes to take an idea, or a piece of academic research, and bring that into an application that can serve an arbitrary amount of users. I really didn’t have any proper experience in databases management or exotic jargony things like DevOps. That’s all second nature to me now – so definitely don’t let lots of software jargon put you off applying for roles.
In my years at KisanHub I’ve helped contribute to a really great software product which serves a very important niche in the food supply change ecosystem. We built software which deliver real time satellite and weather insights for our global userbase, and helps them manage their crops from field to fork. It was a really great experience
I’ve been following the RadiantEarth project for months now, after having originally seen the CEO, Anne Hale Miglarese, speak at RSPSoc’s conference in Birmingham 2018.
They’ve released an initial version of mlhub, which anyone can open an account with. To celebrate, I’ve put together a done-in-20 minutes notebook which displays the data shapes/labels on a map, available for all to see.
Definitely going to keep a close eye on this in the future!
I recently contributed to a paper reviewing structure from motion photogrammetry in forestry. It was great to get back to my PhD roots, and I learned lots about surveying in forest environments over the course of its development.
I went to Mat Disney’s inaugural lecture at UCL last Tuesday. Mat was (is?) the course coordinator for the Masters in Remote Sensing at UCL, and was reflecting over his career, how he got to where he is and what the future might hold. I really enjoyed it, as there’s often a veil of mystery over senior academics. I’ll summarise the core points as they’re definitely of interest to wider audiences!
One of the first ports of call was just a general discussion around trees. Their great diversity is worth celebrating, from the tall (up to 120m!) redwoods of west coast of the US to stumpy flat trees on the sides of windswept valleys, our scientific understanding of trees may be (and we will find out later, is) limited, but appreciating them as an amazing organism is worth doing in the first instance.
But trees are hard to weigh
Carbon estimates for trees are a crucial input to models driving climate change predictions, and Mat succinctly summarised the major gaps in knowledge associated with them. Firstly, to get a real measurement of the amount of carbon stored in a tree, you have no choice but to chop it down and weigh it. It’s a huge and grueling effort to do, so it’s no wonder that only 4,000 or so trees had been felled in tropical forests in 2015 – the extrapolation of which gives our estimates for the amount of carbon in tropical forests. Obviously, this has a huge implication for accuracy within these models, as the sample size and diversity of the sample is miniscule when scaled globally. Even in the UK, where you would expect the measurements to be more refined than the harsh environments of the tropics, we found out that carbon estimates in the UK are based upon a sample of 60 or so trees from a paper written in the late 60s, and a simple linear relationship used to extrapolate to he whole of the UK! in data science, we make lots of assumptions, but this is up there as a massive howler. So how can we hope to get more ground truth?
Enter our hero, the reigl laser scanner, which has gone on tours of tropical forests across the globe, taking 3D images of trees to artificially weigh them where they stand. Mat has used these 3D images to redefine the principals of allometry – the science of relative size of measurements (such as brain size vs weight) – when it comes to trees. He reveals that allometric relationships underestimate carbon in tropical forests by as much as 20 %! In the UK, he revisited the 60 or odd samples off which all UK forestry estimates are based, and showed that these estimates are as far off as 120 %! These are really incredible figures that show how far wrong we’ve been going so far.
The GEDI (recently launched LiDAR) and BIOMASS (PolInSAR) missions are hoping to make the modelling of these ground truth data being recorded by the like’s of Mat to satellite data much tighter, which will hopefully vastly increase our ability to estimate carbon stores in tropical forestry. This, combined with the clear communication of Mat’s methods and the distinct gap in knowledge, make it very important and interesting research!
Lastly, I’d like to give a big congratulations to Mat on the chair, it was well earned!
Zappa is a python library which hugely simplifies the deployment of web apps, by using AWS lambda functions (‘serverless’). In essence, the library packages up an existing app, for example a flask application, and generates the endpoints required as lambda functions.
Why is this useful?
Running servers, at least from a hobbyist level, can be pricy, especially if the app requires lots of resources. Lambda functions are perfect for applications which are used as a demo, or things which are only infrequently required, as the consumer pays only for the time the server is active, billed in ms on AWS.
Generally speaking, lambda functions have a start up time that’s slower than a 24/7 server. When a request comes in for a given function, if the function has not recently been called, it will need to be created before the request can be processed. This can be quite a high overhead for functions with many dependencies. Zappa helps out with this by keeping the function ‘warm’ – periodically sending a request to the function to keep it from being stripped down. If you have a lambda function which gets bursts of requests, it can take time to spin up clones of the functions, limiting its effectiveness in production environments.
The terracotta library, which I have mentioned before on here, is a great example of how effective lambda functions can be. NDVI time series by Vincent Sarago is another great example.
Radiant Earth, whose CEO Anne Hale Miglarese I was lucky enough to see speak at the RSPSoc conference last year, partnered with Amazon in order to provide more ‘geodiverse’ training data for machine learning models. I think this is timely, as the AI4EO paradigm sets in. The availability of Sentinel 2 Analysis Ready Data from s3, as well as the ability for partial reads of this data using gdal, is the preferred option vs. Google Earth Engine for me for geodevelopment, so I’m delighted on these continuing data releases. I’ve been reading about rastervision, and look forward to sinking my teeth into this data with that as a supporting tool to see what kind of learning can be done!
Geodiversity is required for reliable modelling (source)
Past Sentinel 2 data, there’s so much opportunity to shift thinking on how to develop AI4EO models, extending to other metrics such as air quality (for instance from Sentinel3 SLSTR).
In the air quality world, we would do well to better value data gathered and research done in “data-gap places.” Otherwise, we are at high risk of convincing ourselves that a parochial view projected onto the globe represents true scientific understanding. https://t.co/IBq62L20Js
The BBC have released the first of a documentary series focusing on Remote Sensing, and how it has changed/can teach us about out changing planet. It’s definitely a tough subject to fill whole episodes with, so the style is somewhat blended between satellite imagery, and storytelling on the ground, which makes for a very different kind of wildlife documentary experience.
I’m particularly curious as to how they produced the ‘superzooms’, which involve both zooming into, and out from, individual elephants in Africa to a continent wide view, as they’re extremely well done. I’m a bit skeptical as to how much space cameras are involved in videoing shaolin monks, and am curious which satellites would even have the capability for this – maybe Vivid-i could capture a short video sequence, but the resolution wouldn’t really be high enough to discern individuals, and the recently defunct Worldview-4 would only be able to capture stills. Regardless, it’s really a well paced, emotional episode which I enjoyed immensely.
EGU this year was a bittersweet affair, as I actually didn’t make the conference myself, despite having two posters presented on my behalf. I enjoy EGU, but this year my aim is to get to a few new conferences, and having already attended the amazing big data from space conference (BiDS) in Munich in February, I’m hungry to branch out as much as possible. Also on the agenda this year is FOSS4G (I have always wanted to go!) and RSPSoc’s conference in Oxford (this is one I think I will go to every year).
That being said, I did still submit two abstracts, both for posters sessions, with colleagues
of mine presenting on my behalf. The first was another extension of my PhD work, which focused principally on image quality of data collected in the field for use in photogrammetric work and it’s effect on the accuracy and precision of photogrammetric products.
This extension used new innovations within the field to further dive into this relationship, by using Mike James’ precision maps (James et. al 2017). In essence it investigates how stable sparse point clouds are when systematically corrupted with noise (in all of the camera positions, parameters and points within the cloud). His research tries to refine a big unknown within bundle adjustment using structure-from-motion, how do we account for variability in the precision of measurement when presenting results. Due to bundle adjustment’s stochasticity, we can never guarantee that out point cloud accurately reflects real life, but by simulating this sensor variation, we can get an idea of how stable this is.
In all, the research points to the fact that compressing data is generally a bad thing, causing point clouds to be relatively imprecise and inaccurate when compared with uncompressed data. It would be interesting to extend this to other common degradations to image data (blur, over/underexposure, noise) to see how each of those influences the eventual precision of the cloud.
Secondly, I submitted a poster regarding a simple app I made to present Sentinel2 data to a user. This uses data from an area in Greece, and geoserver to serve the imagery behind a docker-compose network on an AWS server. It’s very simple, but after attending BiDs, I think there is an emerging niche for delivery of specific types of data rapidly at regional scales, with a loss of generality. Many of the solutions at the BiDS were fully general, allowing for arbitrary scripts to be run on raw data on servers – something comparable to what Sentinel-hub offer. By pruning this back, and using tools like docker-compose, we can speed up the spin-up and delivery of products, and offer solutions that don’t need HPCs to run on.
Sample of the app
Lastly, I’ve simplified my personal website massively in an attempt to declutter. I’ve just pinched a template from Github in order to not sink too much time into it, so many thanks to Ryan Fitzgerald for his great work.
That’s all for now, I’ll be writing about KisanHub in the next blog!
I think it’s high time I restarted this blog, rather than let it disappear into oblivion. For the last year I’ve been working in Cambridge with an agricultural technology company called KisanHub, who aims to introduce a new wave of efficiency into crop monitoring and the food supply chain. I’ll do a separate post on the company, but for this blog post, I’m going to give general updates on skills I’ve acquired in the year, and what new skills I would recommend budding EO web developers should accrue in an attempt to demystify some of the jargon in the webdev/EO worlds.
Python is the de facto standard language for geoscientific computing, and so it makes sense to learn a web framework in this language. Django and Flask are both good options for common taks in web development – a great example of their power of flask is terracotta, where you can go from local files to a full blown interactive environment in one command.
Terracotta let’s you make an xyz server from a directory of geotiffs
By far and away the biggest revelation in the way I work, docker takes virtual environments taken to an extreme. In a naive sense, it lets you download and run a different computer – therefore stripping down all the barriers of system dependancies (gdal!), environment dependancies (pygdal!) and operating systems. The docker-compose command lets you run many of these computers in a network, and exchange information with one another. The kartoza docker-geoserver project is a great place to start for a demonstration of how easy it is to get a niche piece of software up and running. I generated a demonstration project here based on this network (source)!
Docker–geoserver based project
The natural extension of docker-compose, kubernetes let’s you efficiently run the aforementioned docker-compose network by declaring how much resource (cpu/gpu/memory) is given to each part of the network, and define rules for how the network scales/shrinks under certain conditions. It takes away most of the headache of having to manage servers and network configuration (my nginx config knowledge needs some work!), which I am very grateful for. Minikube can be used to run kubernetes networks locally, but seem to consume far more resources than docker-compose, so I usually use that at the final stages.
The grandfather of tileservers, I hadn’t used Geoserver much before this year, but am impressed with the very active community (I’m on the mailing list), and once bolted on to docker how easy it is to get started. I learned quickly it’s very easy to misuse, and so have spent the last few months properly learning about what it can and can’t do, and it’s REST interface. I think it’s a good starting point for geoscientists with little development experience, as everything can be fully controlled from the GUI, which forms a good base for beginning to manage it through REST.
I think these are the most significant technologies I’ve adopted in the last year, and would encourage any budding (or budded) geoweb developers to invest time into them. In the future, I’ll be writing about my job, my continuing research interest and my now significant commute (spoiler: it’s London -> Cambridge).