When you write an IoT firmware, there are different things you may never forget to think about… The coming 10 things you can’t ignore are coming from my experience of smart object creation and the associated field experience.
The field experience is unfortunately the real step where you will improve your firmware and discover all you have forgotten when you have written the firmware and tested it in your laboratory. In laboratory everything is perfect.
The following 10 things you can’t ignore when writing a Firmware is a non exhaustive checklist of points to verify before pushing your code to the field. It is also a list of test conditions you can execute to validate a Firmware / device made by a third party.
1. The battery will be low
Even if this fact is an evidence, most of the developers think a battery is a binary component having a high level (power-on) and a low-level (power-down). Unfortunately, a battery has a lot of levels between these two and the circuit’s behavior will become unpredictable when a certain low limit has been reached.
The side effect is a device rebooting hundreds, thousands times as an example. As a consequence if the device communicates right after the reset you can fire a large number of messages over the air before getting batteries totally low. This is a total disrespect of the duty-cycle and laws, flooding the backend. A good practice against this is to avoid any communication before the duty-cycle time is over after reseting. Another good practice is to detect a low battery situation and move to a specific code managing the battery end-of-life.
Another consequence is the MCU vdd voltage change: when the battery voltage is higher than the MCU voltage you will use a DCDC to reduce the voltage to the MCU target (ex 3.3V) once the battery will be lower than this voltage the MCU voltage will also be reduced below the expected level. So the voltage reference will decrease. As a consequence the measured ADC values will be wrong when based on VDD. MCU temperature will also be wrong in many cases… This can conduct the firmware to take the wrong decisions and report wrong information. Sensors may also report invalid data in such a situation.
2. Transmitted messages you will be lost
Whatever the technology used, some communications will be lost. This is particularly true with LPWAn using a unidirectional protocol for most of the exchanges; this loss rate will grow with the device movements.
As a consequence, you will need to report your data more frequently than you expect to receive them. As an example, if I want to receive a minimum of 10 messages per day, I need to transmit 15, 20, 30 messages per days depending on the technology and usage conditions. This will have a direct impact on battery cost and size.
As a consequence, for any important communication, I need to consider an acknowledgement, find a way to get a better communication chance (like waiting for a low speed or a stop, reducing the frame size, doubling the communication… there are many options).
As a consequence in bidirectional communications you will have no way to ensure a communication has not succeeded. Potentially you can be sure it has been a success with different frame exchanges. However if the exchange failed, you can never be sure it has really failed. It means you need to consider the two options in your code until one of the assumptions has been verified.
3. On the field, bugs you will have
For sure, once on the field you will have different bugs. You may consider the ability to fix the devices on the field as a last chance option. That said you need to consider a way the end user can update the device firmware. JTAG or Serial are not really an option if you can’t provide this kind of costly interface. BLE and USB sound like the only viable solutions. Networking is not for the LPWAn networks.
As a consequence you need to envision other options than firmware updates. I won’t tell you: you will need to make test, that is just the minimum basic thing. So the other option you may consider is the ability to remotely control the behaviour of your device to avoid passing on the buggy branch. As this buggy branch can’t be pre-identified, the more options and bypasses you will implement, the more chance you have to find a workaround. The ability to remotely reset a device and restore its initial configuration is therefore the minimum viable operation you should have.
This will also require you to implement a function to report the firmware version and the precise build used (because we all make a lot of changes in the build between two committed versions during the pilot phase, we all do that and a 8b version ID is never enough). The firmware must be able to support a remote full configuration report request and must be able to change its configuration in a single step after receiving different sequential configuration requests for the different parameters you have. I mean, as you will have a lot of parameters available, you will need to use a lot of communication to set them, but the setting will have to be committed only after the end of the exchange.
As a consequence you need to have different ways to reboot your device: remotely with the device still working but incorrectly, locally asking a end-user to make a special manipulation for it (but with no specific tool), automatically with a working watchdog and functional infinite loop detection.
Do not forget that any infinite loop will sometimes happen later on! As an example if you send an order to a sensor and have a active wait for getting the result you can be sure that something (usually when on the field for a big, client, far away from any civilisation…) will happen and the sensor will never respond, leaving your firmware in an infinite waiting loop. Watchdogs are like the Java garbage collector: you can think it will solve all these problems but that is totally wrong, if you have a functional infinite loop in which the watchdog is rearmed. Such loops need to be identified and protected by a maximum loop number until firing a software reset.
4. bugs you’ll never be able to reproduce will be your nightmare
Once in the field you will have bugs and once the device is back and reset you will never have this bug again until the device will be back on the field.
As a consequence you need to be able to get a maximum traceability of what is happening on the device. The more indicators you have the better it is. For sure an IoT device does not have a large memory and EEPROM available but you can store information in an optimized way like using a bit to store a status or a condition over time. A rolling storage you can download on request using the network is also a way to deal with memory. The use of counters to understand what is happening on the field is also important with the capability to download these counters on a regular basis for debugging and behaviour analysis.
As an example this was really useful for me to understand the field’s conditions and optimise my algorithm for GPS positioning. The coverage conditions can have a huge impact on the power condition when the algorithm does not take them into consideration. Reporting on a regular basis the reception condition, time to fix, sats in view and different indicators was a great help. This information can be reported automatically during the pilot phase and enabled on request for production firmware.
When more memory is available a flash trace save is a way to be able to detail the firmware behaviour on the field even after weeks when correctly anticipated.
Basic current state and configuration serial dump is also a good practice. It can be used for an in-field analysis and it can be plugged to an external logger to replace the missing memory in the device.
5. An unexpected behaviour you will have
The device will sometime have an expected behaviour, this can be related to unexpected device usage or situations or related to a bug you have. The firmware has to be robust enough against such a situation.
It means you may be able to detect such a situation then switch back to a normal situation (with a reset as an example) automatically. You also need to be able to analyse the reason why it has happened. We have already seen the traceability question previously so here we can explore other solutions.
Many thing can be imagined like a programmed reset on a regular basis working in parallel with the functional normal process. This can be considered as a kind of functional watchdog you could deactivate remotely with a downlink communication => if no communication occurred in the last 24h the device will reset.
Reporting invalid data does not make sense and requires complex backend correction. As an example if you measure -254°C, you can consider what you measured was trash and your system has a problem. Instead of reporting this, the device may report a message with a defect indication and request an action to be done like stopping for a while, resetting, continuing normally…
The best way to start managing the unexpected behaviour is to think about it: instead of designing the firmware to manage what will happen for the happy path you need to consider all the other paths and make a design for failure. Imagine we have a system measuring temperature every 5 minutes and reporting an alert when is is lower than 0°C. You can basically report a message when this situation is reached (if ( T° < 0 ) then report alarm). This implementation will be really bad: during winter the triggering condition will be a durable situation and you will flood the system with messages, this will empty the battery and spam your backend. So you need to invest a bit in your algorithm to correctly manage the situations other than happy-path. This example is made to be simple to understand but the complexity of what I’m exposing is really larger so you really need during your design to think about: what are all the situations I’ve forgotten to take into account.
6. An unexpected device usage you will discover
The device will be used in conditions you never predicted, an indoor device will be exposed to rain, high temperature, low temperature. The device will be shocked. The button will be always maintained pushed because of the way the device will be attached. It will be subject to high vibrations….
As a consequence you can try to test all these conditions in a laboratory for a really high cost and you will discover situations you did not plan for. So you need to define the normal usage condition and think about what will be the impact of a condition change regarding your firmware as you need to identify these conditions.
Basically registering the temperature, the shocks, the humidity, the infinite push, the bad reception condition, the unplanned reset… are valuable information for detecting unexpected usage and acting accordingly. This can prevent from reporting invalid values, reduce the risk of battery destruction, anticipate a future customer complaint and save long analysis time when a defect will be reported and you have no contextual elements.
This is really important during a pilot phase when you do not exactly know the diversity of the usage situations and conditions. Even if the cost of this kind of sensors can impact the business model, you can imagine to solder them only for pilot and later in the production phase produce device without these sensors.
7. Your battery consumption higher than expected will be
The battery duration is always not the one you predicted, for many reasons: temperature, age, current driven during certain spikes you missed. Because your firmware execution time on the filed is not the same as the one you planned, because of coverage, because of missed downlinks over consuming… because you did not correctly consider the right low battery level, the list of the reasons is really long.
As a consequence you need to oversize your battery, at least during your pilot phase, and regularly measure your battery level associated with the temperature. The battery level can vary a lot when you have a low power device between a sleep time and a transmission time. The right time to measure the battery level is when the system consumption is at the end of a maximum power request.
As a consequence is the autonomy in the field could be lower than expected and as it can have an impact on your client TCO or organisation it is important to be able to change the device behaviour remotely, to be able to reduce your power consumption. This can be as an example a reduced number of communications to save energy. The idea is to always be able to provide a lower level of service rather than no service at all.
You can also have a specific end-of-battery process that allows to continue to use the product with reduced features when the battery reaches the last 10%. This can help to maintain +30% of life with 10% of battery for instance.
8. Your client, a new feature, will always request
Your client is “mister plus”, a new feature, a new sensor will be the reason why he will decide to buy or not your product. Even if this is in many situations not justified or a wrong reason for not buying you will prefer to say “yes we can… for that price” than “no we can’t without breaking everything we made”.
As a consequence of the way the communication protocol is made, or the way the firmware is implemented, you will be able you to easily add a feature or not. Once again, most of the requests can be solved by changing some of the settings if these settings exists in the firmware code.
Basically in the code we may never see a constant used in the code but only defines you can simply change; the best is when all these settings can be changed remotely over the network.
On top of this, having pre-defined points where you can add actions into your firmware will accelerate the future customisation. The use of a software state machine also helps to simplify the evolution compared to a classical sequential code.
The use of a SDK for making the firmware will also help. The more you separate your functional code from the MCU technical layer the better tech visibility will be and the easier modifications will become.
9. Your MCU choice you will challenge sooner than planned
When you create a device you spend a lot of time to select the heart of your device: the MCU. You will need to invest a lot into leaning about it, making tools and creating the usual bricks for logging, working low-power… You think you are doing it for a decade but in the IoT area the truth is the silicon eco-system is moving really fast actually and the best solution, the best platform will move from 1 founder to another one. In a market you need to save 1€ to have a competitive product this choice is critical.
You need to anticipate this: rewriting the low-level part of your firmware has a cost but this cost is low compared to all the energy you spend for testing your functional (high-level) code. As a consequence, the best is to be able to easily port your high-level code to any platform you will have to use later. It means you need to create your SDK or to find a MCU independent SDK. At least you need to create you wrapper to ensure in the functional code to have no reference to a vendor dependent library.
This will make you agile to adapt your solution over years and provide a firmware improved over time and running on the more efficient MCUs at the same time.
10. Your communication protocol will evolve
The last tip is about the communication protocol you are using, I’ve seen many devices not implementing a protocol but just reporting a frame with a single type. This is an easy way to evaluate the work made on the firmware, with no communication protocol you are quite sure none of the 9 previous things would have been ignored by the solution providers.
As we have seen previously the communication needs to have multiple messages for sensor reporting, error reporting, status reporting, configuration reporting, reset indications…. and the downlink also requires to have a complex behaviour management. The sum of these exchanges makes a communication protocol.
This communication protocol will later evolve and the evolution of this protocol must not break the backend decoding as it is a mess to manage. So the protocol is a durable API your object will expose. You need to think about this and reserve in your different frames a place for future evolution.
Usually a header helps to identify the type of frame, the frame size can also be a key. LoRaWan uses also a channel (really looks like a header with less options). The LPWAn frame size is limited and the way you manage the header size is important to save space and energy. That said you need in your protocol description to leave some place to add new frame type later, of any size. You need to think about adding a new frame to replace a previous one became obsolete and reduce the impact on your backend. All these future situation need to be evaluated and anticipated.
To be able to correctly manage your communication protocol you also need to version your protocol and report this version to enable your backend to manage this versioning.
Hello Paul,
The points you explain are very good and valuable to share.
I take it English is not your mother tongue, nor is it mine.
The amount of English mistakes in the article make it less readable.
That is a shame. Let somebody help you to make the article even better.
Regards
Thank you for your feedback. My Mother language are C, Java, ASM, eventually BASH. I wish you to write French as I write english 😉
Just to clarify things to readers: I’m spending time to share my experience and this require a large effort making it for free. So I ask my readers a little effort => manage some language bug.
So feel-free to make a bigger efforts and propose me some textual fix, I’ll be happy to merge them for improving reading.
Here you go gist.github.com/tamberg/…
(And thanks for the great article.)
Thank you so much for your proposed text corrections. I’ll soon merge them.