Critical Analysis of the Meshtastic Protocol

Meshtastic is a mesh protocol (peer-to-peer, network by proximity) based on LoRa technology. LoRa is not LoRaWan, just as WiFi is not IP. It is therefore possible to use LoRa for networks without infrastructure.

Meshtastic was designed for communication outside of any public infrastructure, with a survivalist spirit of autonomous and (more or less) secure communication.

Due to its structure, it is difficult to estimate the size of such a community, but the map seems to indicate that more than 10,000 nodes are currently active. However, it seems that there are actually around 40,000 active nodes, with strong participation from the global ham radio community. In practice, the network is composed of clusters of nodes communicating locally with each other and expanding as clusters become visible to one another. In reality, without linking infrastructure, it won’t be possible to connect from one cluster to another but some MQTT relay features exists.

The use and development of the network require very few resources, as simple DIY nodes based on widely available devkits, such as the T-beam, are sufficient. The user interface works via a mobile application interacting through Bluetooth. The investment is just a few dozen euros. In a previous Meshtastic blog post, I detailed its implementation with small LoRa modules.

LoRa Mesh Use-Case

Mesh networks allow for the construction of networks with wide coverage (within the limits of what the technology can achieve point-to-point) without the overhead of infrastructure costs (at least reduced or distributed across the nodes). In a solution with an architecture like LoRaWAN, there are simple and autonomous nodes (costing a few euros) and data-collecting gateways, which are more expensive (a few hundred euros) and require energy and a network to transmit information to a network core.

There are two main families of mesh network use cases:

  • Where the cost of deploying infrastructure would be too predominant—this applies to large areas that require a lot of infrastructure for a low density of objects (for example, in agriculture, environmental data collection, extensive industrial settings, mines…) where there is generally difficulty or a high cost to install and operate infrastructure elements (lack of anchor points, energy, or backhaul network costs).
  • Where it is not possible to deploy infrastructure, either because the area is mobile or because infrastructure cannot be reliably or sustainably deployed. This is the case, for example, on a battlefield where the infrastructure (or rather the network) must follow the moving front lines and connect to a communication infrastructure that can be deployed on the rear lines. Another use case could be monitoring wild or farm animals moving in vast areas.

The use cases for mesh networks are vast. We see implementations in our homes with connected light bulbs and alarms, for example, relying on Zigbee. Bluetooth supports Bluetooth Mesh or Wirepas, which also offers solutions on DECT bands (Wirepas 5G).

Each mesh network improves the performance of the underlying technology, generally by extending its coverage while optimizing bandwidth and energy usage.

In the case of LoRa, a technology allowing communications on the order of 10 km outdoors under nominal conditions, and approximately 500 meters to 1 km indoors, using a mesh network can extend coverage to the size of a city or even a region with well-positioned density and relays.

Even though these technologies optimize energy and bandwidth, there will still be a significant impact on these two criteria for devices compared to an architecture with infrastructure.

Protocol Operation

The Meshtastic protocol is very (perhaps overly) basic. To put it simply, a message is transmitted by a node and is repeated by surrounding nodes to propagate it further. This leads to two consequences:

  • A lot of unnecessary retransmissions, which will saturate the frequency band or rather limit scalability.
  • Significant energy consumption due to almost continuous reception by each node, which limits the ability to operate on battery power.

To address this, the protocol proposes several solutions, which are interesting but, in my opinion, not entirely sufficient:

  • A node listens before transmitting (CSMA/CD or rather CAD) to avoid collisions and prevent unnecessary transmissions.
  • A listening system that prioritizes distant nodes in retransmissions (though this could create blind spots despite coverage).
  • Long preambles to allow for partial sleep during reception.
  • Configuration profiles, some of which disable the relay function (but this deprives the network of possible coverage).

The protocol incorporates two important mechanisms to prevent message looping:

  1. A message that has already been transmitted is not retransmitted.
  2. A message is sent with a maximum number of hops and will not exceed this number of hops (7), which leads to a typical expected range of 70 to 150 km. (Of course, there are records for single-hop distances or more in the hundreds or thousands of kilometers, but this should not be considered a common situation.)

Frame Format

The Meshtastic full frame format

Messages are based on a LoRa frame with a network identification (sync word 0x2B) that allows filtering of non-Meshtastic traffic. Overall, the protocol uses long frames.

There are 4 communication identifiers:

  • Source – who sent the message – based on BLE EUI
  • Destination – to who it has been sent – based on BLE EUI
  • Packet IDrandomly generated 32b number (I’m not sure how random it is across same device family – I did not found a ramdomSeed using device ID as an example)
  • Channel (similar to an IRC channel) – value computed from the channel name (local string decided by the user) byte to byte XORed then XORed with byte to byte pre-shared key XORed value to get a single byte at the end. ( XOR( XOR(channel_name_bytes), XOR(pre-shared_key_bytes) )

Three of them occupy 32 bits, the last one is 8 bits so we have a total of 104 bits and I see some issues:

  1. The source and destination IDs are generated in a way that I consider random. The protocol indicates that 32 bits of the BLE address should be used, which theoretically come from an IEEE range, but we may face collisions between manufacturers since a MAC address is more than 32 bits. Furthermore, not all nodes necessarily have BLE.
  2. The channel is associated with an encryption key that allows reading the messages, and since the network does not offer a routing table, all communication is broadcast on a channel. Therefore, the concept of destination does not really matter, except for using 32 bits unnecessarily and increasing the risk of non-reception or collision.
  3. The way the Channel is encoded and use the PSK gives publicly some information about the PSK.

In fact, the notion of destination is only useful for acknowledgment requests, where the recipient is supposed to confirm reception. In my opinion, this would have been better handled by a higher-level protocol in the payload. Local acknowledgment within the mesh is simply be done by listening for its own message being rebroadcast, as the protocol does it broadcasting over the air. End to end acknowledgment may be managed by a broadcasted response message. My opinion is that the acknowledgment present in the protocol belongs to a higher network layer because Meshtastic does not build dynamic routing tables like those found in Wirepas, for example. Therefore, the need is possibly for point-to-point acknowledgment, which could be done through simple listening. The services above Meshtastic that require acknowledgment could handle it at their own layer, which would allow saving 4 bytes of destination data in each frame.

Additionally, Meshtastic does not include any means of message signing to prevent identity spoofing, message replay… making these fields seem rather useless.

The concept of the channel is more interesting, as it allows the identification of a decryption key, enabling communication that only those who know the PSK can read. The encryption used is up to AES-256 CTR (128-bit would have sufficed and could be shared with other stacks since it’s the standard for LoRaWAN). The size of the PSK in use will decide if AES-CTR256 or AES-CTR128 is used. The encryption uses, among other things, the packet ID as a nonce to generate the XOR mask for the CTR.

Thus, you need to exchange a PSK with a third party to communicate. This could be an opportunity for a second-level protocol, though it doesn’t seem to exist yet. As a result, an infrastructure is necessary to exchange keys. Consequently, the network is mostly used in an unencrypted mode (or encrypted with public keys) for its primary use case.

Nothing prevents usage collisions on these channels, as there is no reservation mechanism beyond one “public” channel used for metric collection. Furthermore, there is no signature or CRC to verify proper decryption of the payload, so validation will need to occur at the functional level to ensure that the decrypted data is correct. The payload is in protobuf (a fairly efficient format, but still too resource-intensive for this type of network), which offers some data validation but is, in my opinion, insufficient. Concretely a channel is identified by a name and a PSK that conduct to a single byte “hash”, you can collision with yourself or anyone else. When you collision, the payload decryption will be wrong and the protobuf payload may be invalid, this can conduct to code crash depends on your protobuf decoder implementation.

It seems to me that the use of 32 bits for packetID is overestimated. Ultimately, a lot of bits could have been allocated to signing mechanisms or simply saved. Packet ID is a random number with a 10 bits fixed on reboot incremented on every messages and 22 bits randomly generated on every message. I assume this is to avoid a collision of 2 identical random number in a short term. Source and PacketID may be used in the node algorithm to not re transmit a message already transmitted and that’s why it needs to be uniq. As I assume the node memory is low, a 8-12bits incremental counter would have been enough. The Packet ID is also used as a nonce for encryption ; LoRaWan and Sigfox use a Counter for the same purpose with a size of 16 bits to 12 bits.

Header Flags

The flags in the header contain three important pieces of information:

  • The number of hops remaining, which determines whether a message should be retransmitted or dropped after a certain number of bounces, as well as the initially requested number of hops, which can be used to determine the distance to a node or to know how many hops are available to respond.
  • The second flag is an acknowledgment request, which has several implications I’ve already mentioned and, in my opinion, could have been designed differently.
  • Lastly, the third flag is interesting, as it indicates whether the packet originates from an MQTT source, meaning it comes from another mesh network bridged via MQTT. This field is mainly used to help the node decide whether or not to relay messages from distant mesh networks.

Radio layer

In Europe, Meshtastic uses the 869 MHz band, which offers a 10% duty cycle with an emission power of 27 dBm. This is a good choice for the intended use case, doomsday scenarios, where the goal is to cover as much distance as possible. In practice, few devices can emit at 27 dBm, and many will only use 20-22 dBm of power. It is also worth noting that devices like the T-Beam are quite far from having an antenna capable of efficiently radiating this power. In a way, this is good because the protocol is chatty, and widespread use could disrupt downlink connections on LoRaWan and Sigfox. Indeed, the 869 MHz band is narrow and is usually used for downlink by these networks. With its 250 kHz bandwidth, Meshtastic uses the entirety of the 869 MHz band, and with the density of nodes in certain regions like the UK and its extremely chatty protocol, this could cause a problem for other networks.

In the USA, the device will use the 906 MHz frequency, which is one of the slots in the 902-928 MHz band. I have some doubts about whether it is permitted to emit without channel hopping under FCC regulations. In any case, this introduces a duty cycle, as the FCC defines two rules: a maximum of 400 ms of consecutive emission and a minimum of 20 s before reusing the same channel. Note : as the nodes uses Channel Activity Detection (CAD) the rules may be different and I may be wrong, but at the end any transmission impacts the other and increase the risk of collision for yourself and the others.

Additionally, Meshtastic uses a bandwidth of 250 kbits, which is double what is typically used in LoRaWan, with a CR4:5 coding rate. The default presets for the USA and Europe are SF11, leading to a data rate of 1 kbps and a link budget of 153 dBm (for 22 dBm of emission). The chirps last 8.192 ms and transport 8.8bits of useful data. If we look at the time split for a Meshtatistic frame with these parameters:

  • Preamble + LoRa Header – 231.4ms
  • Meshtastic Headers + LoRa CRC – 134ms

This leaves only 35 ms of possible communication in this mode, which roughly represents 4 chirps and 37 useful bits, or 4 bytes generally needed for the protobuf envelope. Thus, it doesn’t seem to me that this default profile is compliant with FCC standards, so be careful with your settings. Also the device should not talk during the next 20 seconds conducting to a duty cycle of 2%.

In Europe, if we consider the transmission of a 10-character string, it will require an emission time of 354 ms, which, with a 10% duty cycle, will block further transmissions for the next 3.54 seconds. Overall, this is acceptable in a low-density network, without hindering its operation.

Routing

Before transmitting, a node will check that no transmission is in progress and then wait for a random time, which is related to the observed channel load. This method aims to reduce collisions and adapt to the load. It will work if a new CAD (Channel Activity Detection) is properly performed before actually transmitting, otherwise there is a risk of having difficulty transmitting as the load increases.

The CW (Contention Windows) is defined by the transmission and processing worst time considering the Spread Factor and bandwidth, by default is seems to be around 20 -100 milliseconds. Depending on the channel load and device role a random number between 0 and 141 of CW will be wait.

This is also adjusted based on the quality of the received signal (SNR), which is somewhat surprising because SNR is largely influenced by environmental noise, and the goal here seems to be to give an advantage to retransmission by distant nodes. Essentially, the idea is to lengthen the wait time for a node that is very close. These mechanisms are quite interesting; the wait time is as a random wait time between nodes is up to 142 possible values. With 14 lower values reserved for ROUTER/REPEATER and others reserved for CLIENT & others profiles. The CAD (Channel Activity Detection) will be performed before each transmission attempt, which means it drastically reduce the risk of collisions.

It is worth noting that ROUTER and REPEATER nodes will have priority in retransmitting messages due to their role, which is an interesting approach.

When traffic on the network becomes heavy, a node may go through several waiting loops before transmitting, which can lead to significant delays and cause retransmissions with a long delay of a packet that has already been received.

The nodes are in continuous listening mode, and if they receive a message that they have pending for retransmission, they will not retransmit that message. This function is interesting because it prevents a message from being retransmitted locally many times if the density is somewhat high. By prioritizing retransmission by nodes with low SNR, we can hope that the first nodes to retransmit are farther away. But in theory, once again, RSSI would make more sense.

Moreover, this approach can create blind zones even though they are covered. Take, for example, a triangular arrangement (5 km per side) where all the nodes are in line of sight: one node will transmit, and only one of the other nodes will rebroadcast it, while the other covers a useful area in which the intended destination could be located. The network is thus easily attackable with a node that doesn’t play by the rules and performs an immediate retransmission while being positioned at a geographical extremity of the network, limiting the diffusion by other nodes in different directions.

Conclusion

I found the work done on Meshtastic remarkable, and it’s a fantastic subject of study and a great playground. However, it does not seem at all mature for large-scale use due to design problems, mainly at the protocol level. The protocol was not designed for LPWAN radios, which must always be very economical with airtime. It was not properly thought out in terms of security. Even though the usage services have been well developed, they are based on an unreliable structure. The routing protocol is fairly simple, which makes it interesting to study, but it also causes many issues at scale. The strength of Meshtastic lies in a very complete implementation across multiple targets and well-designed configuration tools that allow many hobbyists to ‘play’ with this technology. From an industrial perspective, I do not recommend using it.

4 thoughts on “Critical Analysis of the Meshtastic Protocol

  1. While in general a good and complete article, there are some incorrect statements and misunderstandings.

    “In fact, the notion of destination is only useful for acknowledgment requests”
    Apart from acknowledgments, there are several packet types that ask for a response of a specific node. Besides, packets not broadcast and not destined to you are not forwarded to a connected client app. Also, in 2.5 (which is currently in a technical preview) a new public-key infrastructure is introduced.

    “Local acknowledgment within the mesh could simply be done by listening for its own message being rebroadcast, as the protocol does it broadcasting over the air. End to end acknowledgment can be managed by a broadcasted response message.”
    This is exactly how it is implemented. Broadcast messages only have this “implicit ACK”, while a direct message has an optional real acknowledgment which is relayed all the way back to the transmitter.

    “It seems to me that the use of 32 bits for packetID is overestimated.”
    The packetID is also used as a nonce for the encryption.

    “the random wait time between nodes is limited to 12 possible values.”
    No, for ROUTER and REPEATER that’s between 0 and 2*7 times the slottime, and for others it’s between 2*7 and 2*7 + 2^7 (=142).

    “Moreover, this approach can create blind zones even though they are covered.”
    Yes, but proper usage of ROUTER and REPEATER nodes circumvents this.

    • Thank you for improving the blog post, sorry I’ve been late to review the comments but I fixed the main issue when you responded on twitter. (that’s why most you noticed is not anymore relevant). I still convinced PacketId it too large, the way it is used here is similar to other networks like LoRaWan and Sigfox and they both use less.

  2. Thank you for the excellent writeup!

    Just dropping in to say the Meshtastic is in very active development and version 3.0 – which will be an intentional breaking-changes version – is in the planning stages.

    Everyone involved knows the current stack is not up to supporting reliable large scale installations due to years old design choices, and right now is the perfect time to work on those upgrades.

    It would be great if you could join and share your advice on these possible improvements.

    The project is always looking for more development help! if you are interested in contributing the community, Discord is the best place to get involved. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.