With the Solana migration, the Helium ETL has became obsolete, the Proof of Coverage (PoC) data are not anymore on the chain but they can be publicly accessed from an AWS bucket. We are going to see how to access these data and what I’m developing to manage them.
Discover my github project to manage Helium Off-Chain POC data
AWS bucket with Helium Off-Chain PoC detailed informations
The AWS bucket from Helium Foundation is a “requestor pays” type of bucket. This means, Helium Foundation is not paying for your data request but you will pay for it. As we can understand this, I would really loved to see the use of some crypto solution like Storj or StreamR to provide theses data … This is a point I’ll look at later. By the way, to access these data, you need an AWS account.
- Go to your AWS account (create one if you don’t have one)
- Go to IAM, then select user menu
- Add a User, give it a name like “helium_reader”, then click on next
- Create a group with a name like “Helium_PoC_group”, then select authorized actions
- Role : AmazonS3ReadOnlyAccess
- Affect the created group to the user and terminate the creation
- Now you can click on the user name on the user list and go to “identification information and security”, then find “access key” and create a new one
- Select “Application executed outside from AWS”, give it a name and validate
- Now you have a key pair, with an access key and secret key.
- Securely store the secret key: it can’t be retrieve (but you can create a new one)
AWS S3 access cost
To give you an idea of the related cost to access the data, you have to pay for any LIST command (to get the file names), then the GET command to extract the data, then for the data traffic. The price is varying so this is just indicating, if you move the data out of AWS you pays about:
Considering you have per days, about:
Type of Data | Files | Total Size |
Beacon | 317 | 44MB |
Witnesses | 1082 | 12GB |
Reward | 1 | 30MB |
Validated IoT PoC | 1500-2000 | 48GB |
So we can estimate a S3 access cost:
Command | Per 1000 Commands or GB | Per Day | Per Month |
LIST | $0,005 | $0,0001 | $0,003 |
GET | $0,0004 | $0,0008 | $0,025 |
DATA TRANSFER | $0,01 | $0,48 | $15 |
Helium Foundation Buckets
The Helium Foundation S3 buckets containing the data is: foundation-poc-data-requester-pays located in US_WEST2 AWS region. I contains different type of files as described in the helium oracle documentation page.When listing / processing these files, you may know the following things:
- Files comes by family, you get all the beacon files, then all the witness files, so if you want to get the updates you need to keep a pointer on the last file of a category and not jumping on the next one or you will not get the new files.
- There are between 300 and 2000 files a day per category
- Size of Witness files is about 20 times bigger than Beacons
- Some of the files are corrupted, some are zero len… you need to be prepared to master any Exception
- Link between beacon and witness is basically made base on timeframe and data content. In my experience the most part of witness does not match any beacon.
It’s really better to consider the Validated IoT Poc file than the raw files, this one contains Beacon and Witness reassembled together and it really simplify the data processing. As the file comes from different Oracle, you have plenty of files within the same timeframe.
The reward are published only once a day at 1 AM UTC and you only get a sum of reward for the previous day for witnesses / beacon / data transfer in $IoT
Processing challenge
If you target to process an hour of data about 15 minutes to be able to resync your data from ETL, you need to process about:
- 100.000 selected witness / minutes
- 10.000 beacon / minutes
In terms for volume, it’s about 200GB / week of data to store.
New ETL to extract and load the Off line poc data
I did an open source project to load these data into a scalable database mongo-db, you can find my Helium Off-chain ETL on github.
Installation is quite easy: everything is packaged in a docker compose file with a Makefile to build and run it.
As it has a lot of caching mechanism and parallelism for performance improvement, you need a server with 3 SSD for each of the shards, a system NVMe is recommended for the rest. The memory size is a minimum of 64GB and 24 CPU is a good recommendation.
If you have some bench of your setup, let me know, I’ll be happy to share some results here:
Setup | CPU | MEM | DB Storage | SYS Storage | Max PoC / m |
Disk91 | E5-2690v1 x8/16 @3Ghz 2012 | 64GB DDR3 | 3xSSD 4TB | NVMe 2TB | 5000 |
This kind of server based on Amazon public price is about $1400 / month. If you have a need for accessing these data or run your proper ETL … just contact me, I can do something for you for a third of this.
The solution comes with grafana dashboard to monitor processing in real time, this is really useful to follow the processing.
Functional Behavior
The ETL is loading all the witnesses, beacon and reward data in different corresponding collections. In parallel it updates a hotspot entity with all the information consolidated. These information can be retrieved from a rest API.