Helium blockchain ETL is basically a copy of the Helium Blockchain into a database. That way you can query the helium blockchain history to build apps like monitoring, tax management…
The Helium blockchain API project is connected to the Helium ETL to get an easy access to these data.
I’m currently building a tool to monitor and alarm a fleet of hotspots (stay tuned) and it currently uses the Helium API (api.helium.io). But helium APIs are a bit overloaded and not made to support high rate querying from each of the application any member of the community is developing. So I’ve been decided to make my own instance of ETL and API for my own purpose. This is the experience I’ll describe in this blog post:
Hardware requirement
The blockchain-ETL contains in a postgres database all the helium history… and it starts to be a huge amount of data over 1.5 TB today, growing fast. During the installation process you need to use a snapshot dump to load the data (or you may wait a couple of months to be in sync) so twice the size of the DB is at least needed.
You also need to store the blockchain in a node and this is currently around 0.5TB. And this is growing days after days.
Currently you may consider needing about 12TB of storage to make the ETL working and you should think about a scalable solution. The IO rate is really high and NVMe storage solution is recommended.
I did test different setup and I’m still working on this:
Server Type | Config | Storage type | Monthly Cost | Avg Bloc Process Duration | ||
Kimsufi | 4/8 Xeon E 3 @3.7GHz 32GB | 3x 2TB Sata 7200 RPM RAID 0 (too small now) | 20€ | 1.7 minutes | ||
Hetnzer AX101 | 16 Rizen 128GB | 2x 3.8TB NVMe (too small now) | 115€ | 6 seconds | ||
Bare metal | 4/8 Xeon E3 & 3.7GHz 64GB | 2x512GB NVME 4x4TB SSD | 180€ | 25 seconds |
The average time between 2 blocs generated on the chain is 1 minutes. Any setup with an average bloc processing duration over a minute will not be able to sync. The maximum acceptable processing duration per block sound like 30 seconds to make sure the ETL will be able to sync.
This documentation is based on the kimsufi setup, same can be applied on different hardware setups.
Software requirement
I’m running these software under Linux and as Helium is usually developing with debian/ubuntu based distribution, I preferred to choose Ubuntu to reduce the risk of incompatibility. Erlang installation on recent Ubuntu distribution has been a problem, so finally, my choice has been a Ubuntu 18.04 LTS. This distribution has been deployed on a 256GB partition of my first Hard drive.
Create a RAID0 stripped storage
At first we need to create a storage getting benefit of multiple hard drives. My 3x2TB setup allows using a RAID0 with stripped 3 setup. That way all the hard drives can be solicited in parallel to improve the performance. I’m reaching 3x80MB/s write with 0.5ms IO during database load. This is to be compared with the 80Mb and 6ms IO I experienced in a first configuration where the configuration was linear (all the read/write where on a single disk)
To get this configuration, here is the setup:
# create 3 partitions of 1.7TB on each of the 3 HDD
# all are primary /dev/sdX3 just to simplify reading and have
# the same size
[~root] fdisk /dev/sda
n
p
3
+1.7TB
w
# loop on sdb and sdc
# Create the Raid 0 logical volume
[~root] pvcreate /dev/sda3 /dev/sdb3 /dev/sdc3
[~root] vgcreate space_vg /dev/sda3 /dev/sdb3 /dev/sdc3
[~root] lvcreate --type raid0 -l 100%FREE --stripes 3 -n space_lv space_vg
[~root] mkfs.ext4 /dev/space_vg/space_lv
# Mount the new volume
[~root] blkid /dev/space_vg/space_lv
/dev/space_vg/space_lv: UUID="e99dceb4-96bd-4b96-86fc-983af2f1f19d" TYPE="ext4"
# Edit /etc/fstab and add a line
# UUID=e99dceb4-96bd-4b96-86fc-983af2f1f19d /space ext4 defaults 0 0
# Create directory and mount it
[~root] mkdir /space
[~root] mount /space
In case your SSD drive is over 2TB you need to use parted instead of fdisk
# create 3 partitions of 3.7TB on each of the 3 HDD
# all are primary /dev/sdX3 just to simplify reading and have
# the same size
[~root] parted /dev/sda
(parted) mklabel gpt
(parted) mkpart primary 0GB 4TB
(parted) print
...
Partition Table: gpt
...
Number Start End Size File system Name Flags
1 1049kB 4000GB 4000GB primary
(parted) quit
...
Install Postgresql 12
[~root] wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
[~root] echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" |sudo tee /etc/apt/sources.list.d/pgdg.list
[~root]apt update
[~root] apt -y install postgresql-12 postgresql-client-12 postgis postgresql-12-postgis-3 postgresql-12-postgis-3-scripts
Make sure the postgresql service is down before modifying the configuration
[~root] systemctl stop postgresql
Then you need to modify db access editing the file /etc/postgresql/12/main/pg_hba.conf:
# change
local all all peer
# to
local all all md5
The you need to edit file /etc/postgresql/12/main/postgresql.conf and change the following lines:
data_directory = '/space/var/lib/postgresql/12/main'
external_pid_file = '/space/var/run/postgresql/12-main.pid'
unix_socket_directories = '/space/var/run/postgresql'
shared_buffers = 4GB
work_mem = 4GB
maintenance_work_mem = 4GB
fsync = off
checkpoint_timeout = 120min
max_wal_size = 2GB
max_connections = 500
stats_temp_directory = '/space/var/run/postgresql/12-main.pg_stat_tmp'
random_page_cost = 1.0
Now we need to move some of the postgresql directories to mach the configuration and use our RAID0 storage:
[~root] mkdir -p /space/var/lib
[~root] mv /var/lib/postgresql /space/var/lib/
[~root] mkdir -p /space/var/run
[~root] mv /var/run/postgresql /space/var/run/
[~root] ln -s /space/var/run/postgresql /var/run/postgresql
Now we can enable / start postgresql
[~root] systemctl enable postgresql
[~root] systemctl start postgresql
Other Software settings and installation
We need to open port 44158 for peer to peer communications
[~root] ufw allow 44158/tcp
We need to extends the number of open file limits
[~root] ulimit -Sn 65535
To make it persistent, edit file /etc/security/limits.conf and add a line
* - nofile 65535
We need to install the Erlang 22 environment
[~root] wget https://packages.erlang-solutions.com/erlang/debian/pool/esl-erlang_24.3.3-1~ubuntu~bionic_amd64.deb
[~root] dpkg -i esl-erlang_24.3.3-1~ubuntu~bionic_amd64.deb
[~root] apt-get update
[~root] apt install cmake libsodium-dev libssl-dev build-essential
We also need to install rust environment
[~root] curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
And some other useful software. Unzip will be needed to extract snapshot from DeWi and atop is a useful tool to look at the IO and other stats to make sure everything is progressing.
[~root] apt-get install unzip atop
Install Database from DeWi snapshot
You can install from scratch… but this will be really long fro you to get in sync. You can do this if you want to verify all the chain on your own because you don’t trust DeWi… it can make sense, but the best way is to use a snapshot. Loading a snapshot is a long operation, on my system is has taken 16 hours. I’ve seen some 2x mirrored NVMe based system doing this in 6 hours for 2021 December snapshot.
So you need first to get the database snapshot from DeWi repository. Make sure you select the last dump. This is a zip file currently about 530GB:
[~root] cd /space
[~root /space] wget https://storage.googleapis.com/dewi_etl_snapshots/database_snapshot_dfee04b-YYYY_MM_DD_0N_00.zip
[~root /space] unzip database_snaptshot*
[~root /space] rm database_snaptshot*
You can also use axel to download dumps faster
[root ~] apt-get install axel
[root ~] axel -a -n 4 https://storage.googleapis.com/dewi ...
Create the ETL database
[~root] sudo su - postgres
[~postgres] psql
postgres=# CREATE DATABASE etl;
postgres=# CREATE USER etl WITH ENCRYPTED PASSWORD 'xxxxx';
postgres=# GRANT ALL PRIVILEGES ON DATABASE etl TO etl;
postgres=# quit
[~postgres] exit
[~root]
Load the database snapshot
[~root] su - postgres
[~postgres] cd /space/root/export/database
[~postgres database] pg_restore -v -d etl -W -j4 -U postgres -Fd .
Password:
Now, you take a pause as this processing is really long (16h in my case)
Install DeWi etl snapshot
Now we need to install the blockchain-etl service that will listen the blockchain and update, block after blocks the database. For this we are going to clone the Helium repository. In my point of view it is better to select the last release for github repository. This is done by the line git checkout:
[~root] mkdir /space/helium ; cd /space/helium
[~root helium] git clone https://github.com/helium/blockchain-etl.git
[~root helium] cd blockchain-etl
[~root blockchain-etl] git checkout X.Y.ZZZ
[~root blockchain-etl] cp .env.template .env
Edit the .env file to match your configuration
DATABASE_URL=postgresql://etl:xxxpasswordxxx@localhost/etl
GOOGLE_MAPS_API_KEY="xxxxxxxxxNSRqoxyeDshj1exxxxxx"
The google API key is not mandatory, if you decide to get the geocodding in your database, you need to consider the related cost. each api call is $0.005 and you have a monthly free credit equivalent to $200. The rate of API call is large so you can have to pay when you are using this option.
Now you can make the blockchain-etl:
[~root blockchain-etl] make release
Now we need to use the DeWi snapshot data. Let’s start by downloading it and unzip it. Make sure you use the same timestamp as for the database.
[~root space] wget https://storage.googleapis.com/dewi_etl_snapshots/etl_snapshot_dfee04b-YYYY_MM_DD_0N_00.zip
[~root space] unzip etl_snapshot*
[~root space] rm etl_snapshot*
Then let’s move the data folder from this snapshot to the blockchain-etl
[~root space] mv /space/root/blockchain-etl/_build/dev/rel/blockchain_etl/data /space/helium/blockchain-etl/_build/dev/rel/blockchain_etl/
We can clean the space from the DeWi import. Make sure the database import is terminated before doing this in case you execute these steps in parallel.
[~root space] rm -rf /space/root
Once the database has been imported (not until it is ended) and all these steps executed, we are ready to run the ETL and to start syncing with the blockchain.
[~root space] cd helium/blockchain-etl
[~root blockchain-etl] make start
You can take a look at the syncing progress
[~root blockchain-etl] tail -f _build/dev/rel/blockchain_etl/log/console.log
To focus on block storage progress
[~root blockchain-etl] tail -f _build/dev/rel/blockchain_etl/log/console.log | grep "Stored block"
In case you update the ETL software from the repository
[~root blockchain-etl] git pull
[~root blockchain-etl] git checkout x.y.z #change tag
[~root blockchain-etl] make release
[~root blockchain-etl] make migrations
[~root blockchain-etl] make start
Install the API
On top of the database we are going to install the helium API to be able to query them for our application. The helium APIs are in the blockchain-http project (the blockchain-api project has been archived).
System prerequisites
Currently I’m doing the tests with a 2VCPU / 4GB RAM virtual machine. I’ll later do some tests using the same machine as the ETL and let you know. The system is Ubuntu 18.04LTS.
The github project details the installation of the erlang installation, the version are a bit different compared to ETL project but I’m not sure it is a real problem.
[root ~] apt-get install git ca-certificates curl gnupg lsb-release
# install Docker
[root ~] curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
[root ~] echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
[root ~] apt-get update
[root ~] apt-get install docker-ce docker-ce-cli containerd.io
# install blockchain-api repo
[root ~] git clone https://github.com/helium/blockchain-http.git
[root ~] cd blockchain-http
Optionally, as I do not want to open my PostgreSQL port publicly and play with IP-Tables, I did create a port forwarding tunnel between the two servers. This is running in background and requiring to have the API public key deployed in the ETL authorized_key file. The port is forwarded on IP 172.17.0.1. This is the docker0 ip and it allows the containers to join the remote Postgresql server.
[root ~] ssh -fNTML 172.17.0.1:5432:localhost:5432 root@my-etl-name.com
If you run the api on the same server as your database, you need to allow the access from the docker network interface (172.17.0.1)
Change /etc/postgresql/12/main/postgresql.conf
listen_addresses = '*'
Change /etc/postgresql/12/main/pg_hba.conf
host all all 172.17.0.1/16 md5
In case you have ufw running add the following rule
[root ~] ufw allow from 172.17.0.1/16 to 172.17.0.1 port 5432
RUN the API Service
Now we can build the project within a Docker image, then run it
[root ...http] docker build -t helium/api .
[root ...http] mkdir ~/api_data
[root ...http] docker run -d --init \
--restart unless-stopped \
--publish 8080:8080/tcp \
--name api \
--add-host=dockerHost:172.17.0.1 \
--mount type=bind,source=$HOME/api_data,target=/var/data \
-e DATABASE_RO_URL=postgresql://etl:<Password>@dockerHost:5432/etl \
-e DATABASE_RW_URL=postgresql://etl:<Password>@dockerHost:5432/etl \
-e DATABASE_RO_POOL_SIZE=10 \
helium/api
Expose the API
Now we need to expose the API with a nginx server and protect it:
[root ~] apt-get install nginx certbot python-certbot-nginx apache2-utils
Add the following site configuration in the /etc/nginx/sites-enabled directory:
upstream api {
server localhost:8080;
}
server {
gzip on;
gzip_types text/plain text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript application/json;
server_name helium-api.disk91.com;
root /var/www/html;
location / {
index index.html index.htm;
}
location ~ ^(/v1) {
proxy_pass http://api$request_uri;
proxy_redirect off;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
auth_basic "Restricted Content";
auth_basic_user_file /etc/nginx/.htpasswd;
}
listen 80;
}
Run the nginx server and create an ssl certificate:
[root ~] systemctl restart nginx
[root ~] systemctl enable nginx
[root ~] certbot --nginx
Now let’s create a user allowed to access the API with a password
[root ~] htpasswd -c /etc/nginx/.htpasswd apiUser
Password:
You can try the Api with a browser now…
Tips
In case you need to change the api rate limit, you can edit file config/sys.config:
{throttle, #{
%% how much request time is allowed
request_time => 20000, %% change this
%% how many requests are allowed
request_count => 100, %% change this
%% over how long
request_interval => 60000,
...
}},
Test postgresql connection
[root ~] psql -U etl -h localhost -W etl
Extends the Database storage
We can extend database storage while the blockchain is growing, this is the advantage of a bare-metal solution.
Stop the blockchain service and the postgresql service. Start by adding the new SSD and configure it to join the lvm group. As we have defined a stripping with 2 disk we may add two disk at a time or split it into 2 partitions (this will impact performance).
[root ~] fdisk /dev/sdc # create a new partitions function of added hw
[root ~] pvcreate /dev/sdc1 /dev/sdc2
[root ~] vgextend /dev/postgres_vg /dev/sdc1 /dev/sdc2
[root ~] vgdisplay # check what you have made
[root ~] umount /postgres/
[root ~] lvextend /dev/postgres_vg/postgres_lv /dev/sdc1 /dev/sdc2
[root ~] e2fsck -f /dev/postgres_vg/postgres_lv
[root ~] resize2fs /dev/postgres_vg/postgres_lv
[root ~] mount /postgres/
Troubleshoot
- I have experiences postgresql not starting / updating during / after an apt-get upgrade. This was related to the symbolic link to the postgresql directory being removed and replaced by normal directory. as a consequence the start and upgrade scripts are never ending, waiting the socket to popup.
- If you need to move files from a volume to another volume to increase your storage. You can do some file-system sync to prepare the switch and limiting the downtime with a rsync command like this one:
[~root ] rsync -avh --sparse --hard-links /blockchain/ /blockchain2 --delete
Some specific tricks with etl
Database migration around version 147
If you have to to that migration (you should not now), you need to create an index or it will be infinite migration:
CREATE INDEX transaction_actors_actor_block_role_idx
ON public.transaction_actors
USING btree (actor,block,actor_role);
It takes 300GB and some long time to be there. Then you can execute the migration. Once the migration done, the index can be removed. So basically it’s good to verify if still here when loading future dump:
SELECT tablename, indexname, indexdef FROM pg_indexes WHERE schemaname = 'public';
and drop it eventually
DROP INDEX CONCURRENTLY transaction_actors_actor_block_role_idx
In case you need to load a snapshot into your ETL
You need to have the ETL running and use the full path for the snapshot. Snapshot can be found on https://snapshots.helium.wtf/mainnet/ but you can’t list the available snapshot and need to ask for one valid. Name format is snap-<block>.gz
Snapshot loaded always need to be under the database height.
[root blockchain-etl] make start
[root blockchain-etl ] _build/dev/rel/blockchain_etl/bin/blockchain_etl snapshot load /xxx/blockchain-etl/snap-xxx.gz
Get the database height
pqsl# select max(height) from blocks;
In case you want to take a snapshot from your ETL
[root blockchain-etl] make start
[root blockchain-etl ] _build/dev/rel/blockchain_etl/bin/blockchain_etl snapshot take ~/snap-xxx.gz
Running from a docker container
The docker container option allows to run the ETL from any Linux underlying OS as the erlang have some dependencies hard to fulfill. Some script to run your etl with docker.
Create the container
[~root blockchain-etl] docker build -t helium/etl .
Start the container
#!/bin/bash
# run_etl.sh
docker stop etl
docker rm etl
docker run -d --init \
--restart unless-stopped \
--publish 2154:2154/tcp \
--publish 44158:44158/tcp \
--name etl \
--add-host=dockerHost:172.17.0.1 \
--mount type=bind,source=/helium/blockchain-etl/_build/dev/rel/blockchain_etl/data,target=/var/data \
-e DATABASE_URL=postgresql://etl:<<db_password>>:5432/etl \
helium/etl
Manage the migrations:
[root ~] docker run -e DATABASE_URL=postgresql://etl:<<db_password>>@dockerHost:5432/etl --add-host=dockerHost:172.17.0.1 helium/etl migrations list
[root ~] docker run -e DATABASE_URL=postgresql://etl:<<db_password>>@dockerHost:5432/etl --add-host=dockerHost:172.17.0.1 helium/etl migrations run
Hi. Interesting article. Have a look at the ZFS file systems, especially tiered caching 😉
No, feedback is welcome