2013. október 26., szombat

Amateur (HAM) radio: analyzing historical data from reversebeacon.net

What is amateur radio?

Amateur radio (or HAM radio) is a great hobby, or sport - it depends on how seriously do you take it, and what activities you choose to engage in precisely.

Basically HAM radio is a pastime where people around the world buy and/or build radio transceivers, antennas and lots of other gear, and make contacts with each other using the frequencies and modes that complies with rather strict regulations.

Amateur radio operators also participate in emergency communication, since no infrastructure is required, and with the proper tools, they can talk over thousands of kilometres.

Despite the aforementioned strictness of radio regulations, there are multitude of ways you can "do HAM radio".

If you enjoy chatting, you can buy a cheap hand-held device and check into a "net" on a local repeater. If you like "hunting DX" - making long-distance contacts on different continents (or covering an unusually large distance at the given circumstances), than you might want to invest into a good short wave radio and a suitable antenna.

You can also build your own radio, antenna and necessary equipment AND hunt DX with that!

What is reverse beacon?

Reverse beacon (http://reversebeacon.net/)  is a service that gathers information from hundreds of radio receivers around the world. These radio receivers listen on various bands, constantly trying to decode the communication.

The network currently mostly listens to "CW" - continuous wave communication or other words Morse code.

Yepp, Morse code is still in use, actually this method is very popular in the amateur radio community. It is particularly easy to build Morse receivers and transmitters, so a lot of the DIY folks speak Morse.

Also, Morse code can be decoded automatically, so reverse beacon can intercept amateur radio Morse code communication.

The intercepted communication can be searched and displayed on the website, even in real time. Cool.

A very nice feature of the RBN is that it makes the historical data available for us. Free. Thank you folks!

RBN stores various interesting info about the intercepted messages. Who sent it, who heard it and when; how loud was the signal and how fast the pace of the dits and dahs were.

Today I've counted 221.582.054 rows in the database. That's a big pile of data to chew on. Let me tell you how I do it.

Choosing a database engine

Considering the size of the data set, you may call this "big data". You could shove it into mysql or postgresql, or something similar, but you'd need to build proper indexes, or just need to be very patient. Loading the data also might take long, hours maybe.

My choice is InfiniDB.

InfiniDB is a columnar database. There is a free, and an enterprise version. The free is more than fine for us.

It stores in a column continuous-, instead of row continuous  fashion. This (and many other design features) is to greatly reduce IO: reading 220 million rows from the disk is no joke in case you have to do a full table scan. InfiniDB never do that.

Instead it does a full column scan, so it won't read data you don't need: in mysql, a simple select <column> from <table> would read or skip through all data in the table, including columns you don't need (well, this is more complicated, but for now this is enough).

You also don't need to create indexes. Actually, the data is structured so that you could say that in InfiniDB data is the index. Whatever query you throw on InfiniDB, it will be kinda fast. (See the concept guide for details. Also, you might want to read a bit about how InfiniDB compares to other, row-based databases.)

Unless you want to do some crazy SQL magic, which just won't work, or will be very slow, since InfiniDB does not support SQL entirely, only a subset of it.

Installing InfiniDB

You can get the installer after registering and logging into infinidb.org.

The installation procedure is simple, read the getting started guide.

Basically what you have to do is to uncompress the downloaded archive under /usr/local. It will create a directory /usr/local/Calpont. After that, run /usr/local/Calpont/bin/install-infinidb.sh as root.

The InfiniDB isntaller is made for servers, so you have to adjust certain parameters to be able to run InfiniDB correctly.

Open up /usr/local/Calpont/etc/Calpont.xml, and look for a line like this:

<NumBlocksPct>66</NumBlocksPct>

The number between the XML tags is a percentage. InfiniDB will acquire this percentage of the totaly physical system RAM for it's own block cache. I've seen this value set by the installer as high as 80%, and it's almost always unacceptable on a desktop computer. Change this to whatever you think is OK.

Also, please try to set this as high as you can easily tolerate, since the block cache can (and will) make queries faster by reducing, or eliminating disk IO.

Fortunately I own a computer with 32 gigabytes of RAM, and the total InfiniDB data directory is about 27 gigabyte, so I can store most of the reverse beacon data in my RAM. After a few queries, things speed up quite a bit, and further queries don't even make my HDD LED blink.

Since we will use a single table for the RBN data (wich is kinda suboptimal, but hey), we don't really need to care about and optimize join behaviour, but InfiniDB tuning is quite an interesting topic, so if you too care, read the tuning guide.

 Getting the historical data

First of all, InfiniDB is "just" a mysql storage engine, so by installing inifidb, you install a mysql instance as well. Make sure to stop any running mysql databases before attempt to install infinidb. (Also, you can configure either InfiniDB or your existing mysql to run on a different than the default 3306 port, by editing the proper my.cnf file.)

You can download raw data from http://www.reversebeacon.net/raw_data/. It not so entertaining to download and process almost two thousand files by hand, so I've written a python script to do the job. Please be polite, and don't comment the out the line that waits between downloads - I don't want anyone to slam the server.
#!/usr/bin/env python
# -*- coding: utf8 -*-

import os
import time
import urllib2
import datetime

base_url = "http://www.reversebeacon.net/raw_data/dl.php?f="

today = datetime.date.today()
delta = datetime.timedelta(1)

i = datetime.date(2009, 2, 21)

while i <= today:
    datestr = "%d%02d%02d" % (i.year, i.month, i.day)
    fname = "RBNDATA/%s.zip" % datestr
    url = base_url + datestr

    if os.path.isfile(fname):
        print "%s exists." % fname
        i += delta
        continue

    print "Downloading %s..." % url

    try:
        content = urllib2.urlopen(url)
        f = open(fname, "w")
        f.write(content.read())
        f.close()
    except urllib2.HTTPError:
        print "Error downloading %s" % url

    i += delta
    time.sleep(5)

Put this script into an empty directory as "download_rbndata.py" and also create a directory named "RBNDATA" next to it. Cd into the script's directory, and run it with python or pypy (issue "python download_rbndata.py", or "pypy download_rbndata.py" command, or just give it an executable flag and fire ./download_rbndata.py).

The download will take about 3 hours. Be patient. Once you've downloaded the files, you can fire the command again, and it will not download the files you already have, so you can gradually gather your own pile of RBN data without re-downloading everything over and over.

If you set up InfiniDB properly, you need one more trick to get a DB shell.

Open up a bash shell and do

. /usr/local/Calpont/bin/calpontAlias

Notice the space between the dot and the rest of the line!

This script will set up command aliases, so - for example - you can get an infinidb shell, by issuing:

idbmysql

After this, a familiar mysql shell will greet you. If you don't want to type the calpontAlias command every time you open a shell, make sure you include it in your .bashrc file. (This step is optional)

Also, you might want to put /usr/local/Calpont/bin into your PATH as well. (This is not vital either.)

Converting and improving on quality

We will use InfiniDB's cpimport tool to import the data into the database.

But first we have to take care of a few problems:

The format of the files are slightly different in different time intervals. Some have headers, some don't, there are files with more columns than others, and the last row contains no data.

Also, cpimport can't read ZIPs, so we have to uncompress the whole thing and make it into a single csv file.

All these things are taken care of by the following python script. Save this as "build_import.py", and put next to the file and RBNDATA directory described above.
#!/usr/bin/env python
# -*- coding: utf8 -*-

import os
import glob
import zipfile

basedir = "RBNDATA"
importfile = "rbndata.csv"

zips = glob.glob(basedir + "/*.zip")
zips.sort()

impf = open(importfile, "w")

for z in zips:
    f = zipfile.ZipFile(z, "r")
    content = None
    csvname = f.namelist()[0]
    print csvname
    csv = f.open(csvname, "r")
    for line in csv:
        if line[:8] == 'callsign':
            print "(dropping header)"
            continue
        if line[:1] == '(':
            print "(dropping last line)"
            continue
        extra = 12 - line.count(',')
        impf.write(line.rstrip() + (',')*extra + "\n")
    csv.close()
    f.close()

impf.close()

Run this just like the other script. It will read the ZIP files and create a huge csv (more than 16 gigabytes). This can be imported into InfiniDB.

Importing the data

Make sure your InfiniDB instance is up and running (use /etc/init.d/infinidb start if it isn't).

Start an idbmysql shell, and run the following sql:

CREATE DATABASE rbndata;

use rbndata;

CREATE TABLE rbndata (
  callsign VARCHAR(16),
  de_pfx VARCHAR(6),
  de_cont VARCHAR(4),
  freq DECIMAL(12,4),
  band VARCHAR(6),
  dx VARCHAR(16),
  dx_pfx VARCHAR(6),
  dx_cont VARCHAR(4),
  mode VARCHAR(6),
  db INT,
  logged_at DATETIME,
  speed INT,
  tx_mode VARCHAR(6)
) ENGINE=InfiniDB DEFAULT CHARSET=latin1;


Now you can use cpimport to do the job:

$ sudo /usr/local/Calpont/bin/cpimport rbndata rbndata rbndata.csv -s ,

Of course, you have to stand in the directory where the huge rbndata.csv file is, which was created by out python scripts.

Also notice the comma after the -s switch. This is how we set the field separator, so it's absolute mandatory.

Also, as the saying goes, "this might take a few minutes". Indeed, it only took a few, my computer chugged this bottle of bytes down in an amazing 510.5 seconds, swallowing about 430 thousand rows per second. Wow.

You can verify the bulk load by issuing the following command at the idbmysql shell:
mysql> select count(*) from rbndata;
+-----------+
| count(*)  |
+-----------+
| 221582054 |
+-----------+
1 row in set (2.42 sec)

This might be a lot slower in your case, since at this time, your block cache is cold, and you have to read a lot from the disk at this time. My cache was hot, and InfiniDB only had to hit RAM to calculate the rows in the table.

Your results also may vary, since after the publication of this article many more contacts were logged by the RBN.

Some interesting queries

Now we have an InfiniDB up and running,  have our data inside it, so let's the fun begin!

We already see how we calculate the total number of rows, which is simple and boring.

Let's see the different callsigns!

select count(distinct dx) from rbndata;

There is an astonishing  690,843 different callsigns in the database. Of course a callsign might be there as a (let's call it) a "simple" one, like mine: HA5FTL, or the operator might have worked outside the shack, on a field, so HA5FTL/P -like callsigns are also there, so a single station or operator might even appear 3-4 different way.

Counting just the simple cases gives us:

select count(distinct dx) from rbndata where dx not like '%/%';

614,185 different callsigns.

Let see portables:

select count(distinct dx) from rbndata where dx like '%/P';
16,409

Mobiles:
select count(distinct dx) from rbndata where dx like '%/M';
4783

Maritime mobiles:
select count(distinct dx) from rbndata where dx like '%/MM';
 873

Air mobiles:
select count(distinct dx) from rbndata where dx like '%/AM';
34

Ok, so how much automatic receiver stations are (were) in the RBN?
select count(distinct callsign) from rbndata;
895 is the result. That's a lot of stations!

Since listening stations come and go, let's see how many were working in august:
select count(distinct callsign) from rbndata where logged_at between '2013-08-01' and '2013-09-01';
Result is 197. Notice the speed: this query run under one second on my machine.

Cool isn't it? And it gets way cooler than that.

See how much loggers on each continent:

select de_cont, count(distinct callsign) as loggers from rbndata group by de_cont order by loggers desc;

And the result is:
  • EU: 502
  • NA: 284
  • AS: 53
  • SA: 33
  • OC: 19
  • AF: 5
So Europe has the largest number of logger stations, and Africa has disappointingly few. Also, Asia and South America could use some volunteers, and the people of Oceania are also have to pull themselves together ;)

You can do the same with the DXes, or the call signs that were recoded:

select dx_cont, count(distinct dx) as dxes from rbndata group by dx_cont order by dxes desc;

  • EU: 376185
  • NA: 178105
  • AS: 83266
  • AF: 19333
  • SA: 17755
  • OC: 16859
No surprise, EU leads the list, North America follows. But what's interesting is Africa's position. This is probably because Africa is a popular target of DX expeditions: trips to places that rarely see HAM radio activity due to the lack local amateurs, or even human beings (like small bare-rock islands in the middle of nowhere). Africa also have a HAM radio life of it's own, in contrast to what you might think first.

(Hm, I miss Antarctica. Don't laugh, there IS some HAM radio activity there, I even made contact with RI1ANF, a station at Bellinghausen Base, King George Island, which technically belongs to Antarctica).

We can play with the bands too. Let's see what bands are the most popular (or at least the most popular in RBN's database):

select band, count(distinct dx) as c from rbndata group by band order by c;

  • 20m: 378063
  • 40m: 372545
  • 80m: 203150
  • 15m: 173935
  • 30m: 120276
  • 10m: 93178
  • 17m: 82932
  • 160m: 63508
  • 12m: 36151
  • 6m: 25389
  • 2m: 2595
  • 60m: 1200
  • 4m: 162
  • 472kHz: 112
  • 70cm: 66
  • 137kHz: 1
 Aaaand the great trio leads: indeed, 20, 40 and 80 meters are very popular, and you can always find something on 30 meters, and when the propagation is good, great DXes keep popping up on 15 meters.

If you lack antenna space, I'd definitely recommend you to build something (a wire antenna, or an aluminium stick) that does not too bad at 15 and 20 meters, and you just might be able to work with local stations on 40 meters up to maybe 1-2 thousand kilometres, if you're lucky (and that stations has a proper antenna).

An other good question is: how each of these band perform at great distances? unfortunately I don't have the exact location of these stations, however I could buy access to qrz.com's huge database, and grab location information there.

Let's define a "DX log entry" as an entry that have different data in the dx_cont and de_cont field, e.g. the logger station and the calling stations were on a different continent:

select band, count(*) as c from rbndata where de_cont <> dx_cont group by band order by c desc;

This will give us the following list:

  • 20m: 25547420
  • 40m: 15628909
  • 15m: 12397729
  • 10m: 4076702
  • 80m: 3660948
  • 17m: 2848269
  • 30m: 2701406
  • 12m: 1012281
  • 160m: 927274
  • 6m: 42104
  • 60m: 5734
  • 2m: 895
  • 4m: 119
  • 472kHz: 10
  • 70cm: 4
So if you want to work DX, you will probably have luck on 20 meters (also don't forget that the data is skewd, since the distribution of logger stations is not uniform).

We can also ask the question: which band is more like a "DX band", and wich is more of a "local band"? Let's see:

select band, sum(if(de_cont <> dx_cont,1,0))/count(*) as c from rbndata group by band order by c desc;
  • 12m: 0.6003
  • 10m: 0.5405
  • 15m: 0.5262
  • 17m: 0.4434
  • 20m: 0.3697
  • 40m: 0.2380
  • 30m: 0.2296
  • 60m: 0.1714
  • 80m: 0.1433
  • 160m: 0.1001
  • 4m: 0.0474
  • 6m: 0.0445
  • 2m: 0.0238
  • 70cm: 0.0071
  • 472kHz: 0.0036
  • 137kHz: 0.0000
So what we've got here? The ration of DX/same continent contacts is the higher on 12 meters. We can speculate that you'll do more DX on 12 meters than not. (Of course, never forget that the data is heavily biased. We should examine the distribution of listening stations through bands and continents and apply some clever de-biasing before ever stating something like that!)

Despite the bias, the figure above are somewhat reflect reality: the higher HF bands are more like DX bands, and the lowers are generally easier to use for local communications. This is because the middle (and sometimes the upper) HF frequencies are more likely to be reflected from the upper atmosphere. The low frequencies are absorbed, the high frequencies are leaving the Earth into space. (Yes I know, this is much more complicated, and the reflectivity of the ionosphere is a science of it's own. Also it's noteworthy that it is a very cool piece of science that potentially can greatly benefit from the data gathered by the RBN.)

I have a humble station with an Elecraft K2 as my main radio along with a few work-in-progress DIY rigs. My antenna is also just a 4 metres long stick of aluminium tube sticking out from my window.

How such amateur station can perform?

Let's see:

select dx, count(*) from rbndata where dx LIKE 'HA5FTL%' group by dx;

  • HA5FTL: 716
  • HA5FTL/P: 32
Well, that's not much. Partly because of my low-level signal, partly because I rarely "call CQ", so my callsign is mostly only heard when I answer to a general call. Weak CQ-s are rarely answered, so I listen and answer calls myself mostly.

But how the continents can hear me?

select dx, de_cont, count(*) as c from rbndata where dx like 'HA5FTL%' group by dx, de_cont order by dx, c desc;

+----------+---------+-----+
| dx       | de_cont | c   |
+----------+---------+-----+
| HA5FTL   | EU      | 654 |
| HA5FTL   | NA      |  46 |
| HA5FTL   | AS      |  14 |
| HA5FTL   | AF      |   1 |
| HA5FTL   | SA      |   1 |
| HA5FTL/P | EU      |  31 |
| HA5FTL/P | AS      |   1 |
+----------+---------+-----+
7 rows in set (6.99 sec)

So you only have a good chance to hear me if you're living in Europe. Also if you own a decent antenna, we might be able to make contact if you're living in Canada's or the USA's east coast. Asian and African HAMs will have a hard time hear me at all. (Also, this low number of logs can be attributed to the low number of loggers at AF and AS, but let's just be realistic: my shack is far from ideal).

Let's see my favourite bands (beside Infected Mushroom and Shpongle :P):

select dx, band, count(*) as c from rbndata where dx like 'HA5FTL%' group by dx, band order by dx, c desc;

+----------+------+-----+
| dx       | band | c   |
+----------+------+-----+
| HA5FTL   | 40m  | 288 |
| HA5FTL   | 20m  | 255 |
| HA5FTL   | 15m  | 100 |
| HA5FTL   | 17m  |  31 |
| HA5FTL   | 80m  |  23 |
| HA5FTL   | 10m  |  10 |
| HA5FTL   | 30m  |   9 |
| HA5FTL/P | 40m  |  26 |
| HA5FTL/P | 10m  |   3 |
| HA5FTL/P | 20m  |   3 |
+----------+------+-----+
10 rows in set (7.04 sec)

Indeed, I mostly work on 40 and 20, occasionally on 15 meters, rarely on others.

We also could see the "DX ratio" on each band:

select dx, band, sum(if(dx_cont <> de_cont,1,0))/count(*) dx_ratio, count(*) calls_heared from rbndata where dx like 'HA5FTL%' group by dx, band order by dx, dx_ratio desc;

I've included the calls_heared column so we can see how precise the dx_ratio might be:

+----------+------+----------+--------------+
| dx       | band | dx_ratio | calls_heared |
+----------+------+----------+--------------+
| HA5FTL   | 15m  |   0.2200 |          100 |
| HA5FTL   | 10m  |   0.2000 |           10 |
| HA5FTL   | 20m  |   0.1216 |          255 |
| HA5FTL   | 17m  |   0.0968 |           31 |
| HA5FTL   | 40m  |   0.0139 |          288 |
| HA5FTL   | 80m  |   0.0000 |           23 |
| HA5FTL   | 30m  |   0.0000 |            9 |
| HA5FTL/P | 20m  |   0.3333 |            3 |
| HA5FTL/P | 40m  |   0.0000 |           26 |
| HA5FTL/P | 10m  |   0.0000 |            3 |
+----------+------+----------+--------------+

It looks like it's worth getting on 15 meters if I want to work DX from home, and If the /P 20 meters dataset size would be bigger at the same ratio, I'd say that 20 meters with a random wire on a field is good for DX, unfortunatel 3 logged contacts is waaay too low to mean anything.

I must note this point that all of these queries run under half a minute, and mostly under 10 seconds. On 200 million rows with group by-s and stuff.

Well I think you all get the picture now. If you have some interesting queries, or even questions, just write a comment below.