2016. július 11., hétfő

Musings on the ADIF file format

The ADIF file format is a simple way to store and organize Amateur Radio (or HAM Radio) log data.

It is a way amateur radio operators most frequently store their records on the contacts they make (probably after paper).

(For fellow HAMs, my callsign is HA5FTL, see my QRZ.com page for a description of my station and yours truly, and for contact information, if you wish to express your objections and/or approval of one or several of my points made in this article.)

I have to start with a disclaimer.

The Disclaimer

I have to premise that I respect the devotion and countless hours of hard and precise work of my fellow Amateur Radio Operators who proposed, devised, specified the ADIF ADI/ADX file formats, and kept it up-to date throughout all these years.

I do not claim their work was in vain, or it's product is really unsuitable or too inferior to be the practical and enabling tool it really is.

It's just that it could be much better.

So let me try to explain my expressed anger with this file format, and do not take it personally when I bash on it.

Because I will do that a lot.

Why do that? - An introdction for non-HAM programmers

Amateur radio operators are folks who are allowed to operate radio transceiver (a device that can receive AND transmit radio frequency energy) manufactured, modified (or hacked together) for being used by them for a great variety of purposes in greatly many ways. Not only that, they are allowed to operate transceivers that cover a HUGE portion of the whole radio frequency spectrum, down from a few kilohertz (well into audio frequencies), up to "daylight", very, very high frequencies.

The exact frequency bands, usable power, and the way radio waves are transmitted all subject to strict regulations. (They cannot jam television or broadcast radio, military & defence frequencies, etc. basically common sense most of the time). An Amateur Radio License is special, because it grants freedom no commercial radio license can ever do.

With great power comes big screw-up potential. HAMs have to pass not-so-easy exams to be given licenses, but after that - depending on the type of license - they can even build their own radio receiver and/or transmitter (which can be surprisingly simple, (or mindbogglingly difficult) so it's a great source of fun & experience). The radio frequency spectrum is a very tight resource, a lot of player hope to grab a slice. Governments sell these slices at high prices, except for hams: they are given a lot of frequencies essentially free. All they have to do is to tightly follow the rules and cooperate with the authorities if requested.

For governments to be able to monitor their HAMs (the operators) closely, all of them are REQUIRED to keep a record of all contacts (sometimes even attempts) made. The details vary across countries and circumstances these contacts are made or attempted.

Operating an amateur radio station is a wonderfully diverse hobby, with wonderfully and surprisingly diverse bunch of people participating in it. From bright-eyed young girls to old, fat granddads, from musicians to royalties a lot of people can and do find pleasure in random or scheduled radio contacts, special events, local contests, or world championships.
They buy, modify or build their equipment, and talk, use Morse code, or digital data transfer modes, operate a network parallel to the Internet, make contacts to and through the International Space Station, they use repeaters, satellites, stratospheric balloons, or using the ionosphere or even the Moon to bounce their signals toward lands beyond horizon.

To engage in the hobby is truly an amazing experience. It is like having several interesting hobbies all at once. One can always find new, interesting things to achieve or discover.

All these different people with different fields of interests are required to keep records of their contacts, and there would be several good reasons to do so even if we wouldn't care about the regulations. (We do, and very much so.)

To make a lot of contacts, or interesting ones is a great achievement. Prestigious awards are given for those making contacts with the greatest number of operators in distant countries, on islands, or summits, or contacting a station operating only for a short period of time.

Such awards can be applied to by handing in the records of the contacts: the logs, and most often by providing additional piece of evidence (such as logs, or written notes (QSL cards) from the contacted stations).

So what is this ADIF thing?

ADIF stands for Amateur Interface Data Format, and it's the most popular way to send and often to store data related to HAM activities. In practice this is almost always means information about contacts made.

The Good...

- ADIF is a standard.

When it comes to exchanging data, any kind of standard is better than chaos.(NASA was taught a very expensive lesson on this).

- ADIF is plain text

You can open it in your favourite editor, and make changes. This IS a huge plus, as no program is perfect - you can resort to a generic editor should you ever need a transformation your logger application does not support. "Use text, because that is an universal interface" - stands in the UNIX philosophy, for very good reasons.

ADIF is really just a collection of records that describe something, most often contacts between stations. A records has fields, that has a name and MIGHT contain data. A record is basically a set of key-value pairs. Pretty nice.

For me the list pretty much ends here.

...the Bad and the Ugly

I'm going to try to explain this in a way that I hope brings my seemingly subjective spleen-spitting closer to be an objective and constructive criticism, and hopefully point towards a good solution. I'm going to be passionate, because this is what really drives me at this hour. 
To be able to explain why am I so overpoweringly repelled by the standard, I'll try to assemble a short list of facts and explanations.

- ADIF is a standard. And it is a BAD one.

This sounds like circular reasoning: it's bad, because it's bad AND popular. I'll nevertheless keep this point, because popularity makes it even worse from a perspective I'm going to take later. For now, let's state that if something is bad, and it's the only path ahead, you have to stumble your way along it.

The solution for this is either to vastly improve the current one, or to throw it out all together, and use a different format - probably an application of a general format - that is accepted widely. Yes, I'm thinking about XML. No, the ADX file format described in the ADIF specification is not XML. (More on that later.)

- ADIF steals programmer's time

That's that particular perspective I'll take now. From this it can be seen how an insufferable piece of software or standard can turn into something objectively ineffective.

ADIF is BAD, because working with the format is a huge cognitive load. It's hard. It has quite a few catches, AND lacks widespread, well-tested programming libraries that work out-of-the-box. If you're writing a program that eats or spits ADIF, you're likely to be forced to roll your own parser or serializer.

Yes, there are a few github projects, there are blogs on how to parse the darn thing in PHP and other languages, but it's not at all "import antigravity". Dealing with ADIF takes time, and it takes a lot of time.

If one uses their time to write ADIF juggling routines, they will have less time and energy for being creative, developing useful features or good-looking and convenient user interfaces.

It also prevents small "fire-and-forget" projects to pop up, because no one can write an ADIF library in 30 minutes, and call it anything near complete or usable (trust me, you can't). There are no "do one thing, do it well" (another piece of UNIX philosophy) tools and shims with narrow, well-defined and tested functionality.

The reason for it is the same as for there are no good libraries really. And this reason is the following:

- ADIF is Hard and Rigid

This is the reason why there's little to no available "canned" software for dealing with ADIF.

The ADIF specification is a big document. Fully understanding and implementing it takes a lot of time. It defines a plethora of data types. A lot of these data types are enumerations, a list of possible values (words) that can stand at particular places in the file or data stream.

One problem with this, is that it rigidly fixes a number of things that should be flexible. A program or library closely adhering to the ADIF specification SHOULD NOT accept a piece of data if it doesn't fit into the given data type.

Let's take the MODE field for example.

Of course any general program or library MUST follow the standard, or it's loses the meaning for it's existence. This means the proper validation of fields, and forbidding any new modes, contests, countries or subdivisions, bands, propagation modes, QSL mediums, etc. to be quickly adopted by the community.

If you develop a new digital mode for example, there will be a significant resistance before it's acceptance, because any existing logger software will decline log entries with your new digital mode. It is currently impossible to properly log contacts in the FSQ digital data transmission mode, because FSQ is not a member of the Mode Enumeration, and log entries containing "FSQ" in the MODE field are declined by logger programs and sites. This clearly holds back FSQ and similarly new digital modes from being accepted. The less stations you can work in a mode, the less likely you'll use it. The less likely a mode is being chosen a station, it's even less likely to be chosen by others. The highly non-linear dynamics of the spread of information transfer modes greatly magnify any such resistance.

One solution could be to modify the standard to be more easily extended and state this possibility clearly in the specification. The effort required to write a log entry with a new digital mode is by no means prohibitive, but others HAVE TO accept it, for even such a small effort as writing a log with a text editor have any point at all. Unfortunately the standard does not make possible, or at least doesn't allow it explicitly.

One could say that ADIF parser implementations are to blame, and we could try to convince the ARRL to make LoTW more flexible and please accept records about contacts in FSQ (or other new) modes. But sadly the ARRL and LoTW (for example) NEEDS the precise mode information to be able to give credits for contacts, because the credits are give per mode: you can get a credit for a phone (SSB, FM, AM), CW (Morse code) or digital contact.

How unfortunate that ADIF lacks tags that describe general facts about contacts, such as whether the contact was made using analogue or digital mode, whether it used frequency shift keying/modulation, amplitude/on-off keying/modulation, phase shift tinkering, uses single or multiple tones, whether the carrier is partially or entirely suppressed, etc.

I bet a sufficiently general and widely usable set of fields could be added so they are in practice (or even in theory) can completely describe a transmission mode, and the current mode enumeration would be a mere convenience, a collection of abbreviations for a set of mode-descriptions with a tendency to appear together in practice.

In case of DXCC entries, countries, subdivisions, SOTA summits, IOTA islands and similar list of possible values there are authoritative information sources, such as the ARRL's DXCC entity list, numeric-, alpha-2 and -3 ISO-3166 country codes, and the official sites of the SOTA and IOTA awards. Such lists take up most space in the ADIF specification. They shouldn’t. There are places these pieces of data should and do live. Mirroring these and similar list are causing inconsistency between the specification, the authoritative sources and/or reality, and keeps new members of these lists from finding their way into the logs of fellow HAMs.

In my opinion, these lists should be replaced with references to the authoritative information sources and the possibility to add new members should be left wide open. Where necessary, a set of fields with describing general properties should be added, to allow passing information (not just data) on yet-unknown situations.

To cut it short: ADIF is hard, because including the enumerations are tiresome and error-prone, and it is rigid be cause these MUST be included by any serious implementation.

The lack of general, widely used libraries seem to be a piece of empirical evidence for this.

- ADIF doesn't natively support Unicode (or at least does it poorly)

The ADIF format is really two file formats. An older "ADI" file format and a newer "ADX". The former was probably modelled after HTML, the latter after XML (ADX IS NOT XML despite what the documentation says, more on the actual formats later).

The spirit of ADI is to use only ASCII characters. I don't know if there are any programs actually enforcing this constraint, but there can be, and if you want a truly compatible implementation, you better not include characters with codes above 127. You have to stuck with the English alphabet, numbers and punctuations/control characters.

Of course the specification can be interpreted so that any byte is allowed as data. (Restricting tag names to be ASCII strings is actually a good idea given ADI is a textual format. It is the most compatible encoding, usable with text presentation/editor software using ASCII, Latin-X, UTF-8 and several other encoding methods / code tables.) If any byte-string of proper length would be EXPLICITLY allowed after tags in ADI files, this problem would quickly be resolved.

Adding an <ENCODING:5>UTF-8 or similar tag into the ADI header would make this misery disappear at once. Encoding is only important, if data is displayed. For data transfer, it's enough to know how many bytes you have, and what are those bytes. An ADI tag contains every piece of information for ADIF to be a truly flexible, binary Amateur Data Interchange Format.

Because of the the *_INTL fields for storing UTF-8 encoded versions of a few fields are unnecessary (and are strangely only part of the ADX file format, which really wants to be XML, and says it's UTF-8 encoded).

Actually there is a recent data transfer and storage format that bears uncanny resemblance to ADI. It is BSON, the "Binary JSON", an efficient binary data format used by the popular document-orinted database server MongoDB. BSON's "records" are called "documents". Documents have fields, with name, data type, and actual data. The length of the data is also stored in fields (but since it is entirely binary, and has no delimiters at all, the length of the field name is also stored, and no extra data is allowed between fields or documents). So this name-length-type-data based principle can work well in practice, "only" the details have to be gotten right.

Of course one can say that the standard is old, however UTF-8 is a great way to store Unicode text data backwardly compatible with  ASCII. UTF-8 is ASCII, if you stick to code points below 128.

- ADIF stores the type of the data explicitly

Now it's time to dwell into the actual file format. A piece of data stored in ADIF looks like this:

<TAG:11:S>actual data blah blah lots of extra
characters that do not matter.

There is some metadata (data about data) and the actual data. Metadata is most often stored in tags, similarly to HTML. Except there are no closing tags. This is probably to save space, as most data is represented as short strings comparable to the length of text describing the tag containing the metadata. Since the length of the data is known beforehand it need not to be interpreted in any way, and no escaping or similar trick is necessary. (As it was said above, ADIF can store opaque binary data.)

Between < and > characters there MUST be a tag name, like CALL if the data is a call sign, or RST_SENT if the data is the signal report sent to the other station. If the tag does not store any actual information, than the name alone suffices. (Such tags are the end of header and end of record tags: <EOH> and <EOR> respectively).

If the tag do contain data, the length of the data represented as an ASCII character string MUST be given after the tag name separated from it by a colon (:) character, and written in decimal ASCII digits, in base ten. After the length there MAY be a character - again, separated by a colon from the length - that denote they type of the data.

This feature makes it simple to quickly write a parser that just "explodes" the ADIF string to (<TAG>, data) pairs. Just look for the first <, then read the tag name, length, and possibly the data type, than expect a >, and read the right number of bytes. Repeat until the end of the file.

Unfortunately ADIF, as described here, cannot be cut up into tokens with the usual way (using regular expressions), as - strictly speaking - the language is not regular. This looks like a minor annoyance not really deserving it's own point. The proof would probably be possible by proving that recognizing this language would be "just as hard" as recognizing the language containing only words with the same number of a-s and b-s in it. The problem with this is that it is impossible to write a fast tokenizer with the popular tools (like lex of flex) that can provide high-level tokens and therefore require little logic above the token level.

Low-level tokens like the <, :, and > characters, length, data type and tag names can be recognized, but additional logic is needed to read the number of bytes. This small design feature prevents the use of popular parser generators, and promote DIY implementations that can be buggy and hard to maintain.

Even if you can write this software quickly, you still have to deal with data types.

This data type may be string, if the character is S, a date if it's D, time if it's T, and so on. The standard describes quite a few data types and their format.

This is problem, because for many tags, there is really only one sensible data type to use. Extra effort needed, to validate the type marker.

This sounds like nothing serious, but if you don't work for Microsoft and aim for full interoperability, you HAVE TO do this. In practice, no one really uses the type marker for frequently used, standardized tags. The data type should always be obvious from the field name, and this is how the type of the data is communicated and determined most of the time in practice.

- Awkward data types

There are data types that are non-scalar: lists, even lists of key-value pairs (such as CreditList). Handling these types requires manual labour. A file format describing structured data should provide a small set of simple solution covering every possibility. Tags, and members of an instance of the CreditList both essentially describe key-value pairs. We have two solutions for a single problem. This leads to an unnecessary large lines-of-code metric. Not good.

XML of course could solve this, as it provides the tools for describing nested data.

What can be done?

Every problem I can think of can be efficiently solved by using a format that is a proper, well-designed XML application.

There are good XML parser libraries for any programming language I can think of. These are ready to be used, and are essentially free of bugs (or their bugs and quirks are well-known).

Upon these libraries general-purpose ADIF/XML parsers could be built with relatively little effort. These libraries than can be re-used by application developers, liberating them from the burden of writing their own ADIF parser.

Such generic libraries would work correctly with application- and user-defined fields living in their own namespaces. File format validation could be programmed in a few lines of code given the XML schemas of the core document and the extensions.

Since the effort to actually implement such libraries would be reduced, providing reference implementation and test suits wouldn't be such an outlandish expectation.

ADX is not entirely an XML application.
Okay, but ADX is XML. Or is it? Well, not so much.

For one, namespaces are ignored, killing extensibility. Namespaces are the perfect solution for incorporating application- or user-specific data. The whole <USERDEF ...> and <APP> thing is unnecessary.

A true, extensible, simple XML application would solve all of these problems. ADX is close, let's hope we keep heading that direction.

Folks, let's use XML like it's meant to be. Please.