There are so many different programming languages, which one do you pick to address the 1.8 trillion lines of code (in 20 years) problem? There are so many ways to represent (store, transport, access) data, which one do you choose?

Software engineers have over-complicated things, but they won’t admit it, because they can’t see it

Let’s start by considering one way to represent data, on the wire: JSON. And let’s consider one programming language to access that data: JavaScript (obviously – but not necessarily). If (when) the client (e.g. Browser) communicates with a Node.js server communication is pretty straightforward. No need to go into details here

In JavaScript (client browser) text is encoded in UTF-16 by default. And in Node.js (server) text is encoded in UTF-8. Fortunately UTF-16 to UTF-8 to UTF-16 conversion is loss-less, so everything should be okay

In Java (server) plug-in a JSON library (e.g. org.json.simple), and communication is almost as straightforward. In Java text is (also) encoded in UTF-8 externally but in UTF-16 internally. The JSON specification (RFC 8259) mandates that text exchanged between systems must be in UTF-8 (1) but if (when) the platform default text encoding isn’t UTF-8 and text conversion isn’t explicitly configured with UTF-8, then it can (eventually it will) result in interoperability issues. And if you want (need) to support other types of encoding in your software application (UTF-32, ASCII, …) make sure you manage conversion to/from UTF-8 when you transfer data ‘over the wire’

It’s helpful that UTF conversion is loss-less, but the same can’t be said for Base conversion. Imagine you have a load of 1s and 0s – you might reasonably expect you were looking at Base 2, except if you spotted the number 2. But just because you spotted a 2 doesn’t mean you’re not looking at Base 3 – maybe it’s just that there aren’t any 3s you can see. And if you saw a 4 doesn’t mean you’re in Base 5. The same goes for every written sentence – for any/every ‘missing’ letter (every letter you know is in the alphabet but isn’t used in that sentence) you’re making an assumption that you’re using a given alphabet. This may sound stupid, but I’m making a point: how do you know what Base you’re in, if nobody tells you?

In Base 32, you’d be forgiven for thinking that you’d use 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and A-P – and why not? Or perhaps A-Z, 0, 1, 2, 3, 4, 5. But one scheme (someone much smarter than me, came up with) determined it should be A-Z, 2, 3, 4, 5, 6, 7 or A-Z (not using for letters i, L, and o – I’ve deliberately used lower/upper case so you can see it’s a letter not a number) to reduce/remove transcription mistakes

In Base 64, you’re using A-Z + a-z (52 letters) and 0-9, and ‘+’ (plus) and ‘/’ (forward-slash) – also using ‘=’ (equals) so that encoding/decoding is fail-safe on a set boundary when sending/receiving data over the network or storing/restoring data on disk

Point is this: if nobody ‘tells’ you what Encoding or Base you’re using (you assume), you can make a guess – you can make the right ‘guess’ – but you can’t be sure. If you encode a sentence in Base 64 and the person decoding looks at it, and it doesn’t happen to have the letters ‘i’, ‘L’, or ‘o’ in it; then the person decoding it could think it’s in Base 32, but if they decoded it as Base 32, it won’t make any sense

auxillery(TM) data is transmitted (in-to out-from or between auxillery runtimes) unambiguously i.e. its encoding and base are either transmitted prior to each data element, or the entire payload is transmitted in the most ubiquitous encoding of all i.e. Binary (Base 2)

(1) https://datatracker.ietf.org/doc/html/rfc8259

Categories: Code

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *