How Computers Store Stuff…


Computers are useful tools. But like most tools, one sometimes needs to know a bit more about how they work. In particular, if you want to appreciate the limitations of seismic processing or to understand seismic data quality issues it’s helpful to know something of the ways in which computers store and manipulate numbers.

Understanding how computers store information, whether numbers or text, is also essential – obviously! – if you want to make sense of seismic data dumps. So, at the risk of insulting everyone’s intelligence, here’s a quick tour of how computers store information:

  1. Binary Numbers…
  2. …and Hexadecimal Representations
  3. Negative Numbers
  4. Storage Order, “Little Endian” and “Big Endian” Architectures
  5. Encoding Text
  6. Floating-point Numbers
  7. IEEE Floating Point Number Formats
  8. IBM Floating Point Number Format
  9. Binary Coded Decimal


Binary Numbers…

Everyone knows that computers store information as something called “binary numbers”. Not everyone though – including, sadly, many programmers and other computer professionals – fully appreciates quite what that means.

First, let’s emphasise that binary numbers are not somehow different from the decimal numbers we’re all familiar with. Other number systems like hexadecimal and binary are just different ways of expressing numbers. When we speak of binary or hexadecimal numbers we’re not referring to some other types of numbers, just alternative ways of writing them down. The numbers themselves remain the same. Ten is ten whether written as ‘10’ in decimal, as ‘A’ in hexadecimal or as ‘1010’ in binary.

So why not use decimal?

Well, binary is useful because it corresponds with what’s physically happening in the computer. Information – numbers really, because that’s all computers know how to deal with – is stored by setting transistors inside a memory chip into either an ‘on’ or ‘off’ state, ‘1’ or ‘0’. Binary, in other words. So if we write down a number in binary form as a string of 1’s & 0’s we’re indicating exactly how that number is stored in computer memory. And the other system we use a lot when talking about numbers stored on a computer, hexadecimal, is simply a more compact way of representing binary numbers.

So far, so good. So what is a binary number?

A normal decimal number – let’s say for sake of argument the number 203 – is written “203” meaning:

(2 × 100) + (0 × 10) + (3 × 1)

that is, each digit represents a power of ten, in this case (2 × 102) + (0 × 101) + (3 × 100). In exactly the same way, the digits in a binary number represent the powers of 2. So ‘11001011’ represents the number:

(1 × 27) + (1 × 26) + (0 × 25) + (0 × 24) + (1 × 23) + (0 × 22) + (1 × 21) + (1 × 20)

that is, (1 × 128) + (1 × 64) + (0 × 32) + (0 × 16) + (1 × 8) + (0 × 4) + (1 × 2) + (1 × 1), which just happens to be 203. So ‘11001011’ is another way of writing 203.

The number used as the basis for a number representation system is called the base or radix. If there’s any doubt about the base it’s written (in decimal) as a subscript to the right of the last digit, thus “20310” or “110010112”.

A single binary digit is called a bit. A bit is rather too small a unit for most purposes so computer memory is organised into 8-bit bytes. The largest number that can be stored in a single byte is 11111111, that is:

(1 × 27) + (1 × 26) + (1 × 25) + (1 × 24) +
    (1 × 23) + (1 × 22) + (1 × 21) + (1 × 20)

or 255 (and not 256, a number often quoted!). It’s important to bear in mind, though, that the fundamental unit of storage – the atom of computer memory if you like – is a bit, not a byte. Bytes are useful. Bits are fundamental.

Computer memory is also often described in terms of “words”. A word is either 1, 2, 4 or 8 bytes. The word-size defines the type of computer and says much about the way it works. Computers with 1-byte words are now (thankfully!) consigned to history. Older PCs – with Intel 8086 and 80286 CPUs – had a 2-byte (16-bit) word. Most current computers, including Pentium PCs, use a 4-byte word. The latest workstations and the next generation of PCs have an 8-byte word. The distinction is important, though the reasons are beyond the scope of this discussion.

Top


…and Hexadecimal Representations

Hexdecimal – or more commonly, “hex” – is the same idea, but based on powers of sixteen rather than 10 or 2. The reason for choosing sixteen is that a single hexadecimal digit corresponds exactly to four bits. Two hexadecimal digits can represent a single byte, compactly and precisely. Consequently a hexadecimal digit – half a byte – is sometimes refered to, rather wryly, as a nibble.

Writing a hexadecimal number requires five extra symbols to represent the (single digit in hexadecimal) numbers 10 to 15. Rather than invent new symbols we use the letters ‘A’ to ‘F’ (upper or lower case, it’s purely a matter of personal taste). The value 203 is thus represented in hexadecimal as “CB16”, that is:

(C [= 1210] × 161) + (B [= 1110] × 160).

Converting a regular decimal to hexadecimal is simple (even if you don’t have a hex calculator) – just keep dividing by 16 and writing down the remainder. As an example, let’s find the hex equivalent of the decimal number ‘12345’:

123456 ÷ 16 = 7716 × 16 + 0
7716 ÷ 16 = 482 × 16 + 4
482 ÷ 16 = 30 × 16 + 2
30 ÷ 16 = 1 × 16 + 14
1 ÷ 16 = 0 × 16 + 1

Now we write down the remainders in reverse, replacing values larger than 9 with the letters ‘A’ to ‘F’ (‘A’ for 10, ‘B’ for 11 and so on, up to ‘F’ for 15) and we get the answer – 12345610 = 1E24016. Try it with a few numbers picked at random and you’ll soon get the hang.

The reverse conversion – finding the decimal equivalent of a number expressed in hex – is only marginally more difficult. Starting with the first (or last, it doesn’t make any difference) digit, multiply each in turn by 16 once for each character to its right (remembering that ‘A’ stands for 10, ‘B’ stands for 11 and so on, up to ‘F’ standing for 15) and add it to the total (a calculator is very handy for this). So for example 1E24016 is:

1 × 16 × 16 × 16 × 16 = 65536
 + (E [= 14] × 16 × 16 × 16) = 57344 + 65536 (from the previous line) = 122880
  + (2 × 16 × 16) = 512 + 122880 (from the previous line) = 123392
   + (4 × 16) = 64 + 123392 (from the previous line) = 123456
    + (0 × 1) = 0 + 123456 (from the previous line) = 12345610

One small problem with all this is that while writing numbers in this way, with the base as a subscript, is OK for printed documents, it’s not possible to type a subscript on a standard keyboard. It’s important, though, to make the distinction. ‘11’ is a very different thing depending whether it’s in binary, decimal or hexadecimal – 112 is only 310 but 1116 is 1710.

Consequently the C programming language (and all subsequent languages, C++, Java, C# etc.) uses the convention of prefixing a hexadecimal number with the string “0x” or “0X” and there are so many C programmers in the world that this form has become commonplace. It is, though nothing new, just an alternative to the more formal “16” subscript.

Top


Negative Numbers

Storing ordinary positive integers as binary numbers (and representing them in hexadecimal) is straightforward. Negative numbers are stored slightly differently. First, the first (most significant) bit is used to indicate the sign – it should be set (1) to indicate a negative value, unset (0) to indicate a positive value.

However, for arcane reasons to do with the way central processor units actually work, the rest of the number isn’t simply the absolute value. Instead a negative number is stored in what is called “two’s complement” form. The two’s complement form is created by adding 1 to the one’s complement. The one’s complement of a number simply has each binary 1 replaced by zero and vice versa. So, for example, the one’s complement of ‘11001011’ is ‘…111100110100’, where the ellipsis indicates all the leading 1’s have been omitted.

Add 1 and you get the two’s complement. Thus the two’s complement of ‘11001011’ is ‘…111100110101’.

Putting this in concrete terms, in the 32-bit integer format used currently by most ordinary computers the number ‘1’ is represented in hex as:

00000001

but ‘-1’ is not:

80000001

(that is, the same as ‘1’ but with the sign bit – the most significant bit – set). Instead it’s actually:

FFFFFFFF

formed by taking the one’s complement of 1 (‘…111110’, FFFFFFE in hex) and adding one (‘FFFFFFFF’). You can now see why negative numbers are stored in two’s complement form – adding 1 to FFFFFFFF results in zero, as it should. The use of the two’s complement form for negative numbers means both addition and subtraction can be carried out by a single adder unit within the CPU, which thus doesn’t need any separate circuitry to perform subtraction, nor does it need to treat negative numbers differently from positive values.

Top


Storage Order, “Little Endian” and “Big Endian” Architectures

One important issue when interpreting dumps is the data storage order. On most modern computers integers – that is, ordinary round numbers like “5” or “123456” – are stored in 4 bytes (32 bits) or 8 bytes (64 bits). So, for example, if the number 123456 (1E24016) was stored in a 32-bit integer the bytes would be:

Byte 0 Byte 1 Byte 2 Byte 3
0x00 0x01 0xE2 0x40

Note that the first byte is “Byte 0”, not “Byte 1”. This is a common convention when talking about computer memory addresses. The reason is that it’s an additive offset – that is, the address of byte 0 is the address of the whole word plus zero, the address of byte 1 is the address of the whole word plus one and so on.

However the individual bytes don’t have to be in this order, called “Big Endian” because the most significant (“biggest”) byte is first. In fact it actually makes more sense in some ways to store them in reverse order, thus:

Byte 0 Byte 1 Byte 2 Byte 3
0x40 0xE2 0x01 0x00

This arrangement is called “Little Endian”. The reason for this choice, which seems somewhat counter-intuitive at first sight, is that this way knowing the value of an integer doesn't depend on knowing its length. Take the number “3”. In little endian storage order it will be 0x0300 in a 2-byte word, 0x03000000 in a 4-byte word and 0x0300000000000000 in an 8-byte word. So even if we read an 8-byte word as if it was a 2-byte or 4-byte word, we still get the right value (3).

In big endian representation, though, 3 will be 0x0003, 0x00000003 or 0x0000000000000003 depending whether it’s a 2-byte, 4-byte or 8-byte value. If it’s actually eight bytes long and we only read four bytes the value will appear to be zero.

The difference may seem trivial and forced, but it actually has a huge effect on the way a processor works, and is perhaps the most important single decision facing the team setting out to design a new CPU chip. Intel x86-class processors, including the Pentium (and, of course, all the CPUs that emulate Intel x86 chips, such as AMD Athlon and Duron chips) are little endian (think In-TEL, lit-TEL endian). Most workstations, including Sun SPARC and IBM RS-6000 (PowerPC) systems, are big endian.

The storage order for seismic data is always big endian, that is, most significant byte first. However when reading files or tapes written as straight binary data always bear in mind that the original storage order could have been different. If every line seems to have a line number somewhere in the region of 1677721600 (0x64000000 in hex) it may be that the original number was really 100 (0x64 in hex) but it was recorded in a 4-byte word on a little endian machine (least significant byte first) and you're now reading it as a 4-byte word on a big endian machine…

Top


Encoding Text

There’s more to information, though, than integer numbers. Information can also take the form of text or non-integer numbers, and increasingly it takes on graphical forms too.

Graphical information can easily be reduced to numbers – lists of vertices approximating shapes as polyhedra for example, red-green-blue levels to specify a colour, and so on. Text and non-integer numbers, however, are fundamentally different from integers and need special treatment.

Non-integer numbers are described in the next section. Text is stored as numbers by the simple trick of assigning a number to each letter.

Everyone played games as a child, writing “code” messages by putting ‘1’ for A, ‘2’ for B etc. Computers use exactly the same trick, albeit they assign codes for punctuation characters and numbers as well, plus special control codes. By far and away the most important code today is ASCII (named after the body that selected it, the American Standards Committee for Information Interchange).

In ASCII, characters numbered zero to 31 (0 to 0x1F in hex) are special control codes, such as the character to indicate a Tab (9), a Backspace (8) or a Newline (10, 0xA). The character with a code of 32 (0x20) is a space and 33 (0x21) to 47 (0x2F) are punctuation characters. The ten decimal digits are represented by 48 (0x30) through 57 (0x39), the upper-case letters are 65 (0x41) through 90 (0x5A) and the lower-case letters by 97 (0x61) to 122 (0x7A). Values larger than 127 (0x7F, the Rub-out character, DEL) are not assigned to any character, and the remaining codes are assorted arithmetic and punctuation characters. A full ASCII code chart can be found here.

IBM mainframe computers used a different, much older system called EBCDIC (“Extended Binary Coded Data Interchange Code” for what it’s worth). This system is now obsolete, or would be if it weren’t still used for the text header record (the so-called EBCDIC reel header) in a SEG-Y tape or file. The EBCDIC system is rather more complex than ASCII. A full EBCDIC code chart can be found here.

Faced by a dump that appears to be a jumbled mess of strange characters with occasional characters that look like text, it’s possible you actually have EBCDIC rather than ASCII text. It’s usually rather easy to recognise EBCDIC text in a dump. The EBCDIC character for a space, 64 (0x40), is the ASCII code for the ‘@” symbol. So if you have something that ought to be text, but in fact looks like gobbledy-gook liberally sprinkled with ‘@” symbols, the chances are you’re looking at EBCDIC text being displayed as ASCII. A good file dump program such as sfd will be able to display text interpreted as either ASCII or EBCDIC so you can easily check which it is (in fact, left to its own devices sfd will make the choice for you).

Top


Floating-point Numbers

Non-integer numbers – floating point numbers, or just “floats”, in the jargon – are stored in a way similar to scientific notation, as an integer together with an exponent. For example, the number 1.25 in decimal can be written as 125 × 10−2, that is, 125 × 0.01.

In a computer the exponent is either a power of two or a power of 16 rather than a power of 10, and the number part – called the mantissa – is usually a fraction rather than a whole number but the principal is the same.

There are two schemes in use today for representing floats. Most computers – including every PC and all Unix and Linux workstations – store non-integer numbers using the scheme defined in IEEE standard 754. The only important exceptions are older IBM mainframe computers. Old IBM mainframes are no longer used in the seismic industry but the way they store floating point numbers remains important since the original SEG-Y format specification required seismic trace data values to be recorded in IBM 32-bit floating point format. This section below describes the technical details of the formats and points out a simple way of distinguishing between them in dumps.

Top


IEEE Floating Point Number Formats

The IEEE 32-bit floating point format comprises a sign bit, an 8-bit exponent and a 23-bit mantissa, thus

Bit positions
31 30 23 22 0
s exp frac

where
s is the sign bit, 1 for a negative value or zero for a positive value
exp is the 8-bit unsigned exponent (power of 2) biased by 127, and
frac is the 24-bit mantissa (fraction)

The fraction is bit-normalized, that is, it is justified – shifted leftwards and the exponent correspondingly decremented – until the leading bit is 1 (set). Since the leading bit is always set it can be omitted (so the mantissa has 24 significant bits but only actually occupies 23 bits).

The number represented is thus

(−1)s × 2(exp−127) × (1.frac)

where the term “1.frac” indicates 1 + the binary fraction “frac”, with the assumed radix (binary) point after the 1.


 

The IEEE 64-bit floating point format, usually referred to as a “double” since it is double the length of a float, comprises two contiguous 32-bit words with one sign bit, an 11-bit exponent and a 52-bit mantissa, thus

Bit positions
63 62 52 51 32
s exp frac (first 20 bits)
31 0
frac (remaining 32 bits)

where s is the sign bit, exp is the 11-bit unsigned exponent of 2, biased by 1023 and frac is the 52-bit mantissa, which, as for the 32-bit format, is justified with the leading bit removed (so in effect the mantissa has 53 bits). On a SPARC, RS-6000 or other “big-endian” architecture the sign bit is the most significant bit of the word at the lower address. On a “little-endian” architecture such as Intel x86 the sign bit is the msb of the word at the higher address.

The number represented is

−1s × 2(exp−1023) × 1.frac

where as for the 32-bit IEEE fp format the term “1.frac” indicates 1 + the binary fraction “frac”, with the assumed radix point after the 1. The range of values that can be represented is thus 2−1022 (roughly 2.2 × 10−308) to 22046 (roughly 1.8 × 10308).

There are also similar rules for the special cases:

Top


IBM Floating Point Number Format

The IBM 32-bit floating point format is rather different from the IEEE 32-bit format. It comprises a sign bit, a 7-bit exponent and a 24-bit mantissa, thus

Bit positions
31 30 24 23 0
s exp frac

where s, exp and frac have the same meanings as before except that exp is the unsigned exponent of 16 (not 2, as for the IEEE formats) biased by 64. The difference in the exponent is crucial. Using an exponent that is a power of 16, not 2, means the mantissa is nibble-normalized not bit-normalized (that is, the leading nibble – 4 bits – is non-zero, but the leading bit may or may not be zero). The mantissa is justified to give the maximum precision. Consequently the leading nibble cannot be zero.

This provides a simple method for distinguishing IEEE from IBM floating point numbers in a hex dump, in that the first nibble of the mantissa in an IBM float cannot be zero (except for a true zero of course, when all the digits are zero). That is, a valid non-zero IBM float cannot take the form ‘xx0xxxxx’, where ‘x’ is any hexadecimal digit, 0 – 9 and A – F. In an IEEE 32-bit floating point number the entire mantissa can be zero, and often is (2.0, for example, is 0x40000000 in IEEE 32-bit floating point format).

The number represented is

−1s × 16(exp−64) × 0.frac

where the term “0.frac” indicates the binary fraction “frac” with an assumed radix (binary) point to the left.

Top


Binary Coded Decimal

Modern computers normally store numbers as binary or as text. However, for somewhat arcane reasons a third system, Binary Coded Decimal was widely used in the early days. It remains relevant in the world of geophysical exploration because the SEG-D format uses BCD for some fields, including the all-important file number and format code at the start of General Header 1.

In BCD each decimal digit is stored as the binary number, one digit per nibble, that is, two decimal digits per byte (recall that a byte – 8 bits – is two 4-bit “nibbles”; a nibble can hold values in the range 0 – 15).

An example will make this clearer.

Storing the decimal integer 1234 as a binary number, the actual contents of the memory, expressed in hexadecimal would be “04 D2” (the binary value 0000 0100 1101 0010).

Were the same number to be expressed as ASCII text the memory contents would be ‘31 32 33 34’, that is, the ASCII codes for the characters ‘1’, ‘2’, ‘3’ and ‘4’.

Expressed as BCD, the memory contents would be:

12 34

in hex (0001 0010 0011 0100 in binary) – that is, the memory would contain the binary number ‘1’ in the first nibble (corresponding to the 1000’s), the binary number ‘2’ in the second nibble (corresponding to the 100’s), the binary number ‘3’ in the third nibble (corresponding to the 10’s) and the binary number ‘4’ in the fourth nibble (corresponding to the units).

Top

 
 
Home