Computer Science Essentials

When learning a computer language, and when using a program for scientific computing, it is helpful—sometimes essential—to have some basic knowledge about how computers work, and about how information is stored in files. The following is a brief introduction to some of the relevant topics. Very little detail is provided here, and very few links. For most of these topics one can easily find a Wikipedia or other web article of reasonable quality and considerable detail, but this is left as an exercise for the reader.

If most of this is new to you, don’t feel overwhelmed, you don’t need to learn all of it at once.

How is information represented?

All modern computers operate on binary numbers, in which each digit, called a bit, is a one or a zero. For example, 23 is represented in binary as 10111:

\[\begin{split}10111 &= 1 \times 2^4 + 0 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 1 \times 2^0 \\ &= 1 \times 16 + 0 \times 8 + 1 \times 4 + 1 \times 2 + 1 \times 1 \\ &= 23\end{split}\]

The minimal grouping of bits is by 8; 8 bits make one byte, which can represent \(2^8 = 256\) different numbers.

Bytes may occur singly, in groups of 2, 4, 8, or as a sequence of arbitrary length. Two bytes can represent \(2^{16} = 65536\) different values, and so on.

Hexadecimal

We normally write numbers using the decimal system—base 10—but in computing applications it is common to see numbers in base 16, called hexadecimal, or “hex” for short.

The idea is simple: each digit multiplies a power of 16 rather than a power of 10. That means one needs 16 digit symbols, so the first six letters of the alphabet are used as digits following the 10 decimal symbols: A is 10, B is 11, etc.

Because \(16 = 2^4\), each hex digit represents 4 bits (sometimes called a “nybble”), so a byte can be represented by two hex digits. The decimal number 125, for example, is “7D” in hex:

\[125 = 7 \times 16^1 + 13 \times 16^0\]

The 6 letter-digits can be upper case or lower case. A prefix of ‘0x’ or ‘0X’ (the first character is zero) is often used to indicate a hex number, so the example above might be printed as ‘0X7D’.

HTML color codes are given as the pound sign followed by 6 hex digits that are interpreted from left to right as 3 single-byte numbers (pairs of hex digits) for the amount of red, green, and blue, on a 0-255 scale. See this handy color picker, for example: https://html-color-codes.info/.

Less commonly, one might see base 8, called octal, which uses only digits 0-7.

Numbers

Numeric information is represented as integers, or as floating point numbers.

Integers can be signed or unsigned, and of length 1, 2, 4, or 8 bytes. A one-byte signed integer can represent the inclusive range -128 to 127; a one-byte unsigned integer can represent the range 0 to 255. An \(n\)-byte signed integer can span \((-2^{8n-1}) \mbox{ to } (2^{8n-1} - 1)\), and if unsigned, the range is \(0 \mbox{ to } (2^{8n} - 1)\).

Floating point numbers use a binary version of scientific notation (e.g., \(234.82 = 2.3482 \times 10^2\)), so that a given number of bytes can represent a vastly larger numeric range than can an integer of the same size, and so that very small as well as very large numbers can be expressed. To represent a floating point number, a sequence of (usually) 4 or 8 bytes is treated as a single string of bits. In the 4-byte case, for example, there is a sign bit, then 8 bits for the exponent, and then 23 bits for the significant digits.

Four-byte floating point numbers are termed “single precision”, and sometimes called “floats”. Eight-byte floating point numbers are termed “double precision”, and sometimes called “doubles”. As computing power and storage have increased, the use of double precision has become routine. There are still many cases, however, where single precision is adequate, and it is well worth using to save storage.

Operating systems come in “32-bit” and “64-bit” flavors, with the latter gradually becoming standard for personal and larger computers. This refers to the ability of the operating system to address memory, and therefore to the maximum amount of memory that can be used in a straightforward way. It does not imply any limitation on the types of numbers that can be used in a calculation; a 32-bit operating system supports double precision (64-bit numbers) just as well as a 64-bit operating system does.

Text

Text is also represented as a sequence of bytes. This used to be very simple; a standard called ASCII maps letters, numbers, punctuation marks, and “control characters” to bytes; see https://www.ascii-code.com/, for example. During the last decade, however, a new and much more complex mapping system called “unicode” has become common. Its big advantage is that it can represent a huge variety of characters, including foreign languages. It specifies several different ways of doing this–several different “encodings”. The one we deal with most directly–on the web, in email–is called “utf-8”. It uses the same single-byte mapping as ASCII for the ASCII characters, and uses variable-length sets of bytes for other characters. See https://en.wikipedia.org/wiki/UTF-8 for more explanation.

How is information stored in files?

A file is just a stored sequence of bytes. A primary job of software is to interpret those bytes as information. How it does that depends on the format of the file. There is an unlimited variety of formats, some of which are well-known, well-documented standards, others of which are custom-designed for particular applications and perhaps only for temporary use. The filename extension is often used to indicate the format, but this is just a loose convention, and cannot be relied upon.

Text files

If the entire string of bytes is intended to be interpreted as ASCII or unicode (normally utf-8), then the file is a text file, regardless of whether it contains a letter to the editor, a table of numbers, a web page, or source code in some computer language. Evidently, then, there are subcategories of text file, including:

plain text

No additional major formal structure; just strings of words and punctuation, separated into lines by one or two control characters. Unfortunately, even this simplest text format comes in different flavors depending on the line separators. Linux and OSX use a single linefeed character (also called a “newline”); Windows uses a pair, a carriage return followed by a linefeed.

XML

Extensible Markup Language is a very general framework for organizing information in a nested way.

HTML

Hypertext Markup Language applies the XML framework to web pages.

reST

reStructuredText is the markup language used for the source of this web page. It is one of many of its type, designed to be reasonably easy to read and write with a plain text editor, but permitting considerable formatting when converted to html. To see an example—the source of this web page—click the “Show Source” link near the bottom of the sidebar.

LaTeX

This is a much more elaborate and flexible markup language—really a typesetting language, designed with particular attention to the needs of technical work including mathematics. I recommend that you use it to write your thesis, dissertation, or journal publications, especially if mathematics is included.

flat ascii

A flat ascii file is a simple representation of a table, in which each line is a record with the same number of fields. Fields are usually separated by one or more space and/or tab characters—that is, by “whitespace”.

CSV

Comma-separated value files are similar to flat ascii, but the field separator is a comma instead whitespace. This makes them more compact but less human-readable.

JSON

See http://www.json.org/.

source code

Text that constitutes a script, program or other code unit in a given programming language.

And there are many more…

Image files

There are two fundamentally different ways that plots or other graphical information can be stored: as a description of what is to be drawn (e.g., a line from one point to another), or as a literal image, with the color of each pixel in a rectangular array. The former method is called “vector graphics”, and the latter is “bit-mapped” or “rasterized”. Each has advantages and disadvantages for any given type of graph or image. In general, vector graphics are superior for scientific plotting until some threshold of complexity is reached; when the information to be displayed is essentially an image, such as a photo or a high-resolution false-color image from a satellite-based sensor, then a rasterized representation is more suitable—more compact and faster to render on a screen or on the printed page.

Within the rasterized category are the subcategories of uncompressed, compressed with no loss of information, and compressed with a loss of information. We are concerned here only with the latter two.

PNG

Pronounced “ping”, the png format is a good general-purpose rasterized format with lossless compression. It is particularly suitable for web pages; it is space-efficient, it is rendered quickly, and it produces no artifacts. It is good for plots; but for dense images such as photographs it does not provide sufficient compression.

JPEG

Jpeg (extension is usually “.jpg”) is a lossy rasterized format designed for photographs and similar images that contain too much information to be compressed well by png. It should not be used for line graphs, contour plots, etc.—use it in the scientific context only when a png would be too large, and the loss of information is acceptable. When used inappropriately it results in a larger file than a png, and it adds ugly visual artifacts.

Vector and combined vector/rasterized formats include the following:

PDF

Portable Document Format is an open specification developed by Adobe. It is widely used for entire documents, but it is equally valuable for single plot files. It can contain text in an unlimited variety of typefaces, vector graphics, and embedded rasterized graphics. When generating a plot file in pdf, the embedded rasterized capability should be used only for image-like content. Pdf is quite space-efficient because it is generated using a compression algorithm.

Postscript

Postscript (extension “.ps”) and Encapsulated Postscript (extension “.eps”) are largely superceded by pdf; pdf is roughly a compressed and enhanced version of postcript. Don’t use postscript except at the last stage, typically when a publisher requires it. Postscript files can be generated from pdf.

SVG

Scalable Vector Graphics is a relatively new web-centric format. It is based on XML. Support for it in browsers and other software is still not entirely solid, so use it with caution if at all.

All vector formats are scalable; that is, suitable software can zoom in on a plot with no loss of sharpness.

Data files

One category of data file has already been introduced: text files such as the flat ascii, CSV, and XML variants. Anything can be represented as text—but it can be extremely inefficient in space and in the computing resources required to handle it. Therefore, serious scientific data applications usually rely primarily on binary data formats, in which the arrangement of bytes in the file corresponds closely to the arrangement of bytes in memory when the file is read.

The two most common general-purpose binary formats used in oceanography are NetCdf and HDF. Both are “self-describing” formats in that they inherently include information that an application needs in order to make sense of the file contents.

NetCDF version 3

NetCDF3 is very widely used for oceanographic data and numerical model output. It is designed mainly for arrays of numbers. The arrays can have any number of dimensions; for example, the output of a numerical model typically would have 3 spatial dimensions and 1 time dimension. Information about the file as a whole, and about individual arrays within it, can be attached as named variables called “attributes”.

HDF (version 5)

Hierarchical Data Format is a flexible binary format designed to be efficient for very large files and complex assemblages of data. In addition to the capabilities and general characteristics of NetCDF3, it adds a nested structure (so you can organize data in the file to correspond to a directory structure). It includes options for breaking a very large array into smaller “chunks”, which can make access to cross sections of an array much faster than would otherwise be the case.

NetCDF version 4

NetCDF4 is actually a hybrid of HDF5 and NetCDF; it uses the underlying HDF5 library and storage format, but with an extended version of the NetCDF3 logical structure and conventions. It is quite new, so NetCDF4 files are not yet commonly found on the web. The NetCDF4 library and utilities are backwards-compatible, and can read and write version 3 files.

Hint: to see the structure–a table of contents–of any NetCDF file, use the command-line utility that comes with the NetCDF library. If the file is called “output_file.nc”, the command is this:

ncdump -h output_file.nc

The “-h” option is necessary–otherwise you will get a complete ascii dump of the file contents. To get a summary of ncdump options, execute it with no options and no arguments. To get a full explanation, execute:

man ncdump

Often the files one needs to access—perhaps provided by a colleague or served on the web—are not in one of these standard formats, but are in the format produced by a given commercial software product. Instruments such as a CTD or an acoustic Doppler current profiler (ADCP) typically produce raw data in a format designed and used only by that manufacturer. To read it, one must either rely on the software provided by the manufacturer, or write or find routines for this purpose. Typically these formats are well-documented by the instrument manufacturers and are not difficult to handle.

Another category of manufacturer-specific file format is exemplified by the binary “matfiles” generated by Matlab:

matfile

Matlab’s binary files can be written in any of several formats, all with the default extension “.mat”. The most recent version is actually using HDF5 (in a way that makes it difficult, but not impossible, to decipher), but most of the matfiles one encounters are written with simpler structures. Matfiles are of course designed for efficiently moving numeric arrays and other Matlab variables between a Matlab session and a file on disk. Fortunately, they are often simple enough that reading and writing them with software other than Matlab is not an insurmountable task. The exceptions are files in which newer Matlab data structures (“table”, perhaps others) were saved; their format on disk is apparently undocumented, and these files can be read only with Matlab. This illustrates a danger of using proprietary formats for data storage.

Older data files are sometimes in the raw format in which ancient Fortran wrote them. To deal with these you need to understand how the Fortran binary write operation works, and you need some independent information about exactly what is in the file. These files are not self-describing, so you can’t use a simple utility to find out what a file contains, the way you can with NetCDF, for example.

A similar situation can arise with custom non-self-describing files written by software in another language, usually C—you still need independent information about what is in the file, but at least the arrangement is likely to be more straightforward and easier to read than in the case of a file written by Fortran.

Archives and compression

It is bad enough that there are all these different ways data can be arranged in a file–and then on top of it all, a file might be compressed in any of several ways, and/or aggregated with other files into a single-file archive. Fortunately, this additional level of complexity is easy to understand and deal with, and requires only familiarity with a few tools and conventions.

tar

The “tar” command, standing for Tape Archive, is a very old but still very useful program for aggregating a set of files, or a whole directory tree, into a single file. In the Unix/Linux world, software source code is typically distributed in a “tarball”, which is a compressed tar aggregation of a directory tree. The typical extension is “.tar.gz” or “.tgz”. This is for the most common case, when “gzip” compression has been used. The tar command can be used to simultaneously aggregate and compress, or the reverse, with many options.

zip

Zipfiles are similar to tarballs, but usually originate in WindowsWorld. The usual extension is…you guessed it. There are many programs with different names for generating and unzipping zipfiles; on Linux and OS X, the standard ones are helpfully named zip and unzip.

gzip/gunzip

For compressing or uncompressing a file at a time, one usually uses this pair of commands. File compression is most effective with text files. Binary files are usually less compressible, and once any file has been compressed, trying to compress it again will make it slightly larger. There is no point in running gzip on a pdf file, for example, because it is already internally compressed.

There are many other compression algorithms and programs, but for now, familiarity with the above should be adequate.

Byte order

We left something out of the discussion of numeric types, above; depending on the kinds of files you need to deal with, the following might be important. We noted that most numeric types are represented using 2, 4, or 8 bytes—but in what order do the bytes appear in the machine’s memory, or on disk? There are two possibilities, and yes, both are in use:

Big-endian

The most significant byte comes first, just as we write numbers from left to right. Consider the number 260 as a 2-byte integer.

\[260 = 1 \times 256^1 + 4 \times 256^0\]

Therefore the value of most significant byte is 1, and the value of the least significant byte is 4. On disk, in big-endian order, the 1 would be followed by the 4.

Little-endian

The reverse: the least significant byte comes first, so the 2-byte integer for 260 would be stored as a 4 followed by a 1.

Most computers in use today are little-endian, so that is the native order in which they read bytes from disk, and write them out to disk. Not so very long ago, however, big-endian computers were common (e.g. Sun workstations and PowerPC Macs), so there are still plenty of files around with big-endian order. NetCDF was developed by Sun, along with much networking infrastructure, so big-endian order is the standard “network byte order” and the order found in NetCDF files. Fortunately, one doesn’t normally read NetCDF files as raw byte streams; one uses a software library that automatically does the low-level reading and translates the byte order to that of the machine, if necessary. Nevertheless, if you need to work with a binary file that is not in a well-known format for which you have a software library, then you simply must know the byte order used in that file, and perform the byteswapping (reordering) if necessary.

Mathematics

When using a computer for arithmetic or mathematics, one must pay attention to the kind of numbers being used—the size, and whether the numbers are integers or floating point. In addition, different computer languages can give different results for what would seem to be the same mathematical operation on the same numbers. Here are a few things to watch out for, starting with integers, and then considering floating point.

Integer arithmetic

True integer division yields an integer result. If the numerator and denominator are both positive or both negative, the quotient will be truncated—that is, the remainder will be discarded. But if the numerator is negative and the denominator is positive, or the reverse, the quotient depends on the particular compiler or computer language being used! So, with one language you might find that \(-3/2 = -2\) and with another it would be \(-3/2 = -1\). If this matters for the calculations you are doing, then you need to do a simple test to find out what the behavior is, and program accordingly.

In some languages, the “/” operator with integer arguments will yield a floating point result if the remainder from integer division is non-zero. In that case there is normally another symbol for the true integer division operation.

Integer addition, subtraction, and multiplication present a different sort of pitfall: the true result of the operation might not match the integer size and type of the arguments. For example, if you are working with single-byte signed integers, and you add 10 to 125 expecting to see 135, you might see -121 instead. The largest signed single-byte integer value is 127, so the computer has silently wrapped the overflow around into the negative range.

Floating point arithmetic

Floating point numbers can represent a very large but finite range of values, and only with a finite degree of precision. Because of the finite precision and the binary representation, a number that has an exact decimal representation usually does not have an exact floating point representation, and floating point arithmetic is not exact. For example, we can use an IPython session to test whether an expression that is mathematically true turns out to be true in a floating point computation:

In [1]: (1.0 / 9999.11) * 9999.11 == 1
Out[1]: False

One consequence of this is that it is often a bad idea to test equality between floating point numbers. Instead, test that the absolute value of the difference between them is below some suitably tiny threshold.

NaN

In the earlier introduction to floating point number representation, we left out something: there are a few bit patterns that are reserved to indicate results that are not numbers. These include positive infinity, negative infinity, and, most commonly, “not a number”, or NaN. NaN actually comes in more than one flavor; it is not just a single bit pattern for a given floating point type. There are special rules governing how NaN is handled in logical and arithmetic operations. They boil down to these: any arithmetic operation with a NaN yields a Nan; and any logical operation with a NaN argument yields False. So, it is counter-intuitive, but the question “does NaN equal NaN?” is answered with “no”! To test whether a variable contains a NaN, you must use a function supplied for exactly that purpose. Here is an illustration using the numpy module in IPython:

In [1]: import numpy as np

In [2]: 1 + np.nan
Out[2]: nan

In [3]: np.nan == np.nan
Out[3]: False

In [4]: np.isnan(np.nan)
Out[4]: True