Tutorial tabarray module

Introduction

Many applications send output of numbers to plain text (ASCII) files in a rectangular form. I.e. they store human readable numbers in one ore more columns in one or more rows. In one of our example figures to illustrate the use of graticules we plotted coastline data from a text file with coordinates in longitude and latitude in a wcs supported projection.

If you want to plot such data or you need data to do calculations, then you need a function or method to read the data into for example NumPy arrays. Package SciPy provides a function read_array() in module scipy.io.array_import. It has all necessary features to do the job but it is very slow when it needs to read big files with many numbers.

We wrote a fast version in module tabarray which is also part of the Kapteyn Package Its speed is comparable to a well known module that is no longer supported, called TableIO. The module interfaces with C and is written in Cython. Such interfaces improve the speed of reading the data with a factor of 5-10 compared to Python based solutions. Module tabarray has a simple interface and an object oriented interface. For the simple interface functions we used the same function names as TableIO. This will simplify the migration from TableIO to tabarray.

Simple interface functions

Function readColumns

A typical example of a text data file is given below. It has 3 columns and several rows of which a number of rows represent a comment. To experiment with tabarray functions and methods you can copy this data and store it as testdata.txt on disk.:

! ASCII file 'testdata.txt' 12-09-2008
!
! X    |   Y   |    err
23.4    -44.12   1.0e-3
19.32   0.211    0.332
# Next numbers are include as
-22.2   44.2     3.2
1.2e3   800      1

Assuming you have some knowledge of the contents and structure of the file, it is easy to read it into NumPy arrays. We use variables x, y and err to represent the columns. The comment characters are ‘#’ and ‘!’ and are included in the comment string which is the second parameter of the tabarray.Readcolumns() function. The file on disk is identified by its name. There is no need to open it first. Use the commands given below to read all the data from our test file.

>>> from kapteyn import tabarray
>>> x,y,err = tabarray.readColumns('testtable.txt','#!')
>>> print(x)
[   23.4 ,    19.32,   -22.2 ,  1200.  ]

All numbers are converted to floating point. Blank lines at the end of a file are ignored. Blank lines in the middle of a file are treated as comment lines. Suppose you want to read only the second and third column, then one needs to specify the columns. The first column has index 0.

>>> y,err = tabarray.readColumns('testtable.txt','#!', cols=(1,2))
>>> print(err)
[  1.00000000e-03   3.32000000e-01   3.20000000e+00   1.00000000e+00]

Note

Column and row numbers start with 0. The last row or last column is addressed with -1.

To make a selection of rows you can specify the rows parameter. Rows are given as a sequence and the first row in a file has index 0. Suppose you want to read the last two rows from the last two columns in the text file together with the first row, then we could write:

>>> x,y = tabarray.readColumns('testtable.txt','#!', cols=(1,2), rows=(2,3,0))
>>> print(x)
[  44.2   800.    -44.12]

To read only the last row in your data you should use rows=(-1,).

If you know beforehand which lines of the data files should be read, you can set the converter to read only the lines in parameter lines. For a big text file (called satview.txt) containing longitudes and latitudes of positions in two columns, we are only interested in the first 1000 lines containing relevant data. Then the lines parameter saves time. So we use the following command:

>>> lons, lats = tabarray.readColumns('satview.txt','s', lines=(0,1000))

Comment lines in this satview.txt file do not start with a common comment character, instead it starts with the word ‘segment’ so our comment character becomes ‘s’.

Function writeColumns

One dimensional array data can also be written back to a file on disk. The function for writing data is called tabarray.writeColumns(). Its first argument is the name of the file. The second is a sequence with columns. With the columns ‘x’ and ‘y’ from the testtable.txt file in the previous section, we want to write a new file where column ‘y’ is the first column and column ‘x’ is the second. Here is the code to do this:

>>> x,y,err = tabarray.readColumns('testtable.txt','#!')
>>> tabarray.writeColumns('testout.txt', (y,x))
# Contents on disk is:
     -44.12       23.4
      0.211      19.32
       44.2      -22.2
        800       1200

The columns are one dimensional NumPy arrays. This implies that we can do some array arithmetic on the columns. We could have changed our columns to:

>>> tabarray.writeColumns('testout.txt', (y*y,x*y,x*x))
# Contents on disk is:
    1946.57   -1032.41     547.56
   0.044521    4.07652    373.262
    1953.64    -981.24     492.84
     640000     960000   1.44e+06

which makes this function very powerful.

It is common practice to start text data file with some comments. The next code shows how to write a date and the name of the author in a new file with function tabarray.writeColumns(). The comments parameter is a list with strings. Each string is written on a new line at the start of the text file.

>>> when = datetime.datetime.now().strftime("Created at: %A (%a) %d/%m/%Y")
>>> author = 'Created by: Kapteyn'
>>> tabarray.writeColumns('testout.txt', (y*y,x*y,x*x), comment=[when, author])

The header of the file will look similar to this:

# Created at: Thursday (Thu) 18/09/2008
# Created by: Kapteyn

Tabarray objects and methods

Reading data and making selections

A tabarray object is created with method tabarray.tabarray(). Again we want to read the data from file ‘testtable.txt’.

>>> t = tabarray.tabarray('testtable.txt', '#!')
>>> print(t)
[[  2.34000000e+01  -4.41200000e+01   1.00000000e-03]
 [  1.93200000e+01   2.11000000e-01   3.32000000e-01]
 [ -2.22000000e+01   4.42000000e+01   3.20000000e+00]
 [  1.20000000e+03   8.00000000e+02   1.00000000e+00]]

Selections are made with methods tabarray.rows() and tabarray.columns().

Warning

The rows() method needs to be applied before the columns() method because for the latter, the array t is transposed and its row information is changed.

With this knowledge we can combine the methods in one statement to read a selection of lines and a selection of columns into NumPy arrays.

>>> x,y = t.rows((2,3)).columns((1,2))
>>> print(x)
[  44.2  800. ]
>>> print(y)
[ 3.2  1. ]

If you want to select rows in a NumPy vector that is already filled with data from disk after applying the lines and/or rows parameters you still can extract data using NumPy indexing:

>>> lines = [0,1,3]
>>> print(err[lines])
[ 0.001  0.332  1.   ]

Messy files

ASCII text readers should be flexible and robust. Examine the contents of the next ASCII data file (which we stored on disk as messyascii.txt):

! Very messy data file

23.343, 34.434, 1e-20
10, 20, xx


2 4      600
-23.23, -0.0002, -3x7
   # Some comment

40, 50.0, 70.2

It contains blank lines at the end and between the data and it has three different separators (spaces, comma’s and tabs). Also it contains data that cannot be converted to numbers. Instead of an exception we want the converter to substitute a user given value for a string that could not be converted to a number. Assume that a user wants -999 for those bad entries, then the numbers should be read by:

>>> t= tabarray.tabarray('messyascii.txt','#!', sepchar=' ,\t', bad=-999)
>>> print(t)
[[  2.33430000e+01   3.44340000e+01   1.00000000e-20]
 [  1.00000000e+01   2.00000000e+01  -9.99000000e+02]
 [  2.00000000e+00   4.00000000e+00   6.00000000e+02]
 [ -2.32300000e+01  -2.00000000e-04  -9.99000000e+02]
 [  4.00000000e+01   5.00000000e+01   7.02000000e+01]]
>>> x,y = t.rows(range(1,4)).columns((1,2))  # Extract some rows and columns
>>> print(x)
[  2.00000000e+01   4.00000000e+00  -2.00000000e-04]
>>>print(y)   # Contains the 'bad' numbers
[-999.  600. -999.]

Note that we could have used function tabarray.readColumns() also to get the same results:

>>> x,y = tabarray.readColumns('messyascii.txt','#!', sepchar=' ,/t', bad=-999, rows(range(1,4)), cols=(1,2))

Note

Probably more useful as a bad number indicator is the ‘Not a Number’ (NaN) from NumPy. Use it as in: bad=numpy.nan and test on these numbers with NumPy’s function: isnan().

Glossary

ASCII
American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet.