Tutorial tabarray module ========================== .. highlight:: none :linenothreshold: 1000 Introduction ------------ Many applications send output of numbers to plain text (:term:`ASCII`) files in a rectangular form. I.e. they store human readable numbers in one ore more columns in one or more rows. In one of our example figures to illustrate the use of graticules we plotted coastline data from a text file with coordinates in longitude and latitude in a :mod:`wcs` supported projection. If you want to plot such data or you need data to do calculations, then you need a function or method to read the data into for example NumPy arrays. Package SciPy provides a function *read_array()* in module :mod:`scipy.io.array_import`. It has all necessary features to do the job but it is very slow when it needs to read big files with many numbers. We wrote a fast version in module :mod:`tabarray` which is also part of the Kapteyn Package Its speed is comparable to a well known module that is no longer supported, called *TableIO*. The module interfaces with C and is written in *Cython*. Such interfaces improve the speed of reading the data with a factor of 5-10 compared to Python based solutions. Module :mod:`tabarray` has a simple interface and an object oriented interface. For the simple interface functions we used the same function names as *TableIO*. This will simplify the migration from *TableIO* to :mod:`tabarray`. Simple interface functions -------------------------- Function readColumns .................... A typical example of a text data file is given below. It has 3 columns and several rows of which a number of rows represent a comment. To experiment with tabarray functions and methods you can copy this data and store it as `testdata.txt` on disk.:: ! ASCII file 'testdata.txt' 12-09-2008 ! ! X | Y | err 23.4 -44.12 1.0e-3 19.32 0.211 0.332 # Next numbers are include as -22.2 44.2 3.2 1.2e3 800 1 Assuming you have some knowledge of the contents and structure of the file, it is easy to read it into NumPy arrays. We use variables *x*, *y* and *err* to represent the columns. The comment characters are '#' and '!' and are included in the comment string which is the second parameter of the :func:`tabarray.Readcolumns` function. The file on disk is identified by its name. There is no need to open it first. Use the commands given below to read all the data from our test file. >>> from kapteyn import tabarray >>> x,y,err = tabarray.readColumns('testtable.txt','#!') >>> print(x) [ 23.4 , 19.32, -22.2 , 1200. ] All numbers are converted to floating point. Blank lines at the end of a file are ignored. Blank lines in the middle of a file are treated as comment lines. Suppose you want to read only the second and third column, then one needs to specify the columns. The first column has index 0. >>> y,err = tabarray.readColumns('testtable.txt','#!', cols=(1,2)) >>> print(err) [ 1.00000000e-03 3.32000000e-01 3.20000000e+00 1.00000000e+00] .. note:: Column and row numbers start with 0. The last row or last column is addressed with -1. To make a selection of rows you can specify the rows parameter. Rows are given as a sequence and the first row in a file has index 0. Suppose you want to read the last two rows from the last two columns in the text file together with the first row, then we could write: >>> x,y = tabarray.readColumns('testtable.txt','#!', cols=(1,2), rows=(2,3,0)) >>> print(x) [ 44.2 800. -44.12] To read only the last row in your data you should use `rows=(-1,)`. If you know beforehand which lines of the data files should be read, you can set the converter to read only the lines in parameter *lines*. For a big text file (called *satview.txt*) containing longitudes and latitudes of positions in two columns, we are only interested in the first 1000 lines containing relevant data. Then the *lines* parameter saves time. So we use the following command: >>> lons, lats = tabarray.readColumns('satview.txt','s', lines=(0,1000)) Comment lines in this *satview.txt* file do not start with a common comment character, instead it starts with the word 'segment' so our comment character becomes 's'. Function writeColumns ..................... One dimensional array data can also be written back to a file on disk. The function for writing data is called :func:`tabarray.writeColumns`. Its first argument is the name of the file. The second is a sequence with columns. With the columns 'x' and 'y' from the *testtable.txt* file in the previous section, we want to write a new file where column 'y' is the first column and column 'x' is the second. Here is the code to do this: >>> x,y,err = tabarray.readColumns('testtable.txt','#!') >>> tabarray.writeColumns('testout.txt', (y,x)) # Contents on disk is: -44.12 23.4 0.211 19.32 44.2 -22.2 800 1200 The columns are one dimensional NumPy arrays. This implies that we can do some array arithmetic on the columns. We could have changed our columns to: >>> tabarray.writeColumns('testout.txt', (y*y,x*y,x*x)) # Contents on disk is: 1946.57 -1032.41 547.56 0.044521 4.07652 373.262 1953.64 -981.24 492.84 640000 960000 1.44e+06 which makes this function very powerful. It is common practice to start text data file with some comments. The next code shows how to write a date and the name of the author in a new file with function :func:`tabarray.writeColumns`. The comments parameter is a list with strings. Each string is written on a new line at the start of the text file. >>> when = datetime.datetime.now().strftime("Created at: %A (%a) %d/%m/%Y") >>> author = 'Created by: Kapteyn' >>> tabarray.writeColumns('testout.txt', (y*y,x*y,x*x), comment=[when, author]) The header of the file will look similar to this:: # Created at: Thursday (Thu) 18/09/2008 # Created by: Kapteyn Tabarray objects and methods ---------------------------- Reading data and making selections .................................. A *tabarray* object is created with method :meth:`tabarray.tabarray`. Again we want to read the data from file 'testtable.txt'. >>> t = tabarray.tabarray('testtable.txt', '#!') >>> print(t) [[ 2.34000000e+01 -4.41200000e+01 1.00000000e-03] [ 1.93200000e+01 2.11000000e-01 3.32000000e-01] [ -2.22000000e+01 4.42000000e+01 3.20000000e+00] [ 1.20000000e+03 8.00000000e+02 1.00000000e+00]] Selections are made with methods :meth:`tabarray.rows` and :meth:`tabarray.columns`. .. warning:: The *rows()* method needs to be applied before the *columns()* method because for the latter, the array *t* is transposed and its row information is changed. With this knowledge we can combine the methods in one statement to read a selection of lines and a selection of columns into NumPy arrays. >>> x,y = t.rows((2,3)).columns((1,2)) >>> print(x) [ 44.2 800. ] >>> print(y) [ 3.2 1. ] If you want to select rows in a NumPy vector that is already filled with data from disk after applying the lines and/or rows parameters you still can extract data using NumPy indexing: >>> lines = [0,1,3] >>> print(err[lines]) [ 0.001 0.332 1. ] Messy files ........... ASCII text readers should be flexible and robust. Examine the contents of the next ASCII data file (which we stored on disk as *messyascii.txt*):: ! Very messy data file 23.343, 34.434, 1e-20 10, 20, xx 2 4 600 -23.23, -0.0002, -3x7 # Some comment 40, 50.0, 70.2 It contains blank lines at the end and between the data and it has three different separators (spaces, comma's and tabs). Also it contains data that cannot be converted to numbers. Instead of an exception we want the converter to substitute a user given value for a string that could not be converted to a number. Assume that a user wants -999 for those bad entries, then the numbers should be read by: >>> t= tabarray.tabarray('messyascii.txt','#!', sepchar=' ,\t', bad=-999) >>> print(t) [[ 2.33430000e+01 3.44340000e+01 1.00000000e-20] [ 1.00000000e+01 2.00000000e+01 -9.99000000e+02] [ 2.00000000e+00 4.00000000e+00 6.00000000e+02] [ -2.32300000e+01 -2.00000000e-04 -9.99000000e+02] [ 4.00000000e+01 5.00000000e+01 7.02000000e+01]] >>> x,y = t.rows(range(1,4)).columns((1,2)) # Extract some rows and columns >>> print(x) [ 2.00000000e+01 4.00000000e+00 -2.00000000e-04] >>>print(y) # Contains the 'bad' numbers [-999. 600. -999.] Note that we could have used function :func:`tabarray.readColumns` also to get the same results: >>> x,y = tabarray.readColumns('messyascii.txt','#!', sepchar=' ,/t', bad=-999, rows(range(1,4)), cols=(1,2)) .. note:: Probably more useful as a bad number indicator is the 'Not a Number' (NaN) from NumPy. Use it as in: `bad=numpy.nan` and test on these numbers with NumPy's function: *isnan()*. Glossary -------- .. glossary:: ASCII *American Standard Code for Information Interchange* is a character-encoding scheme based on the ordering of the English alphabet.