An Introduction to Pipelines and Code-building with Classes

Object-oriented programming (OOP), another way of saying class-based coding, gets thrown around a lot. It's a well known style in computer science, and there are languages (e.g., JAVA) that are always OOP.

For astronomers, though, they can be a confusing concept at first, both in execution ("how do I write my code that way?") and in reasoning ("why should I write my code that way?"

I'm going to try to address both of these questions in this walkthrough, but I won't pretend to be an expert on Python classes or their use --- my goal is to just share the barebones basics that I myself have picked up, as a potentially less intimidating starting point for others. I will say that I've found that using OOP for my coding has really helped me out with organization, debugging, documentation, and workflow. Maybe it'll help you too!

Advantages

So why should you think about using OOP? What's wrong with functional programming --- a .py script with numerous functions, which are wrangled together by a main() function at the end?

Well, nothing is wrong with functional programming, and there are certainly instances where it is the best way to tackle a problem. But I've found that a lot of my time in observational astronomy has been, loosely, centered around the construction of pipelines. By that I just mean a big 'ole machine that I feed raw telescope CCD images into, along with some anciliary data, and expect science to come out the other side: measurements, like emission line fluxes and equivalent widths, or photometry.

Python is about performing operations on things. Those operations, and those things, are all objects in python. float is a class in python; every float you define is an instance of that class, which have well defined attributes and methods. What happens when you use the + operator between two floats is defined by the class, and is different than what + does between two strings (also class objects) or two arrays.

If we think, in the abstract, of our data as being an object, then the purpose of our pipeline is to perform a series of operations on that data to change it from its raw, telescope form to something more useful.

Using Objects

I just want to stop for a second to remind you that you already use object-oriented programming all the time, because everything is an object in python. OOP is "activated" through dot notation, meaning something like the following:

In [3]:
import numpy as np 
import matplotlib.pyplot as plt 

a = 'this is a string'
b = np.arange(10)

#Let's access an attribute of our array: 
print(b.shape)

#Let's access a method of our string 
print a.upper()
(10,)
THIS IS A STRING

Right, so by taking our variable names (which in python are pointers to the objects in memory) and add a '.', we can then access attributes by typing their names, and we can access methods the same way, but including the parenthesis for the arguments. Some methods have none (like the above), but many do, e.g.,

In [5]:
b.reshape(5,2)
Out[5]:
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

Hopefully you are already sort of getting the feel for why having a bunch of helpful built-in attributes and methods for our variable names is convenient. The "functional" alternative would there being some numpy function called np.reshape() which you would use like this:

In [6]:
b = np.arange(10)
b = np.reshape(b,(5,2))
In [7]:
b
Out[7]:
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

Oh look! That exists! This is a good highlighting of the fact that many of the key modules in python, like numpy and matplotlib have both OOP and functional versions of things.

Defining a class

A class has a few key pieces. Let's define one below:

In [8]:
class Pipeline(object):
    def __init__(self):
        pass
    def method1(self,a,b):
        pass

What do we have? We have a statement initializing the class. I've fed it object as the superclass to pull it's own properties from, but that's unimportant (it would happen on its own if you didn't specify). I then define a very important function called __init__(), which is required for all classes. (Note: that's double-underscore init double-underscore).

When you instantiate an instance of a class (like I did by defining b above), the init method is what gets run up front, to define the basic attributes of the object. The other methods also get defined, so they can be run subsequently.

An important point about classes is that every method of a class has its first argument as self. The word self is not special, though many syntax highlighters will color it differently --- what matters is that it's the first argument and always needs to be the same word. That said, always use self, if you want your code to be readable.

In OOP, self is the placeholder variable that defines that abstract concept of the object itself. When we instantiate and instance of the class, we get to "rename" self to whatever we've named our variable. For example:

In [9]:
my_object = Pipeline()
my_object.method1(1,2)

Now, we can also call our own methods from within our classes, and we'll do it with the same dot-notation style, but because we are dealing with the abstract object and not an instance of it, we'll use the self placeholder. For example,

In [10]:
class Pipeline(object):
    def __init__(self):
        pass
    def method1(self,a,b):
        return a+b 
    def method2(self,x):
        a = self.method1(x,1)
        return a+1

So in this play example, my method2 function calls my method1 function. I could then run

In [11]:
my_object = Pipeline()
my_object.method2(50)
Out[11]:
52

Input and calculated attributes

So far our pipeline object is pretty useless, because it doesn't take any inputs. Presumably, you would only want to be instantiating objects of this class if there were something different about each one.

Using my personal experience, let's pretend we are going to want to be reducing observations of many different galaxies. We want to run raw images from these different systems through the same pipeline, but use the right calibrations, etc., along the way.

One possible organization of the data is into date directories, like 20180405/. Within that directory, we might have files named with some convention and numbering, like tspec0045.fits.

So, if we want to intitially setup our object, we might need to feed it a directory to load images from, a galaxy name, and a set of frame numbers which correspond to that galaxy. We want these all to be class attributes up front.

In [45]:
class Pipeline():
    
    def __init__(self,gal_name,gal_dir,frames,telescope='PalomarTspec'):
        '''
        Initializes a Pipeline object. 
        INPUT: 
            gal_name (str): Name of the galaxy being reduced
            gal_dir (str): Path (full or relative) to the directory containing the raw fits files
            frames (arr_like): list/array of frame numbers corresponding to desired galaxy
            telescope (str): Name of telescope for filename conventions. 
                            Options: 'Palomar Tspec','Keck NIRES','KECK MOSFIRE'
            
        RETURNS:
            None
        '''
        self.gal_name = gal_name
        self.telescope = telescope
        self._set_galdir(gal_dir)
        self.frames = frames
        self.convert_frames_to_files(frames)
    
    def _set_galdir(self,gal_dir):
        if gal_dir.endswith('/'):
            self.gal_dir = gal_dir
        else: 
            self.gal_dir = gal_dir + '/'
    
    def convert_frames_to_files(self,frames):
        '''
        Converts frame numbers (e.g., [10,11,13,16]) to filenames/paths, e.g., ['../20180403/tspec0010.fits',...]
        INPUT:
            frames (arr_like): list/array of frame numbers
        OUTPUT:
            filenames (arr_like): list of associated filenames
        '''
        if self.telescope == 'PalomarTspec':
            self.filenames = [self.gal_dir + 'tspec' + str(i).zfill(4) + '.fits' for i in frames]
        elif self.telescope == 'Keck NIRES':
            print('feature incoming')
            pass
        elif self.telescope =='Keck MOSFIRE':
            print('feature incoming')
            pass
            

Woah woah woah. Okay. What happend there? I did a few of the things we tend to do to our inputs up front in a class. First, there are a few inputs to our class that we can make attributes without doing anything to them (for example, gal_name). We trust the user has input this correctly, and simply set self.gal_name = gal_name in our init statement. This means that someone can access the name of a created object by doing pipeline_object.gal_name. I've decided the same is true for telescope, which is used to define how the fits files are going to be named.

For self.gal_dir, though, I decided to build in a bit of flexibility -- namely, the ability to include or not include the trailing / in the folder name. For path construction later (e.g., in convert_frames_to_files()), ensuring the trailing / is there is helpful. So I define a small helper function called _set_galdir() which takes the user input gal_dir and quickly checks if there's a /, and adds it if there isn't. You could easily imagine checking other inputs in this manner.

What about the weird underscore in the name of the function? This is just a coding convention to indicate that this is a function which shouldn't be changed/used by users, it's just there for setup.

Next, I define a method called convert_frames_to_filenames(). This method takes in the numerical frame numbers and uses the known telescope to construct the full filepaths to the data. str.zfill(4) is a handy function which will make a string of the value and leading zeros filling up to a certain number of digits:

In [15]:
a = '4'
a.zfill(4)
Out[15]:
'0004'

which is very handy because it's how files in telescopes are typically named.

Okay, so my __init__() function takes my inputs, and sets some of them as attributes immediately, uses helper functions to set some of them, and uses more involved functions to set one attribute that wasn't in some way an input at all. You the coder gets to decide what all gets initialized -- some things might not always be needed and can be set in methods that don't get run automatically.

I also illustrated a quick handy way to remind yourself of code bits you need to write without keeping the code from running -- the pass option. Great for keeping track of methods and features of methods you want to add later.

Finally, you'll notice I elected to keep frames as its own attribute, separate from filenames. It's possible that now that I have the filenames, I'll not need the frame numbers by themselves again. But who knows? Maybe they'll come in handy later, or for debugging. One advantage of classes is that I can make any variable a class attribute just by sticking self. in front of it, and it will still stay local to that object and not interfere with my global namespace, while being accesible within any method of my class. Great for quick debugging.

Here's where some of the OOP advantages will start becoming clear (I hope).

In a functional sense, next I would write a function that takes as an input the files I want to load, and load the images into an array. But with OOP, I can now make a method to do that, and it would look like this:

In [46]:
class Pipeline():
    
    def __init__(self,gal_name,gal_dir,frames,telescope='PalomarTspec'):
        '''
        Initializes a Pipeline object. 
        INPUT: 
            gal_name (str): Name of the galaxy being reduced
            gal_dir (str): Path (full or relative) to the directory containing the raw fits files
            frames (arr_like): list/array of frame numbers corresponding to desired galaxy
            telescope (str): Name of telescope for filename conventions. 
                            Options: 'Palomar Tspec','Keck NIRES','KECK MOSFIRE'
            
        RETURNS:
            None
        '''
        self.gal_name = gal_name
        self.telescope = telescope
        self._set_galdir(gal_dir)
        self.frames = frames
        self.convert_frames_to_files(self.frames)
    
    def _set_galdir(self,gal_dir):
        if gal_dir.endswith('/'):
            self.gal_dir = gal_dir
        else: 
            self.gal_dir = gal_dir + '/'

    
    def convert_frames_to_files(self,frames):
        '''
        Converts frame numbers (e.g., [10,11,13,16]) to filenames/paths, e.g., ['../20180403/tspec0010.fits',...]
        INPUT:
            frames (arr_like): list/array of frame numbers
        OUTPUT:
            filenames (arr_like): list of associated filenames
        '''
        if self.telescope == 'PalomarTspec':
            self.filenames = [self.gal_dir + 'tspec' + str(i).zfill(4) + '.fits' for i in frames]
        elif self.telescope == 'Keck NIRES':
            print('feature incoming')
            pass
        elif self.telescope =='Keck MOSFIRE':
            print('feature incoming')
            pass
        
    def load_images(self):
        '''
        Function to use astropy.io.fits to load images into a big array
        Uses: self.filenames
        Sets: self.image_array
        Returns: none
        '''
        images = []
        for i in self.filenames:
            with fits.open(i) as HDU:
                image = HDU[0].data
                images.append(image)
        self.image_array = np.array(images)

Notice my method didn't need any inputs, because self.filenames is a "global" variable to my class, as an attribute. It's still useful to note in the documentation which variables your method needs set, and which it sets itself. Now, self.filenames must be present, really, because it was set by the init function. But it's likely you'll also have some attributes that don't get set by init (because they're not always needed), but are needed when you run some other function. We can always double check using the following style:

In [18]:
def load_images(self):
        '''
        Function to use astropy.io.fits to load images into a big array
        Uses: self.filenames
        Sets: self.image_array
        Returns: none
        '''
        images = []
        if not hasattr(self,'filenames'):
            self.convert_frames_to_files(self.frames)
        for i in self.filenames:
            with fits.open(i) as HDU:
                image = HDU[0].data
                images.append(image)
        self.image_array = np.array(images)

So now when load_images() runs, it checks if self "has the attribute" filenames, and if it doesn't, runs the method that generates it (hey look, keeping frames around was helpful!). Again, we don't need that here because both are set in __init__().

At this point, our "main" function which would run our pipeline would look something like this:

In [47]:
galaxy = Pipeline('NGC_2048','../data/20180503/',[10,11,12,13,14,15])
In [48]:
galaxy.gal_name
Out[48]:
'NGC_2048'
In [49]:
galaxy.gal_dir
Out[49]:
'../data/20180503/'
In [50]:
galaxy.frames
Out[50]:
[10, 11, 12, 13, 14, 15]
In [51]:
galaxy.filenames
Out[51]:
['../data/20180503/tspec0010.fits',
 '../data/20180503/tspec0011.fits',
 '../data/20180503/tspec0012.fits',
 '../data/20180503/tspec0013.fits',
 '../data/20180503/tspec0014.fits',
 '../data/20180503/tspec0015.fits']

Deciding whether your class methods return a value or set a value is up to you. You can define methods called setters which is how I did things above -- the __init__() function runs a bunch of "setting" methods which set up some of the class attributes. I could also have had them return values, and set the attribute within init instead, like this :

In [52]:
class Pipeline():
    
    def __init__(self,gal_name,gal_dir,frames,telescope='PalomarTspec'):
        '''
        Initializes a Pipeline object. 
        INPUT: 
            gal_name (str): Name of the galaxy being reduced
            gal_dir (str): Path (full or relative) to the directory containing the raw fits files
            frames (arr_like): list/array of frame numbers corresponding to desired galaxy
            telescope (str): Name of telescope for filename conventions. 
                            Options: 'Palomar Tspec','Keck NIRES','KECK MOSFIRE'
            
        RETURNS:
            None
        '''
        self.gal_name = gal_name
        self.telescope = telescope
        self.gal_dir = self._set_galdir(gal_dir)
        self.frames = frames
        self.filenames = self.convert_frames_to_files(frames)
    
    def _set_galdir(self,gal_dir):
        if gal_dir.endswith('/'):
            return gal_dir
        else: 
            return gal_dir + '/'

    
    def convert_frames_to_files(self,frames):
        '''
        Converts frame numbers (e.g., [10,11,13,16]) to filenames/paths, e.g., ['../20180403/tspec0010.fits',...]
        INPUT:
            frames (arr_like): list/array of frame numbers
        OUTPUT:
            filenames (arr_like): list of associated filenames
        '''
        if self.telescope == 'PalomarTspec':
            return [self.gal_dir + 'tspec' + str(i).zfill(4) + '.fits' for i in frames]
        elif self.telescope == 'Keck NIRES':
            print('feature incoming')
            pass
        elif self.telescope =='Keck MOSFIRE':
            print('feature incoming')
            pass
        
    def load_images(self):
        '''
        Function to use astropy.io.fits to load images into a big array
        Uses: self.filenames
        Sets: self.image_array
        Returns: none
        '''
        images = []
        for i in self.filenames:
            with fits.open(i) as HDU:
                image = HDU[0].data
                images.append(image)
        self.image_array = np.array(images)
In [53]:
galaxy = Pipeline('NGC_2048','../data/20180503/',[10,11,12,13,14,15])
In [54]:
galaxy.gal_name
Out[54]:
'NGC_2048'
In [55]:
galaxy.filenames
Out[55]:
['../data/20180503/tspec0010.fits',
 '../data/20180503/tspec0011.fits',
 '../data/20180503/tspec0012.fits',
 '../data/20180503/tspec0013.fits',
 '../data/20180503/tspec0014.fits',
 '../data/20180503/tspec0015.fits']

So it works basically the same way. This second way can be a little more organized, because your functions return things and you can see up within one function (init) what most of your class attributes are called. Insofar as they are those key methods that get run automatically when the class is initialized it doesn't matter too much. But notice I left load_images as a setter, because it's not run in init. So the user should be able to run

In [ ]:
galaxy.load_images() 

and have that load up the images into memory as a class attribute. Since there are many ways to do this, one could have their "main" function store that image array and plug it back into other methods, but I prefer (so long as the memory footprint isn't massive) just to store the relative parts of the reduction within the class. If people are interacting with your code, this might be a place for a print statement like

In [56]:
    def load_images(self):
        '''
        Function to use astropy.io.fits to load images into a big array
        Uses: self.filenames
        Sets: self.image_array
        Returns: none
        '''
        images = []
        for i in self.filenames:
            with fits.open(i) as HDU:
                image = HDU[0].data
                images.append(image)
        self.image_array = np.array(images)
        print('Images loaded into object attribute image_array')

At this point, hopefully you can see what would come next --- perhaps a method which uses self.image_array to start doing image combinations or calibrations (calibrations, btw, can easily be loaded up with a method that gets run in init, or when needed).

Step by step, method by method, we can create our reduction pipeline much like we would in the functional way, but a bit more streamlined, without having to feed tons of variables in and out of each function. When running our code in the terminal, we can easily debug, because unlike functions which trap intermediate variables to local namespaces that are hard to access (requiring us to use a special debugging tool or print statements), here many important variables are attributes, and we can easily set a problem variable using self.blah and then interrogate it in-terminal.

It means we can reduce each galaxy into an object that knows lots of things about itself, and can calculate many things about itself. Each object, for example, might know how to make a nice plot of itself just by typing objname.plot(), which might also take some optional arguments (in contrast, if we had one global plotting function, we'd need to feed it, e.g., wavelengths and fluxes, which would have to come out from some function or just be stored in a global variable).

Finally, this sort of class-oriented code structure is the primary way of making production level code packages in python (a quick github browsing can confirm this). So learning it now is a great way to prep yourself for hosting code on github that other people might use.

Wrapping up

I hope this has been a helpful introduction to classes, why we might use them, and how to get started with them! It takes some getting used to, but over time I've found it more and more helpful. I myself have plenty to learn when it comes to this topic!

For a super quick and easy tutorial that involves using a very simple class, I'm going to be posting (soon) one where you'll get to set up a simulation of supernovae going off in the Milky Way to estimate the volume filling fraction of SNe bubbles in our galaxy!