SDSS: Data I/O and Binned Plotting

In this tutorial we will be loading up some SDSS data (mass, metallicity, and star formation rate) for a bunch of galaxies. We will then plot the Mass metallicity relation, and explore it in a bit more depth.

You can work on this tutorial in anyway you like- You can use this html to copy and paste into your own ipython notebook (or just a python script in a plaintext editor) and then make changes, you can follow along working from scratch, or you can download the .ipynb version of this document from the Tutorials page and work directly in it.

Let's go ahead and get started with our import statements:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pyfits as pf
from matplotlib.colors import LogNorm
%matplotlib inline

Great. Now we will want to load the data from the fits files we downloaded. let's define a function that will do this for any fits data table file, and then feed it the three file-names we have.

In [3]:
def load_fits(fname):
    hdu =[1] #loads the fits file into python
    data =      #accesses the data table
    return data

sfr_full = load_fits('gal_totsfr_dr7_v5_2.fits')
mass_full = load_fits('')
z_full = load_fits('gal_fiboh_dr7_v5_2.fits')

We now have three sets of data. Each of our datasets above has multiple columns being stored in what is known as a "record array" (or rec.array). This is a form of array that allows indexing by attribue, much like a dictionary. We can print out what the column names are by using dot notation on our data sets.

In [11]:
    name = 'AVG'; format = 'E'
    name = 'ENTROPY'; format = 'D'
    name = 'FLAG'; format = 'E'
    name = 'MEDIAN'; format = 'E'
    name = 'MODE'; format = 'E'
    name = 'P16'; format = 'E'
    name = 'P2P5'; format = 'E'
    name = 'P84'; format = 'E'
    name = 'P97P5'; format = 'E'

As we can see, we have columns like "Avg", "Entropy", "Median", "Flag", etc. For the purposes of this tutorial, we are interested in the "Avg column." Like most datasets of real data, not all of the galaxies or entries in these files are usable- generally the pipeline that creates the datasets will "flag" bad data in an easily programmatically removable way. Sometimes it is using a "flag" column, and other times is is by selecting an arbitrary and non-physical number to enter as the value (this is how the SDSS data is handled). The values of the flags can usually be found in a readme file.

So, we need to restrict our data to just those that don't have any warning flags. In the cell below, (or in your code), find the indices (locations) for which the following conditions are satisfied, and save them to a variable called "restrictions":

  1. In the SFR array, the value is > -99
  2. In the Mass array, the value is not equal to -1
  3. In the metallicity array, the value is > -99.9

Hint: Do not use a for loop to iterate over the arrays and check the conditions- there is a much faster and more efficient method.

In [12]:
restrictions = np.where((sfr_full['AVG'] > -99) & (mass_full['AVG'] != -1) & (z_full['AVG'] > -99.9))[0]

Great, so now we know which indices correspond to the data that's actually useable. Out of curiosity, calculate the number of elements this process actually removed below:

In [14]:
print len(sfr_full['Avg']) - len(restrictions)

Next, we need to actually grab that data (remember, we currently only have the indices). Fill in the rest below

In [15]:
sfr = np.array(sfr_full[restrictions])
mass =np.array(mass_full[restrictions])
z = np.array(z_full[restrictions])

Note: These arrays (because we are using our full arrays from above), are record arrays containing multiple values for each galaxy. We can print out sfr below to see what that means:

In [19]:
array([ (0.08139009773731232, -4.42832320307308, 0.0, 0.04046526923775673, 0.05000019073486328, -0.19248105585575104, -0.4438594877719879, 0.30087733268737793, 0.6232559680938721),
       (-0.14585991203784943, -4.361890854835793, 0.0, -0.18654750287532806, -0.25, -0.40367355942726135, -0.6441667675971985, 0.05915503576397896, 0.3976192772388458),
       (-0.39462989568710327, -4.454854048351025, 0.0, -0.4310218095779419, -0.5, -0.663306474685669, -0.9434208869934082, -0.1682741343975067, 0.16142888367176056),
       (0.22899013757705688, -2.662650692885225, 2.0, 0.19214117527008057, 0.20000028610229492, 0.11597011238336563, 0.027193086221814156, 0.2938598096370697, 0.34685245156288147),
       (-2.0122199058532715, -4.884956881635399, 2.0, -2.058490514755249, -2.0, -2.44917368888855, -2.7102410793304443, -1.600000023841858, -1.2664284706115723),
       (-0.8014197945594788, -2.2798539805751283, 2.0, -0.8301931619644165, -0.7999997138977051, -0.889761745929718, -1.001724123954773, -0.7295780777931213, -0.6834284663200378)], 
      dtype=(numpy.record, [('AVG', '>f4'), ('ENTROPY', '>f8'), ('FLAG', '>f4'), ('MEDIAN', '>f4'), ('MODE', '>f4'), ('P16', '>f4'), ('P2P5', '>f4'), ('P84', '>f4'), ('P97P5', '>f4')]))

Essentially, sfr is a an array of the same type as the original, only with the bad entries removed. Using the same indexing, create variables sfrs, masses, and metallicities by indexing your sfr, mass, and z arrays for the "AVG" column.

In [24]:
sfrs = sfr['AVG']
masses = mass['AVG']
metallicities = z['AVG']
In [18]:
print sfrs
array([ 0.0813901 , -0.14585991, -0.3946299 , ...,  0.22899014,
       -2.01221991, -0.80141979], dtype=float32)

Now we are getting somewhere. We have a single array for all the good star formation rates, an array of masses, and an array of metallicities. What is the Mass-Metallicity relation for galaxies? Let's find out. Define a function to plot mass vs metallicity as a 2D histogram with 300 bins (plt has a command for this). You'll want to utilize the "LogNorm" function I imported at the top to get a good color gradient- google "plt 2d histogram LogNorm" if you don't know how to do this already.

In [22]:
def plot_mass_vs_metal(masses,metallicities):
    #Plot mass against metalicity.
    plt.hist2d(masses,metallicities,bins=300, norm=LogNorm())
    plt.title('Mass/Metallicity relation for SDSS Galaxies')
    plt.xlabel(r'log Mass [$M_\odot$]')
    plt.ylabel(r'log Gas Phase Metallicities')
In [25]:
/Users/ipasha/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/matplotlib/ FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

When you run your function, you should get an image identical to the one I have above. As we can see, the gas phase metallicity of a galaxy (that is, it's composition in terms of elements) has a positive correlation with the log Mass. However, maybe this is not quite a direct relation as we might think. We also pulled in Star Formation Rates. Perhaps those are also correlated? Fill in the functions below to make the histograms for SFR v metallicity and SFR v mass.

In [26]:
def plot_sfr_metal(sfrs,metallicities):
    plt.title('SFR/Metallicity relation for SDSS Galaxies')
    plt.xlabel(r'log SFR')
    plt.ylabel(r'log Gas Phase Metallicities')
def plot_mass_sfr(masses=masses,sfrs=sfrs):
    plt.title('Mass/SFR relation for SDSS Galaxies')
    plt.xlabel(r'log Mass')
    plt.ylabel(r'log SFR')
In [27]:

Oh dear. It seems like SFR correlates positively with mass, and metallicity correlates positively with SFR. We need to tease out which of these things are actually correlated, and which only look correlated because they depend on something else which is correlated.

One way we can do this is to take slices of one variable. For example, if metallicity truly does depend on mass, then for all galaxies of a single mass, the correlation should dissapear. We can do such a check for all three of our variables- taking slices of single SFR, metallicity, and mass, and see for which the positive correlation dissapears.

In the space below, write 3 functions, which will bin your data by mass, metallicity, and sfr slices. In theory we would select multiple single slices (choosing a specific value), but in practice our bins will have to have a certain width. To make things easier, I have looked at the data and created a bins array for each function- see if you can figure out how it works (look at the bounds on the graphs above and the behavior of linspace).

In [34]:
def mass_bins(masses):
    bins = np.linspace(7,12,10)
    binned_masses = []
    binned_sfrs = []
    binned_z = []
    for i in range(len(bins)-1):
        mass_indices = np.where((masses>bins[i]) & (masses<bins[i+1]))[0] #check "where" masses are > left edge of bin and < right edge of bin
        mass_indices = np.array(mass_indices)
        masses_needed = masses[mass_indices] #index masses for the indices found above
        sfr_needed = sfrs[mass_indices]
        z_needed = metallicities[mass_indices]
    return binned_masses, binned_sfrs, binned_z

def sfr_bins(sfrs):
    bins = np.linspace(-2,2,10)
    binned_masses = []
    binned_sfrs = []
    binned_z = []
    for i in range(len(bins)-1):
        to_choose = np.where((sfrs>bins[i]) & (sfrs<bins[i+1]))[0]
        to_choose = np.array(to_choose)
        masses_needed = masses[to_choose]
        sfr_needed = sfrs[to_choose]
        z_needed = metallicities[to_choose]
    return binned_masses, binned_sfrs, binned_z
def z_bins(metallicities):
    bins = np.linspace(8,9.5,10)
    binned_masses = []
    binned_sfrs = []
    binned_z = []
    for i in range(len(bins)-1):
        to_choose = np.where((metallicities>bins[i]) & (metallicities<bins[i+1]))[0]
        to_choose = np.array(to_choose)
        masses_needed = masses[to_choose]
        sfr_needed = sfrs[to_choose]
        z_needed = metallicities[to_choose]
    return binned_masses, binned_sfrs, binned_z

Lastly, let's write some simple functions to plot up all the sliced data: (This time use 100 bins since there is less data per plot)

In [35]:
def plot_mbins(masses=masses):
    m,s,z = mass_bins(masses)
    for i in range(len(m)):
        plt.xlabel('Log SFR')
        plt.ylabel('Log Metallicity')
def plot_sfrbins(sfrs=sfrs):
    m,s,z = sfr_bins(sfrs)
    for i in range(len(m)):
        plt.xlabel('Log Mass')
        plt.ylabel('Log Metallicity')
def plot_zbins(metallicities=metallicities):
    m,s,z = z_bins(metallicities)
    for i in range(len(m)):
        plt.xlabel('Log Mass')
        plt.ylabel('Log SFR')

As a final step, lets run our functions below, and see what we get:

In [37]:
<matplotlib.figure.Figure at 0x10eaa9c50>
In [38]:
<matplotlib.figure.Figure at 0x11018c990>
In [39]:
<matplotlib.figure.Figure at 0x1050a2550>

As we can see, the relationship between SFR and metallicity disappears at a single mass slice (the first ten plots). Thus, SFR and metallicity are not truly correlated, but only appear so when all masses are included because each depends on mass (which can be seen in the second and third set of plots, where for single slice in metallicity, the SFR/mass relation still exists, and for single slices in SFR, the mass/metallicity relation still exists.

Congrats! You made it to the end of the tutorial. I hope you enjoyed it, practiced a little python, and learned something about galaxy properties. As always, feel free to contact me (post an issue on the github ) if anything was messed up, confusing, or poorly explained.

In [ ]: