NASCArrays CSV files
The CSV files from NASCArrays contain the normalised signal value and a range of annotations for each probe set which are useful when analysing the data (see fig 1).
Fig 1: Example of CSV file
The annotations are:
- Spot ID - The Affymetrix ID for the probe set. This can be used at the Affymetrix web site to find more information about the probe set (sequence of probes etc)
- Probename - The Affymetrix name for the probe set
- Gene name - The name of the gene, usually the AGI code
- Gene symbol - The AGI code for the gene
- Chromosome - Which chromosome the gene is on
- Start - The start position of the gene on the chromosome
- End - The end position of the gene on the chromosome
- Strand - Which strand of DNA the gene is on
- Description - A description of the gene
- MRNA Length - The mRNA length
- CDSLength - The length of the coding sequence
- NOOFEXONS - Number of exons in the gene
- INTERPRO - The INTERPRO accession number for the protein. This is a database of protein families, domains and functional sites. Can be searched at the InterPro website.
- GO - The gene ontology numbers for the gene, see the Gene Ontology website.
- Signal - Normalised signal value for the probe set.
- Detection - The detection call (1=present, -1=absent)
- P-value - A p-value for the detection call
- StatPairsUsed - The number of probe pairs used in the normalisation
Analysing the data
An example of the analysis that can be performed can be found in the associated file (rhd mutant data.xls).
Fig 2: Example of analysed data
Preparation of the data
To aid the analysis it is advisable to copy the gene name and all the signal value columns into a new worksheet. This reduces the number of columns in the sheet making it easier to work with the data.
Analysis of replicate variation
It is important to analyse the variation in the data between the replicates as this gives an indication of the quality of the data, experimental design and how well the experiment was performed. If the variation between the replicates is large, then this will mask any changes in gene expression being investigated.
To measure the variation between replicates you must:
- Calculate the average (mean) signal value for the replicates
- Calculate the standard deviation (SD) for the replicates
- Calculate the percentage co-efficient of variation (%CV = (SD/mean)*100). This will show how each gene varies between the replicates. In a well designed and executed experiment these figures will be low.
Finding changed genes
The aim of many array experiments is to find genes that have changed expression levels between treatments. Genes that have significantly changed (commonly greater than 2 fold change) can then be analysed further. Finding these genes can be done in a number of ways:
Ratios of expression
The easiest way is to find genes that have changed is to calculate the ratio of the means. The log ratio can also be calculated which scales the data making it easier to visualise.
T-Test
A T-Test can also be performed to find genes that have changed. A T-test assesses whether the means of two groups are statistically different from each other.
Sorting the results
To find interesting genes the data can be sorted based on any the calculations performed. This is done by (see fig 3):
- Select all the columns
- Select data>sort
- Check My list has header row
- Choose attribute to sort by (e.g. T-test), then whether descending or ascending
Fig 3: Sorting data
Selecting genes for further analysis
Genes should be selected for further analysis on more than one criterion. For example, genes that have been up or down regulated based on the ratio should only be further analysed if the %CV, T-test scores and P value for detection call are low.
Further Analysis
AtEnsembl
More information about individual genes can be found by using the AtEnsembl web site , by searching using the AGI code.
Fig 4: Arabidopsis Ensembl
AtEnsembl is an Arabidopsis genome browser that can provide information about the gene (sequence, transcript, structure etc), the protein (sequence, motifs etc). It also has links to NASCArrays, where the gene expression across all the experiments in the database can be viewed using the spot history and gene swinger tools. Information from the NASC seed catalogue is can also be accessed, allowing knockout lines to be ordered.
Fig 5: Example of an AtEnsembl page for an individual gene
GO Annotations
A gene list can be analysed based on the GO annotations for those genes. An easy way to perform this is using the TAIR website.
This is done by:- Paste a list of selected genes (AGI codes).
- Click get all GO annotations to download all the annotations for all the genes.
- Click functional categorisation to sort the genes by function, e.g cellular compartment, and biological process. This can be shown as a pie chart (see fig 6).
Fig 6: Characterisation of gene list
Analysis of potential promoter motifs
The promoters of a list of candidate genes can be analysed to find short motifs (6 base pairs) that are over represented in your list as compared to the whole genome. This can be done at the TAIR motif search web site. As with the GO annotation page, a list of AGI codes is used to perform the search on either 500bp or 1000bp of upstream sequence. A list is returned (see figure 7) that shows the motif sequence, how many genes in your list and the genome contain this motif, a P-value and the genes in your list that contain this motif. The motifs in this list maybe known or unknown binding sites for transcription factors.
Fig 7: Results of a motif search
For further help and advice on how to analyse your data, please contact affy@arabidopsis.info.