Some ideas for analysis of data
Introduction
Microarray analysis is a large subject. The subject is made even larger because at the moment there is no one "correct" way of analysing data. There are many pieces of software available for analysing data at various costs- a few pieces are free, many are free if you are an academic user, and some are sold. The purpose of this web page is to demonstrate simple analysis you can do with our sample datasets, and some free (at least if you are an academic user) software that you can use.
We're assuming you have some spreadsheets from us already. You can get spreadsheets from NASCArrays. A description of how to do this is available on the download help page.
For a more detailed description, please have a look at this page: AffyAnalysisWithExcel.html. And for any help or advice on any of the above topics, please contact affy@arabidopsis.info.
About the spreadsheets
A description of what you get in the Affymetrix spreadsheets is on NASCArrays' data help page.
Normalisation performed on Affymetrix Results
All the spreadsheets presented in the database have been trivially normalised using the Affymetrix standard procedure. A so-called "Scaling Factor" was applied using the Affymetrix software. This is calculated by removing the top and bottom 2% of signal values, then calculating a value that adjusts the mean of the remaining 96% to 100. All the results are multiplied by this factor to give normalised results. This has the effect of allowing different experiments to be comparable. More detailed normalisation is available, however most literature on the subject tends to focus on two colour arrays.
Analysis with Excel
A lot of the obvious questions from microarrays can be answered only using Excel (or other spreadsheets). This section will step through how to answer simple questions from the data spreadsheets downloadable from NASCArrays.
The first important features in Excel are the sort buttons.
Pressing these buttons sorts the entire spreadsheet according to the column the currently selected cell is in. So for instance, if you select a cell in the signal column, and press "sort descending" (the button labelled Z-A), the entire spreadsheet will sort on the signal. Scrolling to the top will show you the gene that was expressed most on the chip, and then in descending order. Choosing "sort ascending" (A-Z) will show the gene that was expressed the least at the top. Any of the columns can be sorted. To get the spreadsheet back the way it started, sort ascending on the spotid.
Caution! Do not select an entire column and press one of the sort buttons. When this happens just that column will be sorted, leaving the rest of the spreadsheet as it was before. This will scramble the data on the spreadsheet making it meaningless.
Other spreadsheet programs have this facility too, but they may work differently.
Excel can be used to do comparisons between two sets of results. One easy way is by doing a "log ratio". Download an experiment, or two slides from your selection, and find a pair of slides you want to compare. Find a free column and calculate the log of the ratio of the signal column of one slide by the signal column of the other. For example, type "=log(A1/B1)", where A and B are signal columns into the top empty space of the free column, then copy-and-paste this. This is the "log ratio". A value of greater than ±2 is traditionally considered a significant change, although this is being replaced by more statistical methods.
Once you've calculated the log ratio into a column, you can also find the most upregulated and downregulated gene by sorting on the log ratio.
If the experiment you are studying has replicates, you can do the student's T-test. What you need to do in this case is to have two groups of replicates to compare between. Use the "ttest" function for this.
You can download a sample spreadsheet (pictured) with log-ratio and t-test calculated as an example.
For a more detailed description, please have a look at this page: AffyAnalysisWithExcel.html.
Clustering
Clustering is a mathematical technique for grouping things. In gene expression work, clustering is used to group together genes that act together over a series of samples. The genes in a group do not have to be expressed at a similar strength - one can be highly expressed and the other hardly expressed at all, but they must go up and down together in the samples studied. Clustering allows you to find out which genes are co-regulated in the experimental conditions studied.
It is also possible to cluster slides- this allows you to find out samples that are grouped together. For instance, all replicates should group together!
You can get data from the website using settings that are ideal for clustering. Download an experiment that you want to cluster, using the following settings: tab-delimited format, probeset name only annotation, uncompressed.
You also need the annotation spreadsheet, which is available here. This is just a tab-delimited file with annotation -no expression data.
For this clustering example we will use Expression Profiler from the EBI to perform our clustering. This is a free web based clustering tool. Its URL is http://ep.ebi.ac.uk. The text files supplied are suitable for loading into Expression Profiler.
There is a newer version of Expression Profiler being developed. This example still uses the old version- I could not get the new version to work.
Starting at the Expression Profiler homepage, click on "Use Expression Profiler", then "EPCLUST". EPCLUST is the clustering portion of Expression Profiler. Make sure you have the clustering files supplied close to hand, and click "Upload your own data". This will take you to a form where you can browse for the supplied files. Browse and find matrix.txt in the data matrix box, and put annotations.txt into the annotations box. Press proceed to put the data into Expression Profiler.
The data is uploaded. The next two pages ask about filtering the uploaded data before analysis. Press "Select the Correct Experiments" on the next page to get to the main filtering page.
When clustering, typically you cluster over many genes. The results you get can become bewildering. It is a good idea to reduce the number of genes if at all possible. This is particularly important with large experiments. Every gene you remove reduces the amount of computer time needed to cluster dramatically.
One way of reducing the number of genes is to remove the genes with very low expression values. These genes are probably not expressed at all, and any signal for them is probably just noise. Leaving them in will leave you clustering noise. Remove the low filtered genes by choosing "Select rows where at least 2 genes are greater than 150 OR less than 0" by typing in 150 into the box. This removes any genes with signals less than 150. This number was chosen because it removes data that starts to be marked "Absent" on the Affymetrix spreadsheets.
The clustering page proper then appears. This allows you to select what kind of clustering you wish to do.
Hierarchical clustering groups genes into a tree. From the top of the tree, the genes are split into 2 groups. Each of these groups are split into two subgroups that in turn are split into two more subgroups and so on until each group only contains one gene.
Once you get the data uploaded, perform hierarchical clustering by pressing GO! next to the hierarchical clustering options. The default options are probably fine- the intricacies of hierarchical clustering are beyond the author at the moment!
After given some time to think, the computer produces a tree, and a picture representing the clustering. The genes have been rearranged into their cluster order, and are represented on the vertical axis (although not labelled). The experiments are represented on the horizontal axis. The brighter the red colour on the picture, the higher that gene was expressed (vertical axis) in that experiment (horizontal axis).
Changing the colour scheme gives better visualisation of data. Click on the "Show Options" link at the top. Change the discretisation size to 30, and choose Exponential sizes. Press redraw, and the colours can be seen more vividly.
The tree can be explored by clicking on it. Clicking on one of the branches shows you the branch in more detail. It also lists the genes involved in that cluster underneath. The annotation file supplied is used to provide information about the genes, and a link to GenBank.
The other kind of clustering that can be done using expression profiler is called K-means clustering. This uses a different algorithm. Get back to the clustering page by clicking on Data: xxxxx where xxxxx is a random series of characters. This link is at the top of the page
K-means clustering requires you to tell it how many clusters you require. Type in a number, and press GO!. It will then arrange the genes in to the number of clusters required.
For a more detailed description, please have a look at this page: AffyAnalysisWithExcel.html. And for any help or advice on any of the above topics, please contact affy@arabidopsis.info.