|
PROCESS-4/5-2005
Advanced Tools for Genetic Data Management and Analysis

This article presents a genetic analysis system that provides a powerful set of research tools for manipulating, organizing, and analyzing genetic information, including genotypes, phenotypes and pedigrees. The software module reduces the time required to manage, mine, query and format data using visualizations and graphic functions that allow interactive exploration and interpretation of data.

, den 12. Oktober 2005

One of the main goals of genetic analysis is to gain insight into the potential of a gene as a marker for disease diagnosis or a target for therapeutic development. In search of the best route for these ends, researchers have generated a tremendous wealth of information about the structure and function of genomes. The next step in the process is to make efficient use of the data by engaging in association studies, linkage studies and pharmacogenomic applications. A solution that simplifies the interpretation of genotype and phenotype information for subsequent association and linkage studies is presented in this article: the CEQ 8800 visualize software module, that provides a powerful set of tools for manipulating, organizing and analyzing genetic information, including genotypes, phenotypes, and pedigrees.
Relational Database The visualize data management module has the ability to keep track of all the genotypes and phenotypes (clinical information) of each subject in the database. Each subject ID is a unique identifier that becomes the single point of convergence of all the information from that subject. When new clinical information on subjects becomes available, the information can be imported using the subject ID as a reference point or entered on a case-by-case basis. Data Queries Subjects may be queried using a sophisticated set of query tools that will return data in tabular or various graphical formats, such as histograms, 2-D scatter plots, 3-D scatter plots, color maps and bi-variate 3-D histograms. However, queries are not restricted to subjects. They can also be configured to search data sources or study folders for kindreds, markers, maps, allele frequency sources or study variables. Within any of the subject-based queries, phenotypes and genotypes can be selectively displayed. The visualizations allow advanced, interactive ways to look at data, enabling the exploration of phenotypic and genetic patterns within the data and enhanced interpretation of complex data sets. Using “mapped” imports, a wide variety of data types (table 1) may be imported into the visualize data management module by simply defining the types of data found in the columns of a *.csv file. This includes genotypes that may have been derived from hardware platforms other than the CEQ. All genotypes and phenotypes will reference back to specific subjects. Thus a phenotype table will normally have the subject ID in the first column, a study variable name in a second column and the phenotype in the third column, etc. To place multiple study variables or markers in the same *.csv table, the subject IDs can be repeated, changing only the values in subsequent columns. Alternatively, independent *.csv files with the same subjects can be created and imported sequentially. If not already present, the subjects, markers, and study variables (the traits to which phenotypes are assigned) may also be imported. Each data type has a specific minimal set of required elements, the components0 which are listed in table 1. Before the phenotypes can be imported, the study variable names must be present. Similarly, before the genotypes can be imported, the marker names must be present. Subjects must be present in the database before the subjects’ genotypes or phenotypes can be imported. The *.csv file can be referenced to import the list of subjects and the phenotypes for the two study variables. When a separate import is created for each data type, the order of import is important. However, the order within a single saved import configuration is not. If all the necessary precursors are present in a single saved import file, the visualize data manager will import the data in the correct sequence. Note that multiple import lines can independently reference the same *.csv file. Pheno- and Genotype Visualizations The visualize data management module has a number of convenient graphical tools for visualizing the distribution of subjects with different phenotypes. Examples of each type of visualization are included below. The first ones are frequency-dependent visualizations: -Histogram: A standard histogram plots the number of individuals that represent each characteristic as a bar chart. -Color map: Color maps plot the number of individuals that represent each characteristic as a color-coded, 2-D bar chart, where X- and Y-axes represent two variables or classes of subjects, and a color spectrum represents the number of subjects at the intersection of each variable (figure 1). -Bi-variate histogram: A bi-variate histogram is a 3-D variation of the color map. It plots the number of individuals that represent each characteristic as a 3-D bar chart, where X- and Y-axes represent two variables or classes of subjects. In addition to a color spectrum, the Z-axis height is proportional to the number of subjects in each class (figure 2). -Genotype/phenotype histogram: This histogram allows users to plot the frequency distributions of a set of selected genotypes or phenotypes side by side in the same plot. Population Summary Statistics The CEQ 8800 supports both family- and population-based studies, as well as calculation of allele and genotype frequencies. Additionally, it checks for Hardy Weinberg equilibrium and performs Chi-squared analysis. Inspection of the statistics allows the user to determine whether certain alleles are present in unusually low or high fractions in the population, and whether certain genotypes are under- or overrepresented. -Allele frequency distributions: The relative frequencies of all alleles de- tected in the population. -Genotype frequency distributions: The relative frequencies of all genotypes detected in the population. -2-D scatter plot: Select any combination of two variables or markers and plot these in two dimensions. A unique feature of the 2-D Scatter plot is the ability to select clusters of subjects and place them in separate folders based on their positions in the plot. -3-D scatter plot: Select any combination of three variables or markers and plot these in three dimensions. A unique feature of the 3-D scatter plot is the ability to rotate the plot in three dimensions in order to best view the distribution of the subjects. -Interactive pedigree viewer: The CEQ 8800 allows to view, edit, and print pedigrees. The CEQ’s customizable pedigree viewer allows the user to see individual data in a highly specific, detailed manner, with any number of genotype and phenotype features. Complex pedigrees, such as consanguineous relationships and twins, are appropriately represented. Once opened, pedigrees can be viewed at several different levels of magnification, panned, and labeled with color or textual information (“adornments”). To color code the subjects of a pedigree, affection status variables can be used. From the adornments tab on the left of the pedigree view, select an affection status, then select how many quadrants of the pedigree symbols to fill, and, finally, select the colors to use for affected, unaffected, and unknown subjects. The new code will be displayed in diamond-shaped symbols. The display of other study variables and genotypes can be achieved by selecting the variable or marker names from the same drop-down box. Note: Phenotypes that have been imported, but not yet associated with study variables will not display properly on the pedigree. Analysis Format Preparation Beyond simply creating the files for analysis, data is conveniently formatted for any export to standard genetic analysis packages such as Linkage, CRI-Map, SimWalk and GeneHunter. Data can also be exported from several interfaces in *.csv format. The visualize data management module will execute the analyses and display the results of the analysis. The process has four stages: -Stage 1 – analysis setup: To view the analysis setup screen, create a new analysis in the study of interest. In thefollowing screen specify new names for the analysis and the run. After that select the target format. There are several drop-down boxes for selecting parameters that will be used by the linkage analysis software packages. In most cases, parameters can be left at their default settings. -Stage 2 – creating the formatted files: When all required data is selected in this fashion, save the configuration. -Stage 3 – reviewing the formatted files: Check the input files and formatting summary. The latter will display warning and error messages after formatting the data for the third-party software programs. -Stage 4 – running the analysis: If the analysis will take longer than a few minutes, it is useful to check the “Run in Background” checkbox so that other tasks may be executed in the interim. To check on the progress of the analysis, check the task manager for completion of the run. Each analysis will generate one or more output files. If more than one file is generated, a drop-down box will be available in the results tab. The contents of each result file can be viewed below. The files can be exported or printed as necessary.
|