Abstract: The DNA microarray technology concurrently monitors the expression levels of thousands of genes during significant biological processes and across the related samples. The better understanding of functional genomics is obtained by extracting the patterns hidden in gene expression data. It is handled by clustering which reveals natural structures and identify interesting patterns in the underlying data. In the proposed work clustering gene expression data is done through an Advanced Nelder Mead (ANM) algorithm. Nelder Mead (NM) method is a method designed for optimization process. In Nelder Mead method, the vertices of a triangle are considered as the solutions. Many operations are performed on this triangle to obtain a better result. In the proposed work, the operations like reflection and expansion is eliminated and a new operation called spread-out is introduced. The spread-out operation will increase the global search area and thus provides a better result on optimization. The spread-out operation will give three points and the best among these three points will be used to replace the worst point. The experiment results are analyzed with optimization benchmark test functions and gene expression benchmark datasets. The results show that ANM outperforms NM in both benchmarks.
Abstract: Mammalian genomes contain large number of
retroelements (SINEs, LINEs and LTRs) which could affect
expression of protein coding genes through associated transcription
factor binding sites (TFBS). Activity of the retroelement-associated
TFBS in many genes is confirmed experimentally but their global
functional impact remains unclear. Human SINEs (Alu repeats) and
mouse SINEs (B1 and B2 repeats) are known to be clustered in GCrich
gene rich genome segments consistent with the view that they
can contribute to regulation of gene expression. We have shown
earlier that Alu are involved in formation of cis-regulatory modules
(clusters of TFBS) in human promoters, and other authors reported
that Alu located near promoter CpG islands have an increased
frequency of CpG dinucleotides suggesting that these Alu are
undermethylated. Human Alu and mouse B1/B2 elements have an
internal bipartite promoter for RNA polymerase III containing
conserved sequence motif called B-box which can bind basal
transcription complex TFIIIC. It has been recently shown that TFIIIC
binding to B-box leads to formation of a boundary which limits
spread of repressive chromatin modifications in S. pombe. SINEassociated
B-boxes may have similar function but conservation of
TFIIIC binding sites in SINEs located near mammalian promoters
has not been studied earlier. Here we analysed abundance and
distribution of retroelements (SINEs, LINEs and LTRs) in annotated
sequences of the Database of mammalian transcription start sites
(DBTSS). Fractions of SINEs in human and mouse promoters are
slightly lower than in all genome but >40% of human and mouse
promoters contain Alu or B1/B2 elements within -1000 to +200 bp
interval relative to transcription start site (TSS). Most of these SINEs
is associated with distal segments of promoters (-1000 to -200 bp
relative to TSS) indicating that their insertion at distances >200 bp
upstream of TSS is tolerated during evolution. Distribution of SINEs
in promoters correlates negatively with the distribution of CpG
sequences. Using analysis of abundance of 12-mer motifs from the
B1 and Alu consensus sequences in genome and DBTSS it has been
confirmed that some subsegments of Alu and B1 elements are poorly
conserved which depends in part on the presence of CpG
dinucleotides. One of these CpG-containing subsegments in B1
elements overlaps with SINE-associated B-box and it shows better
conservation in DBTSS compared to genomic sequences. It has been
also studied conservation in DBTSS and genome of the B-box
containing segments of old (AluJ, AluS) and young (AluY) Alu
repeats and found that CpG sequence of the B-box of old Alu is
better conserved in DBTSS than in genome. This indicates that Bbox-
associated CpGs in promoters are better protected from
methylation and mutation than B-box-associated CpGs in genomic
SINEs. These results are consistent with the view that potential
TFIIIC binding motifs in SINEs associated with human and mouse
promoters may be functionally important. These motifs may protect
promoters from repressive histone modifications which spread from
adjacent sequences. This can potentially explain well known
clustering of SINEs in GC-rich gene rich genome compartments and
existence of unmethylated CpG islands.
Abstract: Inferring the network structure from time series data
is a hard problem, especially if the time series is short and noisy.
DNA microarray is a technology allowing to monitor the mRNA
concentration of thousands of genes simultaneously that produces
data of these characteristics. In this study we try to investigate the
influence of the experimental design on the quality of the result.
More precisely, we investigate the influence of two different types of
random single gene perturbations on the inference of genetic networks
from time series data. To obtain an objective quality measure for
this influence we simulate gene expression values with a biologically
plausible model of a known network structure. Within this framework
we study the influence of single gene knock-outs in opposite to
linearly controlled expression for single genes on the quality of the
infered network structure.
Abstract: Among all microRNAs (miRNAs) in 12 plant species investigated in this study, only miR398 targeted the copper chaperone for superoxide dismutase (CCS). The nucleotide sequences of miRNA binding sites were located in the mRNA protein-coding sequence (CDS) and were highly homologous. These binding sites in CCS mRNA encoded a conservative GDLGTL hexapeptide. The binding sites for miR398 in the CDS of superoxide dismutase 1 mRNA encoded GDLGN pentapeptide. The conservative miR398 binding site located in the CDS of superoxide dismutase 2 mRNA encoded the GDLGNI hexapeptide. The miR398 binding site in the CDS of superoxide dismutase 3 mRNA encoded the GDLGNI or GDLGNV hexapeptide. Gene expression of the entire superoxide dismutase family in the studied plant species was regulated only by miR398. All members of the miR398 family, i.e. miR398a,b,c were connected to one site for each CuZnSOD and chaperone mRNA.
Abstract: MiRNAs participate in gene regulation of translation.
Some studies have investigated the interactions between genes and
intragenic miRNAs. It is important to study the miRNA binding sites
of genes involved in carcinogenesis. RNAHybrid 2.1 and ERNAhybrid
programmes were used to compute the hybridization free
energy of miRNA binding sites. Of these 54 mRNAs, 22.6%, 37.7%,
and 39.7% of miRNA binding sites were present in the 5'UTRs,
CDSs, and 3'UTRs, respectively. The density of the binding sites for
miRNAs in the 5'UTR ranged from 1.6 to 43.2 times and from 1.8 to
8.0 times greater than in the CDS and 3'UTR, respectively. Three
types of miRNA interactions with mRNAs have been revealed: 5'-
dominant canonical, 3'-compensatory, and complementary binding
sites. MiRNAs regulate gene expression, and information on the
interactions between miRNAs and mRNAs could be useful in
molecular medicine. We recommend that newly described sites
undergo validation by experimental investigation.
Abstract: Analysis and visualization of microarraydata is veryassistantfor biologists and clinicians in the field of diagnosis and treatment of patients. It allows Clinicians to better understand the structure of microarray and facilitates understanding gene expression in cells. However, microarray dataset is a complex data set and has thousands of features and a very small number of observations. This very high dimensional data set often contains some noise, non-useful information and a small number of relevant features for disease or genotype. This paper proposes a non-linear dimensionality reduction algorithm Local Principal Component (LPC) which aims to maps high dimensional data to a lower dimensional space. The reduced data represents the most important variables underlying the original data. Experimental results and comparisons are presented to show the quality of the proposed algorithm. Moreover, experiments also show how this algorithm reduces high dimensional data whilst preserving the neighbourhoods of the points in the low dimensional space as in the high dimensional space.
Abstract: DNA microarray technology is widely used by
geneticists to diagnose or treat diseases through gene expression.
This technology is based on the hybridization of a tissue-s DNA
sequence into a substrate and the further analysis of the image
formed by the thousands of genes in the DNA as green, red or yellow
spots. The process of DNA microarray image analysis involves
finding the location of the spots and the quantification of the
expression level of these. In this paper, a tool to perform DNA
microarray image analysis is presented, including a spot addressing
method based on the image projections, the spot segmentation
through contour based segmentation and the extraction of relevant
information due to gene expression.
Abstract: Biclustering is a very useful data mining technique for
identifying patterns where different genes are co-related based on a
subset of conditions in gene expression analysis. Association rules
mining is an efficient approach to achieve biclustering as in
BIMODULE algorithm but it is sensitive to the value given to its
input parameters and the discretization procedure used in the
preprocessing step, also when noise is present, classical association
rules miners discover multiple small fragments of the true bicluster,
but miss the true bicluster itself. This paper formally presents a
generalized noise tolerant bicluster model, termed as μBicluster. An
iterative algorithm termed as BIDENS based on the proposed model
is introduced that can discover a set of k possibly overlapping
biclusters simultaneously. Our model uses a more flexible method to
partition the dimensions to preserve meaningful and significant
biclusters. The proposed algorithm allows discovering biclusters that
hard to be discovered by BIMODULE. Experimental study on yeast,
human gene expression data and several artificial datasets shows that
our algorithm offers substantial improvements over several
previously proposed biclustering algorithms.
Abstract: Tumor cells have an invasive and metastatic phenotype
that is the main cause of death for cancer patients. Tumor
establishment and penetration consists of a series of complex
processes involving multiple changes in gene expression. In this study,
intraperitoneal administration of a high concentration of ascorbic acid
inhibited tumor establishment and decreased tumor mass in BALB/C
mice implanted with S-180 sarcoma cancer cells. To identify proteins
involved in the ascorbic acid-mediated inhibition of tumor
progression, changes in the tumor proteome associated with ascorbic
acid treatment of BALB/C mice implanted with S-180 were
investigated using two-dimensional gel electrophoresis and mass
spectrometry. Twenty protein spots were identified whose expression
was different between control and ascorbic acid treatment groups.
Abstract: Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.
Abstract: Microarrays technique allows the simultaneous measurements of the expression levels of thousands of mRNAs. By mining this data one can identify the dynamics of the gene expression time series. By recourse of principal component analysis, we uncover the circadian rhythmic patterns underlying the gene expression profiles from Cyanobacterium Synechocystis. We applied PCA to reduce the dimensionality of the data set. Examination of the components also provides insight into the underlying factors measured in the experiments. Our results suggest that all rhythmic content of data can be reduced to three main components.
Abstract: Most of the biclustering/projected clustering algorithms are based either on the Euclidean distance or correlation coefficient which capture only linear relationships. However, in many applications, like gene expression data and word-document data, non linear relationships may exist between the objects. Mutual Information between two variables provides a more general criterion to investigate dependencies amongst variables. In this paper, we improve upon our previous algorithm that uses mutual information for biclustering in terms of computation time and also the type of clusters identified. The algorithm is able to find biclusters with mixed relationships and is faster than the previous one. To the best of our knowledge, none of the other existing algorithms for biclustering have used mutual information as a similarity measure. We present the experimental results on synthetic data as well as on the yeast expression data. Biclusters on the yeast data were found to be biologically and statistically significant using GO Tool Box and FuncAssociate.
Abstract: Tumor classification is a key area of research in the
field of bioinformatics. Microarray technology is commonly used in
the study of disease diagnosis using gene expression levels. The
main drawback of gene expression data is that it contains thousands
of genes and a very few samples. Feature selection methods are used
to select the informative genes from the microarray. These methods
considerably improve the classification accuracy. In the proposed
method, Genetic Algorithm (GA) is used for effective feature
selection. Informative genes are identified based on the T-Statistics,
Signal-to-Noise Ratio (SNR) and F-Test values. The initial candidate
solutions of GA are obtained from top-m informative genes. The
classification accuracy of k-Nearest Neighbor (kNN) method is used
as the fitness function for GA. In this work, kNN and Support Vector
Machine (SVM) are used as the classifiers. The experimental results
show that the proposed work is suitable for effective feature
selection. With the help of the selected genes, GA-kNN method
achieves 100% accuracy in 4 datasets and GA-SVM method
achieves in 5 out of 10 datasets. The GA with kNN and SVM
methods are demonstrated to be an accurate method for microarray
based tumor classification.
Abstract: Biochemical and molecular analysis of some
antioxidant enzyme genes revealed different level of gene expression
on oilseed (Brassica napus). For molecular and biochemical
analysis, leaf tissues were harvested from plants at eight different
developmental stages, from young to senescence. The levels of total
protein and chlorophyll were increased during maturity stages of
plant, while these were decreased during the last stages of plant
growth. Structural analysis (nucleotide and deduced amino acid
sequence, and phylogenic tree) of a complementary DNA revealed a
high level of similarity for a family of Catalase genes. The
expression of the gene encoded by different Catalase isoforms was
assessed during different plant growth phase. No significant
difference between samples was observed, when Catalase activity
was statistically analyzed at different developmental stages. EST
analysis exhibited different transcripts levels for a number of other
relevant antioxidant genes (different isoforms of SOD and
glutathione). The high level of transcription of these genes at
senescence stages was indicated that these genes are senescenceinduced
genes.
Abstract: Since dealing with high dimensional data is
computationally complex and sometimes even intractable, recently
several feature reductions methods have been developed to reduce
the dimensionality of the data in order to simplify the calculation
analysis in various applications such as text categorization, signal
processing, image retrieval, gene expressions and etc. Among feature
reduction techniques, feature selection is one the most popular
methods due to the preservation of the original features.
In this paper, we propose a new unsupervised feature selection
method which will remove redundant features from the original
feature space by the use of probability density functions of various
features. To show the effectiveness of the proposed method, popular
feature selection methods have been implemented and compared.
Experimental results on the several datasets derived from UCI
repository database, illustrate the effectiveness of our proposed
methods in comparison with the other compared methods in terms of
both classification accuracy and the number of selected features.
Abstract: Serial Analysis of Gene Expression is a powerful
quantification technique for generating cell or tissue gene expression
data. The profile of the gene expression of cell or tissue in several
different states is difficult for biologists to analyze because of the large
number of genes typically involved. However, feature selection in
machine learning can successfully reduce this problem. The method
allows reducing the features (genes) in specific SAGE data, and
determines only relevant genes. In this study, we used a genetic
algorithm to implement feature selection, and evaluate the
classification accuracy of the selected features with the K-nearest
neighbor method. In order to validate the proposed method, we used
two SAGE data sets for testing. The results of this study conclusively
prove that the number of features of the original SAGE data set can be
significantly reduced and higher classification accuracy can be
achieved.
Abstract: An evolutionary method whose selection and recombination
operations are based on generalization error-bounds of
support vector machine (SVM) can select a subset of potentially
informative genes for SVM classifier very efficiently [7]. In this
paper, we will use the derivative of error-bound (first-order criteria)
to select and recombine gene features in the evolutionary process,
and compare the performance of the derivative of error-bound with
the error-bound itself (zero-order) in the evolutionary process. We
also investigate several error-bounds and their derivatives to compare
the performance, and find the best criteria for gene selection
and classification. We use 7 cancer-related human gene expression
datasets to evaluate the performance of the zero-order and first-order
criteria of error-bounds. Though both criteria have the same strategy
in theoretically, experimental results demonstrate the best criterion
for microarray gene expression data.
Abstract: The goal of Gene Expression Analysis is to understand the processes that underlie the regulatory networks and pathways controlling inter-cellular and intra-cellular activities. In recent times microarray datasets are extensively used for this purpose. The scope of such analysis has broadened in recent times towards reconstruction of gene networks and other holistic approaches of Systems Biology. Evolutionary methods are proving to be successful in such problems and a number of such methods have been proposed. However all these methods are based on processing of genotypic information. Towards this end, there is a need to develop evolutionary methods that address phenotypic interactions together with genotypic interactions. We present a novel evolutionary approach, called Phenomic algorithm, wherein the focus is on phenotypic interaction. We use the expression profiles of genes to model the interactions between them at the phenotypic level. We apply this algorithm to the yeast sporulation dataset and show that the algorithm can identify gene networks with relative ease.
Abstract: Yeast cells live in a constantly changing environment that requires the continuous adaptation of their genomic program in order to sustain their homeostasis, survive and proliferate. Due to the advancement of high throughput technologies, there is currently a large amount of data such as gene expression, gene deletion and protein-protein interactions for S. Cerevisiae under various environmental conditions. Mining these datasets requires efficient computational methods capable of integrating different types of data, identifying inter-relations between different components and inferring functional groups or 'modules' that shape intracellular processes. This study uses computational methods to delineate some of the mechanisms used by yeast cells to respond to environmental changes. The GRAM algorithm is first used to integrate gene expression data and ChIP-chip data in order to find modules of coexpressed and co-regulated genes as well as the transcription factors (TFs) that regulate these modules. Since transcription factors are themselves transcriptionally regulated, a three-layer regulatory cascade consisting of the TF-regulators, the TFs and the regulated modules is subsequently considered. This three-layer cascade is then modeled quantitatively using artificial neural networks (ANNs) where the input layer corresponds to the expression of the up-stream transcription factors (TF-regulators) and the output layer corresponds to the expression of genes within each module. This work shows that (a) the expression of at least 33 genes over time and for different stress conditions is well predicted by the expression of the top layer transcription factors, including cases in which the effect of up-stream regulators is shifted in time and (b) identifies at least 6 novel regulatory interactions that were not previously associated with stress-induced changes in gene expression. These findings suggest that the combination of gene expression and protein-DNA interaction data with artificial neural networks can successfully model biological pathways and capture quantitative dependencies between distant regulators and downstream genes.
Abstract: MicroRNAs are an important class of gene expression
regulators that are involved in many biological processes including
embryogenesis. miR-125b is a conserved microRNA that is enriched
in the nervous system. We have previously reported the function of
miR-125b in neuronal differentiation of human cell lines. We also
discovered the function of miR-125b in regulating p53 in human and
zebrafish. Here we further characterize the brain defects in zebrafish
embryos injected with morpholinos against miR-125b. Our data
confirm the essential role of miR-125b in brain morphogenesis
particularly in maintaining the balance between proliferation, cell
death and differentiation. We identified lunatic fringe (lfng) as an
additional target of miR-125b in human and zebrafish and suggest
that lfng may mediate the function of miR-125b in neurogenesis.
Together, this report reveals new insights into the function of miR-
125b during neural development of zebrafish.