Abstract: Analysis of the human microbiome using metagenomic
sequencing data has demonstrated high ability in discriminating
various human diseases. Raw metagenomic sequencing data require
multiple complex and computationally heavy bioinformatics steps
prior to data analysis. Such data contain millions of short sequences
read from the fragmented DNA sequences and stored as fastq files.
Conventional processing pipelines consist in multiple steps including
quality control, filtering, alignment of sequences against genomic
catalogs (genes, species, taxonomic levels, functional pathways,
etc.). These pipelines are complex to use, time consuming and
rely on a large number of parameters that often provide variability
and impact the estimation of the microbiome elements. Training
Deep Neural Networks directly from raw sequencing data is a
promising approach to bypass some of the challenges associated with
mainstream bioinformatics pipelines. Most of these methods use the
concept of word and sentence embeddings that create a meaningful
and numerical representation of DNA sequences, while extracting
features and reducing the dimensionality of the data. In this paper
we present an end-to-end approach that classifies patients into disease
groups directly from raw metagenomic reads: metagenome2vec. This
approach is composed of four steps (i) generating a vocabulary of
k-mers and learning their numerical embeddings; (ii) learning DNA
sequence (read) embeddings; (iii) identifying the genome from which
the sequence is most likely to come and (iv) training a multiple
instance learning classifier which predicts the phenotype based on
the vector representation of the raw data. An attention mechanism
is applied in the network so that the model can be interpreted,
assigning a weight to the influence of the prediction for each genome.
Using two public real-life data-sets as well a simulated one, we
demonstrated that this original approach reaches high performance,
comparable with the state-of-the-art methods applied directly on
processed data though mainstream bioinformatics workflows. These
results are encouraging for this proof of concept work. We believe
that with further dedication, the DNN models have the potential to
surpass mainstream bioinformatics workflows in disease classification
tasks.
Abstract: Health for all is considered as a sign of well-being and inclusive growth. New healthcare technologies are contributing to the quality of human lives by promoting health education and awareness, leading to the prevention, early diagnosis and treatment of the symptoms of diseases. Healthcare technologies have now migrated from the medical and institutionalized settings to the home and everyday life. This paper explores these new technologies and investigates how they contribute to health education and awareness, promoting the objective of high-value health system for all. The methodology used for the research is literature review. The paper also discusses the opportunities and challenges with futuristic healthcare technologies. The combined advances in genomics medicine, wearables and the IoT with enhanced data collection in electronic health record (EHR) systems, environmental sensors, and mobile device applications can contribute in a big way to high-value health system for all. The promise by these technologies includes reduced total cost of healthcare, reduced incidence of medical diagnosis errors, and reduced treatment variability. The major barriers to adoption include concerns with security, privacy, and integrity of healthcare data, regulation and compliance issues, service reliability, interoperability and portability of data, and user friendliness and convenience of these technologies.
Abstract: The organizations have structured and unstructured information in different formats, sources, and systems. Part of these come from ERP under OLTP processing that support the information system, however these organizations in OLAP processing level, presented some deficiencies, part of this problematic lies in that does not exist interesting into extract knowledge from their data sources, as also the absence of operational capabilities to tackle with these kind of projects. Data Warehouse and its applications are considered as non-proprietary tools, which are of great interest to business intelligence, since they are repositories basis for creating models or patterns (behavior of customers, suppliers, products, social networks and genomics) and facilitate corporate decision making and research. The following paper present a structured methodology, simple, inspired from the agile development models as Scrum, XP and AUP. Also the models object relational, spatial data models, and the base line of data modeling under UML and Big data, from this way sought to deliver an agile methodology for the developing of data warehouses, simple and of easy application. The methodology naturally take into account the application of process for the respectively information analysis, visualization and data mining, particularly for patterns generation and derived models from the objects facts structured.
Abstract: A DNA microarray technology is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increase the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. It is handled by clustering which reveals the natural structures and identifying the interesting patterns in the underlying data. In this paper, gene based clustering in gene expression data is proposed using Cuckoo Search with Differential Evolution (CS-DE). The experiment results are analyzed with gene expression benchmark datasets. The results show that CS-DE outperforms CS in benchmark datasets. To find the validation of the clustering results, this work is tested with one internal and one external cluster validation indexes.
Abstract: Reverse engineering of genetic regulatory network involves the modeling of the given gene expression data into a form of the network. Computationally it is possible to have the relationships between genes, so called gene regulatory networks (GRNs), that can help to find the genomics and proteomics based diagnostic approach for any disease. In this paper, clustering based method has been used to reconstruct genetic regulatory network from time series gene expression data. Supercoiled data set from Escherichia coli has been taken to demonstrate the proposed method.
Abstract: The DNA microarray technology concurrently monitors the expression levels of thousands of genes during significant biological processes and across the related samples. The better understanding of functional genomics is obtained by extracting the patterns hidden in gene expression data. It is handled by clustering which reveals natural structures and identify interesting patterns in the underlying data. In the proposed work clustering gene expression data is done through an Advanced Nelder Mead (ANM) algorithm. Nelder Mead (NM) method is a method designed for optimization process. In Nelder Mead method, the vertices of a triangle are considered as the solutions. Many operations are performed on this triangle to obtain a better result. In the proposed work, the operations like reflection and expansion is eliminated and a new operation called spread-out is introduced. The spread-out operation will increase the global search area and thus provides a better result on optimization. The spread-out operation will give three points and the best among these three points will be used to replace the worst point. The experiment results are analyzed with optimization benchmark test functions and gene expression benchmark datasets. The results show that ANM outperforms NM in both benchmarks.
Abstract: This study describes a micro device integrated with
multi-chamber for polymerase chain reaction (PCR) with different
annealing temperatures. The device consists of the reaction
polydimethylsiloxane (PDMS) chip, a cover glass chip, and is
equipped with cartridge heaters, fans, and thermocouples for
temperature control. In this prototype, commercial software is utilized
to determine the geometric and operational parameters those are
responsible for creating the denaturation, annealing, and extension
temperatures within the chip. Two cartridge heaters are placed at two
sides of the chip and maintained at two different temperatures to
achieve a thermal gradient on the chip during the annealing step. The
temperatures on the chip surface are measured via an infrared imager.
Some thermocouples inserted into the reaction chambers are used to
obtain the transient temperature profiles of the reaction chambers
during several thermal cycles. The experimental temperatures
compared to the simulated results show a similar trend. This work
should be interesting to persons involved in the high-temperature
based reactions and genomics or cell analysis.
Abstract: This study describes a capillary-based device
integrated with the heating and cooling modules for polymerase chain
reaction (PCR). The device consists of the reaction
polytetrafluoroethylene (PTFE) capillary, the aluminum blocks, and is
equipped with two cartridge heaters, a thermoelectric (TE) cooler, a
fan, and some thermocouples for temperature control. The cartridge
heaters are placed into the heating blocks and maintained at two
different temperatures to achieve the denaturation and the extension
step. Some thermocouples inserted into the capillary are used to obtain
the transient temperature profiles of the reaction sample during
thermal cycles. A 483-bp DNA template is amplified successfully in
the designed system and the traditional thermal cycler. This work
should be interesting to persons involved in the high-temperature
based reactions and genomics or cell analysis.
Abstract: During the last years, the genomes of more and more
species have been sequenced, providing data for phylogenetic recon-
struction based on genome rearrangement measures. A main task in
all phylogenetic reconstruction algorithms is to solve the median of
three problem. Although this problem is NP-hard even for the sim-
plest distance measures, there are exact algorithms for the breakpoint
median and the reversal median that are fast enough for practical use.
In this paper, this approach is extended to the transposition median as
well as to the weighted reversal and transposition median. Although
there is no exact polynomial algorithm known even for the pairwise
distances, we will show that it is in most cases possible to solve
these problems exactly within reasonable time by using a branch and
bound algorithm.
Abstract: Eukaryotic protein-coding genes are interrupted by spliceosomal introns, which are removed from the RNA transcripts before translation into a protein. The exon-intron structures of different eukaryotic species are quite different from each other, and the evolution of such structures raises many questions. We try to address some of these questions using statistical analysis of whole genomes. We go through all the protein-coding genes in a genome and study correlations between the net length of all the exons in a gene, the number of the exons, and the average length of an exon. We also take average values of these features for each chromosome and study correlations between those averages on the chromosomal level. Our data show universal features of exon-intron structures common to animals, plants, and protists (specifically, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Cryptococcus neoformans, Homo sapiens, Mus musculus, Oryza sativa, and Plasmodium falciparum). We have verified linear correlation between the number of exons in a gene and the length of a protein coded by the gene, while the protein length increases in proportion to the number of exons. On the other hand, the average length of an exon always decreases with the number of exons. Finally, chromosome clustering based on average chromosome properties and parameters of linear regression between the number of exons in a gene and the net length of those exons demonstrates that these average chromosome properties are genome-specific features.
Abstract: The National Agricultural Biotechnology Information
Center (NABIC) plays a leading role in the biotechnology information
database for agricultural plants in Korea. Since 2002, we have
concentrated on functional genomics of major crops, building an
integrated biotechnology database for agro-biotech information that
focuses on bioinformatics of major agricultural resources such as rice,
Chinese cabbage, and microorganisms. In the NABIC,
integration-based biotechnology database provides useful information
through a user-friendly web interface that allows analysis of genome
infrastructure, multiple plants, microbial resources, and living
modified organisms.
Abstract: MicroRNAs (miRNAs) are small, non-coding and
regulatory RNAs about 20 to 24 nucleotides long. Their conserved
nature among the various organisms makes them a good source of
new miRNAs discovery by comparative genomics approach. The
study resulted in 21 miRNAs of 20 pre-miRNAs belonging to 16
families (miR156, 157, 158, 164, 165, 168, 169, 172, 319, 390, 393,
394, 395, 400, 472 and 861) in evergreen spruce tree (Picea). The
miRNA families; miR 157, 158, 164, 165, 168, 169, 319, 390, 393,
394, 400, 472 and 861 are reported for the first time in the Picea. All
20 miRNA precursors form stable minimum free energy stem-loop
structure as their orthologues form in Arabidopsis and the mature
miRNA reside in the stem portion of the stem loop structure. Sixteen
(16) miRNAs are from Picea glauca and five (5) belong to Picea
sitchensis. Their targets consist of transcription factors, growth
related, stressed related and hypothetical proteins.