Efficient Pre-Processing of Single-Cell Assay for Transposase Accessible Chromatin with High-Throughput Sequencing Data

The primary tool currently used to pre-process 10X chromium single-cell ATAC-seq data is Cell Ranger, which can take very long to run on standard datasets. To facilitate rapid pre-processing that enables reproducible workflows, we present a suite of tools called scATAK for pre-processing single-cell ATAC-seq data that is 15 to 18 times faster than Cell Ranger on mouse and human samples. Our tool can also calculate chromatin interaction potential matrices and generate open chromatin signal and interaction traces for cell groups. We use scATAK tool to explore the chromatin regulatory landscape of a healthy adult human brain and unveil cell-type specific features, and show that it provides a convenient and computational efficient approach for pre-processing single-cell ATAC-seq data.

Towards End-To-End Disease Prediction from Raw Metagenomic Data

Analysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and stored as fastq files. Conventional processing pipelines consist in multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimensionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life data-sets as well a simulated one, we demonstrated that this original approach reaches high performance, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Modified Genome-Scale Metabolic Model of Escherichia coli by Adding Hyaluronic Acid Biosynthesis-Related Enzymes (GLMU2 and HYAD) from Pasteurella multocida

Hyaluronic acid (HA) consists of linear heteropolysaccharides repeat of D-glucuronic acid and N-acetyl-D-glucosamine. HA has various useful properties to maintain skin elasticity and moisture, reduce inflammation, and lubricate the movement of various body parts without causing immunogenic allergy. HA can be found in several animal tissues as well as in the capsule component of some bacteria including Pasteurella multocida. This study aimed to modify a genome-scale metabolic model of Escherichia coli using computational simulation and flux analysis methods to predict HA productivity under different carbon sources and nitrogen supplement by the addition of two enzymes (GLMU2 and HYAD) from P. multocida to improve the HA production under the specified amount of carbon sources and nitrogen supplements. Result revealed that threonine and aspartate supplement raised the HA production by 12.186%. Our analyses proposed the genome-scale metabolic model is useful for improving the HA production and narrows the number of conditions to be tested further.

Transcriptomics Analysis on Comparing Non-Small Cell Lung Cancer versus Normal Lung, and Early Stage Compared versus Late-Stages of Non-Small Cell Lung Cancer

Lung cancer is one of the most common malignancies and primary cause of death due to cancer worldwide. Non-small cell lung cancer (NSCLC) is the main subtype in which majority of patients present with advanced-stage disease. Herein, we analyzed differentially expressed genes to find potential biomarkers for lung cancer diagnosis as well as prognostic markers. We used transcriptome data from our 2 NSCLC patients and public data (GSE81089) composing of 8 NSCLC and 10 normal lung tissues. Differentially expressed genes (DEGs) between NSCLC and normal tissue and between early-stage and late-stage NSCLC were analyzed by the DESeq2. Pairwise correlation was used to find the DEGs with false discovery rate (FDR) adjusted p-value £ 0.05 and |log2 fold change| ³ 4 for NSCLC versus normal and FDR adjusted p-value £ 0.05 with |log2 fold change| ³ 2 for early versus late-stage NSCLC. Bioinformatic tools were used for functional and pathway analysis. Moreover, the top ten genes in each comparison group were verified the expression and survival analysis via GEPIA. We found 150 up-regulated and 45 down-regulated genes in NSCLC compared to normal tissues. Many immnunoglobulin-related genes e.g., IGHV4-4, IGHV5-10-1, IGHV4-31, IGHV4-61, and IGHV1-69D were significantly up-regulated. 22 genes were up-regulated, and five genes were down-regulated in late-stage compared to early-stage NSCLC. The top five DEGs genes were KRT6B, SPRR1A, KRT13, KRT6A and KRT5. Keratin 6B (KRT6B) was the most significantly increased gene in the late-stage NSCLC. From GEPIA analysis, we concluded that IGHV4-31 and IGKV1-9 might be used as diagnostic biomarkers, while KRT6B and KRT6A might be used as prognostic biomarkers. However, further clinical validation is needed.

Hybrid Structure Learning Approach for Assessing the Phosphate Laundries Impact

Bayesian Network (BN) is one of the most efficient classification methods. It is widely used in several fields (i.e., medical diagnostics, risk analysis, bioinformatics research). The BN is defined as a probabilistic graphical model that represents a formalism for reasoning under uncertainty. This classification method has a high-performance rate in the extraction of new knowledge from data. The construction of this model consists of two phases for structure learning and parameter learning. For solving this problem, the K2 algorithm is one of the representative data-driven algorithms, which is based on score and search approach. In addition, the integration of the expert's knowledge in the structure learning process allows the obtainment of the highest accuracy. In this paper, we propose a hybrid approach combining the improvement of the K2 algorithm called K2 algorithm for Parents and Children search (K2PC) and the expert-driven method for learning the structure of BN. The evaluation of the experimental results, using the well-known benchmarks, proves that our K2PC algorithm has better performance in terms of correct structure detection. The real application of our model shows its efficiency in the analysis of the phosphate laundry effluents' impact on the watershed in the Gafsa area (southwestern Tunisia).

Meta-Learning for Hierarchical Classification and Applications in Bioinformatics

Hierarchical classification is a special type of classification task where the class labels are organised into a hierarchy, with more generic class labels being ancestors of more specific ones. Meta-learning for classification-algorithm recommendation consists of recommending to the user a classification algorithm, from a pool of candidate algorithms, for a dataset, based on the past performance of the candidate algorithms in other datasets. Meta-learning is normally used in conventional, non-hierarchical classification. By contrast, this paper proposes a meta-learning approach for more challenging task of hierarchical classification, and evaluates it in a large number of bioinformatics datasets. Hierarchical classification is especially relevant for bioinformatics problems, as protein and gene functions tend to be organised into a hierarchy of class labels. This work proposes meta-learning approach for recommending the best hierarchical classification algorithm to a hierarchical classification dataset. This work’s contributions are: 1) proposing an algorithm for splitting hierarchical datasets into new datasets to increase the number of meta-instances, 2) proposing meta-features for hierarchical classification, and 3) interpreting decision-tree meta-models for hierarchical classification algorithm recommendation.

Identification of Disease Causing DNA Motifs in Human DNA Using Clustering Approach

Studying DNA (deoxyribonucleic acid) sequence is useful in biological processes and it is applied in the fields such as diagnostic and forensic research. DNA is the hereditary information in human and almost all other organisms. It is passed to their generations. Earlier stage detection of defective DNA sequence may lead to many developments in the field of Bioinformatics. Nowadays various tedious techniques are used to identify defective DNA. The proposed work is to analyze and identify the cancer-causing DNA motif in a given sequence. Initially the human DNA sequence is separated as k-mers using k-mer separation rule. The separated k-mers are clustered using Self Organizing Map (SOM). Using Levenshtein distance measure, cancer associated DNA motif is identified from the k-mer clusters. Experimental results of this work indicate the presence or absence of cancer causing DNA motif. If the cancer associated DNA motif is found in DNA, it is declared as the cancer disease causing DNA sequence. Otherwise the input human DNA is declared as normal sequence. Finally, elapsed time is calculated for finding the presence of cancer causing DNA motif using clustering formation. It is compared with normal process of finding cancer causing DNA motif. Locating cancer associated motif is easier in cluster formation process than the other one. The proposed work will be an initiative aid for finding genetic disease related research.

Antibody Reactivity of Synthetic Peptides Belonging to Proteins Encoded by Genes Located in Mycobacterium tuberculosis-Specific Genomic Regions of Differences

The comparisons of mycobacterial genomes have identified several Mycobacterium tuberculosis-specific genomic regions that are absent in other mycobacteria and are known as regions of differences. Due to M. tuberculosis-specificity, the peptides encoded by these regions could be useful in the specific diagnosis of tuberculosis. To explore this possibility, overlapping synthetic peptides corresponding to 39 proteins predicted to be encoded by genes present in regions of differences were tested for antibody-reactivity with sera from tuberculosis patients and healthy subjects. The results identified four immunodominant peptides corresponding to four different proteins, with three of the peptides showing significantly stronger antibody reactivity and rate of positivity with sera from tuberculosis patients than healthy subjects. The fourth peptide was recognized equally well by the sera of tuberculosis patients as well as healthy subjects. Predication of antibody epitopes by bioinformatics analyses using ABCpred server predicted multiple linear epitopes in each peptide. Furthermore, peptide sequence analysis for sequence identity using BLAST suggested M. tuberculosis-specificity for the three peptides that had preferential reactivity with sera from tuberculosis patients, but the peptide with equal reactivity with sera of TB patients and healthy subjects showed significant identity with sequences present in nob-tuberculous mycobacteria. The three identified M. tuberculosis-specific immunodominant peptides may be useful in the serological diagnosis of tuberculosis.

Domain Knowledge Representation through Multiple Sub Ontologies: An Application Interoperability

The issues that limit application interoperability is lack of common vocabulary, common structure, application domain knowledge ontology based semantic technology provides solutions that resolves application interoperability issues. Ontology is broadly used in diverse applications such as artificial intelligence, bioinformatics, biomedical, information integration, etc. Ontology can be used to interpret the knowledge of various domains. To reuse, enrich the available ontologies and reduce the duplication of ontologies of the same domain, there is a strong need to integrate the ontologies of the particular domain. The integrated ontology gives complete knowledge about the domain by sharing this comprehensive domain ontology among the groups. As per the literature survey there is no well-defined methodology to represent knowledge of a whole domain. The current research addresses a systematic methodology for knowledge representation using multiple sub-ontologies at different levels that addresses application interoperability and enables semantic information retrieval. The current method represents complete knowledge of a domain by importing concepts from multiple sub ontologies of same and relative domains that reduces ontology duplication, rework, implementation cost through ontology reusability.

ELISA Based hTSH Assessment Using Two Sensitive and Specific Anti-hTSH Polyclonal Antibodies

Production of specific antibody responses against hTSH is a cumbersome process due to the high identity between the hTSH and the other members of the glycoprotein hormone family (FSH, LH and HCG) and the high identity between the human hTSH and host animals for antibody production. Therefore, two polyclonal antibodies were purified against two recombinant proteins. Four possible ELISA tests were designed based on these antibodies. These ELISA tests were checked against hTSH and other glycoprotein hormones, and their sensitivity and specificity were assessed. Bioinformatics tools were used to analyze the immunological properties. After the immunogen region selection from hTSH protein, c terminal of B hTSH was selected and applied. Two recombinant genes, with these cut pieces (first: two repeats of C terminal of B hTSH, second: tetanous toxin+B hTSH C terminal), were designed and sub-cloned into the pET32a expression vector. Standard methods were used for protein expression, purification, and verification. Thereafter, immunizations of the white New Zealand rabbits were performed and the serums of them were used for antibody titration, purification and characterization. Then, four ELISA tests based on two antibodies were employed to assess the hTSH and other glycoprotein hormones. The results of these assessments were compared with standard amounts. The obtained results indicated that the desired antigens were successfully designed, sub-cloned, expressed, confirmed and used for in vivo immunization. The raised antibodies were capable of specific and sensitive hTSH detection, while the cross reactivity with the other members of the glycoprotein hormone family was minimum. Among the four designed tests, the test in which the antibody against first protein was used as capture antibody, and the antibody against second protein was used as detector antibody did not show any hook effect up to 50 miu/l. Both proteins have the ability to induce highly sensitive and specific antibody responses against the hTSH. One of the antibody combinations of these antibodies has the highest sensitivity and specificity in hTSH detection.

Intellectual Property Protection of CRISPR Related Technologies

CRISPR research has the potential to completely transform life science, agriculture, live-stock and the health care industry. The Intellectual Property derived from its research has raised significant attention in the academic as well as the biopharmaceutical industry culminating an urgent need for strategic IP protection. We review the rudimentary concepts and key competitors of CRISPR technologies as well as the paramount strategies for intellectual property protection. Further, we elaborate on prosecution issues related to CRISPR patents as well as possible solutions to various patent laws, interferences and litigation. Finally, we address how the bioinformatics of the CRISPR technology begs an inquiry into issues of privacy and a host of ethical concerns.

Structuring and Visualizing Healthcare Claims Data Using Systems Architecture Methodology

Healthcare delivery systems around the world are in crisis. The need to improve health outcomes while decreasing healthcare costs have led to an imminent call to action to transform the healthcare delivery system. While Bioinformatics and Biomedical Engineering have primarily focused on biological level data and biomedical technology, there is clear evidence of the importance of the delivery of care on patient outcomes. Classic singular decomposition approaches from reductionist science are not capable of explaining complex systems. Approaches and methods from systems science and systems engineering are utilized to structure healthcare delivery system data. Specifically, systems architecture is used to develop a multi-scale and multi-dimensional characterization of the healthcare delivery system, defined here as the Healthcare Delivery System Knowledge Base. This paper is the first to contribute a new method of structuring and visualizing a multi-dimensional and multi-scale healthcare delivery system using systems architecture in order to better understand healthcare delivery.

21st Century Biotechnological Research and Development Advancements for Industrial Development in India

Biotechnology is a discipline which explains the use of living organisms and systems to construct a product, or we can define it as an application or technology developed to use biological systems and organisms processes for a specific use. Particularly, it includes cells and its components use for new technologies and inventions. The tools developed can be further used in diverse fields such as agriculture, industry, research and hospitals etc. The 21st century has seen a drastic development and advancement in biotechnology in India. Significant increase in Government of India’s outlays for biotechnology over the past decade has been observed. A sectoral break up of biotechnology-based companies in India shows that most of the companies are agriculture-based companies having interests ranging from tissue culture to biopesticides. Major attention has been given by the companies in health related activities and in environmental biotechnology. The biopharmaceutical, which comprises of vaccines, diagnostic, and recombinant products is the most reliable and largest segment of the Indian Biotech industry. India has developed its vaccine markets and supplies them to various countries. Then there are the bio-services, which mainly comprise of contract researches and manufacturing services. India has made noticeable developments in the field of bio industries including manufacturing of enzymes, biofuels and biopolymers. Biotechnology is also playing a crucial and significant role in the field of agriculture. Traditional methods have been replaced by new technologies that mainly focus on GM crops, marker assisted technologies and the use of biotechnological tools to improve the quality of fertilizers and soil. It may only be a small contributor but has shown to have huge potential for growth. Bioinformatics is a computational method which helps to store, manage, arrange and design tools to interpret the extensive data gathered through experimental trials, making it important in the design of drugs.

A Survey of Semantic Integration Approaches in Bioinformatics

Technological advances of computer science and data analysis are helping to provide continuously huge volumes of biological data, which are available on the web. Such advances involve and require powerful techniques for data integration to extract pertinent knowledge and information for a specific question. Biomedical exploration of these big data often requires the use of complex queries across multiple autonomous, heterogeneous and distributed data sources. Semantic integration is an active area of research in several disciplines, such as databases, information-integration, and ontology. We provide a survey of some approaches and techniques for integrating biological data, we focus on those developed in the ontology community.

Development of Fuzzy Logic Control Ontology for E-Learning

Nowadays, ontology is common in many areas like artificial intelligence, bioinformatics, e-commerce, education and many more. Ontology is one of the focus areas in the field of Information Retrieval. The purpose of an ontology is to describe a conceptual representation of concepts and their relationships within a particular domain. In other words, ontology provides a common vocabulary for anyone who needs to share information in the domain. There are several ontology domains in various fields including engineering and non-engineering knowledge. However, there are only a few available ontology for engineering knowledge. Fuzzy logic as engineering knowledge is still not available as ontology domain. In general, fuzzy logic requires step-by-step guidelines and instructions of lab experiments. In this study, we presented domain ontology for Fuzzy Logic Control (FLC) knowledge. We give Table of Content (ToC) with middle strategy based on the Uschold and King method to develop FLC ontology. The proposed framework is developed using Protégé as the ontology tool. The Protégé’s ontology reasoner, known as the Pellet reasoner is then used to validate the presented framework. The presented framework offers better performance based on consistency and classification parameter index. In general, this ontology can provide a platform to anyone who needs to understand FLC knowledge.

Questions Categorization in E-Learning Environment Using Data Mining Technique

Nowadays, education cannot be imagined without digital technologies. It broadens the horizons of teaching learning processes. Several universities are offering online courses. For evaluation purpose, e-examination systems are being widely adopted in academic environments. Multiple-choice tests are extremely popular. Moving away from traditional examinations to e-examination, Moodle as Learning Management Systems (LMS) is being used. Moodle logs every click that students make for attempting and navigational purposes in e-examination. Data mining has been applied in various domains including retail sales, bioinformatics. In recent years, there has been increasing interest in the use of data mining in e-learning environment. It has been applied to discover, extract, and evaluate parameters related to student’s learning performance. The combination of data mining and e-learning is still in its babyhood. Log data generated by the students during online examination can be used to discover knowledge with the help of data mining techniques. In web based applications, number of right and wrong answers of the test result is not sufficient to assess and evaluate the student’s performance. So, assessment techniques must be intelligent enough. If student cannot answer the question asked by the instructor then some easier question can be asked. Otherwise, more difficult question can be post on similar topic. To do so, it is necessary to identify difficulty level of the questions. Proposed work concentrate on the same issue. Data mining techniques in specific clustering is used in this work. This method decide difficulty levels of the question and categories them as tough, easy or moderate and later this will be served to the desire students based on their performance. Proposed experiment categories the question set and also group the students based on their performance in examination. This will help the instructor to guide the students more specifically. In short mined knowledge helps to support, guide, facilitate and enhance learning as a whole.

Bioinformatics and Molecular Biological Characterization of a Hypothetical Protein SAV1226 as a Potential Drug Target for Methicillin/Vancomycin- Staphylococcus aureus Infections

Methicillin/multiple-resistant Staphylococcus aureus (MRSA) are infectious bacteria that are resistant to common antibiotics. A previous in silico study in our group has identified a hypothetical protein SAV1226 as one of the potential drug targets. In this study, we reported the bioinformatics characterization, as well as cloning, expression, purification and kinetic assays of hypothetical protein SAV1226 from methicillin/vancomycin-resistant Staphylococcus aureus Mu50 strain. MALDI-TOF/MS analysis revealed a low degree of structural similarity with known proteins. Kinetic assays demonstrated that hypothetical protein SAV1226 is neither a domain of an ATP dependent dihydroxyacetone kinase nor of a phosphotransferase system (PTS) dihydroxyacetone kinase, suggesting that the function of hypothetical protein SAV1226 might be misannotated on public databases such as UniProt and InterProScan 5.

Application of KL Divergence for Estimation of Each Metabolic Pathway Genes

Development of a method to estimate gene functions is an important task in bioinformatics. One of the approaches for the annotation is the identification of the metabolic pathway that genes are involved in. Since gene expression data reflect various intracellular phenomena, those data are considered to be related with genes’ functions. However, it has been difficult to estimate the gene function with high accuracy. It is considered that the low accuracy of the estimation is caused by the difficulty of accurately measuring a gene expression. Even though they are measured under the same condition, the gene expressions will vary usually. In this study, we proposed a feature extraction method focusing on the variability of gene expressions to estimate the genes' metabolic pathway accurately. First, we estimated the distribution of each gene expression from replicate data. Next, we calculated the similarity between all gene pairs by KL divergence, which is a method for calculating the similarity between distributions. Finally, we utilized the similarity vectors as feature vectors and trained the multiclass SVM for identifying the genes' metabolic pathway. To evaluate our developed method, we applied the method to budding yeast and trained the multiclass SVM for identifying the seven metabolic pathways. As a result, the accuracy that calculated by our developed method was higher than the one that calculated from the raw gene expression data. Thus, our developed method combined with KL divergence is useful for identifying the genes' metabolic pathway.

Imputation Technique for Feature Selection in Microarray Data Set

Analyzing DNA microarray data sets is a great challenge, which faces the bioinformaticians due to the complication of using statistical and machine learning techniques. The challenge will be doubled if the microarray data sets contain missing data, which happens regularly because these techniques cannot deal with missing data. One of the most important data analysis process on the microarray data set is feature selection. This process finds the most important genes that affect certain disease. In this paper, we introduce a technique for imputing the missing data in microarray data sets while performing feature selection.

Isolation and Identification of Diacylglycerol Acyltransferase Type- 2 (GAT2) Genes from Three Egyptian Olive Cultivars

Aim of this work was to study the genetic basis for oil accumulation in olive fruit via tracking DGAT2 (Diacylglycerol acyltransferase type-2) gene in three Egyptian Origen Olive cultivars namely Toffahi, Hamed and Maraki using molecular marker techniques and bioinformatics tools. Results illustrate that, firstly: specific genomic band of Maraki cultivars was identified as DGAT2 (Diacylglycerol acyltransferase type-2) and identical for this gene in Olea europaea with 100% of similarity. Secondly, differential genomic band of Maraki cultivars which produced from RAPD fingerprinting technique reflected predicted distinguished sequence which identified as DGAT2 (Diacylglycerol acyltransferase type-2) in Fragaria vesca subsp. Vesca with 76% of sequential similarity. Third and finally, specific genomic specific band of Hamed cultivars was identified as two fragments, 1- Olea europaea cultivar Koroneiki diacylglycerol acyltransferase type 2 mRNA, complete cds with two matches regions with 99% or 2- Predicted: Fragaria vesca subsp. vesca diacylglycerol O-acyltransferase 2-like (LOC101313050), mRNA with 86 % of similarity.