Internship Positions

We offer a variety of exciting projects for possible interns. Interns can apply for SIGPA to get a monthly funding of 2000 SGD. You can find the list possible internship projects below.

Project 1: AI methods for improving DNA sequencing

Genome sequencing assembling is one of the most essential tasks in genomics. High accuracy sequences help us find relevant information about the sequenced organism (i.e., human or virus) which is especially important for clinical use. This project will focus on one, important sub-task of genome sequencing, named basecalling. As the DNA strand passes through the nanopore, a few nucleobases create a characteristic current disruption in the pore. A sequencing device continuously measures electrical current and stores it for sequence decoding. The process of sequence decoding from the measured current is called basecalling.

The focus of this project will be on developing high accuracy basecaller using deep learning methods. The student will get to know the practical approaches to data preparation and deep learning (such as attention networks and transformers). The solution will be implemented using Python and PyTorch library. The student will be encouraged to publish his application on GitHub as an open-source project.

Requirements:
- Motivation
- Willingness to learn by doing
- Basic programming skills
- Prior knowledge in biology is not required

Preferable skills (not required):
- Python
- Basic knowledge in probability and statistics, linear algebra and information theory
- Knowledge of PyTorch or another DL framework
- Basic knowledge in the machine and deep learning methods

Project 2: Using NLP and graph neural networks to determine the DNA sequence

One of the main challenges in genomics is the determination of genome sequence using sequenced DNA fragments called reads. The standard procedure is the construction of a graph from overlapping reads and finding a path through it. The path represents the final sequence. Since there are no exact algorithms, which could accomplish this problem in a reasonable amount of time, it is necessary to use heuristic approaches. An intern would try several deep learning methods and reinforcement learning algorithms for graph simplifications.

The goal of this project would be to develop deep learning models for locating critical patterns in graphs. The idea is learn in the space of algorithms. Initial models will be based on Message Passing Neural Networks. Still, at later stages of the project, other approaches will be tried out as well – such as reinforcement learning methods similar to those used in Alpha Zero and using NLP models (i.e. GPT3 or BERT) to learn genome sequence.

Requirements:
- Motivation
- Willingness to learn by doing
- Basic programming skills
- Prior knowledge in biology is not required

Preferable skills (not required):
- Python
- Basic knowledge in probability and statistics, linear algebra and information theory
- Knowledge of PyTorch or another DL framework
- Basic knowledge in the machine and deep learning methods

Project 3: Deep Learning methods for epigenomics

Modification of DNA nucleotides is an important way to control the function of the genome through the regulation of gene expression. DNA modifications contribute to diseases such as cancer where it is used as a biomarker, and it has been found to have influence on aging, demonstrating the value of epigenomics data to understand the profile of each individual patient.

The goal of this project would be to develop deep learning models for the detection of the modification from sequencing data. Initial models would be based on convolutional neural networks. Still, at later stages of the project, other approaches would be tried out as well – such as attention models that recently made a breakthrough in the field of natural language processing and proved to be more successful in language tasks then recurrent neural networks.

Requirements:
- Motivation
- Willingness to learn by doing
- Basic programming skills
- Prior knowledge in biology is not required

Preferable skills (not required):
- Python
- Basic knowledge in probability and statistics, linear algebra and information theory
- Knowledge of PyTorch or another DL framework
- Basic knowledge in the machine and deep learning methods

Project 4: Algorithms for determination of the content of the maternally and paternally-derived chromosomes

Genome assembly cannot be feasible without algorithms and methods integrated into tools called de novo assemblers which reconstruct genomes from short sequenced DNA fragments in a manner similar to puzzle solving. The “power horse” of de novo assembly are algorithms on strings and graphs. The majority of de novo assemblers were designed on smaller genomes and work well on larger eukaryotic organisms, but most of them create haploid representations of the genome regardless of the ploidy (they collapse information from their parents). The separation of genetic material from each parent results in knowledge of the complete genotype – all variant forms of all genes. Therefore, different methods should be applied in the assembly process to achieve better reconstructions.

The main goal of this thesis is to adapt Raven, a de novo assembler for long erroneous sequencing data, for diploid organisms sequenced. Raven is a tool produced by our group and now is one of the most popular de novo assemblers.

Requirements:
- Motivation
- Willingness to learn by doing
- Intermediate C/C++ skills
- Knowledge of algorithms and data structures
- Prior knowledge in biology is not required

Preferable skills (not required):
- Software engineering skill
- Parallel programming

Project 5: Deep learning models for determination of the content of the maternally and paternally-derived chromosomes

Genome assembly is procedure for reconstructing genomes from short sequenced DNA fragments in a manner similar to puzzle solving. The majority of de novo assemblers were designed on smaller genomes and work well on larger eukaryotic organisms, but most of them create haploid representations of the genome regardless of the ploidy (they collapse information from their parents). The separation of genetic material from each parent results in knowledge of the complete genotype – all variant forms of all genes. Therefore, different methods should be applied in the assembly process to achieve better reconstructions.

A student will work on various deep learning models based on convolutional neural networks or attention mechanism. The project will start with an approach similar to those used in Google’s Deep Variant model.

Requirements:
- Motivation
- Willingness to learn by doing
- Basic programming skills
- Prior knowledge in biology is not required

Preferable skills (not required):
- Python
- Basic knowledge in probability and statistics, linear algebra and information theory
- Knowledge of PyTorch or another DL framework
- Basic knowledge in the machine and deep learning methods

Project 6: AI method for vaccine design

AI vaccine design process can be seen as selecting good fragments of the virus proteins, then constructing them together into a final vaccine. A fragment with multiple merits can be selected as a subunit of the final vaccine. Proteins may be viewed as sequences of amino acid residues. However they can also be considered as graphs with its residues as nodes, with two nodes sharing an edge if their residues are spatially close.

A student will work on various deep learning models based on sequences (ie. attention mechanism) and graphs (graph neural network) for detection of fragments of proteins suitable to be a subunit of a vaccine.

Requirements:
- Motivation
- Willingness to learn by doing
- Basic programming skills
- Prior knowledge in biology is not required

Preferable skills (not required):
- Python
- Basic knowledge in probability and statistics, linear algebra and information theory
- Knowledge of PyTorch or another DL framework
- Basic knowledge in the machine and deep learning methods

Project 7: Rapid diagnostic of infectious diseases using DNA sequencing and deep learning methods

The fast and accurate detection of microbes which cause infections disease can reduce unnecessary usage of antibiotics, speed up recovery and sometime even save lives. The current methods, when we do not have a clue about possible pathogen, can last days. The goal of the project is the development of the AI method for the detection of present microbes in a sample using nanopore sequencing. Nanopore sequencers read DNA/RNA fragments while they are passing through a tiny pore. This results in a signal which is converted to the sequence of nucleotides.

The project aims to recognize each microbe present in the sample using a pattern matching approach like those for song recognition, ie Shazam. Instead of signal processing as in Shazam, we plan to use AI methods to find a good succinct representation of each known microbe genome. A student will work with reseachers on building new deep learning methods based on self-supervised learning and attention based architecture.

Requirements:
- Motivation
- Willingness to learn by doing
- Basic programming skills
- Prior knowledge in biology is not required

Preferable skills (not required):
- Python
- Basic knowledge in probability and statistics, linear algebra and information theory
- Knowledge of PyTorch or another DL framework
- Basic knowledge in the machine and deep learning methods