Nanopore Sequencing

Nanopore DNA Sequencing

Data Science Engineering for Single-Molecule Sensors

###### Jesse Clark

###### 2023

Notes:
              Hello. I'm a software engineer at Roche Diagnostics.
              I'd like to tell you about the work I've been doing on our nanopore sequencing instrument.
              I'm familiar with the data science operations for this project, and how they interact with the electrical and chemical systems that we use, and the markets that we operate in.

First, I will give some background about DNA sequencing technologies.

Then, I will explain the algorithms involved in primary analysis of nanopore data.

---

#### Abstract

DNA sequencing has a long history of tradeoffs between cost and accuracy, mediated by data science.

As improvements in biochemistry deliver raw data with better signal-to-noise ratio, the role of conventional bioinformatics in primary analysis is being supplanted by recurrent neural networks.

A case study of the record-breaking high-throughput process at Roche Sequencing Solutions.

Conventional Sequencing

[The Human Genome Project](https://www.genome.gov/human-genome-project)
              first sequenced the human genome  
              (3.1 billion base pairs) in 2003.

Since then, we have seen three distinct generations of sequencing technology, characterized by higher signal-to-noise ratio and lower costs.

Increasing cost-effectiveness opens up new markets:

- basic research in genetics
              - non-invasive prenatal testing
              - early cancer detection
              - large-scale infectious disease research

Notes:
              In each generation, we see incremental improvements due to massively parallel instruments, streamlined processes such as "click chemistry".

Breakthroughs are due to new techniques.

$$\textrm{sequencing cost} \approx \textrm{num base reads} \times \frac{\textrm{samples per base}}{\textrm{samples per second}} $$

---

#### Sequencing by Ligation

- Cut DNA strands into fragments  
              - Attach base-dependent fluorescence to the end of each fragment  
              ("chain termination method")
              - Separate fragments by size  
              ([Gel Electrophoresis](https://en.wikipedia.org/wiki/Gel_electrophoresis))

Signal gets weaker at longer read lengths. The sample can be amplified with [PCR](https://en.wikipedia.org/wiki/Polymerase_chain_reaction), but that introduces noise dependent on earlier noise.

High [depth of coverage](https://en.wikipedia.org/wiki/Coverage_(genetics)) is needed to differentiate between mutations in the sample and mutations in the process.

Notes:
              Electrophoresis sequencing (c. 1977)

PCR (c. 1985)

Data scientists can correlate non-synchronized data *in silico* during primary analysis.

Roche is doing interesting work with neural network early interpretation of PCR curves. Faster turnaround time can open new markets.

---

#### Sequencing by Synthesis

Synchronized reaction and detection.

Biochemical engineers can modify DNA polymerase or modify available nucleotides to create a detectable signal when each base is processed.

For example, engineered nucleotides that fluoresce when added to a chain, combined with optical detection.

Nanopore Sequencing

#### Patch Clamp

<figure class="figure-right" style="width:33%; margin-top:-1.5em">
 <img src="Patch_clamp_schematic_1.png" style="width: 100%"/>
 
 <img src="Single_channel.png" style="width: 100%"/>
 <figcaption>Voltage across a single ion channel</figcaption>
 </figure>

A [patch clamp](https://en.wikipedia.org/wiki/Patch_clamp) can measure voltage or current flow through a single ion channel.

[Nanopore sequencing](https://en.wikipedia.org/wiki/Nanopore_sequencing) is based on threading a single strand of DNA through an ion channel and measuring the characteristic voltage signature of each nucleotide.

This technique is sensitive enough to measure single molecules of native DNA without introducing noise before or after the main reaction.

Notes:
              Patch clamp (early 1980s)

Nanopore: a pore about 1 nanometer in diameter.

---
  
              #### Ion Channel Threading

<figure class="figure-right">
 <img src="oxford-nanopore-2009.jpg" style="height: 10em" />
 <figcaption>Oxford Nanopore (2009)</figcaption>
 </figure>

- Create a membrane with a single pore.  
              Pore-forming toxins such as α-hemolysin can be produced by bacteria.

- Attach a DNA-threading enzyme to the pore.

- Measure voltage across the pore as the DNA passes through it.

- Engineer nucleotides to make the voltage levels more distinct.  
              ( "sequencing by synthesis")

---

#### Roche High-Throughput Platform

- Custom chip with 8,000,000 single-pore wells and a general purpose rolling-shutter array of patch clamp sensors 
 (Genia Technologies)

- Engineered nucleotides with a ratcheting effect that regulates procession rate  
              (Stratos Genomics)

Notes:
              SBX makes it much easier to detect and prevent insertions and deletions, giving us a very high accuracy with fewer samples per base dwell.

Genia (2009-2014)

Stratos (2007-2020)

Primary Data Analysis

Primary analysis of sequencing data produces:
              - sequence of DNA base-calls
              - quality-control metrics
              - genome alignment (if applicable)
              - feedback for experiment designers (during R&D)

These can then be passed into later analysis stages such as mutation calling.

**Algorithms need to be adapted to the specific chemistry they are working with.**

Notes:
              The cost of sequencing is usually dominated by hardware and reagents.

The analysis algorithms are straightforward, but they need to be adapted to the specific chemistry they are working with.

A well-run data science team should be able to make changes quickly and enable experimentalists to self-manage code versions and parameters.

---

#### R&D Pipeline

<object type="image/svg+xml" data="Primary Data Analysis.svg" style="width:90%" />

Notes:
              Here is how we handle primary data analysis at Roche.

- Experiment metadata (chemical reagents, electric waveforms, relevant genome, ...) is specified in advance. Voltage data is collected as "frames" which consist of millions of 8-bit voltage readings, plus some "frame header" metadata.

- Data is transposed from frame-major order into cell-major order, so that the complete record of a single cell is available in a single data chunk. Cells are subsampled randomly into chunks. These chunks are copied to cloud storage and tracked with an in-house database.

- Data in the cloud is analyzed on-demand by a Docker image (with versions, branches, parameters, ...). Data chunks are analyzed in parallel and then the results are merged. Results include essential base-call information as well as extra information for algorithm development and tuning.

- Data in the cloud can be viewed with an in-house data visualization tool.

- Analysis results are used to train a neural network to perform base-calls in real time on the production instrument.

---

#### Algorithms

- Pore lifetime detection (including various dysfunctional states)

- Voltage normalization (open-channel voltage drifts over time)

- Molecular trace detection

- Level-calling

- Base-calling

- Genome alignment

- Quality metrics (read length statistics, confidence of each base-call, ...)

- Neural network training

---

#### Neural Network Pipeline

In production, the instrument can stream voltage data directly into a GPU running the pre-trained neural network, and produce base-calls in real time.

What risks and biases does this introduce?

- Commercial lab conditions may not match training conditions  
              (temperature/humidity, freshness of reagents, *etc.*)

- Customer DNA distribution may not match training data  
              (genes that are more common in different populations)

- Accuracy may be hard to interpret  
              (higher coverage combines non-linearly)

How can we mitigate these risks?

Notes:
              A neural network is a statistical pattern-matching tool that can be trained on all the features of a data set we feed it, not just the ones we deliberately engineer.

We don't want the neural net to "hallucinate" a certain reading just because it is likely in that location. Even with higher coverage, that would interfere with early cancer detection.

for example: a population in one country may have a much higher rate of some genetic condition such as Sickle Cell Anemia

The basic tradeoff between cost and accuracy becomes much more non-linear when you use techniques that emphasize the most likely reading of the data, but you are looking for unlikely readings such as new mutations.

- JBIG2 hallucinations

- tank photo neural net urban legend
              https://gwern.net/tank

- Include confidence intervals in output, recommend a resonable coverage level to customers

- Train models in a realistic range of conditions, identify the boundary conditions between them

Would synthetic data be acceptable for training?