RNA Secondary Structure Prediction using Nussinov Algorithm

March 03, 2023

The Nussinov algorithm is an RNA secondary structure (folding) prediction method using a dynamic programming approach. Ruth Nussinov introduced this algorithm in the year 1978. It involves computing a two-dimensional (2D) diagonal matrix with the same sequence at both dimensions. The scores are given based on complementary (1) or non-complementary (0) matches of characters. Matrix solving consists of three stages (i) initialization, (ii) matrix-filling, and (iii) trace-back of arrows for structures.

\(\style{ color: blue; } {\begin{array} \\ \text{RNA sequence, } S=a_1a_2a_3....a_{l-1}a_l \\ \begin{align*} \\ \!\!\!\!\! \text{where,} \\ & a=\text{characters (A, U, C, G)} \\ & l=\text{length of the sequence} \\ \end{align*} \end{array}} \)

In this tutorial, I have taken a sample RNA sequence (S) as GGGAAAUCC for prediction.

Initialization

The initialization step is to preset the diagonal cells with zero (0) values to perform matrix filling.

\(\style{ color: blue; } {\begin{align} M_{i, i-1}=0 \text{ where, } i=2 \text{ to } l \\ M_{i,i}\quad =0 \text{ where, } i=1 \text{ to } l \\ \end{align}} \)

PS: Here, the diagonal row (M_i,i-1) in the matrix is considered for initialization purpose only.

Matrix Filling

The matrix-filling step should be performed using the formula below. The matrix cells are filled with maximum scores and pointers while solving the formula.

\(\style{ color: blue; } {\begin{array} \\ M_{i, j} = \max \begin{cases} M_{i+1, j} & i \text{ unpaired } (\downarrow) \\ M_{i ,j-1} & j \text{ unpaired } (\leftarrow) \\ M_{i+1, j-1} + s(a_i,a_j) & i, j \text{ paired } (\swarrow) \\ \underset{i \lt k \lt j} \max \lbrace M_{i, k} + M_{k+1, j} \rbrace & \text{bifurcation } (\downarrow,\leftarrow) \end{cases} \\ \begin{align*} \kern-1em \text{where,} \\ & s(a_i, a_j) = 0, \text{ if } a_i \text{ and } a_j \text{ are non-complement} \\ & s(a_i, a_j) = 1, \text{ if } a_i \text{ and } a_j \text{ are complement} \end{align*} \end{array}} \)

The maximum score determines the structure of the RNA sequence. There are four possible chances of structures based on the scores.

An example,

\(\style{ color: blue; } {\begin{array} \\ M_{4,7}=\max \begin{cases} M_{5,7} \\ M_{4,6} \\ M_{5,6}+s(a_4,a_7) \\ \underset{k \in (5,6)} \max \begin{cases} M_{4,5}+M_{6,7} \\ M_{4,6}+M_{7,7} \end{cases} \\ \end{cases} \end{array}} \)

\(\style{ color: blue; } {\begin{array} \\ M_{4,7} = \max \begin{cases} 1 \\ 0 \\ 0+s(\text{A,U}) \\ \max \begin{cases} 0+1 \\ 0+0 \end{cases} \\ \end{cases} & = \max \begin{cases} 1 \\ 0 \\ 0+1 \\ \max \begin{cases} 1 \\ 0 \end{cases} \\ \end{cases} & = \max \begin{cases} 1 \\ 0 \\ 1 \\ 1 \\ \end{cases} & = 1 \end{array}} \)

\(\style{ color: blue; } {\begin{array} \\ M_{4,7}=\text{pointers} \begin{cases} M_{5,7} \ (\downarrow )\\ M_{4,6} \\ M_{5,6} \ (\swarrow) \\ M_{4,5} \ (\leftarrow), M_{6,7} \ (\downarrow) \\ \end{cases} \end{array}} \)

The completed matrix-filling is below.

Trace Back

The secondary structure of the RNA sequence can be predicted by tracing back the arrows from the right-most cell in the first row towards to half-diagonal in the matrix. There can be many possible structures. The best optimal structure is predicted by tracing arrow having high scores or diagonal arrows (↙) in the cell.

The final completed trace-back matrix with pointers by preferring high scores is below.

The RNA secondary structure by tracing the pointers is below.

The secondary structure for the RNA sequence GGGAAAUCC in bracket-notation model is .(((..))).

Search This Blog

BioGem Blog