This is a blog to illustrate the ideas of a neural network through a simplified variant caller named ShallowVariant.
In the previous blog we discussed how a neural network works. Here we will build a simpler variation of Google’s DeepVariant using a neural network, to illustrate a practical example in the area of Genomics with Machine Learning.
Downloading the Data
The first step is to download the necessary files:
These two files are sufficient to get us started. Throughout this blog other commonly used bioinformatic tools will be utilized to keep the example simple.
Performing the Pileup to Generate Candidate Variants with Truth Values
The next step is to extract candidate variants from the BAM file, by performing a pileup. A pileup is an aligment (mapping) of reads to a reference. If there is a region of variation among the reads and reference, then these will become candidates. In our case we will use freebayes, but other tools like bcftools mpileup will also work. The idea is to focus on bi-allelic SNPs (just to keep the problem simple), which we can use to build a neural network to predict the genotype. Since a pileup provides the position, we would not need the bases, and can rely only on the reference and alternate allelic depths. These provide information regarding the number of reads that support the reference or alternate allele.
Thus our goal is to see if we can estimate the genotypes from just the reference and alternate allelic depths, and use the truth valuesTruth values denote what would be the known and expected empirically derived values. of known genotypes to train the weights of our neural network appropriately $\text{–}$ with the hope of uncovering this correspondence between coverage (depths) and genotype. Our hypothesis is that there is correlation between the coverage (depths) and genotype. Thus, all we will need are the following elements:
Reference Counts
Alternate Counts
Genotype
Therefore, we will run the following two commands to generate the answers:
# This command will generate the SNP candidates (among other ones), along with
# their corresponding genotypes and allelic depths. These will be the truth set.
freebayes -f hs37d5.fa \
NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam > candidates-freebayes.vcf
# This command will extract the reference base, alternate base, genotype and
# allelic depths. These will be used in building the neural network.
bcftools query -f '%REF,%ALT,[%GT,%AD]' candidates-freebayes.vcf > ref_alt_gt_ad.txt
The output of ref_alt_gt_ad.txt will look something like this (for just the bi-allelic SNPs):
I do realize there are additional filtering criteria such as minium coverage, and base quality scores, but we want to keep this example simple to connect the big ideas together $\text{–}$ and not become lost in the details.
The next step is to prepare these values for training a machine learning model.
Defining the Input (Tensor) and Output Labels
Now that we have bcftools-processed candidates from the VCF file, let’s try to generate the input to a neural network. In our example, the machine learning framework we will use is PyTorch. For a neural network model in PyTorch, the standard input is a tensor.
Since we’ll be taking a simpler approach, we will base our genotype predictions based on the following ratio (the symbols $||$ just refer to count or cardinality):
Regarding the output genotype labels, they would need to have some form of numerical class label in order to be computable. Thus let’s label $homozygous \; reference$, $heterozygous$, and $heterozygous \; alternate$ as $1$ through $3$ $\text{–}$ with missing or uncallable (./.) labeled as $0$:
Thus before going too deep, let’s plot these and visualize what the genotype classes look like:
Figure 1: This illustrates the plot of the fraction of reference read counts versus alternate read counts as compared to the total read depth.
Based on the figure above, the genotype classes do not seem to separate very well, thus making the stratification by genotype-classes is not as clean. The fix would be to transform the space the data lives in, and a simple way would be to put them into $log$-$space$ by taking the $log_2$ of the two fractions:
Figure 2: This illustrates the $log_2$ transformed plot of the fraction of reference read counts versus alternate read counts as compared to the total read depth.
Now there seem to be better separation among the classes, allowing for the weights of the neural network model to be able to classify among the genotypes.
Loading and Processing the Data
Now we will be getting into the details of how to use PyTorch. We have a file called ref_alt_gt_ad.txt that we want to load. Let’s first start with a few variable definitions and helper functions.
# The reference and alternate allele count fractions
number_of_inputs = 2
# This is the number of neurons in the hidden layer between the input and output
number_of_neurons_in_hidden_layer = 6
# This will represent four classes of predicted outputs: ./., 0/0, (0/1 or 1/0), 1/1
number_of_genotype_classes = 4
# This represents the number of training or testing rounds (epochs)
number_of_training_or_testing_rounds = 10
# The path of our truth set, containing reference count, allele count and genotypes
labeled_data_file = 'ref_alt_gt_ad.txt'
# Helper function for converting genotypes to numeric labels
def convert_genotype_to_number( genotype ):
if '0/0' in genotype:
return 1
if '0/1' in genotype:
return 2
if '1/0' in genotype:
return 2
if '1/1' in genotype:
return 3
else:
return 0
# A helper function for converting numeric labels to genotypes
def convert_numeric_class_to_genotype( numeric_genotype_tensor ):
genotype_numeric_list = numeric_genotype_tensor.tolist()
genotypes=[]
for i in range(0, len( genotype_numeric_list )):
if genotype_numeric_list[i] == 0:
genotypes.append( './.' )
continue
if genotype_numeric_list[i] == 1:
genotypes.append( '0/0' )
continue
if genotype_numeric_list[i] == 2:
genotypes.append( '0/1' )
continue
if genotype_numeric_list[i] == 3:
genotypes.append( '1/1' )
continue
return genotypes
# Helper function for determining the number of rows in a file
def get_total_rows( data_file ):
with open(data_file, newline='') as candidates_file:
data = csv.reader(candidates_file, delimiter=' ', quotechar='|')
row_counter = 0
for row in data:
row_counter = row_counter + 1
return row_counter
# Defining the number of rows in a file for array instantiation
number_of_rows = int( get_total_rows( labeled_data_file ) )
# Defining NumPy arrays that will contain the reference and allele fractions
X = np.zeros( (number_of_rows, number_of_inputs), dtype=float )
# Defining a NumPy array for output (genotype) labels
y = np.zeros( (number_of_rows), dtype=int )
The X and y vectors are instantiated as NumPy arrays $\text{–}$ with all zero values $\text{–}$ and having the following dimensions (shapes):
X has dimensions 74732 rows x 2 columns, where two represents the reference and alternate allele
y has dimensions 74732 elements to store the genotypes. You can think of it having one
74732 rows x 1 column, which is equivalent.
The values in X look as follows:
print(X)
array([[0., 0.],
[0., 0.],
[0., 0.],
...,
[0., 0.],
[0., 0.],
[0., 0.]])
print( X.shape )
(74732, 2)
# This shows the type being a NumPy ndarray class
print( type(X) )
<class 'numpy.ndarray'>
The values in y look as follows:
print( y )
array([0, 0, 0, ..., 0, 0, 0])
print( y.shape )
(74732,)
# This shows the type being a NumPy ndarray class
print( type(y) )
<class 'numpy.ndarray'>
Next let’s define the function for loading the data:
def load_data( filename=labeled_data_file ):
with open(filename, newline='') as candidates_file:
data = csv.reader(candidates_file, delimiter=' ', quotechar='|')
row_counter = 0
for row in data:
row_element = row[0]
row_split = row_element.split(',')
if len(row_split) > 5: # skip multi-allelic sites
continue
reference = row_split[0]
alternate = row_split[1]
genotype = row_split[2]
reference_read_count = int( row_split[3] )
alternate_read_count = int( row_split[4] )
total_read_count = reference_read_count + alternate_read_count
# Skip low coverage sites
if total_read_count < 10:
continue
# In case of zero reference counts (for division by zero errors).
# An increment of 1 will not impact the overall classification results,
# because the minimum read counts would need to be at least 10.
if reference_read_count == 0:
reference_read_count = 1
total_read_count = total_read_count + 1
# In case of zero alternate counts (for division by zero errors).
# An increment of 1 will not impact the overall classification results,
# because the minimum read counts would need to be at least 10.
if alternate_read_count == 0:
alternate_read_count = 1
total_read_count = total_read_count + 1
reference_reads_to_fill = int( number_of_inputs *
fraction( reference_read_count,
total_read_count ) )
# Log2 Transformed Read Ratios
# For reference-supporting read counts
X[ row_counter ][0] = math.log( float(reference_read_count) /
float(total_read_count), 2 )
# For alternate allele-supporting read counts
X[ row_counter ][1] = math.log( float(alternate_read_count) /
float(total_read_count), 2 )
# Convert the genotypes to numeric labels
y[ row_counter ] = convert_genotype_to_number( genotype )
row_counter = row_counter + 1
load_data( filename=labeled_data_file )
Next we will define the multi-class neural network model.
Defining the MultiClass Neural Network Model
Let’s define the general network model for classifying for multiple genotype labels (i.e. more than two, or binary):
Figure 3: This illustrates the multiclass neural network model we will be implementing.
This is the same 2-layer neural network model you have seen before, but there are a few new additions. One of the most obvious additions is the increased number of nodes. We put more nodes at the beginning to get more details and the summarized the pertinent information by selecting only four nodes in the second hidden layer. That is a very common approach in neural network design to have more nodes initially and then focusing the results down the network to fewer and fewer ones.
Another element in this network is the new activation function called $ReLU$, which stands for Rectified Linear Unit. Its behavior is by returning a value of $0$ for any input below 0, and returning back the input for anything higher than $0$. This is defined as follows:
Here x can represent complex interactions among multiple variables, which is transmitted to the next nodes of the model. The graph below shows the general behavior of the activation function:
Figure 4: This is the graphical representation of the ReLU activation function.
The other characteristic that this model contains is the $softmax$ function, which turns a vector of values into probabilities (i.e. the sum of the values will then be equal to 1). The $softmax$ function is the following:
class GenotypeClassifier(torch.nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.main = torch.nn.Sequential(
torch.nn.Linear(input_size, hidden_size),
torch.nn.ReLU(),
torch.nn.Linear(hidden_size, output_size)
)
def forward(self, x):
out = self.main(x)
return out
You will notice that we inherit the class torch.nn.Module, which is the base class for neural networks in PyTorch. Then we utilize the torch.nn.Linear class, which performs the linear transformations of $X \cdot W^T + bias$ calculations (where $W^T$ is the transpose of the weights).
One more important element when building a model is to randomize the data, and then use 80% for training the model and 20% for subsequently testing it. The testing step usually is not randomized. In our case we will randomly shuffle both stages, as the data loading methods in PyTorch will keep providing the same input for every round, which would not perform proper validation.
Let’s now train and test the neural network.
Implementing the MultiClass Neural Network Model
The first step is to define the model and a few initial variables that we will use:
# We will need to split our data into a training and testing group.
# The training set will be composed of randomly selected values
# representing 80% of the data.
# The testing set will be composed of randomly selected values
# representing 20% of the data.
TEST_SIZE = 0.2
# The batch size represents how much of the data at a time will
# be used during rounds of training or testing (epochs). In this
# case an epoch will represent a round of training, with a total
# of 10 rounds. In each round, 6700 values of the input will
# be used.
BATCH_SIZE = 6700
# This is a value to randomize the random number generator used
# when selecting values.
SEED = 42
# Here we will generate the training set, and the test set for both
# the input (X) and output (y). Here the input represents the ratios
# of reference and alternate allele counts compared to the total.
# The output represents the numeric genotype values.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=TEST_SIZE,
random_state=SEED
)
# Here we store the size of the training and test sets for calculating the accuracy of
# our model's predictions.
train_size = len(X_train)
test_size = len(X_test)
If we print the values for train_size and test_size they will be 59785 and 14947, respectively.
The next we will generate the tensors. A tensor is the equivalent of the NumPy array, and is the standard data format in machine learning platforms, since it can remember additional parameters such as if it should reside on a GPU or CPU. Usually the shape (dimensions) of a tensor is defined as $(batch, channels, rows, columns)$, where channels are different data values of the rows and column, such as RGB (red, green, blue) values for an image. Batches are how many groups (sets) of the dimensions below $(channels, rows, columns)$ should be taken together in a calculation, such as for training or testing (validation).
Next we will create dataset tensors to perform on-the-fly (lazy) data set reprentations for both training and testing. We will use dataloaders to shuffle and iterate over them. This nice tutorial provides a more friendly definition with examples.
# Here we initialize the GenotypeClassifier with:
# * The number of inputs
# * The number of nodes in the hidden layer
# * The number of outputs
# The instantiation below is equivalent to:
# shallow_variant_model = GenotypeClassifier(2, 6, 4)
shallow_variant_model = GenotypeClassifier( number_of_inputs,
number_of_neurons_in_hidden_layer,
number_of_genotype_classes )
We will be training the model in batches, since we do not have enough computing power to train the whole dataset at once. Therefore with each training round (epoch), we take a batch and compare after training with how it performs against the expected (true) y values. Basically we compare the predicted $y$ with the true $y$, and based on their difference we optimize the model. With a bigger difference we optimize more, and vice versa if less. The $criterion$ measures how wrong our model performs, which is usually called the $loss \; (cost) \; function$. The loss will be used by the $optimizer \; function$ to adjust the model’s weights for the next round. Since we will be classifying across multiple classes, we will utilize the CrossEntropyLoss function:
# The Loss function
criterion = torch.nn.CrossEntropyLoss()
The basic formula for cross entropy loss is the following:
Assuming the values of y are between 0 and 1, what the above formula calculates is the deviation among predicted outputs $\hat{y}$ in a batch, as compared to true (expected) values ($y$). What it really performs is the information (entropy, $H$) when having varied distributions. If the values are random, then these values will be high, and thus have a big loss (wrong result), requiring a larger shift in the weights of the neural network. Otherwise, they will stabilize and the loss will become minimal $\text{–}$ which is ideal $\text{–}$ making the model highly accurate in predicting the outcome.
Given the $loss \; function$ (or $criterion$), there needs to be another function to optimize based on the loss. For that we will use the Adam optimizer:
The Adam (Adaptive Moment Estimation) optimizer was first proposed in 2014 (with multiple updates following it) by Diederik P. Kingma and Jimmy Ba. The general intuition is that it has the advantage over other ones by adaptively adjusting the learning rate of each weight based on previous gradients, thus it converges much faster. The lr parameter is the initial learning rate to start with, meaning by how much to adjust the weights the first time when no gradients are available.
Now we are ready to begin to train the model.
Training the Model
Now we are ready to run through 10 epochs (rounds) to train the model, while checking on its loss and accuracy:
for epoch in range( number_of_training_or_testing_rounds ):
losses = 0
correct = 0
total = 0
for X_batch, y_batch in dataloader_train:
optimizer.zero_grad() # Reset the gradients for re-optimization
y_pred = shallow_variant_model(X_batch) # Run the model to get the predictions
loss = criterion(y_pred, y_batch) # Calculate the loss
loss.backward() # Determine the gradients of the weights and biases
optimizer.step() # Update the weights biases
losses = losses + loss.item() # Add the loss for this batch to the running total
correct = correct + correct_count( y_pred, y_batch )
total = total + 1
print( f"epoch: {epoch + 1} | loss: {losses / len(dataloader_train):.4f} |
accuracy: {100*(correct / train_size):.2f}" )
You might notice that we have a correct_count() function specified above. All that function performs is to determine how many of the predicted values y_pred match with the true values of y_batch. The contents of the function are the following:
def correct_count( y_pred, y_true ):
correct = 0
y_pred_argmax = torch.argmax(y_pred, dim=1)
for i in range(0, len(y_pred_argmax) ):
if (y_pred_argmax[i] - y_true[i]) == 0:
correct = correct + 1
return correct
While training the model, below is the output of each epoch:
Epoch
Loss
Accuracy
1
0.2682
97.07 %
2
0.0249
98.19 %
3
0.0203
99.26 %
4
0.0179
99.26 %
5
0.0141
99.30 %
6
0.0137
99.36 %
7
0.0124
99.39 %
8
0.0121
99.42 %
9
0.0119
99.41 %
10
0.0117
99.45 %
Below is a graph representing the above results:
Figure 5: This shows the relationship of the loss and accuracy of the model, while it is being trained throughout the epochs.
The general idea is that a model as it is being trained with each round will have a lower loss and become more accurate, which this is being exhibited here.
Testing the Model
Now we are ready to test how good our model behaves with testing data, which it has not seen before. Again we will go through 10 epochs (rounds) of testing, while checking on its loss and accuracy:
shallow_variant_model.eval()
with torch.inference_mode():
for epoch in range( number_of_training_or_testing_rounds ):
losses = 0
correct = 0
total = 0
for X_batch, y_batch in dataloader_test:
y_pred = shallow_variant_model(X_batch) # Run the model to get the predictions
loss = criterion(y_pred, y_batch) # Calculate the loss
losses = losses + loss.item() # Add the loss for this batch to the running total
correct = correct + correct_count( y_pred, y_batch )
total = total + 1
print( f"epoch: {epoch + 1} | loss: {losses / len(dataloader_test):.4f} |
accuracy: {100*(correct / test_size):.2f}")
While testing the model, below is the output of each epoch:
Epoch
Loss
Accuracy
1
0.0095
99.61 %
2
0.0105
99.61 %
3
0.0110
99.61 %
4
0.0093
99.61 %
5
0.0119
99.61 %
6
0.0106
99.61 %
7
0.0108
99.61 %
8
0.0096
99.61 %
9
0.0094
99.61 %
10
0.0106
99.61 %
Below is a graph reprenting the above results:
Figure 6: This shows the relationship of the loss and accuracy, while the model is being tested throughout the epochs.
Based on the low loss and high accuracy the model performs fairly well.
Making Some Predictions
Now let us try to predict some genotypes. We will take the following sets of read counts for different positions:
Position
Reference
Alternate
1
10
1
2
10
10
3
1
10
We will need to take their fractions based on the total number of reads at each position, and then take the $log_2$ of each of those fractions:
# The reads are of the form: [reference, alternate]
# We will divide them by their total and take the log2 of each,
# before we submit them to the model
predition_reads_fractions = torch.log2(
torch.div(
torch.tensor([[10,1], [10,10], [1,10]], dtype=torch.float32 ),
torch.tensor([[10+1], [10+10], [1+10]], dtype=torch.float32 )
)
)
# The unprocessed predictions from the model
prediction = shallow_variant_model( predition_reads_fractions ).detach()
print( "The unprocessed predictions from the model: \n" )
print( prediction )
# The softmax transformations of the predictions to probabilities
probability = nn.Softmax( dim=1 )( prediction )
print( "\nThe softmax transformations of the predictions to probabilities: \n" )
print( probability )
# Checking that the probabilities add up to 1
print( "\nChecking that the probabilities add up to 1: \n" )
print( probability.sum( dim=1 ) )
# The numeric genotype predictions
print( "\nThe probability after argmax: \n" )
classes = probability.argmax( dim=1 )
print( classes )
# The genotype of the numerically-transformed predictions
print( "\nThe genotype of the numerically-transformed predictions: \n" )
classes = probability.argmax( dim=1 )
print( convert_numeric_class_to_genotype( classes ) )
# The parameters of the trained model
print( "\nThe parameters of the trained model:\n" )
print( shallow_variant_model.state_dict() )
The output result of the predictions is the following:
The unprocessed predictions from the model:
tensor([[-40.8362, 19.1928, 16.7496, -6.0768],
[-20.0807, 0.3643, 6.1820, -5.4118],
[-35.6244, -17.7323, 4.2061, 7.6862]])
The softmax transformations predictions to probabilities:
tensor([[7.8263e-27, 9.2006e-01, 7.9939e-02, 9.7584e-12],
[3.9171e-12, 2.9654e-03, 9.9703e-01, 9.1953e-06],
[1.5040e-19, 8.8650e-12, 2.9883e-02, 9.7012e-01]])
Checking that the probabilities add up to 1:
tensor([1.0000, 1.0000, 1.0000])
The numeric genotype predictions:
tensor([1, 2, 3])
The genotype of the numerically-transformed predictions:
['0/0', '0/1', '1/1']
The results seem promising and will always get better with more data and more rounds of training.
One of the reasons I started this blog series is to unmystify what the models actually are performing internally. In the next section we will dive a bit deeper on what the model actually learned from the read depths to predict the genotype.
What has the model learned? Diving Deeper Into the Genotype Model
PyTorch has made it easy to look inside the model to determine what the weights have learned from the data:
# The parameters of the trained model
print( "\nThe parameters of the trained model:\n" )
print( shallow_variant_model.state_dict() )
Let us now try to inspect the effect of the weights across the range of read fractions. If we take a look at the outputs of Hidden Layer I, they are in the range from {$0.0001,…,0.9999$}. The node output distribution (for Hidden Layer I) across that range is as follows:
Figure 7: This shows the change in the outputs of Hidden Layer I for the range of possible inputs of read fractions.
Another way the analysis can be viewed, is from the perspective of the binary contribution of the weights. The inputs are $(p, 1-p)$ where $p \in $ {$0,…,1$}, as is $1-p$. Thus given the log plot of the values, the highest contribution would be around $0.5$:
Figure 8: This shows the contributions of the weights among the two inputs based on the log-transformed values.
Thus the ReLU activation function will remove any negative contributions for upcoming layers of network. With that knowledge in mind, let’s explore which nodes actually have the most effect on predicting each genotype class:
Figure 9: This shows the change in the outputs of the ReLU activation function, for the range of possible inputs of read fractions.
Given the above graph, it seems that only the weights of nodes 2, 3, and 4 of Hidden Layer I actually will have significant inpact in the separation of genotypes. With that knowledge, let’s explore the output of Hidden Layer II:
Figure 10: This shows the change in the outputs of Hidden Layer II for the range of possible inputs of read fractions.
Based on the separation of genotype classes, only classes $1$ $(0/0)$, $2$ $(0/1)$ and $3$ $(1/1)$ perform separation among each other. Thus we can use class $1$ $(0/0)$ as a baseline to plot the other two against it, in order to determine separation of genotypes:
Figure 11: This shows the separation among genotype classes $2$ $(0/1)$ and $3$ $(1/1)$ with respect to class $1$ $(0/0)$.
There is a clear point of intersection at around the value of $-14$ of class $1$ $(0/0)$, where a switch takes place among classes $2$ $(0/1)$ and $3$ $(1/1)$ $\text{–}$ emphasized by the weights.
There is more analysis in the works, with which this report will be updated, but I hope I gave you a overview of what neural network models learn and how genotype can be predicted just from the read depth.