Learning some bioinformatics with Python

I’m making a mini-project about bioinformatics. I’m going to be following this well put together YouTube series by rebel coder.

According to Wikipedia, is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex.

Bioinformatics has become a more viable field as the increase of computing power and major improvements in genetic sequencing. Allows us to analyse biological data at a scale never seen before.

Examples can include:

Genomics: Studying a person genome to find causes of diseases and complex traits. Is now easier as technology can now go through millions of units of DNA in quicker time.

 

Epidemiology: Nextstrain is a website that tracks how pathogens evolve. By looking thorough genome of the pathogen and finding and changes.

Figure 1 History Tree of SARS-COV-2 https://nextstrain.org/ncov/global

Figure 1 History Tree of SARS-COV-2 https://nextstrain.org/ncov/global

Drug Discovery: Bioinformatics can used to help find drugs using DNA stored in a database to test with chemical compounds. One example: http://click2drug.org/

A very important distinction I found from the website https://www.ebi.ac.uk:

 

Bioinformatics is distinct from medical informatics – the interdisciplinary study of the design, development, adoption and application of IT-based innovations in healthcare services delivery, management and planning. Somewhere in between the two disciplines lies biomedical informatics – the interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific enquiry, problem-solving and decision making, motivated by efforts to improve human health. - https://www.ebi.ac.uk/training/online/course/bioinformatics-terrified/what-bioinformatics-0

In the first video of the series, we started work on validating and counting nucleotides.

Nucleotides are units that help build up DNA and RNA structures. Nucleotides contain a phosphate group. A five-carbon sugar and They have 4 different nitrogenous bases adenine (A), thymine (T), guanine (G), and cytosine (C).

Figure 2 The Structure of DNA MITx Bio

Figure 2 The Structure of DNA MITx Bio

Sources

https://en.wikipedia.org/wiki/Nucleotide,

https://www.youtube.com/watch?v=o_-6JXLYS-k - The Structure of DNA MITx Bio

https://www.youtube.com/watch?v=hI4v7v8AdfI& - Introduction to nucleic acids and nucleotides | High school biology | Khan Academy

The first function I made by watching the video. Was a DNA validator. This function will check if a string has incorrect characters that are not nucleotides. adenine (A), thymine (T), guanine (G), and cytosine (C). Then return false if any incorrect characters are in the string.

image011.png

The function I made from the video is frequency count for the nucleotides. Counts the number of nucleotides in strings and counts them in a dictionary.

def countNucFrequency(seq):
    temFreqDict = {'A': 0, 'C': 0, 'G': 0, 'T': 0}
    for nuc in seq:
        temFreqDict[nuc] += 1
    return temFreqDict


nucleotides = ['A', 'C', 'G', 'T']


def validateSeq(dna_seq):
    temsep = dna_seq.Upper()
    for nuc in temsep:
        if nuc not in nucleotides:
            return False
     return temsep

nucleotides = ['A', 'C', 'G', 'T'] defines the correct characters that should be in the sequence.

def validateSeq(dna_seq):defines the function with argument called dna_seq.

temsep = dna_seq.Upper() sets a variable that changes the dna_seq to uppercase. So the input is correctly compared to the nucleotides list which contains uppercase letters.

for nuc in temsep: >>> for loop that checks each character in the temseq.

if nuc not in nucleotides:>>> An if statement that checks the character is not in the nucleotides list.

return False. >>> Function returns if statement from above is true

return temsep. >>> Function returns the sequence if all characters are in the nucleotides list.

Testing the function with a random string:

from dna_toolkit import validateSeq

rndDNASeq = 'ATTACG'

print(validateSeq(rndDNASeq))
image005.png

Testing with wrong character

rndDNASeq = 'ATTACGX'

print(validateSeq(rndDNASeq))
image007.png

Testing with mixed cases

rndDNASeq = 'ATTACGcAat'

print(validateSeq(rndDNASeq))
image009.png

Next in the video, he created a random DNA generator.

rndDNASeq = ''.join([random.choice(nucleotides) for nuc in range(20)])

‘’.join This joins the characters together with no spaces. Random.choice(nucleotides) Chooses a random character based on the list. for nuc in range(20) sets a for loop with the range 20. Setting the length of the DNA string.

image013.png

In the next YouTube video, we solve Rosalind problems. http://rosalind.info is a website that provides bioinformatics problems to solve.

First problem will be working is called Counting DNA Nucleotides. According to the video we will adapting the countNucFrequency we made earlier to solve the issue.

def countNucFrequency(seq):
    temFreqDict = {'A': 0, 'C': 0, 'G': 0, 'T': 0}
    for nuc in seq:
        temFreqDict[nuc] += 1
    return temFreqDict


Dna_string = 'CAGCGCTTTTCGACTCCAGT'
result = countNucFrequency(Dna_string)
print(' '.join([str(val) for key, val in result.items()]))

The most important bit of this code is the formatting. As Rosalind requires a certain output:

image015.png

In the dictionary output from earlier needs to change into format in the picture.

print(' '.join([str(val) for key, val in result.items()]))

‘ ’.join >> adds a space between the numbers.

([str(val) for key, val in result.items()] >> List comprehension of looping through the dictionary.

str(val) for key >> Turns val variable (count) in into string so it can be printed and formatted. Does this for each key in the dictionary.

, val in result.items() >> returns items in results directory

Now I download the dataset and use the program I created.

image017.png

Downloading the dataset gives us a text file of a long DNA string

image019.png

I changed the DNA string variable to the DNA string from the text file.

image021.png

The second video works on transcription and Reverse Complement. Transcription is the process of converting DNA to mRNA. Reverse Complement creates the second complementary strand of DNA. That is in reverse order. Each of the bases pair to specific bonds. Adenine with Thymine and Cytosine with Guanine. The structure of DNA is that there are two strands where they are facing opposite directions.  This is done as two-stranded DNA can be written as one stranded.

Source: DNA structure and the Reverse Complement operation


We replace thymine(T) with uracil(U) as this is the base used in RNA.

 Source:  https://youtu.be/JQByjprj_mA?t=155

def transcription(seq):
    return seq.replace('T', 'U')
image023.png

As can see the output shows uracil(U) instead of thymine(T).

DNA_reverse_complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'} This directory changes the base of the nucleotide so it matches with base-pairing rules. Adenine with Thymine, Cytosine with Guanine.

'A': 'T', 'T': 'A' >>> Adenine gets swapped with Thymine. ('A': 'T') and vice versa ('T': 'A')

'G': 'C', 'C': 'G' >>> Guanine gets swapped with Cytosine. ('G': 'C') and vice versa ('C': 'G')

def reverse_complement(seq):
    return ''.join([DNA_reverse_complement[nuc] for nuc in seq])[::-1]

The DNA_reverse_complement dictionary is used to replace characters when the DNA sequence is being reversed.

The reverse_complement function changes each nucleotide with the dictionary, then reserves the sequence with [::-1]

image025.png

To check if the reverse complement is correct I inputted the sequence into this website. That does reverse complements for you and copied their result. Then I used the find and search feature of my code editor to match the string.

image027.png
image029.png

The reverse complement matches the answer.  So, the function works correctly.

 

In the third video, we solve Rosalind problems again using the functions we created in DNA toolkit of earlier.

The Rosalind problem is asking us to transcribe DNA to RNA.

image031.png

In the program, I copied the transcription function. And passed the DNA string into the function. Which was saved into a variable called RNA_string.

def transcription(seq):
    return seq.replace('T', 'U')
DNA_string = 'GATGGAACTTGACTACGTAAATT'

RNA_string = transcription(DNA_string)

print(RNA_string)

Used the Find feature from my code editor to see if the output matches the sample output from the website.

image033.png

It does so the program successfully transcribes DNA to RNA.

So now we can use the real data from the website:

image035.png
image037.png

I submitted the RNA string into Rosalind. Then I got this result:

image039.png

I will post more about me learning about this topic.

Please check rebelcoder’s channel as im basically learning almost everything about this topic from his videos.

Tobi Olabode