Bioinformatics Part 2

Link to part 1

GitHub Link

The next problem to solve “Complementing a Strand of DNA” were develop a reverse complement of a DNA string.

Using the reverse_complement function we used before, we can develop the complement using the sample dataset.

DNA_reverse_complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}


def reverse_complement(seq):
    return ''.join([DNA_reverse_complement[nuc] for nuc in seq])[::-1]


reverse_complement_string = reverse_complement('AAAACCCGGT')

sample_output = 'ACCGGGTTTT'

if sample_output == reverse_complement_string:
    print('MATCH \n')
    print('reverse_complement_string: {}'.format(reverse_complement_string))
    print('sample_output: {}'.format(sample_output))
else:
    print('NO MATCH')

I added small if function to check if the result matches the sample dataset.

I got this result:

image001.png

So now we can get the real data:

image003.png
image005.png

The code successfully worked:

image007.png

In the next video, We calculate the GC content. {explain GC content TK}

def gc_content(seq):
    return round(seq.count('C') + seq.count('G') / len(seq) * 100)
image009.png
image011.png

I checked the GC content on an online tool. I got this result:

image013.png

Which means my program is undercounting the GC content. I quickly found the issue, which is about the function has missing parenthesis over the seq.count.

Before:

def gc_content(seq):
    return round(seq.count('C') + seq.count('G') / len(seq) * 100)

After:

def gc_content(seq):
    return round((seq.count('C') + seq.count('G')) / len(seq) * 100)

Next we make function that calculates the GC content in a subsection.

def gc_content_subsec(seq, k=20):
    res = []
    for i in range(0, len(seq) - k + 1, k):
        subseq = seq[i:i + k]
        res.append(gc_content(subseq))
    return res

Next video to with Rosalind problems again using the function we just made.

The Rosalind problems ask us to work out the GC content. With sting labels in the dataset. This is called the FASTA data format. A widely used format for bioinformatics.

image015.png

Following the video most of the work is formatting the data so it can be analysed for the GC content.

def read_file(file):
    with open(file, 'r') as f:
        return [l.strip() for l in f.readlines()]


def gc_content(seq):
    return round((seq.count('C') + seq.count('G')) / len(seq) * 100)

The first function reads the file. And strips the lines of whitespace. And separates them into lines.

FASTAFile = read_file("rosalind_problems\\test_data\gc_content_ex.txt")
# Dictionary for Labels + Data
FASTADict = {}
# String for holding the current label
FASTALabel = ""

for line in FASTAFile:
    if '>' in line:
        FASTALabel = line
        FASTADict[FASTALabel] = ""
    else:
        FASTADict[FASTALabel] += line

For loop adds data into the directory made before. It used an if statement to see the check the open sequence. The “>” symbol is used as the label (key) in the FASTADict. Then others lines are added as values to the key.

RESULTDict = {key: gc_content(value) for (key, value) in FASTADict.items()}
print(RESULTDict)

Output:

{'>Rosalind_6404': 54, '>Rosalind_5959': 54, '>Rosalind_0808': 61}

Creates a new dictionary containing the GC content of values in the FASTA dict.

MaxGCKey = max(RESULTDict, key=RESULTDict.get) This uses the python max function to find the largest item in the dictionary.

print(f'{MaxGCKey[1:]}\n{RESULTDict[MaxGCKey]}')

The line of code prints the first result of the MaxGCKey. The uses that result in finding the name of the value in the RESULTDict.

Now we can download the dataset form, Rosalind.

image017.png
image019.png

Rosalind said to was wrong:

image021.png

I think the error comes down to the program rounding down/up the result too much. As we can see from the sample output:

image023.png

The sample out has a lot more decimal points which allow for a finer margining of error. To fix this I simply adjusted the round function to allow 6 decimal points:

def gc_content(seq):
    return round((seq.count('C') + seq.count('G')) / len(seq) * 100, 6)
image025.png

This is the same data from above but allows decimal points to be used.

So I tried the program again with new data:

image027.png
image029.png

The program successfully worked:

image031.png

For the next video, we work on using codons and translation. Codons are a sequence of 3 nucleotides in RNA and DNA strings. Mainly used for protein synthesis. Translations covert the RNA and DNA to proteins.

Copying the codon table turned directory to the structures file:

image033.png

The translation function:

def translate_seq(seq, init_pos=0):
    return [DNA_Codons[seq[pos:pos + 3]] for pos in range(init_pos, len(seq) - 2, 3)]

[DNA_Codons[seq[pos:pos + 3]] >>> Goes through the DNA_Codons dictionary. The seq variable is the DNA string is inputted. The sequence is then sliced to check 3 nucleotides in the sequence. As the starting position is zero the first nucleotides may look like this:

seq = 'GCATCATTTTCCGACTTGCAGAAGACCTTCTTAGTGAAATTGGGATAACT'
seq[0:3]
'GCA'
>>> seq[3:6]
'TCA'

for pos in range(init_pos, len(seq) - 2, 3) >>> Starts a for loop with the range of the stating position of the sequence (init_pos variable). Stops the for loop and the end of sequence minus 2. len(seq) - 2. The len(seq) – 2 in the for loop stops the program from reading a sequence that only has 2 nucleotides making it unable to read a codon.

Like this error:

bioinfom error anti doxx.png

Next function we develop is a function that counts the usage of a codon in a sequence.

image037.png
def codon_usage(seq, aminoacid):
    """Provides the frequency of each codon encoding a given aminoacid in a DNA sequence"""
    tmpList = []
    for i in range(0, len(seq) - 2, 3):
        if DNA_Codons[seq[i:i + 3]] == aminoacid:
            tmpList.append(seq[i:i + 3])

    freqDict = dict(Counter(tmpList))
    totalWight = sum(freqDict.values())
    for seq in freqDict:
        freqDict[seq] = round(freqDict[seq] / totalWight, 2)
    return freqDict
tmpList = [] >>> creates temporary list to codons
for i in range(0, len(seq) - 2, 3): >>> Makes loop that starts at 0 and then with the length of the  sequence -2 with the increment of 3.
if DNA_Codons[seq[i:i + 3]] == aminoacid:
    tmpList.append(seq[i:i + 3])

If statement checks if codon matches with amino acid selected in the Function. If the codon matches the amino acid then the codon is then added to the temporary list.

freqDict = dict(Counter(tmpList)) >>> creates a dictionary with amount of times the codon occurred in the list.

totalWight = sum(freqDict.values()) >>> Adds up all the number together so it later be used to calculate the percent of amino acids

for seq in freqDict:
        freqDict[seq] = round(freqDict[seq] / totalWight, 2)

A small for loop that calculates the percentage of the codon. By dividing the frequency of codon with the total number codons with totalWight. The rounding by decimal points.

$\setCounter{0}$
Previous
Previous

How I created an API that can work out your shipping emissions

Next
Next

Learning some bioinformatics with Python