# A biology question that is actually a probability question

Now that the hurricane issues are slowly dissipating, I made it back to Brooklyn today, back to the place I spend most of my time… my school… I suppose you could call it home.

I’m doing some work here before turning to my apartment, and I ran into a science teacher who asked me a question:

Let’s say you have a sequence of 3 billion nucleotides. What is the probability that there is a sequence of 20 nucleotides that repeats somewhere in the sequence? You may assume that there are 4 nucleotides (A, C, T, G) and when coming up with the 3 billion nucleotide sequence, they are all equally likely to appear.

I liked the question, but I haaaave to work on my own work and not this problem at this moment. So I thought I’d throw it to you.

A. What’s the answer to this question?

B. How would you explain it to this biology teacher (who knows basic math stuffs)?

and for the bonus…

C. How would you design a lesson that would make a student understand the process and your answer. You can assume that the student understands combinations and permutations.

If I get some work done today, I may think through this problem as a treat. If none of you beat me to the punch. But I’d rather you beat me to the punch.

PS. I might as well throw in the additional question of: “how long does the length of the sequence have to be before you are guaranteed a repetition of a sequence of 20 nucleotides?”

UPDATE: My friend Jason Lang sent me his solution, which is amazingly written and cogent.

1. This is related to the birthday problem. If there are M people in a room, and there are N days in a year, then what is the probability that two or more people share a birthday? The answer is approximately 1 – e^(-M^2/(2N)). In the biology problem, M is 3*10^9 and N is 4^20, so the probability of no repetition is vanishingly small — approximately 1 in 6 * 10^3554896. This calculation assumes that the subsequences are independent, which is not quite true, but nearly so.

2. More intuitively, there are 4^20 which is about 10^12 different sequences of 20. You have 3*10^9 spots to start your sequence of 20, so that means for any given sequence there’s about a 0.3% chance of finding it, which isn’t very high.

But then you have almost 3*10^9 sequences to try! (I’m ignoring slight differences because the sequenes overlap each other). That means on *average* you’d expect to find about 0.3% of that, which is about ten MILLION repeated sequences. So the probability that it doesn’t happen even once is vanishingly small.

(Maybe the 4^20 part requires explanation too, depending on your audience.)

The guaranteed repetition is going to be after essentially 4^20 … but with such a large space, by the time you’re guaranteed one you can expect on average for there to be billions of repeats, so unless someone is very carefully arranging the sequence, you’ll never have to wait that long.

Also I’m sure the assumption of randomness is not correct, so the number of repeated sequences of 20 will be much longer than this.

Maybe a more interesting question is, on average, what would you expect the longest repeated sequence to be? That is, out of your 3*10^9, we know there’s almost a certainty of a length 20 sequence being repeated, and we should expect millions of repeats in there … how about a length 30? 40? The exponential growth there is probably still surprising to people.

3. I don’t wish to seem like I’m nitpicking, but I think before I can answer I’d have to know: would a sequence of 21 of the same nucleotide (that is, let’s say, CCC CCC CCC CCC CCC CCC CCC) be counted as containing a repetition of the 20-copies-of-C sequence or not?

1. It’s a good question. The way the question is worded, and the way I’m thinking about it, yes, that counts as a repetition.

1. So, I posted how I’d explain it to your colleague or your colleague’s students on my blog. Thanks for passing that one along!

4. Kevin C. says:

By the final step in Jason’s answer he’s multiplying the probability for every length-20 substring together, and some of those strings overlap and are therefore not independent even if individual nucleotides are independent. It’s certainly true that the probability approaches 1 that there’s some overlap, but I’m not sure how much of an impact the (lack of) independence has on the final probability.