Recursive Formula for the Random String Word Detection Probability , Overlaps and Probability Extremes

In this paper, for the first time ever, the properties of the word detection probability in a random string have been investigated. The formerly known methods led to numerical evaluation of the researched probabilities only. The present work derives the simplest algorithm for calculation of the word’s at least once detection probability in a random string. A recursive formula that considers the overlap capability has been deduced for the probability under study. This formula is being used for the proposition on comparison of the word detection probabilities in a random string for the words with different periods. The result allows determining the structure of words that have maximum and minimum detection probabilities. In particular, words having equal number of alphabetic characters have been studied. It has been established, that for the words in question detection probability is minimal for the ideally symmetrical words that have irreducible period and maximal for the words devoid of the overlap feature. These results will be useful for molecular genetics, as well as for students studying discrete mathematics, probability theory and molecular biology.


Introduction
Calculation of detection probability of a given word in a random string is of the great interest, most of all in connection with the molecular genetics research activities.The substantial breakthrough in the problem solution had been made by Gentleman and Mullin (1989).The work (Gentleman & Mullin 1989) established distribution of the subsequence's occurrence frequencies within the nucleotide sequence, taking into account possible overlaps within the equiprobable distribution model.Such success was stipulated by employment of the enumerative combinatory analysis and generating functions (Gulden, 1983).Chufang (2005 ) offered a different scheme to solve the problem in question, based on the finite Markov chains imbedding technique.A common approach to the similar problems, based on the Markov chains model, had been also developed in works (Robin & Daudin, 1999), (Robin & Daudin, 2001), (Lotharie, 2004), (Rigner, 1995).The principal problem in calculation of the word occurrence probability in a random string is the overlap capability.Although the known solution methods give reliable numerical results, they do not allow investigating the extremal properties of the word detecting probability in a random string that relates to its symmetry, stipulated by presence of overlaps.It should be emphasized that the symmetry factor plays an important part in the debate on the degree of order and the amount of information (Ilyevsky, 2014), (Ilyevsky, 2017).This paper offers an original method of deriving a recursion formula for the word detection probability to obtain recurrence relations in a previously unknown elegant form.The proposed recursive formula is very simple for computer programming.By means of the derived recurrence formula the theorem on the extremal properties of the probabilities under study, associated with the presence of overlaps, has been proved.
Section 2 of the present paper considers a random string, n characters of k ≥ 2 alphabetical elements long, within the equiprobable distribution model.Problem to calculate the probability p n of detecting a specified m characters long word in a random string at least once, has been solved.The solution has been obtained in the form of a recursive formula that connects p n with probabilities p n−1 , p n−s i and p n−m , where s i + 1 are coordinates of the word overlap positions, numbered with the i index.The result allows to calculate the precise p n meaning under any value of m and n.In section 2.3 the explicit formula for p n has been received for the zero overlap cases.Section 3 offers proposition on comparison of the word occurrence probabilities in a random string for the words with different periods.Section 4 shows, that for words with equal number of each of the k alphabetic characters the p n value is maximal at zero overlaps and is minimal at the ideal symmetry of the word, when it have the irreducible period.
2. Recursive Formula for the Given Word's at Least Once Occurrence Probability in a Random String

Basic Idea and Method
In contrast to the known methods, the approach offered below allows us to derive necessary formulas without invoking the enumerated combinatorics.
Let us assume there are: an alphabet of k ≥ 2 characters, R n -a random sequence n characters long and D -a preset sequence m characters long.For the sake of D and R n sequence convenience we shall hereinafter denote them a word and a string, respectively.Let us draw on the model of equiprobable distribution of all alphabetic characters in the R n string.Our objective is to find a probability of the D word's at least once occurrence in the R n string.Let us consider a set, that consists of k n different R n sequences of all kinds.We shall denote this set as R n .All strings within the R n set are equiprobable.Let us denote a subset of the R n set, in which D does not occur even once, as R ′ n .Correspondingly, we shall denote strings that belong to the R ′ n set, as R ′ n .The number of sequences in R ′ n we shall designate by Q n .Now let us construct a recursive formula for Q n .To this end, we shall employ the idea as follows.The R ′ n set could be deduced from the R ′ n−1 set using the following procedure.Let us denote a set of all possible alphabet characters as W. Let us add all possible wϵW words to the left of every R ′ n−1 string.We shall have a set of strings, hereinafter referred to as WR ′ n−1 .The number of such strings shall be: Now let us cross out all strings that have word D from the WR ′ n−1 set.We shall have a R ′ n set.Hereinafter we shall refer to the procedure described above as the R ′ n−1 to R ′ n transition.(We shall also use such transitions step by step to describe the transition from R ′ n−m to R ′ n .)The number of crossed-out strings in the transition R ′ n−1 → R ′ n will be: On the other hand, the number of crossed-out strings in transition from R ′ n−1 to R ′ n can be expressed through Q n−m .Indeed, since length of the word D equals m, in crossed-out strings it occupies positions from n − m + 1 to n.Should we add all possible words of length m to the left of each of the R ′ n−m set strings, then D will be one of these words.Therefore, the number of strings with length n and prefix D that are deleted in transition from R ′ n−1 to R ′ n , must equal Q n−m minus number of strings crossed out in the overlap positions.The corresponding detailed analysis of the overlap accounting is given below in sections 2.2, 2.5.Equating the number of crossed out strings obtained in two ways, we arrive at the required recurrence relation.

Overlap Description
As already noted, when constructing a recursive formula for Q n , one shall take into account all possible overlaps of the word D in the string R n .D's overlap feature represents a certain type of the shift symmetry.Let us write D down as follows: D = a 1 a 2 . . .a m , where a j represents characters of the given alphabet.Under the string D we shall write down an identical string, shifted to the right by s i characters.
If all the characters of the upper and lower strings, one above the other, are the same, then equivalent definitions are introduced by different authors, such as: the notion of autocorrelation of the D word (Guibus & Odlyzko, 1981), D's overlap capability (Gentleman, 1989), or periodicity (Lotharie, 2001).For our purposes it would be more convenient to define overlaps as follows.

Definition. A word D has an overlap position with the coordinate s
(1) Index i in (1) enumerate possible overlaps from left to right.A a 1 a 2 . . .a s i word is usually referred as the D period.Length of the period equals s i .As an example, let us consider the following word from the alphabet 1,2,3: In the above example (2) we have: s 1 = 7, s 2 = 14, s 3 = 17, s 4 = 18, and, correspondingly, coordinates of the overlap positions are :8, 15, 18, 19.In the discussions that follow, we omit the case of the D's trivial coincidence with himself, that corresponds to s 0 = 0. Let us note here, that s i values can not be arbitrary, but obey certain rules, derived in work (Guibus & Odlyzko, 1981).To begin with, let us consider a case, when a word D does not have any overlap positions.

Recursive Formula for Zero Overlaps in D
Let a m -long word D have no overlaps, meaning that D does not have any such s 1 within the 1 ≤ s 1 ≤ m − 1 range, for which (1) holds.We shall hereinafter refer to a word D, having such property, as D 0 .For the Q n number of strings in the set R n , that have no D 0 words, we have the recurrence relation as follows: Proof.For 0 ≤ n < m the proposition (3) is obvious, since in the string R n , which is shorter then m, a word D 0 does not occur even once.Let n ≥ m.Let us consider the set R ′ n−1 , which means there are n − 1 -long strings, in which D 0 does not occur even once.The R ′ n−1 set is being derived from R n−1 by means of elimination from the latter of all strings, in which D 0 occurs.According to the designation introduced as above, number of sequences in To each of the sequences in the set R ′ n−1 we shall add (to the left, in turn) all characters from the alphabet.Such a procedure will generate a WR ′ n−1 set that contains kQ n−1 new strings.Apparently, among strings in the WR ′ n−1 set there are In this way we arrive at the equation ( 4).
Having divided equation ( 4) by k n , we get the recursive relation for a q n (D 0 ) probability that not a single word D 0 shall ever occur in the R n set: For the probability that word D 0 would occur in the set R n at least once we have: The appropriate recurrence relation for the probability p n (D 0 ) will be: 2.4 Explicit Formula Fore the Probability p n (D 0 ) Based on the inclusion -exclusion principle one may derive an explicit formula for the probability p n (D 0 ).At n ≥ m we get: Let us show, that result (8) satisfies the recurrence relation (7).Substituting the result (8) to the right side of the expression (7), we obtain for n ≥ m: Performing the substitution t + 1 = t in the sum B, and omitting the tilde sign over t, we get: Let n ≥ m.Let us denote n = um + v, where u and v are natural numbers, and v satisfies an inequality 0 ≤ v < m.Thereupon, we obtain as follows: From ( 11) and ( 12) it follows that, provided v 0 , the upper limits in the sum A and sum B , written down in the form of ( 10), equal u.Using well-known property of binomial coefficients, we transform sums of corresponding combination terms in expressions A and B for the case v 0 in the following way: Then for the case v 0 we get: When v = 0 , the the sum A (9) has one summand less than the sums ( 10) and ( 8).In order to convert the first corresponding u − 1 terms in the sums A and B (10) we have expression ( 13), and the last summand in the B (10) equals, as it is easily seen, the last summand in the sum (8).In this way, we have proven formula (8) on the basis of the recursive relation (7).

2.5
Recurrence Formula for the D Overlaps Case 2.5.1 Recurrence Formula for Q n Theorem 1.Let a word D have l ≥ 1 overlap positions.We shall denote these overlap positions' coordinates as follows: where 1 ≤ s l ≤ m − 1.Then, the following recursive formula is valid: Proof.Expression ( 15) is obvious.Let us prove equation ( 16).Similar to equation (4) we shall write: where U n−m is a number of D words crossed out during R ′ n−1 to R ′ n transition.For now, however, due to the presence of overlaps, U n−m < Q n−m .Let us express U n−m through Q n−m and Q n−s i .A word D will be presented as concatenation of words u 0 , u 1 , . . ., u l : D = u 0 u 1 . . .u l , where u 0 is the beginning of the D, s 1 characters long, u 1 is the next word s 2 − s 1 characters long etc.The last word u l will be m − s l -long.The first character of every u i word for 1 ≤ i ≤ l matches the i-numbered overlap position.Alternatively, due to overlaps, for every 1 ≤ i ≤ l a word D may be presented as follows: where f i is the corresponding suffix.The R ′ n−m set does not have any D words, but its strings may incorporate f i prefixes.Let us consider a F i subset of the set R ′ n−m , strings of which have prefix f i , but lack prefixes f i+1 , . . ., f l .Having extended the R ′ n−m set strings to the left, step by step according to the procedure described earlier and as far as the n position, we shall get WR ′ n−1 set.Let us single out subsets of the strings, that have prefix D: DR ′ n−m and DF i .During transition from R ′ n−m to R ′ n−1 all strings, that have prefix D (18), are being crossed out consequentially, starting with i = l and ending with i = 1.Therefore, the set WR ′ n−1 has no strings containing D words (18), that begin at n − s i position for all 1 ≤ i ≤ l (D f i WR ′ n−1 ).All such strings, however, are present in the DR ′ n−m set, because DF i ⊆ DR ′ n−m .Subsets DF i do not intersect for different i.Therefore, the number of D -prefixed strings, that should be crossed out from the WR ′ n−1 set, shall be as follows: The number of strings, crossed out from each of the n − s i overlap positions, may be written down in the following way: From ( 19) and ( 20) we get: Substituting expression ( 22) into equation ( 17) we get the recurrence formula ( 16).In this way the Theorem 1 is proven.

Recurrence Formula for the p n Probability
Recurrence formula for the probability p n is the Theorem 1 corollary.Having divided equation ( 16) by k n , we get the recurrence formula for the probability q n that not a single word D shell never occur in the set R n : In equation ( 24) and in the following text, we denote s 0 = 0, s l+1 = m.From equation ( 24) we get a recursive formula for p n = 1 − q n : 3. Comparison of the Given Word Occurrence Probability in a Random String for Two Words With Different Periods

Lemma on the Number of Crossings out
Lemma 1.The number of crossings out J j , performed during transition from the set R ′ j−1 to the set R ′ j , is a nondecreasing function of j, that is for any j ≥ m: Proof by induction.By virtue of the fact that we are only interested in cases of nontrivial overlaps in D, let m ≥ 2. For j = m conclusion (27) holds, because J m−1 = 0, J m = 1.Let us assume, that (27) holds for m + 1 ≤ j ≤ n and show, that it also holds for j = n + 1.Using the formula of J n (21) and recurrence relation ( 16), we get: In equation ( 28) we have: by inductive hypothesis we have: J n−s i−1 − J n+1−s i ≥ 0. Consequently, from (28) it follows that: In this way the Lemma 1 has been proven.

Theorem on Comparison of Probabilities
Theorem 2. Let there be given two words of equal length m ≥ 2 , hereinafter referred to as D and E .Let a word D have l overlaps (l ≥ 0), described dy periods t i , whereas E has l + r overlaps , described dy periods s i .Let l first periods of E be smaller then l corresponding periods of D, i.e. for all 1 ≤ i ≤ l we have s i < t i .Then at n > m + s 1 , probability p n to detect word E at least once in a random string R n shall be smaller, the corresponding probability g n for the word D : The highlighted characters mark the overlap positions.Calculations for k = 2, n = 5000, m = 14 give us p n (E) = 0.2340, g n (D) = 0.2590.

Proof of the Theorem 2
Let us denote the number of words E and D in the sets R ′ n (E), R ′ n (D) by Q n and G n correspondingly.We shell also denote δ n = Q n − G n .In this notation inequality ( 30) is equivalent to inequality δ n > 0.
To begin with, let us consider a particular case of the theorem, when number of overlaps in both E and D is the same and equals l.Primarily, we shall make sure that: Then, by induction, we shall prove correctness of the following inequality: (In the (32) the δ n−1 = 0 case holds only at n = m + s 1 ).According to the recurrence formula for Q n (15) and ( 16) we have: Context of the formula ( 34) is evident -in transition from R m to R ′ m exactly one string D and one string E are being crossed out from R m .Further, for Q n we have an equation ( 16) and for G n -a similar equation as follows: Let n < m + s 1 .Then,, taking into account that t i > s i ≥ 1, m ≥ 2, we get: Since at n = m, based upon (34), we have δ n = 0, then from (38) there follows conclusion (31).Let us calculate δ m+s 1 .From ( 16) and ( 35)-( 37) we get: If s 1 = 1 we have: At s 1 > 1 we get: From equations (41-44), taking into account (31), we get: Thus, we obtain: In order to prove inequality (32), let us make use of the result (46) as the induction beginning on the variable n .Let us suppose, that for all j values in the interval of m + s 1 ≤ j ≤ n − 1 the following inequality has been fulfilled: Let us prove that for j = n inequality (47) has also been fulfilled.From ( 16) and ( 35) we have: Let us represent equation ( 48) in the next form: In the formula (49) J D j is a number of crossings out of the strings, that have D while in transition from R ′ j−1 to R ′ j : According to the inductive hypothesis (47), and taking into account that s i+1 ≥ s i + 1, in equation ( 49) we have: example, character α, occurs in the period w 0 at least twice.Consequently, in the word w 1 . . .w l character α occurs d − 2 times at most.With the provision as above, number of overlaps in the word D cannot exceed d − 2 < l 0 .
Statement of the Theorem 3 for d = 2 is being verified directly, so we shall concentrate on the case d > 2. As above, we shall denote coordinates of overlaps in E by means s i + 1, whereas coordinates of overlaps in D -by means t i + 1.It would be sufficient for the proof to show, that for both E and D Theorem 2 statement had been fulfilled, i.e. for all 1 ≤ i ≤ l: We shall write down the periods of words E and D correspondingly for 1 ≤ i ≤ l in the following way: ) From the results, presented in (Guibus & Odlyzko 1981), (Lotharie, 2001), it follows, that sequence of the w i words' lengths does not increase: If for all i we have |w i | > k, then from ( 61), ( 62) it follows, that inequality (60) is satisfied.Let us assume now, that for Then, due to the fact, that l < l 0 holds at d > 2, for i 1 < i ≤ l we get: In this way for all i condition (60) is satisfied and, consequently, Theorem 2 conditions had been fulfilled for words E and D. Hence it follows, that for words E and D we have: p n (E) < p n ( D) or p n (E) ≤ p n (D).
As an example, let us present three following words that illustrate conclusion of the Theorem 3.

E : 123123123123
D 1 : 112233112233 D 2 : 123123123132 In the above example, the word D 1 has one overlap, whereas D 2 has none.Calculations for n = 40000 give us: p n (E) = 0.0699, p n ( D 1 ) = 0.0724, p n ( D 2 ) = 0.0725.Let us emphasize a certain point of interest.The ideally symmetrical word E and the word D 2 differs only in rearrangement of the two last characters.Nevertheless, such rearrangement deprives the word D 2 of the shift symmetry and, therefore, probability of its occurrence in a random string is the same as for any other word devoid of overlaps.

Conclusion
The main result of this work is establishment of the extremal properties of the word's at least once occurrence probability in a random string.The method used to derive the necessary formula for the probability under study can be generalized in the event of calculating a given word's occurrence frequency distribution in a random string.The corresponding recursion relations and further study of the discussed probabilities' properties have been obtained by the author and will be offered for publication in the nearest future.
In conclusion, let us cite a number of open questions stemming from the research undertaken above.
• It seems very likely that Theorem 2 conditions may be substantially weaken by replacing condition s i < t i for all 1 ≤ i ≤ l by single requirement s 1 < t 1 .
• Obviously, along with increase of the alphabet's size k , difference between P n (E) and P n ( D) shrinks.It would be interesting to establish the relevant asymptotic dependence.
• It would be interesting to locate words that possess extreme detection probabilities in a random string, generated by the Markov process.
strings that represent concatenation of the word D 0 with the string R ′ n−m .These strings generate a set of D 0 -prefixed strings that we shall refer to as D 0 R ′ n−m .Let us represent a word D 0 in the form of D 0 = u s v m−s where u s and v m−s shall be parts of the word D 0 , s and m − s long, correspondingly (sϵ[1, m − 1]).Now let us analyze the set v m−s R ′ n−m , in which every R ′ n−m string has a v m−s word, attached from the left.Since the word D 0 does not have any overlaps, the word v m−s shall not represent a D 0 prefix.That is the reason why none of the v m−s R ′ n−m set elements shall be crossed out during transition Let us give an example of two words D and E that satisfy conditions of the theorem: