Outline and synopses of the 2002 Lecture Videos Lecture 1 Introduction to Bioinformatics; the four modalities of biological research; many of the key ideas in bioinformatics predate the rise of bioinformatics (some going back to the 1940's), and are yet are central today; genomic and other high-throughput omic data has made those ideas and methods, and hence bioinformatics, essential. This course focuses on the mathematical and algorithmic ideas in the methods, and much less on packaged tools. We emphasize the chain-of-reasoning that leads from raw biological data and questions, to simplifying biological rules and models (ex. central dogma of molecular biology), to mathematical models based on those rules, to mathematical questions, to algorithmic problems that derive from the mathematical and biological models, to programming issues, and finally to user-friendly tools. All along this chain of reasoning, one makes simplifications and compromises,and the utility of the final tool must be carefully evaluated. The tool is best understood by understanding the chain of reasoning and the compromises that lead to it. This course focuses on the mathematical and algorithmic parts of the chain. Through Perl programming it touches on some programming aspects of the chain, and through the labs it introduces the student to some representative and illustrative tools, but there are other courses and resources that do that focus on tools more effectively than is done in this course. ************** Lecture 2 Continuation of the discussion of the chain of reasoning that leads to bioinformatics tools. Introduction to Sequence Analysis. Why the use of sequence? First Fact of Biosequence analysis: In bio-sequences, high similarity usually implies significant functional, structural or historical similarity or relationship. This is one of those rules or models that one assumes today, but was not always known or believed. And of course, it frequently needs to be modified or modulated. Exploitation of the First Fact has made bioinformatics important. So, much of the course deals with technical issues of how we define, compute and exploit sequence similarities. The First Fact pays the bills. What is the basis for the First Fact: Evolution, shared history, common descent: Protein evolution involves duplication (parology) in a species with modification (R. Doolittle quote). This leads to similar protein sequences and similar chemical and/or biological properties. Another mechanism is through speciation and then differentiation of the proteins in the two species (orthology). In either case, what we learn from one variant of the protein may be valuable in understanding the other variant, and through sequence similarity we can find which proteins give insight to other proteins. That is technologically very powerful - when one has a new sequence, its biological properties may be very similar to similar sequences that have already been studied. That was a revolutionary insight in the 1970's and 1980's but is now absolutely accepted - and even taken to be obvious. Discussion of the words ``homology" and ``similarity". Similarity is used to infer homology. Another explanation for the First Fact: Necessity and/or convergent evolution. History (chance) v. Necessity. The Converse of the First Fact is NOT true. It is not true that similar function or structure implies similar sequence similarity (Fred Cohen quote). Introduction to the technical issues of sequence similarity; the basic algorithmic problem; first definition of sequence similarity, and the algorithmic problem of computing a measure of similarity. Definition of alignment, and an example of an alignment. ******************** Lecture 3 Grain Crop Circle illustrating synteny of genes in the grains, and partial interchangeability of genes between the grains. Wheat has a genome size of 16 Billion bases, whereas Rice has only a few hundred million bases. Definition of String Alignment and Similarity. The formal definition of similarity should reflect the biology. Highly ``similar" sequences should reflect and expose related biology or chemistry. Mismatches in an alignment reflect single residue, point mutations. So the concept of a mismatch is acceptable in our definition of similarity because it reflects underlying biology. What does a space in an alignment reflect in the underlying biology? Insertion or deletion of DNA or amino acids. Examples of insertions or deletions in DNA. The formal definitions of alignment and similarity should realistically model actual biological processes that change sequences over time. Example of simplest objective function. Does the simplest objective function correctly model biology? Matches and mismatches seem well motivated, but what about the way spaces (Indels) are handled in the objective function? That will be a large issue that will be examined in the course. Definition of string similarity. Notions of string similarity are at the heart of sequence-oriented bioinformatics. We will embellish the model and definition throughout the course. Algorithmic problem: How can we efficiently compute the similarity of two strings, when given two strings? The mathematical definition of similarity gives the ``what", but now we want to address the ``how". First, lets start thinking naively and find a straight-forward, brute-force approach. Generate all the explicit alignments, compute the value of each alignment, and take the best (maximum value). How many alignments are there? Way too many. So brute-force will not work. Remember that computers are small (memory), slow, and dumb (contrary to popular images of computers). Brute force would take many ages of the universe to align moderate length sequences. Because of that we will spend a lot of time discussing alternative methods that are much faster. So we want to be clear why we are spending so much energy on the more clever methods - the brute-force methods won't work. Details of counting how many alignments there are. Two Perl programs that compute the numbers, countbadly.pl and countgoodly.pl. Get them from the class website. ********************** Lecture 4 Review of the definition of the (basic) Similarity of two sequences: Max over all possible alignments of the two sequences of the number of matches - number of mismatches - number of spaces in the alignment. Review of the fact that a straight-forward enumeration of all alignments is impractical for all but very small examples. Start of discussion on richer alignment models (this is slightly out of order, but needed now to prepare for the scheduled lab of that week). Addition of multipliers (costs or penalties) on mismatches and spaces. That is, in the objective function we multiply the number of mismatches by a number X and multiply the number of spaces by a number Y, in the objective function, in order to tune the objective function to better reflect the underlying biology. But now we have the question of what values X and Y should take on. Introduction to the work of Walter Fitch in the 1980s. He had a pair of sequences where he believed he knew (from biological or chemical considerations) the "correct" alignment. He then studied how the optimal alignment was influenced by different values of X and Y in the objective function. He observed that none of the obtained alignments agreed sufficiently with the "correct" alignment. That led him to introduce the concept of a "gap" in an alignment. A gap is a contiguous run of spaces. The observation (by Fitch and others) is that in protein particularly, the spaces in the alginment are typicaly organized into a few gaps, and alignments that have many gaps do not capture the biology adequately (particularly for proteins). The basic objective function for similarity does not have a term for the number of gaps. So Fitch enriched the definition of similarity to include a term for gaps. So now in the objective function we have matches, mismatches, spaces and gaps. Introduction to Xparal (the focus of the lab for this week). Mention of the Barton data and alignments also used in the lab. Introduction to Substitution Matrices and their uses in the lab problems. Substitution matrices penalize mismatches different depending on which mismatch it is, and even score matches differently depending on which amino acid matches. For example, a mismatch between Y and G should be penalized more than a mismatch between I and L. Matches of very common amino acids should have a smaller (positive) score than matches of rare amino acids. Now the value of an alignment uses the total for all the scores coming from matches, and the total of all penalties coming from mismatches, rather than just counting the number of matches and mismatches. The lab on Xparal is intended to get you to see how critical is each piece of the alignment model. Unfortunately, the version of Xparal uses old Substitution matrices, PAM matrices, rather than BLOSUM matrices, but the main point still is clear - a good substitution matrix can lead to alignments that correctly reflect the biology. Definitions of Gap Initiation and Gap Extension. Gap initiation should be more highly penalized than Gap extension. Details of Xparal and parametric alignment. Overhead projections that are hard to read. Parametric alignment decomposes the parameter space into convex polygons, where in any polygon, any alignment A that is optimal for some combination of parameters inside that polygon is optimal for all combination of parameters inside that polygon. Conversely, alignment A is not optimal for any parameter combinations that fall outside that polygon. ******************* Lecture 5 How do we actually compute the Similarity of two sequences efficiently? Introduction to notions of efficiency and complexity. The time required for a method is considered as a function of the length of the input, i.e., the lengths of the two strings to be aligned. Definition and use of Alignment Graph to discuss and compute optimal alignments, and similarity. Each legal (directed) path from node (0,0) to node (n,m) specifies an alignment; conversely, each alignment is specified by such a path. Diagonal edges align two characters (the two characters for the row and column that the edge enteres); horizontal and vertical edges align a character with a space (details in the lecture or think about it). [Note, the video is too dark at this point in the lecture, so consult the lecture notes.] Moreover, the total cost (length) of that path (adding the costs on the edges in the path) is exactly the value of the associated alignment. Therefore, the problem of finding an optimal alignment is equivalent to finding a path from (0,0) to (n,m) which has the largerst total cost. So far, we have only transformed the problem of computing an optimal alignment to the problem of finding a maximum cost path from (0,0) to (n,m) in a directed graph. But we have not really made any progress in seeing how we can solve either of those problems. Now we tackle that issue. Introduction to the logic of Dynamic Programming (DP) to find a path from (0,0) to (n,m) of maximum total cost (length, distance). ******************* Lecture 6 Continuation of the discussion on how to efficiently compute the Similarity of two sequences, and the optimal alignment of two sequences, by DP. Review of the Alignment Graph. Development of a Dynamic Programming (DP) procedure to compute the length of the longest path from node (0,0) to each of the nodes in the Alignment Graph. The distances to the nodes in the Alignment Graph can be computed row by row, going left to right in each row, or column by column going top to bottom in each column. We can also progress along the anti-diagonals. The key thing is that a node (i,j) can be processed at any time after its three neighbors (i,j-1), (i-1,j) and (i-1,j-1) have been processed. Why do we do this kind of computation? For each node in the graph, the number of operations needed is a constant number, independent of how long the sequences being aligned are. So for two sequences of length n and m each, the totoal number of operations for this DP approach, to compute the longest path length (or distance) from node (0,0) to node (n,m) is O(nm), i.e. bounded from above by a function which is proportional to the product of n and m. Compare that to the number of alignments to see why this is so attractive for practical computation, compared to explicitly generating all alignments. Introduction to the issue of traceback, to actually construct the optimal alignment from the computation done to find the longest distance from (0,0) to (n,m). This uses the same kind of logic that was used to compute the distances (numbers) but it is used in reverse. The traceback is a general step in DP, not just for computing optimal alignments. It is (in my experience) the most subtle part of DP for students to understand. For example, it is not always obvious why we need such a step. Why can't we keep track of the longest paths during the step when the distances (numbers) are computed? Why do we have to do this traceback from node (n,m) to node (0,0)? Explanation via the running example. The bread-crumb or pointer approach to traceback. All this, using an explicit Alignment Graph, is conceptually nice, but not a natural approach to turn into a computer program. For that, we have to change our view of DP from the Alignment Graph to a purely notational approach. Question for next time: What changes if we use a substitution matrix to weight specific matches and mismatches differently? ******************** Lecture 7 DP formal treatment - moving from alignment graph to recurrences and programable notation. Define V(i,j) and develop general recurrences for it. This is essentially, but not exactly, the Needleman-Wunsch algorithm in bioinformatics, that computes an optimal *global* alignment between two sequences. The value of that optimal alignment is defined as the ``similarity" between the two sequences. The use of an ``alignment table", which clearly translates into an array in a computer program. At bit of the required Perl needed to extend the basic Needleman-Wunsch program to handle mismatch and space penalties more general than -1. ********************* Lecture 8 Development of DP notation and recurrences to allow the use of a general substitution matrix in the computation of similarity. Computation in a table instead of in a graph. Introduction to local alignment. Why local alignment of biological sequences? Several examples. Formalizing the local alignment objective in terms of a pair of substrings of the two input strings, and the global alignment between the pair of substrings. If we try to compute the optimal local alignment by directly following the definition of local alignment, how many operations will that require? Answer: a cubing of the number needed for global alignment, so n raised to the sixth power, if both input strings are of length n. This is compared to n**2 for global alignment of two strings of length n. So the straightforward implementation of the definition of local alignment is impractical, and we need a better (faster) method, if we are going to be able to compute local alignments. The Smith-Waterman DP method is the answer. It computes the optimal local alignment using a number of operations that is proportional to the number of operations used for global alignment, and the constant of proportionality is small. ******************** Lecture 9 Guest Lecture on Longest Common Subsequence by Professor Martel. Definition and DP solution. Expected length of the longest common subsequence. ********************** Lecture 10 Detailed development of the DP solution to compute optimal local alignment by the Smith-Waterman DP method. Worst-case number of operations for the method is O(nm) which is the same as for global alignment. The constant of proportionality is slightly larger for local than for global alignment, but the difference is very small. ******************** Lecture 11 End-gap-free alignment. Example of use in Whole-Genome Shotgun sequencing. Development of the DP to compute optimal End-gap-free alignment. Introduction to the derivation of the expected length of the longest common substring. ******************** Lecture 12 Derivation of the expected length of the longest common substring between two random strongs of the same length. Actually an upper bound on that length. Be aware that I made an error towards the end of the lecture (around minute 35), defining r as the minimum integer such that E(Cm) < 1 rather than defining r as the maximum integer such that E(Cm) >= 1, and also forgetting that when the base of a logarithm is less than 1, the log is negative. That caused trouble. The notes for that lecture correct the error. ******************** Lecture 13 Continuation of the topic of string matching probability. Now we look at the probability that a query string S matches completely at least once in a much longer string D, which represents the concatination of sequences in a database. We address the question: If a query string S and the long string D are both random, what is the probability that S is found to exactly match somewhere in D? In the last nine minutes (starting around minute 44 of the lecture) of the lecture I talk about Perl and hashes in Perl. Hashes are, for me, one of the key reasons for programming in Perl, and their use is central in the toy BLAST program that you will write in the next two weeks. ********************* Lecture 14 BLAST Part of this lecture, on the ideas behind BLAST, follows the Lecture Notes on Blast we wrote that are linked from Lab 5. Alignment based on DP is better than pure enumeration of alignments and this is fast enough for the optimal alignment (local or global) of two sequences, but even DP is not fast enough to do alignment between a query string and every sequence in a large database. The ideas of BLAST 1, ungapped BLAST. The use of k-mers, seeds, extensions, and hashing in BLAST. The exact matches are called HSPs, high scoring pairs. The value of k was around 11 for DNA at the time of the lecture, and around 3 for amino acids. I don't know what it is today. In the case of protein, a substitution matrix is used. Also, neighborhoods are used, in order to incorporate small mismatches. The lecture partly makes reference to the PERL needed to implement toy BLAST. Question: Why is this method so much faster than doing DP? ********************** Lecture 15 Continuation of BLAST Hashing kmers, BLAST speed, HSPs, Kmer neighborhoods, E-value, Introdction to Gapped-BLAST, BLAST II. ********************* Lecture 16 This whole lecture explicitly talks about the five kmer programs distributed in Lab 6 for the start of the toy BLAST project. ******************** Lecture 17 Continuation of discussion of kmer programs, and Perl. Two dimensional hashes. The assignment of list to the value of a hash. Dereferencing hashes. Final discussion of BLAST and statistics in BLAST: E-values reported by BLAST. BLAST results are based on the maximum score found in the matches of a query to sequences in the database. Therefore, BLAST statistics concern the maximum value of a set of random variables (match values). Maximum values are described by the extreme value distribution. In contrast, the Normal distribution concerns the Sum of random variables. It is a mistabke to think that BLAST statistics based on the extreme value distribution should look like or behave like the Normal distribution. In particular, the extreme value distribution has a very long right tail - that is the probability of scores at the right end of the extreme value distribution remains high for much longer than for scores at the right end of the Normal distribution. Therefore, if you think the random scores are distributed as in the Normal distribution, you will often be overly impressed by large observed scores. ******************** Lecture 18 This lecture starts with a review of matching probabilities, the case of a query string matching somewhere in a database. As n (the size of the database) grows relative to m, the probability that the query string matches somewhere in the database grows quickly. The take-home lesson is that when there are many trials (possible starting places for the exact match of the query string in the database) the probability of a success somewhere can be very different from the probability that two random strings of the same length exactly match. The formula for the Extreme value distribution and density is displayed without much discussion. A graph for the density is shown. The point again is that a large max over many events can happen with high probability, even when the probability of an exact match between two random stings is very low. This point is connected to the Granin and BRCA1 story and the PROSITE search. The probability of a random sequence matching the Granin regular expression is very low, but the probability that a random sequence will match some regular expression in PROSITE, with a probability as low or lower than the number computed for Granin, is very high. The analogy is made to coin flipping. The probability of a specific sequence of heads/tails in 100 flips is very low, but the probability that a sequence of that low probability will occur, is 1. So when such a small probability event is observed, we have no ``statistical support" to reject the null hypothesis that the heads/tail sequence was the result of random coin flips. PROSITE is a filter, not an oracle. ******************** Lecture 19 The start of discussion on Multiple Sequence Alignment. Extension of pairwise sequence alignment to multiple sequences in order to identify weak patterns in the sequence. These patterns would not be pulled out of pairwise sequence alignment. Why do we do multiple alignment? "Alignment of two sequences whisper, but a multiple sequence alignment shouts". Sum-of-Pairs objective function, and how to compute it exactly for three strings, using DP. Development of the DP for three strings. For three strings of length m each the number of operations for this exact DP is proportional to m^3 (n raised to the third power). In general, if there are k strings, each of length n, the number of operations for the exact DP that computes the optimal sum of pairs multiple alignment is proportional to m^k (n raised to the k'th power). As k grows, unless m is extremely small, this quickly becomes prohibitive, and for proteins, exact DP solutions are practical for about 8 or 9 sequences at most. There is a program called MSA (look for it on the web) that does this exact computation, but with many speedup tricks compared to the raw DP solution. Introduction to Progressive Multiple Sequence Alignment. The tree consistency theorem. *********************** Lecture 20 The tree consistency theorem. Definitions: S is a set of k strings and T is a tree with k nodes, and each node is labeled with a distinct string in S. A multiple alignment M of S is said to be ``consistent" with T, if the induced pairwise alignment of each pair of sequences Si and Sj that are neighbors in T, is an optimal pairwise alignment of Si and Sj. An example is shown of these definitions. Tree Consistency Theorem: For any tree T labeled with the sequences in S, there is a multiple alignment M of S, which is consistent with the labeled tree T. Definition of D(i,j) and DM(i,j). D(i,j) is the optimal pairwise alignment value of sequences Si and Sj. DM(i,j) is the value of the induced pairwise alignment of Si and Sj in the multiple alignment M. For multiple alignment we score a pairwise alignment as the number of mismatches plus the number of spaces. This is different from the maximization criteria we used earlier for pairwise alignment, but it has the same modeling validity - but now a low D value for a pair of sequences indicates highly related sequences, instead of having a high similarity value V for the pair of sequences. The multiple alignment sum-of-pairs problem is to find a multiple alignment of the sequences in S to minimize SUM i <= i <= j <= k [DM(i,j)]. Clearly, for any i,j, DM(i,j) >= D(i,j), so SUM i <= i <= j <= k [DM(i,j)] >= SUM i <= i <= j <= k [D(i,j)]. So we would like in the multiple alignment M to make DM(i,j) = D(i,j) for as many pairs of sequences Si, Sj as possible. The Tree Consistency Theorem says that given any set of sequences S and a tree T whose nodes are labeled by the sequences in S, there is a multiple alignment M of S such that DM(i,j) = D(i,j) for each pair of sequences Si, Sj which label neighboring nodes in T. A proof is given that is constructive, showing how to actually construct the multiple alignment M from T and S. This constructive proof also shows how to merge two sub-multiple alignments, given a pairwise alignment between a sequence Si already in one of the sub-multiple alignments and a sequence Sj in the other sub-multiple alignment. As a special case, this allows us to add a new sequence into an existing mutiple alignment given a pairwise alignment between a sequence that is already in the MSA, and a sequence that is not already in it. This merge achieve two goals: a) it preserves the induced pairwise alignment value of any pair of sequences that were already both in one of the sub-multiple alignment before the merge; the induced pairwise alignment value of Si and Sj in the merged multiple alignment is exactly the same as the optimal pairwise alignment of Si and Sj. Although tree consistency only assures that n-1 of the (n choose 2) induced pairwise distances are equal (as small as) their optimal pairwise distances, if the objective function used for pairwise alignment satisfies the triangle inequality condition, then the total induced (sum-of-pairs) distance of the best center star is at most twice the total distances of the (n choose 2) optimal pairwise alignments. ******************* Lecture 21 Introduction to the alignment method encoded in star.pl. In that method, the ``guide tree" T is a star with one center and n-1 leaves. So the only issue is which sequence to put in the center. Multiple sequence alignment: progressive alignment using a guide tree - ClustalX and star.pl Although consistency only assures that n-1 of the (n choose 2) induced pairwise distances are equal (as small as) their optimal pairwise distances, if the objective function used for pairwise alignment satisfies the triangle inequality condition, then the total induced (sum-of-pairs) distance of the best center star is at most twice the total distances of the (n choose 2) optimal pairwise alignments. In the discussion of Clustal, the lecture says that the guide tree used in Clustal is essentially the Minimum Spanning Tree of the weighted complete graph (i.e., with all possible edges) where each node represents one of the input sequences, each edge weight is edit distance between the two nodes at the ends of the edge. The MST is also given by the Single-Link Cluster algorithm. But actually, the guide tree used in Clustal is given by the algorithm called the NeighborJoining algorithm, which gives a tree which is not the same as the Minimum Spanning Tree. A few uses of multiple alignment - PSSM, PSI-BLAST, HMMs and Pfam, Prosite, BLOCKs (these go both ways, where a multiple alignment is used to construct an HMM, and conversely where an HMM is used to build a multiple alignment). Review of the First Fact of sequence analysis, and review of the Second Fact of sequence anlysis. ********************** Lecture 22 Continuation of the discussion of uses of multiple alignment to build a model of a set of sequences that are related in some biologically significant way, but where the pairwise sequence similarity of any pair of sequences is not suffient to identify the important biological elements. In this lecture we focus on PSSMs (Position Specifice Scoring Matrices) derived from a multiple alignment. Frequency counts in each colum of a multiple alignment are turned into scores. How and Why, and how are the scores used? A PSSM is used to evaluate how similar a new sequence, not one used in the multiple alignment that lead to the PSSM, is to the original sequences. The goal is to accurately recognize new members of the family of sequences used in the multiple alignment, and of course, to not incorrectly declare a sequence to be in that family. We want no false positive and no false negatives. The score of a new sequence, using the character scores in the PSSM, really gives the log of the ratio of the probability that the new sequence was generated by a ``generator of the family" or by the same process that generated the family, divided by the probability that the new sequence was generated by a random sequence generator. Later we will be more precise about what a ``generator" or ``process" is. So the score given to a sequence is a ``log odds ratio", a kind of score that is common in statistics and genetics. Why is the log used? ************************* Lecture 23 Finish of the discussion of profiles and PSSMs - a more formal treatment, and a more complete discussion of why the log of the probability ratio is used. A more complete discussion of the model that generates the sequences to obtain the PSSM. This is really an introduction to very simple Markov Models that generate sequences. Introduction to Markov chains and the probability of a given Markov chain. An explicit introduction to Markov models and Hidden Markov Models. The key property of a Markov Model is that at any point, the future is influenced by the current state the chain is in, but not influenced by the past before the current state - that is, the probability of future parts of the chain is not influenced by how the chain arrived at its current state, but is influcenced by the current state of the chain. ************************ Lecture 24 HMMs for CpG islands. This lecture follows the discussion in the Durbin and Eddy book on CpG models. ************************ Lecture 25 Review of the CpG island model and the idea of HMMs modeling nature. The Viterbi Algorithm to find the chain of largest probability in a HMM that generates a given sequence S. This is one of the ways that one can decide (educatedly guess) which part of a sequence S is in the + model and which is in the - model. The actual discussion of the DP to find the chain of largest probability that generates S starts around minute 25 of the lecture. In the context of HMMs the DP solution of this problem is called the Viterbi Algorithm. ************************* Lecture 26 More detailed discussion of the Viterbi Algorithm: developing the recurences; evaluating the recurrences in a table; traceback to find the optimal chain; derivation of the number of operations used in the Viterbi Algorithm. Introduction and discussion of the Forward Algorithm, computes the probability that a given sequence is generated by the HMM. For that, one has to (implicitly) consider all the chains in the Markov Model, but again this can be done efficiently using DP with a number of operations proportional to nm^2 where n is the length of the sequence, and m is the number of states in the HMM. ********************** Lecture 27 Introduction to the Backward Algorithm for HMMs. The `what' and `why' of the backward algorithm, but not the `how'. For information on the `how', see the Durbin book. Discussion of Profile HMMs - these are actually the simplest and most common HMMs and the basis for many of the HMMs in Pfam. Generally, a profile HMM is derived from a multiple alignment of the sequences in the family. *********************** Lectures 28, 29 Mystery lectures: We didn't get to these in the 2008 course, so I have not looked at them recently. When I watch them, I will write the synopses. Be the first on your block to do it before I do.