Lecture – 35 Data Mining and Knowledge Discovery Part II


Hello and welcome to this second session in data mining.
In the previous session we saw what this concept of data mining was all about and we saw some
very fundamental concepts of item sets and association rules and how do you discover
particular patterns in an item set. That is how do you discover something that you don’t
know from a data set using the concept of support and confidence and so on. So, essentially you give a particular interestingness
criteria and then you start distilling out certain patterns from the data set. Let us
move on further in this session where we will briefly look into some fundamental algorithms
or some very simple algorithms on different kinds of data mining activities namely in
discovering classification trees or discovering clusters of properties of data and mining sequence
data, the data of different sequences or stream data mining and so on. Let us briefly summarize
what data mining was all about. Data mining essentially is the concept of
or is the idea of looking for hidden patterns and trends in data that’s not immediately
apparent by just summarizing the data. So when we say hidden patterns, its essentially means that
something that we don’t know about. There is nothing hidden if you already knew such
a pattern existed in the data base. So in a data mining setting there is no query but we use the concept
of an interestingness criteria. That is we use let us say frequency or consistency or
rarity or whatever be the interestingness criteria and certain parameters define each of these interestingness
criteria like frequencies is parameterized by support and confidence for association
rules and just support for item sets and so on. And again there are different kinds of data
we can think of tabular data, spatial data, temporal data, tree data, graph data and so
on and so forth. So today or in this session we shall look
at specifically at sequence data mining and streaming or mining streaming data and in
addition to other mining algorithms. And of course type of interestingness itself could be varied that
we could talk of frequency as frequent patterns as being interesting or rare patterns being
interesting and so on. Now let us move further from here and look
at the concept of classification and clustering that is discovering classification tree and
discovering clusters within a given data set. Now what is the difference between classification
and clustering? Intuitively they both seem do the same thing. That is when you classify
a given data set into different classes or whether you cluster a given data set into different
clusters but essentially few observe closely classification maps data elements to one of
a different set of pre-determined classes based on the differences between data elements. That is
if data element a and data element b belong to different classes if they are different
enough. On the other hand clustering groups data elements
into different groups based on similarity between elements within a single group and
sometimes it’s also the case that in a classification we know the classes apriori. We know what
are all the different classes into which data can be classified into and sometimes in clustering,
we don’t know how many clusters we are going to get before the clustering process begins.
Let us look at mining in relation to classification techniques rather; we are not interested here
in the idea of classification itself but we are interested in the idea of discovering classification.
What is it meant by discovering classification? Discovering a decision tree or which decides
how to classify data sets into different classes. Let us take a small example. Discovering this
algorithm is best represented by an example. So let us take a small example and see how
we can discover a classification tree. Let us say that we have data about different cricket matches
that have been played over the last several years. Now we have a, let us say in a given
city. Now the question is this city is notorious for it rains for its rains and its unpredictable
weather. Now in the past several times, play had to be abandoned that is play were to be
continued or was abandoned and so on. Now we have data like this from different
data sets. When it was sunny and the temperature was 30 degrees, play was continued. When it
was overcast and the temperature was 15 degrees play wasn’t continued, when it was sunny and temperature
was 16 degrees play was still continued and so on. So in some times play was continued
and sometimes play was discontinued, its no. Now what is the classification problem is can
I classify weather conditions which is a combination of the outlook and the temperature into one
of two classification classes that is whether we are going to play or play is going to be discontinued.
That is what is the criteria, when play was discontinued and what was the weather criteria
when play was continued. So there is a well-known algorithm called
Hunt’s method for identification of decision trees and like before let us first look at
an example of how we identify a decision tree before looking at the algorithm itself. The way of identifying decision tree is quite
simple. First of all because this temperature field here is a numeric value, it could take
several different values and which might be of no interest to us. So let us perform a hand classification
of this numerical values into different classes. So what we have done here is that temperature
is now classified into three different classes warm, chilly and pleasant. So whether the
temperature was warm whether the temperature was chilly or whether the temperature was pleasant
based on dividing the set of temperatures into different classes. Now first of all because
there are two values here that is there are two fields here outlook and temperature, both
of them both of them will affect the decision on whether we are going to play. So how do we know what is the best or how
do each how do each parameter affects the decision whether to play or not. Let us start
by looking at one parameter after another. First let us look at sunny. Now if you see here that whenever
the outlook was sunny, the cricket match was played it was not abandoned. It is sunny only
twice here and in both cases cricket matches played. Therefore we can directly conclude
that if the weather is sunny regardless of whether the temperature is warm or whether
the temperature whether the temperature is chilly or whatever, we can conclude that play will continue,
the play is not going to be stop. On the other hand let us look at cloudy here.
Now when it is cloudy here play was continued in one case or rather in two cases and when
it was cloudy here, play was discontinued in one case. So from cloudy we are still in a what
is called as a bivalent state that is it is still yes or no may be or whatever, may be
yes may be no, we still don’t know. Similarly when the outlook was overcast, let us say here it was
overcast and they didn’t play. Here once when it was overcast they actually played and once
more, when it was overcast then they didn’t played. So, from overcast we still say yes or no,
we don’t know whether they are going to continue play or not. So what we can do now is we can safely remove
the first rule from our process that is this is a rule that we have already discovered
that is when it is sunny they are going to play. So now let us remove this rule from our from consideration
and take these two rules. Now because from cloudy and over cast, we are still in a bivalence
state we have to ultimately reach to a state where we can remove this bivalence that is
we can either conclude yes or no conclusively. So we will try to we will try to now introduce
the second parameter temperature into this state here to see whether we can remove this uncertainty
about yes or no. The first case the uncertainty is already removed, so there is nothing we
need to do any more. So we have introduced let us say here for
cloudy, we have introduced all three possible cases warm, chilly and pleasant, similarly
for overcast warm, chilly and pleasant. So let us take cloudy and warm. So, whenever it was cloudy
and warm there is only one case here play was continued, yes. So basically we have removed
the bivalency that is we have conclusively stated that whenever it is cloudy but the temperature
is warm, play is going to continue, we are not going to abandon play. On the other hand whenever it was cloudy and
chilly, there is only one case here where play was discontinued. So again there is the
bivalency is removed that is cloudy and chilly means no. So we can again conclusively state that the
play is going to be abandoned if the outlook is cloudy and the temperature is chilly. Similarly when it is cloudy and pleasant,
cloudy and pleasant is here and there is only one case here, cloudy and pleasant is yes.
So when the outlook is cloudy but the temperature is pleasant, we can still conclude that they
are going to continue play. Similarly overcast and warm there is no entry at all, so we don’t
know there we can’t decide anything. So overcast and warm remains as it is and overcast and chilly
gives us no, that is play is going to be abandon. Similarly, overcast and pleasant gives us
yes. So effectively we have removed this bivalency
that existed here when it was cloudy and overcast and decided or came to know when, under what
conditions play is going to be continued when it is cloudy and under what conditions play is
going to be discontinued when it is cloudy and the same thing for overcast. So therefore
what we have actually done is we have discovered this decision tree. So initially we were in a bivalent
state that is we don’t know play is going to be continued or discontinued. Now in this
bivalent state we were told that the outlook is sunny then we can immediately conclude yes
we are going to play today. On the other hand if you are in this bivalent
state here, if you are told that the outlook is cloudy, we will still be in a bivalent
state we still don’t know whether they are going to, whether the play is going to be continued
or not. So we ask for more information and then when you find out that the temperature
is pleasant, let us say for example then we say that yes the play is going to continue. On the other hand
if the temperature is chilly then we have reasons to believe that play is not continued
that is the data set tells us that play is going to be abandon and so on. So what we have got here is a tree data structure
where from a bivalent state, we eventually go into a univalent state that is a state
were the uncertainty is removed and then we have concluded or we have classified his this play
into two different classes that is yes or no that is play is going to be continued or
play is going to be abandon. So let us look back at the algorithm little bit as how to go about this.
Suppose we are given n different elements. In our case in the example that we right now
saw, n was equal to 2 that is outlook and temperature. so suppose we are given n different
element types and m different decision classes, in this case again m was two that is yes and
no. so what we do in this loop here, for each of the different element types we keep progressively
adding element i to the i minus oneth element item sets from the previous iteration. And
then whenever and then we see whether we can decide, identify the set of all decision classes
for each such item set. If the item set has only one decision class
that means we have already decided. so this is done, removed that item set from subsequent
iterations otherwise keep continuing until you finish all your element types. And of course it could well be the case that
even after finishing all my n different item sets, I may not be able to reach a conclusive
decision. So it might well be the case that when it
is over cast and chilly. Sometimes they actually play and sometimes they didn’t play and so
on. so that again, there are several methods to deal with such kinds of indecisiveness for example
to use probabilities that is this is going to or some kind of fuzzy classification where
we say that outlook is overcast and temperature is pleasant then they are going to play with
a probability of 90% or something like that. So let us look further into what are some
clustering techniques. Now what is meant by clustering or how does it differ from classification?
We saw earlier that there is a philosophical difference between classification and clustering,
probably not in the n result but philosophically there is a difference. Of course even in the
end result there are differences but the most marked difference is philosophically. That
is classification is based on amplifying the differences between different elements so
as to make them belong to different classes. On the other hand clustering is based on amplifying
the similarities between elements so as to form them into different clusters. So clustering
essentially partitions the data sets into several clusters one or more clusters or equivalence
classes. And what is the property of a cluster or an equivalence class? Essentially the property
here is that the similarity among members of a given class in a cluster is much
more than similarity among members across clusters. So members belonging to the same cluster are
much more similar to one another than they are to some members belonging to some other
clusters. And there are several measures of similarities and most of which are reduced to geometric
similarity by projecting these data sets into hyper cubes or n dimensional spaces and then
use some kind of Euclidian distance or other kinds of distance measures like Manhattan distance
and so on and several distance measures to compute the similarity. Let us look at the first kind of clustering
algorithm which is called the nearest neighbor clustering algorithm. This is quite simple
that is this clustering algorithm takes a parameter called threshold or the minimum distance or
the maximum distance t between members of a given cluster. So given n elements that
is x1, x2 to xn and given a threshold t which is a maximum distance that can exist between elements of
a cluster, we can find clusters in a very simple process. Initially the set of clusters
is a null set. Then for each element let us say j equal to 1 here and j goes to, until j plus one
here for each element find the nearest neighbor of xj. Now let the nearest neighbor be in some cluster
if it is already in a cluster, if it is not in a cluster then fine you can just create
another cluster by yourself. So suppose the nearest neighbor is in cluster m. now if the distance
to nearest neighbor is greater than t that is if it is greater than threshold then we
know that there is no other element that is nearer to me with a distance less than t. therefore I should
belong to a new cluster so then create a new cluster and increment the number of clusters
else assign it to the cluster m were the nearest neighbor of it existed. So, as simple as that.
That is given a small threshold, you basically start partitioning your set of elements into
different clusters based on which is the nearest neighbor to a given element. If the nearest
neighbor is within this threshold distance then I join the cluster, otherwise I belong
to a new cluster. There is another kind of clustering techniques
which is again quite popular which is called as the iterative partitional clustering. This
is another clustering technique where this differs from the nearest neighbor technique in the
sense that here the number of clusters are fixed apriori. In the nearest neighbor technique
or in the nearest neighbor clustering techniques, the number of clusters are not fixed apriori that
means you don’t know how many clusters you are going to get, given a particular threshold
and a data set. So this is very much unlike classification
where we know the classification, where we know the classes under which data can be classified
into. In iterative partitional clustering, the number of clusters are already known apriori
and then we are trying to rearrange the clusters that is but that is we don’t know how many
or what elements belong to which clusters. So, given n different elements and k different clusters,
each with a center. What do we mean by a center here? It’s the centroid in the statistical
sense, for example it could be the first centroid. That means if a cluster has several features,
the average of all these features along all different dimensions will form the centroid
of a given data set. So let us say we have k clusters each with
a center. Now assign for each element, assign it to the closest cluster center. So each
clusters has a cluster or a centroid. For each element, find out which is its closest cluster center
and assign it to that cluster. After all assignments have been made, compute the cluster centroids
for each of the cluster. That is compute the average of all the points that made up this
cluster and possibly this will shift the centroid to a different to a different location. So
once this centroid is shifted to a different location, the nearest centroid or the nearest
cluster center will now differ for each element. Therefore we keep repeating these two steps,
until the new centroid I mean with a new centroids that are formed until the algorithm converges.
That is until the algorithm stabilizes so that the centroids will stop shifting and then
we know that we have found the exact or we have found the best centroids for each of
the clusters, each of the k clusters. so iterative partitional clustering essentially is a technique were
something like saying, suppose I have a data set and I say that suppose I want to create
10 different clusters out of this data set, where would these clusters lie and so on. On the other hand, a nearest neighbor clustering
technique would say suppose I have this data set and suppose I have a maximum distance,
a threshold distance of 5 between elements that can lie within a data set then how many clusters
will I find. whereas in the in the iterative clustering algorithm, we are interested in
where the clusters are going to be, where are the cluster centroids of these 10 different clusters
that are going to be formed. Let us now move on further and look at different
other kinds of data sets. We have been looking into, until now we have been looking into
let us say the tabular data as in apripri or association rule mining or some kind of multi-dimensional
data. Tabular data can be treated as multi-dimensional data as long as they belong to certain ordinal
classes which is of course beyond the scope of this session here that
is how do we convert a tabular data into multi-dimensional data. But any way as long as the data can
be converted to multi-dimensional form, we can use clustering techniques for clustering
them into different clusters. Similarly tabular data can be used to also
infer classification trees. Let us now move on to different kind of data what is called
as sequence data. What do we understand by the term sequence? Sequence is essentially a collection
of data elements wherein it’s not just the collection, it’s an ordered collection that
is where in the ordering matters. That is in a sequence each item in a sequence
has an index associated with it. That is some kind of a subscripted element, each element
is a subscripted element. So this is the first element, this is the second element and so
on. So when we say we have a k sequence, it means that we have a sequence of length k
that is there are k different elements in a particular order in this. there are different kinds of sequence data
like for example any kind of transaction log over a period of time or let us say some kind
of web browsing logs, http logs or DNA sequences or the patient history, the medical history of a
patient over time that is how is the history changing or what kinds of events happened
and so on. So all of these are sequence data. So let us look at some definitions in mining
sequence data and which help us in formulating algorithm for looking at patterns in sequence
data. First of all when we talk of a sequence, a sequence is essentially a list of item sets
of finite length that is each element in a sequence need not be atomic, it could actually
be a set, it could actually be a different set of items. So for example this is the sequence.
The first element here is pencil, pen, ink or pen, pencil, ink. The second element here
is pencil, ink. The third element is eraser, ink and so on and the fourth element is ruler, pencil
and so on. So this sequence essentially for example could
be denoting the purchases of single customer over time in this particular store or whatever.
So let us say the customer came in the first month and purchase these three things, the
second month you purchase these two and the third month you purchase these two and so
on in some stationary store. Now the order of items within an item set
here does not matter but the order of item sets itself matters. That is this is the first
month, this is the second month, this is the third month, so the position of this item set matters but
the position of items within an item set doesn’t matters. So whether I read this as pencil,
ink or ink, pencil it doesn’t matter. And we define the term sub sequence, as any sequence with
some item sets deleted from it. So, some more definitions. Suppose I take a sequence a1,
a2 until am, this is actually a sequence it’s not a set, so this curly braces should actually be a,
it should not be there. So suppose I take a sequence s prime a1 a2
until am. we say that s prime is set to be contained within another sequence s, if s
contains a sub sequence of the form b1 b2 etc bm that is m different elements such that each corresponding
element is a subset, a1 subset of b1 subset equal to rather and a2 subset equal to b2
and so on. So, hence for example this sequence pen, pencil and ruler pencil is contained in this
sequence. That is pen is a subset of this, pencil is a subset of this and suppose you
take this out and create this sub sequence pen, these three as a subsequence then ruler pencil is
a subset of this one. So, let us look at the apriori algorithm.
I think called the apriori gen algorithm or whatever apriori all algorithm where it is
applied for sequence data rather than item sets or association rules. The apriori algorithm for sequences looks
very similar to the apriori algorithm for item sets as well. How does the apriori algorithm
look? First of all we set, we generate L1 that is the set of all interesting one sequences. What
is the one sequence? A sequence containing just one element. And then when Lk is not
empty when k equal to 1, we generate all candidate k plus 1 sequences and out of these, we take only the
set of all interesting k plus 1 sequences. What is interesting k plus 1 sequence here?
It is simply the set of all k plus 1 sequence which have at least the minimum support that
we have specified and so on. Now the main question here lies in this statement here 3.1, that
is how do we generate or what is the candidate generation algorithm? How do we generate all
candidate k plus 1 sequences? So how do we generate all candidate algorithms?
Now given let us say different interesting sequences that is L1 L2 until Lk, candidate
sequences of Lk+1 are generated simply by concatenating all sequences in Lk with all
new one sequences found while generating Lk-1. What is this mean? Let us illustrate this
with an example. Let us say this is my data set and this data
set let us say denotes, let us say I have a website and this data set denotes which
are all the different pages that have been visited by users in different usage sessions. So one user a
went from, one user went from page a to b to c to d to e and so on. Another user came
from b and went to d and a and e and so on like this. So we have different sequences and of course
as you can see here that an element can repeat in a sequence that is this user has requested
for the page a 4 times one after the other and same thing here that is after b, a is requested
three times and so on for whatever reason. Now from here in order to look at, in order
to mind for all interesting sub sequences that is what will be visited before what in this data set, let
us start with the set of all interesting one sequences. Now we have set a minsub as 0.5
that is at least 50% of support. Now let us look at the set of all interesting one sequences. What
is it mean to say interesting one sequences? Essentially it means that which all sequence
of length one have appeared at least 5 times or more. So a has appeared 1 2 3 4 5 6 7 8 times in 8
different sequences, b has appeared 1 2 3 4 5 6 7 8 9 different times and so on. So
a b d and e are interesting one sequences, c for example has appeared just once here, so therefore it is
not interesting at all as a one sequence. Now we generate all possible candidate two
sequences that is it is now rather than a combination, it’s a permutation that is where
the order matters. So aa and rather it’s not a permutation, it is a concatenation rather
that is concatenation of all possible concatenations that are possible between elements of this
one. So ab is different from ba and ad is different from da and so on. So these are the set of
all candidate two sequences. Now we just see which of these candidate two sequences have
minimum support. Now among these you see that only ab and bd
have a minimum support of 0.5. That is all others aa for example has the minimum support
of 1 2 3 that’s it, not 0.5. That is one is here rather 4, 1 2 3 and 4, ab also has minimum support
less than 5 and so on. So the only set of interesting two sequences
are ab and bd in this case. So we have got the set of all interesting two sequences.
Now how do we generate the set of all interesting three sequences that is candidate three sequences?
We concatenate ab and bd with all the interesting one sequences found in the previous iteration.
So the previous iteration here is still the one sequence here ab d and e. therefore we concatenate
both of this with a b d and e like this and then we see that there are no interesting
three sequences at all and then the process stops. Otherwise we would have filtered out few more
elements here and then out of these, again we would have concatenated with all possible
interesting one sequences that we found in the previous iteration. So here the interesting one sequences
that we have found in the second iterations are a b and d. So for level 4 there is no
need to concatenate it with let us say e, so it’s enough if we just concatenate with a b and
d. With sequence data there is an other kind
of interesting mining problem that occurs, when we look at a sequence data as a behavioral
pattern. See for example when we say this is the way that users behave in a data, user behave in
a website. The user here comes to page a then goes to page b then goes to page c, d and
e and so on. Now we are encountered with a question as to can we model the behavior of the user. What
would be a model that would explain me how users behave on my website? So what this means is that we have to find
out, suppose these are all the different strings of a given hypothetical machine, we have to
find out some machine which can generate all of these strings and possibly other strings that
belong to the same class in whatever sense that is. So the question here is that given
different sequences, treat this different sequences as strings that are generated by a particular
machine. The simplest kind of machine that we can generate is the state machine or the
deterministic finite automate or the finite state machine or whatever.
Now but that doesn’t mean that everything can be modeled by a finite state machine but
it’s purely because of complexity considerations or practical considerations that we assume
that the model representing user behavior is given
by a finite state machine. So given a set of input sequences, we have to find out what
is the finite state machine that recognizes this class of input sequences. This also called as language
inference that is given the strings of a language, you are trying to infer the grammar of the
language or you are trying to infer the structure of the language. Now what is the problem in
language inference? What is the big, where is the trickiest problem that occurs in language
inference? Take a look at these strings. Let us say I
have these four strings abc, aabc, aabbc, abbc so on. Now if I want to give you these
four strings and tell you that create a state machine that will recognize these four strings. It is quite
obvious that one would come out with the state machine like this which says which accepts
these fours strings and exactly these four strings, so abbc, abc and aabbc and so on. So which
accepts exactly these four strings. On the other hand, one can also write a machine
like this comprising of a single state which leads on to itself and accepts all strings
like this. So this is a most general state machine that is this state machine is also correct
in a sense that it accepts these four strings but it also accepts anything else made of
a b and c in addition to these four strings, while this is a most specific state machine. That is this
is a state machine that accepts these four strings and these four strings only and nothing
else. Now the challenge or now the trickiest problem
in language inference is to find the right kind of generalization. That is if we make
something into a most specific state machine, it will be of no use, while we make something into a
most general state machine, it will be useless as well.
So when we discover or when we try to discover a model of user behavior, we should discover
a model which is not too specific and is neither too general, it has to have the right kind
of generalization. How do we do that? There are
several different algorithms that try to generalize a little bit and not too much and not be too
specific and so on. We will just look at one specific algorithm
which might be termed as the shortest run generalization that is generalize based on
behaviors by using what is called as a shortest run technique of this thing. Now as we did for
the previous algorithms, let us first look at the example and then come back to the algorithm. Now the way shortest run generalization works
is shown in this state machine here. Now let us say that we encountered different strings.
Now let us say this is the first string that we encounter aabcb. Now there is no other string
therefore we just build a state machine like this which accepts only aabcb and we haven’t
seen anything else, so we can’t generalize anything else. Now second we encounter the string aac.
So what this means is this state machine should accept not only aabcb but also accept aac.
What does this mean? This means that start from aa and after aa if I get a c I can go directly
to the end state, so it has to accept not just aabcb but also aac. Now let us say that I get the third string,
even here i won’t be able to generalize anything. This is the state machine that accepts aabcb
or aac, so we still haven’t generalize anything. Now let us say I encounter one more string
of the form aabc. Now what is this mean? This means that aabc that is this string, that
is this is a prefix of this thing. That is this is the substring of this thing, this is the prefix
of string of the first one. So aabc this state itself should be a end state. So basically
we come like this here and abc this becomes the end state. Now what we do is we merge both of these end
states, so b comes back like this. When we merge these end states, note that we have
performed a specific particular generalization here. Now what is this machine recognize? This machine
recognizes aabc b star that means any number of b’s after aabc. So essentially what it
sees is that or any number of b’s after aac as well. That means it has seen a b appear after aac that
is this substring aa and c with or without b included, it has seen that b may appear
or not appear. And it generalized to the fact that any number of b’s may appear, including 0 number
of b’s which may or may not be right that means to say that there might be an implicit,
there might be some more hidden variables that says that at most 3 b’s can appear let us say 0 1 2
or 3 b’s can appear not 4 b’s but we don’t have that information here as such.
So basically the state machine generalize to the fact that after aabc or after aac zero
or more b’s can appear and we still lie in the end state but then we also see that when
we look at the end state here, we look at the tails of
all the edges coming into the end state. so there is a tail here which says c and there
is a tail here which says c. now whenever from the end state it finds that there are two or more
tails having the same suffix, these two the corresponding states are also merged. So what we finally get is aa b star c b star
so that means what the machine generally is actually saying is that this language has
to have two a’s to begin with, so it has two a’s and it can have 0 or more b’s following two a’s and
then it should have a c and then it can have 0 or more b’s and so on. So because it has
found 0 or 1 b’s between a and c and it has found 0 or 1 b’s after this c, it has performed this generalization.
So this is one way of performing or trying to discover the behavior that is exemplified
by a set of sequences. Let us look at the last kind of data set for
this session namely streaming data. Streaming data has been of relatively newer interest
among the data mining community and especially since the streaming data or mining on streaming
data has several interesting applications. Now what is the characteristic of streaming
data, what you understand by streaming data? You have let us say streaming audio, streaming
video, network traffic and sever several other such data sets which are essentially large data
sequences possibly infinite data sequences. in practice of course there are finite but
possibly infinite data sequences and there is no or very little storage that is it is not practical
to say that I am going to store the entire streaming data into a file and then start
mining the file. Because this if it is infinite or if it is
extremely large, it will be impractical, it could be tera bytes or even more bytes of
data that could eventually accumulate into the file. So some examples are stock market quotes or streaming
audio or video or network traffic and so on. So in order to mine streaming data or rather
even in order to let us say query streaming data, there is a notion of what is called
as running queries or also what are called as standing queries. That means in a traditional database
the data is standing, the data is there and the query actually slides through the data
set in order to return you the answer. But in a streaming data set it is the query that is
standing and the data streams through the query and then the query keeps returning you
answers as and when the data streams through it. So how do we write some standing queries or
how do we find some aggregate behaviors based on some standing queries? Let us look at some
simple standing queries, computing the running mean of a data stream. That is suppose I am getting
a stream of different numbers and I have to calculate the average of these numbers as
and when I read a new numbers, so it’s a running mean. So a simple way to calculate this running
mean is like this, let us say I just need to maintain two variables here. One is the
number of items that I have read so far or the number of numbers that I have read so far and the running
average that I have calculated so far. So whenever I read the next number, all I
need to do is first compute n times average that is average times the number of numbers
that I have read so far, add number to it and divide it by n plus 1 and then increment the number of
numbers that you have read or the number of items that you have read that is n equal to
n plus 1, so as simple as that. That is as soon as a new number comes, you generate the sum, see n
times average is basically the sum of all the numbers that have come so far. So generate
the sum here, add the new number and divide it by the new that is number plus 1, n plus 1 as the new
set of numbers that have come and then increment your set of numbers.
Similarly this slide shows how to write a running query that computes the running variance.
Variance as you know is the square of the standard deviation of a given data set. How
do you compute standard deviation? That is it is
for every element x, compute x minus x bar that is number minus average whole square
and compute the sigma or compute the sum over all of them, all of these differences, so mean square distances
essentially. So in order to compute the running variance,
we look at this formula little more carefully. Variance equal to sigma of number minus average
whole square where number ranges from i equal to 1 to n or whatever. Now, when you expand this,
you can expand this into number square minus 2 times number times average plus average
square. So essentially what this means is we have to maintain certain variables, one is sigma of
number square. So, every time you read a number, square the number and add it to the previous
sum that you have maintained. Of course you also have to maintain the number of numbers that
have been read so far. Then you also have to maintain two times number star average
of all numbers that have been read so far. So you know how to compute the running average,
so every time you get new number compute the running average that is we saw how to compute
the running average in the previous slide and then compute two times number times average and
add it to this. So essentially you can take out average out of this and sigma of number
or two times average out of this and you just basically have to maintain sigma of numbers. That is
the sum of all the numbers that we have calculated until now and multiplied to the new average
that we have found. And then we have to maintain, there is no
sigma that is necessary here because average is a single number and we have to just maintain
the square of the average of all the numbers that we have read so far and we know how to maintain
the average. Now by maintaining all this, we can easily calculate the running variance
that is you just compute each of them, put each of them in their corresponding places and compute
the running variance. Therefore even if I have a long, let us say stock quotes from
the stock market giving me how the quotes of, how the stock price of a particular stock is changing I
can maintain what is the mean stock price that it has recorded so far and what has been
the variance and I can easily calculate standard deviation at any point in time by computing the square
root of the variance. So I know how much it has varied over time and what has been the
mean behavior of this stock over the entire time that I have read
so far. So this slide essentially shows how you can
calculate the running variance that is whenever you read the next number first compute the
average, we know how to compute the average then each of these is computed like this. That is A
equal to A plus n square B equal to B plus two times average star n and C equal to C
plus average square and variance is A plus B plus C. We shall also look at one more algorithm for streaming
data essentially what is called as a gamma consistency or looking for events that have
what are called as gamma consistency. What is meant by this gamma consistency? Essentially
the idea behind this is as follows. Suppose an event happens at some point in time. The
interestingness of that event will be high in the vicinity of the event that is right after
the event happens, let us say stock market crashes. The interest in that event will be
high in the next few days but over a period of time, the interest that event starts going down unless
of course the stock market crashes again. So that is the essential idea behind gamma
consistency. That is first consider this streaming data to be in the form of frames where each frame
comprises of one or more data elements. Then we look for some interesting events within
a frame essentially let us say support based interestingness. So by let us say number of
occurrences of k divided by number of elements in frame and then we see which of these events
have sustained support over all frames rate so far with a leakage of 1 minus gamma. That
means in every frame let us say every day or every week or whatever, we look at events that are interesting
with a support of k. And if this event keeps on occurring with
at least this much support then you can consider this to be some kind of beaker where you are
pouring in the events which are coming in with some kind of support and this beaker has a small
hole underneath where in it leaks at a rate of 1 minus gamma. So over a period of time
if you take it over a period of time, if and only if this event has a sustained support over time this
beaker is going to be full or this beaker is going to have a particular level. And if
the event does not sustain over time eventually, the beaker is going to empty itself. So the level in this beaker is an indication
of two things. One is how sustained is the support for this event and second could also
be how recent was this event. So the more recent the event is the higher the level is going to
be, similarly the more sustained the support for an event is again the higher the level
is going to be. So you can calculate the level like this and then you can again put a threshold for
this level and look at all events which have a particular level or so or level are higher
at any given point in time. So we now come to the end of this second session
on data mining. We have just crashed the surface of what is a vast area of knowledge discovery
from databases and we have kind of scratched it in a breadth first fashion that is we looked
at several representative algorithms for different kinds of data mining problems whether it was
a apriori or whether it was classification or clustering or sequence data or something like
language inference and streaming data and so on. But this is just still the tip of the
iceberg. So anyway that brings us to the end of this session.

7 Comments

  1. Kutu Politik

    October 6, 2010 at 4:22 am

    Thank you very much for an interesting and understandable lecture

    Kumar

  2. Gayatri Ganeshan

    May 28, 2011 at 7:49 pm

    tooo much of ammm..hmmm.. plzz try to reduce it in next lectures.. but the content was good…:)

  3. SLAP Company

    August 10, 2011 at 12:48 pm

    Could you be emotionally detached at work and not know it? Sure, if you’re detached. There’s a party in your head and you’re not invited. slapCompany

  4. yan nay

    July 28, 2012 at 10:37 am

    nice video thanks for sharing "nptelhrd"

  5. Thomas Scanlan

    February 28, 2013 at 3:08 pm

    Gobshitte

  6. Nitesh Kumar

    June 30, 2013 at 4:53 am

    nice, but tremendous scope of making it better like more clear accent , and better clarity of the diagrams. Green letters on green background ?? y not make it more contrasting

  7. Anwer Mustafa

    November 27, 2013 at 3:19 pm

    very nice Dr.

Leave a Reply