## Lecture – 35 Data Mining and Knowledge Discovery Part II

Hello and welcome to this second session in data mining.

In the previous session we saw what this concept of data mining was all about and we saw some

very fundamental concepts of item sets and association rules and how do you discover

particular patterns in an item set. That is how do you discover something that you don’t

know from a data set using the concept of support and confidence and so on. So, essentially you give a particular interestingness

criteria and then you start distilling out certain patterns from the data set. Let us

move on further in this session where we will briefly look into some fundamental algorithms

or some very simple algorithms on different kinds of data mining activities namely in

discovering classification trees or discovering clusters of properties of data and mining sequence

data, the data of different sequences or stream data mining and so on. Let us briefly summarize

what data mining was all about. Data mining essentially is the concept of

or is the idea of looking for hidden patterns and trends in data that’s not immediately

apparent by just summarizing the data. So when we say hidden patterns, its essentially means that

something that we don’t know about. There is nothing hidden if you already knew such

a pattern existed in the data base. So in a data mining setting there is no query but we use the concept

of an interestingness criteria. That is we use let us say frequency or consistency or

rarity or whatever be the interestingness criteria and certain parameters define each of these interestingness

criteria like frequencies is parameterized by support and confidence for association

rules and just support for item sets and so on. And again there are different kinds of data

we can think of tabular data, spatial data, temporal data, tree data, graph data and so

on and so forth. So today or in this session we shall look

at specifically at sequence data mining and streaming or mining streaming data and in

addition to other mining algorithms. And of course type of interestingness itself could be varied that

we could talk of frequency as frequent patterns as being interesting or rare patterns being

interesting and so on. Now let us move further from here and look

at the concept of classification and clustering that is discovering classification tree and

discovering clusters within a given data set. Now what is the difference between classification

and clustering? Intuitively they both seem do the same thing. That is when you classify

a given data set into different classes or whether you cluster a given data set into different

clusters but essentially few observe closely classification maps data elements to one of

a different set of pre-determined classes based on the differences between data elements. That is

if data element a and data element b belong to different classes if they are different

enough. On the other hand clustering groups data elements

into different groups based on similarity between elements within a single group and

sometimes it’s also the case that in a classification we know the classes apriori. We know what

are all the different classes into which data can be classified into and sometimes in clustering,

we don’t know how many clusters we are going to get before the clustering process begins.

Let us look at mining in relation to classification techniques rather; we are not interested here

in the idea of classification itself but we are interested in the idea of discovering classification.

What is it meant by discovering classification? Discovering a decision tree or which decides

how to classify data sets into different classes. Let us take a small example. Discovering this

algorithm is best represented by an example. So let us take a small example and see how

we can discover a classification tree. Let us say that we have data about different cricket matches

that have been played over the last several years. Now we have a, let us say in a given

city. Now the question is this city is notorious for it rains for its rains and its unpredictable

weather. Now in the past several times, play had to be abandoned that is play were to be

continued or was abandoned and so on. Now we have data like this from different

data sets. When it was sunny and the temperature was 30 degrees, play was continued. When it

was overcast and the temperature was 15 degrees play wasn’t continued, when it was sunny and temperature

was 16 degrees play was still continued and so on. So in some times play was continued

and sometimes play was discontinued, its no. Now what is the classification problem is can

I classify weather conditions which is a combination of the outlook and the temperature into one

of two classification classes that is whether we are going to play or play is going to be discontinued.

That is what is the criteria, when play was discontinued and what was the weather criteria

when play was continued. So there is a well-known algorithm called

Hunt’s method for identification of decision trees and like before let us first look at

an example of how we identify a decision tree before looking at the algorithm itself. The way of identifying decision tree is quite

simple. First of all because this temperature field here is a numeric value, it could take

several different values and which might be of no interest to us. So let us perform a hand classification

of this numerical values into different classes. So what we have done here is that temperature

is now classified into three different classes warm, chilly and pleasant. So whether the

temperature was warm whether the temperature was chilly or whether the temperature was pleasant

based on dividing the set of temperatures into different classes. Now first of all because

there are two values here that is there are two fields here outlook and temperature, both

of them both of them will affect the decision on whether we are going to play. So how do we know what is the best or how

do each how do each parameter affects the decision whether to play or not. Let us start

by looking at one parameter after another. First let us look at sunny. Now if you see here that whenever

the outlook was sunny, the cricket match was played it was not abandoned. It is sunny only

twice here and in both cases cricket matches played. Therefore we can directly conclude

that if the weather is sunny regardless of whether the temperature is warm or whether

the temperature whether the temperature is chilly or whatever, we can conclude that play will continue,

the play is not going to be stop. On the other hand let us look at cloudy here.

Now when it is cloudy here play was continued in one case or rather in two cases and when

it was cloudy here, play was discontinued in one case. So from cloudy we are still in a what

is called as a bivalent state that is it is still yes or no may be or whatever, may be

yes may be no, we still don’t know. Similarly when the outlook was overcast, let us say here it was

overcast and they didn’t play. Here once when it was overcast they actually played and once

more, when it was overcast then they didn’t played. So, from overcast we still say yes or no,

we don’t know whether they are going to continue play or not. So what we can do now is we can safely remove

the first rule from our process that is this is a rule that we have already discovered

that is when it is sunny they are going to play. So now let us remove this rule from our from consideration

and take these two rules. Now because from cloudy and over cast, we are still in a bivalence

state we have to ultimately reach to a state where we can remove this bivalence that is

we can either conclude yes or no conclusively. So we will try to we will try to now introduce

the second parameter temperature into this state here to see whether we can remove this uncertainty

about yes or no. The first case the uncertainty is already removed, so there is nothing we

need to do any more. So we have introduced let us say here for

cloudy, we have introduced all three possible cases warm, chilly and pleasant, similarly

for overcast warm, chilly and pleasant. So let us take cloudy and warm. So, whenever it was cloudy

and warm there is only one case here play was continued, yes. So basically we have removed

the bivalency that is we have conclusively stated that whenever it is cloudy but the temperature

is warm, play is going to continue, we are not going to abandon play. On the other hand whenever it was cloudy and

chilly, there is only one case here where play was discontinued. So again there is the

bivalency is removed that is cloudy and chilly means no. So we can again conclusively state that the

play is going to be abandoned if the outlook is cloudy and the temperature is chilly. Similarly when it is cloudy and pleasant,

cloudy and pleasant is here and there is only one case here, cloudy and pleasant is yes.

So when the outlook is cloudy but the temperature is pleasant, we can still conclude that they

are going to continue play. Similarly overcast and warm there is no entry at all, so we don’t

know there we can’t decide anything. So overcast and warm remains as it is and overcast and chilly

gives us no, that is play is going to be abandon. Similarly, overcast and pleasant gives us

yes. So effectively we have removed this bivalency

that existed here when it was cloudy and overcast and decided or came to know when, under what

conditions play is going to be continued when it is cloudy and under what conditions play is

going to be discontinued when it is cloudy and the same thing for overcast. So therefore

what we have actually done is we have discovered this decision tree. So initially we were in a bivalent

state that is we don’t know play is going to be continued or discontinued. Now in this

bivalent state we were told that the outlook is sunny then we can immediately conclude yes

we are going to play today. On the other hand if you are in this bivalent

state here, if you are told that the outlook is cloudy, we will still be in a bivalent

state we still don’t know whether they are going to, whether the play is going to be continued

or not. So we ask for more information and then when you find out that the temperature

is pleasant, let us say for example then we say that yes the play is going to continue. On the other hand

if the temperature is chilly then we have reasons to believe that play is not continued

that is the data set tells us that play is going to be abandon and so on. So what we have got here is a tree data structure

where from a bivalent state, we eventually go into a univalent state that is a state

were the uncertainty is removed and then we have concluded or we have classified his this play

into two different classes that is yes or no that is play is going to be continued or

play is going to be abandon. So let us look back at the algorithm little bit as how to go about this.

Suppose we are given n different elements. In our case in the example that we right now

saw, n was equal to 2 that is outlook and temperature. so suppose we are given n different

element types and m different decision classes, in this case again m was two that is yes and

no. so what we do in this loop here, for each of the different element types we keep progressively

adding element i to the i minus oneth element item sets from the previous iteration. And

then whenever and then we see whether we can decide, identify the set of all decision classes

for each such item set. If the item set has only one decision class

that means we have already decided. so this is done, removed that item set from subsequent

iterations otherwise keep continuing until you finish all your element types. And of course it could well be the case that

even after finishing all my n different item sets, I may not be able to reach a conclusive

decision. So it might well be the case that when it

is over cast and chilly. Sometimes they actually play and sometimes they didn’t play and so

on. so that again, there are several methods to deal with such kinds of indecisiveness for example

to use probabilities that is this is going to or some kind of fuzzy classification where

we say that outlook is overcast and temperature is pleasant then they are going to play with

a probability of 90% or something like that. So let us look further into what are some

clustering techniques. Now what is meant by clustering or how does it differ from classification?

We saw earlier that there is a philosophical difference between classification and clustering,

probably not in the n result but philosophically there is a difference. Of course even in the

end result there are differences but the most marked difference is philosophically. That

is classification is based on amplifying the differences between different elements so

as to make them belong to different classes. On the other hand clustering is based on amplifying

the similarities between elements so as to form them into different clusters. So clustering

essentially partitions the data sets into several clusters one or more clusters or equivalence

classes. And what is the property of a cluster or an equivalence class? Essentially the property

here is that the similarity among members of a given class in a cluster is much

more than similarity among members across clusters. So members belonging to the same cluster are

much more similar to one another than they are to some members belonging to some other

clusters. And there are several measures of similarities and most of which are reduced to geometric

similarity by projecting these data sets into hyper cubes or n dimensional spaces and then

use some kind of Euclidian distance or other kinds of distance measures like Manhattan distance

and so on and several distance measures to compute the similarity. Let us look at the first kind of clustering

algorithm which is called the nearest neighbor clustering algorithm. This is quite simple

that is this clustering algorithm takes a parameter called threshold or the minimum distance or

the maximum distance t between members of a given cluster. So given n elements that

is x1, x2 to xn and given a threshold t which is a maximum distance that can exist between elements of

a cluster, we can find clusters in a very simple process. Initially the set of clusters

is a null set. Then for each element let us say j equal to 1 here and j goes to, until j plus one

here for each element find the nearest neighbor of xj. Now let the nearest neighbor be in some cluster

if it is already in a cluster, if it is not in a cluster then fine you can just create

another cluster by yourself. So suppose the nearest neighbor is in cluster m. now if the distance

to nearest neighbor is greater than t that is if it is greater than threshold then we

know that there is no other element that is nearer to me with a distance less than t. therefore I should

belong to a new cluster so then create a new cluster and increment the number of clusters

else assign it to the cluster m were the nearest neighbor of it existed. So, as simple as that.

That is given a small threshold, you basically start partitioning your set of elements into

different clusters based on which is the nearest neighbor to a given element. If the nearest

neighbor is within this threshold distance then I join the cluster, otherwise I belong

to a new cluster. There is another kind of clustering techniques

which is again quite popular which is called as the iterative partitional clustering. This

is another clustering technique where this differs from the nearest neighbor technique in the

sense that here the number of clusters are fixed apriori. In the nearest neighbor technique

or in the nearest neighbor clustering techniques, the number of clusters are not fixed apriori that

means you don’t know how many clusters you are going to get, given a particular threshold

and a data set. So this is very much unlike classification

where we know the classification, where we know the classes under which data can be classified

into. In iterative partitional clustering, the number of clusters are already known apriori

and then we are trying to rearrange the clusters that is but that is we don’t know how many

or what elements belong to which clusters. So, given n different elements and k different clusters,

each with a center. What do we mean by a center here? It’s the centroid in the statistical

sense, for example it could be the first centroid. That means if a cluster has several features,

the average of all these features along all different dimensions will form the centroid

of a given data set. So let us say we have k clusters each with

a center. Now assign for each element, assign it to the closest cluster center. So each

clusters has a cluster or a centroid. For each element, find out which is its closest cluster center

and assign it to that cluster. After all assignments have been made, compute the cluster centroids

for each of the cluster. That is compute the average of all the points that made up this

cluster and possibly this will shift the centroid to a different to a different location. So

once this centroid is shifted to a different location, the nearest centroid or the nearest

cluster center will now differ for each element. Therefore we keep repeating these two steps,

until the new centroid I mean with a new centroids that are formed until the algorithm converges.

That is until the algorithm stabilizes so that the centroids will stop shifting and then

we know that we have found the exact or we have found the best centroids for each of

the clusters, each of the k clusters. so iterative partitional clustering essentially is a technique were

something like saying, suppose I have a data set and I say that suppose I want to create

10 different clusters out of this data set, where would these clusters lie and so on. On the other hand, a nearest neighbor clustering

technique would say suppose I have this data set and suppose I have a maximum distance,

a threshold distance of 5 between elements that can lie within a data set then how many clusters

will I find. whereas in the in the iterative clustering algorithm, we are interested in

where the clusters are going to be, where are the cluster centroids of these 10 different clusters

that are going to be formed. Let us now move on further and look at different

other kinds of data sets. We have been looking into, until now we have been looking into

let us say the tabular data as in apripri or association rule mining or some kind of multi-dimensional

data. Tabular data can be treated as multi-dimensional data as long as they belong to certain ordinal

classes which is of course beyond the scope of this session here that

is how do we convert a tabular data into multi-dimensional data. But any way as long as the data can

be converted to multi-dimensional form, we can use clustering techniques for clustering

them into different clusters. Similarly tabular data can be used to also

infer classification trees. Let us now move on to different kind of data what is called

as sequence data. What do we understand by the term sequence? Sequence is essentially a collection

of data elements wherein it’s not just the collection, it’s an ordered collection that

is where in the ordering matters. That is in a sequence each item in a sequence

has an index associated with it. That is some kind of a subscripted element, each element

is a subscripted element. So this is the first element, this is the second element and so

on. So when we say we have a k sequence, it means that we have a sequence of length k

that is there are k different elements in a particular order in this. there are different kinds of sequence data

like for example any kind of transaction log over a period of time or let us say some kind

of web browsing logs, http logs or DNA sequences or the patient history, the medical history of a

patient over time that is how is the history changing or what kinds of events happened

and so on. So all of these are sequence data. So let us look at some definitions in mining

sequence data and which help us in formulating algorithm for looking at patterns in sequence

data. First of all when we talk of a sequence, a sequence is essentially a list of item sets

of finite length that is each element in a sequence need not be atomic, it could actually

be a set, it could actually be a different set of items. So for example this is the sequence.

The first element here is pencil, pen, ink or pen, pencil, ink. The second element here

is pencil, ink. The third element is eraser, ink and so on and the fourth element is ruler, pencil

and so on. So this sequence essentially for example could

be denoting the purchases of single customer over time in this particular store or whatever.

So let us say the customer came in the first month and purchase these three things, the

second month you purchase these two and the third month you purchase these two and so

on in some stationary store. Now the order of items within an item set

here does not matter but the order of item sets itself matters. That is this is the first

month, this is the second month, this is the third month, so the position of this item set matters but

the position of items within an item set doesn’t matters. So whether I read this as pencil,

ink or ink, pencil it doesn’t matter. And we define the term sub sequence, as any sequence with

some item sets deleted from it. So, some more definitions. Suppose I take a sequence a1,

a2 until am, this is actually a sequence it’s not a set, so this curly braces should actually be a,

it should not be there. So suppose I take a sequence s prime a1 a2

until am. we say that s prime is set to be contained within another sequence s, if s

contains a sub sequence of the form b1 b2 etc bm that is m different elements such that each corresponding

element is a subset, a1 subset of b1 subset equal to rather and a2 subset equal to b2

and so on. So, hence for example this sequence pen, pencil and ruler pencil is contained in this

sequence. That is pen is a subset of this, pencil is a subset of this and suppose you

take this out and create this sub sequence pen, these three as a subsequence then ruler pencil is

a subset of this one. So, let us look at the apriori algorithm.

I think called the apriori gen algorithm or whatever apriori all algorithm where it is

applied for sequence data rather than item sets or association rules. The apriori algorithm for sequences looks

very similar to the apriori algorithm for item sets as well. How does the apriori algorithm

look? First of all we set, we generate L1 that is the set of all interesting one sequences. What

is the one sequence? A sequence containing just one element. And then when Lk is not

empty when k equal to 1, we generate all candidate k plus 1 sequences and out of these, we take only the

set of all interesting k plus 1 sequences. What is interesting k plus 1 sequence here?

It is simply the set of all k plus 1 sequence which have at least the minimum support that

we have specified and so on. Now the main question here lies in this statement here 3.1, that

is how do we generate or what is the candidate generation algorithm? How do we generate all

candidate k plus 1 sequences? So how do we generate all candidate algorithms?

Now given let us say different interesting sequences that is L1 L2 until Lk, candidate

sequences of Lk+1 are generated simply by concatenating all sequences in Lk with all

new one sequences found while generating Lk-1. What is this mean? Let us illustrate this

with an example. Let us say this is my data set and this data

set let us say denotes, let us say I have a website and this data set denotes which

are all the different pages that have been visited by users in different usage sessions. So one user a

went from, one user went from page a to b to c to d to e and so on. Another user came

from b and went to d and a and e and so on like this. So we have different sequences and of course

as you can see here that an element can repeat in a sequence that is this user has requested

for the page a 4 times one after the other and same thing here that is after b, a is requested

three times and so on for whatever reason. Now from here in order to look at, in order

to mind for all interesting sub sequences that is what will be visited before what in this data set, let

us start with the set of all interesting one sequences. Now we have set a minsub as 0.5

that is at least 50% of support. Now let us look at the set of all interesting one sequences. What

is it mean to say interesting one sequences? Essentially it means that which all sequence

of length one have appeared at least 5 times or more. So a has appeared 1 2 3 4 5 6 7 8 times in 8

different sequences, b has appeared 1 2 3 4 5 6 7 8 9 different times and so on. So

a b d and e are interesting one sequences, c for example has appeared just once here, so therefore it is

not interesting at all as a one sequence. Now we generate all possible candidate two

sequences that is it is now rather than a combination, it’s a permutation that is where

the order matters. So aa and rather it’s not a permutation, it is a concatenation rather

that is concatenation of all possible concatenations that are possible between elements of this

one. So ab is different from ba and ad is different from da and so on. So these are the set of

all candidate two sequences. Now we just see which of these candidate two sequences have

minimum support. Now among these you see that only ab and bd

have a minimum support of 0.5. That is all others aa for example has the minimum support

of 1 2 3 that’s it, not 0.5. That is one is here rather 4, 1 2 3 and 4, ab also has minimum support

less than 5 and so on. So the only set of interesting two sequences

are ab and bd in this case. So we have got the set of all interesting two sequences.

Now how do we generate the set of all interesting three sequences that is candidate three sequences?

We concatenate ab and bd with all the interesting one sequences found in the previous iteration.

So the previous iteration here is still the one sequence here ab d and e. therefore we concatenate

both of this with a b d and e like this and then we see that there are no interesting

three sequences at all and then the process stops. Otherwise we would have filtered out few more

elements here and then out of these, again we would have concatenated with all possible

interesting one sequences that we found in the previous iteration. So here the interesting one sequences

that we have found in the second iterations are a b and d. So for level 4 there is no

need to concatenate it with let us say e, so it’s enough if we just concatenate with a b and

d. With sequence data there is an other kind

of interesting mining problem that occurs, when we look at a sequence data as a behavioral

pattern. See for example when we say this is the way that users behave in a data, user behave in

a website. The user here comes to page a then goes to page b then goes to page c, d and

e and so on. Now we are encountered with a question as to can we model the behavior of the user. What

would be a model that would explain me how users behave on my website? So what this means is that we have to find

out, suppose these are all the different strings of a given hypothetical machine, we have to

find out some machine which can generate all of these strings and possibly other strings that

belong to the same class in whatever sense that is. So the question here is that given

different sequences, treat this different sequences as strings that are generated by a particular

machine. The simplest kind of machine that we can generate is the state machine or the

deterministic finite automate or the finite state machine or whatever.

Now but that doesn’t mean that everything can be modeled by a finite state machine but

it’s purely because of complexity considerations or practical considerations that we assume

that the model representing user behavior is given

by a finite state machine. So given a set of input sequences, we have to find out what

is the finite state machine that recognizes this class of input sequences. This also called as language

inference that is given the strings of a language, you are trying to infer the grammar of the

language or you are trying to infer the structure of the language. Now what is the problem in

language inference? What is the big, where is the trickiest problem that occurs in language

inference? Take a look at these strings. Let us say I

have these four strings abc, aabc, aabbc, abbc so on. Now if I want to give you these

four strings and tell you that create a state machine that will recognize these four strings. It is quite

obvious that one would come out with the state machine like this which says which accepts

these fours strings and exactly these four strings, so abbc, abc and aabbc and so on. So which

accepts exactly these four strings. On the other hand, one can also write a machine

like this comprising of a single state which leads on to itself and accepts all strings

like this. So this is a most general state machine that is this state machine is also correct

in a sense that it accepts these four strings but it also accepts anything else made of

a b and c in addition to these four strings, while this is a most specific state machine. That is this

is a state machine that accepts these four strings and these four strings only and nothing

else. Now the challenge or now the trickiest problem

in language inference is to find the right kind of generalization. That is if we make

something into a most specific state machine, it will be of no use, while we make something into a

most general state machine, it will be useless as well.

So when we discover or when we try to discover a model of user behavior, we should discover

a model which is not too specific and is neither too general, it has to have the right kind

of generalization. How do we do that? There are

several different algorithms that try to generalize a little bit and not too much and not be too

specific and so on. We will just look at one specific algorithm

which might be termed as the shortest run generalization that is generalize based on

behaviors by using what is called as a shortest run technique of this thing. Now as we did for

the previous algorithms, let us first look at the example and then come back to the algorithm. Now the way shortest run generalization works

is shown in this state machine here. Now let us say that we encountered different strings.

Now let us say this is the first string that we encounter aabcb. Now there is no other string

therefore we just build a state machine like this which accepts only aabcb and we haven’t

seen anything else, so we can’t generalize anything else. Now second we encounter the string aac.

So what this means is this state machine should accept not only aabcb but also accept aac.

What does this mean? This means that start from aa and after aa if I get a c I can go directly

to the end state, so it has to accept not just aabcb but also aac. Now let us say that I get the third string,

even here i won’t be able to generalize anything. This is the state machine that accepts aabcb

or aac, so we still haven’t generalize anything. Now let us say I encounter one more string

of the form aabc. Now what is this mean? This means that aabc that is this string, that

is this is a prefix of this thing. That is this is the substring of this thing, this is the prefix

of string of the first one. So aabc this state itself should be a end state. So basically

we come like this here and abc this becomes the end state. Now what we do is we merge both of these end

states, so b comes back like this. When we merge these end states, note that we have

performed a specific particular generalization here. Now what is this machine recognize? This machine

recognizes aabc b star that means any number of b’s after aabc. So essentially what it

sees is that or any number of b’s after aac as well. That means it has seen a b appear after aac that

is this substring aa and c with or without b included, it has seen that b may appear

or not appear. And it generalized to the fact that any number of b’s may appear, including 0 number

of b’s which may or may not be right that means to say that there might be an implicit,

there might be some more hidden variables that says that at most 3 b’s can appear let us say 0 1 2

or 3 b’s can appear not 4 b’s but we don’t have that information here as such.

So basically the state machine generalize to the fact that after aabc or after aac zero

or more b’s can appear and we still lie in the end state but then we also see that when

we look at the end state here, we look at the tails of

all the edges coming into the end state. so there is a tail here which says c and there

is a tail here which says c. now whenever from the end state it finds that there are two or more

tails having the same suffix, these two the corresponding states are also merged. So what we finally get is aa b star c b star

so that means what the machine generally is actually saying is that this language has

to have two a’s to begin with, so it has two a’s and it can have 0 or more b’s following two a’s and

then it should have a c and then it can have 0 or more b’s and so on. So because it has

found 0 or 1 b’s between a and c and it has found 0 or 1 b’s after this c, it has performed this generalization.

So this is one way of performing or trying to discover the behavior that is exemplified

by a set of sequences. Let us look at the last kind of data set for

this session namely streaming data. Streaming data has been of relatively newer interest

among the data mining community and especially since the streaming data or mining on streaming

data has several interesting applications. Now what is the characteristic of streaming

data, what you understand by streaming data? You have let us say streaming audio, streaming

video, network traffic and sever several other such data sets which are essentially large data

sequences possibly infinite data sequences. in practice of course there are finite but

possibly infinite data sequences and there is no or very little storage that is it is not practical

to say that I am going to store the entire streaming data into a file and then start

mining the file. Because this if it is infinite or if it is

extremely large, it will be impractical, it could be tera bytes or even more bytes of

data that could eventually accumulate into the file. So some examples are stock market quotes or streaming

audio or video or network traffic and so on. So in order to mine streaming data or rather

even in order to let us say query streaming data, there is a notion of what is called

as running queries or also what are called as standing queries. That means in a traditional database

the data is standing, the data is there and the query actually slides through the data

set in order to return you the answer. But in a streaming data set it is the query that is

standing and the data streams through the query and then the query keeps returning you

answers as and when the data streams through it. So how do we write some standing queries or

how do we find some aggregate behaviors based on some standing queries? Let us look at some

simple standing queries, computing the running mean of a data stream. That is suppose I am getting

a stream of different numbers and I have to calculate the average of these numbers as

and when I read a new numbers, so it’s a running mean. So a simple way to calculate this running

mean is like this, let us say I just need to maintain two variables here. One is the

number of items that I have read so far or the number of numbers that I have read so far and the running

average that I have calculated so far. So whenever I read the next number, all I

need to do is first compute n times average that is average times the number of numbers

that I have read so far, add number to it and divide it by n plus 1 and then increment the number of

numbers that you have read or the number of items that you have read that is n equal to

n plus 1, so as simple as that. That is as soon as a new number comes, you generate the sum, see n

times average is basically the sum of all the numbers that have come so far. So generate

the sum here, add the new number and divide it by the new that is number plus 1, n plus 1 as the new

set of numbers that have come and then increment your set of numbers.

Similarly this slide shows how to write a running query that computes the running variance.

Variance as you know is the square of the standard deviation of a given data set. How

do you compute standard deviation? That is it is

for every element x, compute x minus x bar that is number minus average whole square

and compute the sigma or compute the sum over all of them, all of these differences, so mean square distances

essentially. So in order to compute the running variance,

we look at this formula little more carefully. Variance equal to sigma of number minus average

whole square where number ranges from i equal to 1 to n or whatever. Now, when you expand this,

you can expand this into number square minus 2 times number times average plus average

square. So essentially what this means is we have to maintain certain variables, one is sigma of

number square. So, every time you read a number, square the number and add it to the previous

sum that you have maintained. Of course you also have to maintain the number of numbers that

have been read so far. Then you also have to maintain two times number star average

of all numbers that have been read so far. So you know how to compute the running average,

so every time you get new number compute the running average that is we saw how to compute

the running average in the previous slide and then compute two times number times average and

add it to this. So essentially you can take out average out of this and sigma of number

or two times average out of this and you just basically have to maintain sigma of numbers. That is

the sum of all the numbers that we have calculated until now and multiplied to the new average

that we have found. And then we have to maintain, there is no

sigma that is necessary here because average is a single number and we have to just maintain

the square of the average of all the numbers that we have read so far and we know how to maintain

the average. Now by maintaining all this, we can easily calculate the running variance

that is you just compute each of them, put each of them in their corresponding places and compute

the running variance. Therefore even if I have a long, let us say stock quotes from

the stock market giving me how the quotes of, how the stock price of a particular stock is changing I

can maintain what is the mean stock price that it has recorded so far and what has been

the variance and I can easily calculate standard deviation at any point in time by computing the square

root of the variance. So I know how much it has varied over time and what has been the

mean behavior of this stock over the entire time that I have read

so far. So this slide essentially shows how you can

calculate the running variance that is whenever you read the next number first compute the

average, we know how to compute the average then each of these is computed like this. That is A

equal to A plus n square B equal to B plus two times average star n and C equal to C

plus average square and variance is A plus B plus C. We shall also look at one more algorithm for streaming

data essentially what is called as a gamma consistency or looking for events that have

what are called as gamma consistency. What is meant by this gamma consistency? Essentially

the idea behind this is as follows. Suppose an event happens at some point in time. The

interestingness of that event will be high in the vicinity of the event that is right after

the event happens, let us say stock market crashes. The interest in that event will be

high in the next few days but over a period of time, the interest that event starts going down unless

of course the stock market crashes again. So that is the essential idea behind gamma

consistency. That is first consider this streaming data to be in the form of frames where each frame

comprises of one or more data elements. Then we look for some interesting events within

a frame essentially let us say support based interestingness. So by let us say number of

occurrences of k divided by number of elements in frame and then we see which of these events

have sustained support over all frames rate so far with a leakage of 1 minus gamma. That

means in every frame let us say every day or every week or whatever, we look at events that are interesting

with a support of k. And if this event keeps on occurring with

at least this much support then you can consider this to be some kind of beaker where you are

pouring in the events which are coming in with some kind of support and this beaker has a small

hole underneath where in it leaks at a rate of 1 minus gamma. So over a period of time

if you take it over a period of time, if and only if this event has a sustained support over time this

beaker is going to be full or this beaker is going to have a particular level. And if

the event does not sustain over time eventually, the beaker is going to empty itself. So the level in this beaker is an indication

of two things. One is how sustained is the support for this event and second could also

be how recent was this event. So the more recent the event is the higher the level is going to

be, similarly the more sustained the support for an event is again the higher the level

is going to be. So you can calculate the level like this and then you can again put a threshold for

this level and look at all events which have a particular level or so or level are higher

at any given point in time. So we now come to the end of this second session

on data mining. We have just crashed the surface of what is a vast area of knowledge discovery

from databases and we have kind of scratched it in a breadth first fashion that is we looked

at several representative algorithms for different kinds of data mining problems whether it was

a apriori or whether it was classification or clustering or sequence data or something like

language inference and streaming data and so on. But this is just still the tip of the

iceberg. So anyway that brings us to the end of this session.

## Kutu Politik

October 6, 2010 at 4:22 amThank you very much for an interesting and understandable lecture

Kumar

## Gayatri Ganeshan

May 28, 2011 at 7:49 pmtooo much of ammm..hmmm.. plzz try to reduce it in next lectures.. but the content was good…:)

## SLAP Company

August 10, 2011 at 12:48 pmCould you be emotionally detached at work and not know it? Sure, if you’re detached. There’s a party in your head and you’re not invited. slapCompany

## yan nay

July 28, 2012 at 10:37 amnice video thanks for sharing "nptelhrd"

## Thomas Scanlan

February 28, 2013 at 3:08 pmGobshitte

## Nitesh Kumar

June 30, 2013 at 4:53 amnice, but tremendous scope of making it better like more clear accent , and better clarity of the diagrams. Green letters on green background ?? y not make it more contrasting

## Anwer Mustafa

November 27, 2013 at 3:19 pmvery nice Dr.