Lecture – 34 Data Mining and Knowledge Discovery

Hello and welcome. In this session today we
are going to look at very interesting aspect of or interesting application in which database
technologies are used namely the field of data mining and knowledge discovery. In fact
in recent years data mining has become an extremely or fields that eliciting an extremely
large amount of interest not just from researchers but also from commercial domain. I mean the
commercial utility of data mining is probably of more interest than or at least as much
interest as the research interest that lies in data mining. And in addition to commercial interest, there
is also number of public debates that data mining has started which range from topics
like legalities and ethics and the rights to certain information and the rights to non-disclosure
of information or the rights to privacy and so on and so forth. So data mining actually
is in some sense has opened a pan door as box in and only time will tell whether the
technology has given, has been on an overall sense completely beneficial or destructive
in nature. But then there is nothing beneficial or destructive
about technology per say it’s how we use it, how we use technology which is what matters.
So any way in this session, we shall be concentrating mostly on the technical aspects of data mining
obviously. And we shall look at the basic algorithms
and concepts that make up data mining and what exactly is meant by data mining and how
does it differ from the traditional operations of databases or traditional way in which databases
are used. So the overview of this or this set of two sessions would be as follows. Let
us first motivate the need for data mining that is why data mining and what are some
of the basic underlying concepts in data mining, what are the building blocks of data mining
concepts. Then we look at data mining algorithms and several classes of this data mining algorithms. We will start with tabular mining as in mining
relational tables and we will look at classification and clustering approaches and we will also
look at mining of other kinds of data like sequence data mining or mining of streaming
data and so on. And data warehousing concepts would be covered as a different session all
together. First of all, why data mining from a managerial perspective. Let’s first look
at what a data mining has for the commercial world first before we go in to looking at
the technical aspects of data mining. If you were to let us say give an internet
search or talk to a manager, let us say about why he or she would invest in data mining,
you would encounter a variety of answers. One would say something like strategic decision
making that is I look for some kinds of some ways or some patterns in or mind for certain
nuggets of knowledge to understand something about strategic decision making or to help
in strategic decision making. Somebody would say well it is very useful for something called
wealth generation although there is no precise definition of the term wealth generation and
you would say that data mining would help me in understanding or making the right decisions
that can help me increase my financial portfolio or whatever. Somebody would say well I would use data mining
for analyzing trends, analyzing how my customers behave or analyzing how particular market
is behaving and so on and so forth. And more recently data mining has been used extensively
for security purposes especially mining network logs or network streaming data in order to
look for abnormal behavioral patterns or patterns that might be potentially linked to abnormal
activity in the network or in the system and so on. So, security is now relatively recent
and very important application area of data mining. So, what is this data mining all about and
why is this so controversial and why is it so interesting from a technical perspective
at the same time. Data mining is the generic term used to look for hidden patterns in data
or hidden pattern and trends in data that are not immediately apparent by just summarizing
the data. So if I want to look for certain patterns, let us say if I have set of all
students and their grades if I want to look for certain patterns on how are the students
performing over time or what is the is there some kind of relation between subject A and
subject B I mean if a student does well in subject A, he or she does badly in subject
B or so on and so forth. Such things cannot be discovered by just aggregating
the data, by just saying what is the average or what is the summation or whatever. Besides,
such things also cannot be discovered by, I mean such things in a sense cannot be within
quotes discovered if we have to give queries that finds out these aggregations. That is
if we already knew what it is that we are looking for then it’s not a hidden pattern
any more. We know that such a pattern exists that is students performing in subject A will
not perform well in subject B, we know that such a correlation exists and there is nothing
hidden in the pattern anyway. So data mining essentially has no query that
is if you are performing a data mining on a on a database, we do not talk of any data
mining query. In fact it is the mining algorithm that should give us something which we don’t
know. Now how do we say something which we don’t know, which is putting it in a very
broad sense I mean which is making things so vague. So data mining is actually controlled
by what are called as interestingness criteria and we just specify to the database that this
is what we understand by an interesting pattern. Let us say correlation between performances
in subject A and subject B or some kinds of trends over a period of time. This is what
is interesting for us. Now find me something or find me everything which I don’t know about
or which are interesting according to this criteria. So when we talk about data mining, we have
a set of data to begin with that is we have a database and then we give one or more interestingness
criteria and the output of which will be one or more hidden patterns which we didn’t know
exists in the first place. Now given this model, we should say now when we say patterns
then the obvious question to ask is what type of patterns, what do you mean by patterns
or what do you mean that this is or when do you say that something is a pattern and something
is not a pattern. If we have to answer that we have to ask two
further questions that is what is the type of data that we are looking at, what kind
of data set is it that we are looking at and what is the type of interestingness criteria
that we are looking. What do we mean by interestingness, is it correlation between something, what
exactly do we mean by interestingness. So let us look at the different type of data
that we encounter in different situations. The most common kind of data is the tabular
data or the relational database which is in the form of set of tables or now slightly
different multi-dimensional form of database. And it’s very common that any kind of transaction
data that is let us say data array coming out from the database from an ATM for example
or the data coming out from the transactional database at a railway reservation counter
or at a bank or any place like that are all tabular in nature. So it’s a most common form
of data and which is a rich source of data to be in mine. In addition to tabular data, there are spatial
data for example where data is represented in the form of either points or regions which
have been encoded with certain coordinates X Y Z coordinates. So each point in addition
to having certain attributes also has certain coordinates and mining in this context also
requires us to know what is the importance of the coordinates system. In addition to spatial data there are other
kinds of data like say temporal data, temporal data in the sense that were each data element
has a time tag associated with it. So temporal data could be for example streaming data where
network traffic or set of all packets that are flowing through a network forms streaming
data which just flows fast and where each packet can be allocated some kind of a time
stamp or something like activity logs, your database activity log is a temporal data.
There could be also be spatio temporal data that is data that are tagged both by time
and coordinates. And other kinds of data like tree data which for example XML databases
or graph data where especially bio molecular data or volvoid web is a big graph data and
so on. Then there are sequence data like data about
genes and DNAs and so on and again activity, I mean sequence is a kind of temporal data
where timestamp need not be explicit in sequence then text data, the arbitrary text or multimedia
and so on and so forth. So, the several different kinds of data that can be the source from
which we can extract or mine for unknown nuggets of knowledge. Similarly when we talk about interestingness
criteria, several things could be interesting. If certain pattern of events or certain patterns
of data keep occurring frequently then it might be of interest to us, something that
happens very frequently. So frequency by itself is an interestingness criteria or interestingness
or a criteria on which interestingness can be based. Similarly rarity, if something happens very
rarely and we don’t know about it or let us say rarity is again a very interesting pattern
to be searched for when we are looking at say abnormal behavior of any system or abnormal
behavior of network traffic and so on. So something that happens rarely that is away
from the norm is again an interestingness pattern. Correlation between two or more elements
and if the correlation being more than a threshold is again interesting or length of occurrence
in the case of sequence or temporal data and so on. And consistent occurrence, consistency that
is consistency is different from frequency in the sense that overall in the set of all
databases, overall for the entire database a given pattern may not be frequent enough.
For example there could be one particular behavior pattern, let us say one particular
customer comes to a bank every month at the tenth of each month. So if we are looking
for frequently banking customers, let us say this customer would not figure out in this
algorithm because this customer comes only once a month whereas other customers could
be coming many times a month. However if we are looking for consistency in behavior then
this customers behavior is far more consistent than someone who comes let us say arbitrarily
10 times the first month and once the second month and 50 times the third month and so
on and so forth. So in terms of consistency in his behavioral pattern across different
months, this pattern is interesting even though it’s not frequent. Then repeating or periodicity is slightly
similar to consistency except that a periodicity is I mean consistency is across the entire
set, across the entire set of months if you have divided our database into months but
periodicity, the time interval could vary in in a periodicity of a pattern. If a customer
comes let us say a 5 times to the bank every 6 month, we may not be able to catch it as
part of a consistent pattern analysis but if we use an algorithm that detects periodicity
of several occurrence of events, we will be able to detect it. And similarly there are
several other patterns of interestingness that which one could think of. Now when we talk about data mining, usually
there is sometimes a misconception and not completely but usually there is a contention
that data mining is the same as statistical inference. For many cases it is yes, the answer
is true that is several concepts from statistics have been incorporated in to data mining and
data mining software uses statistical concepts or many kinds of statistical algorithms comprehensively.
However there is a fundamental difference between statistical inference and data mining
which is perhaps the reason for the renewed interest in data mining algorithms. And here
is the general idea behind the data mining versus statistical inference. What do we do when we talk about statistical
inference? Statistical inference in techniques, essentially have the following three steps
as is shown in this slide here. In statistical inference, we start out with the conceptual
model or what is called as the null hypothesis. That is we first of all present ourselves
or perform a hypothesis about the system in concern. That is we make a hypothesis that
if some something to the effect that if exams are held in the month of march then there
would be I mean then the turnout would be higher than if it is held in the month of
June or something like that. Now based on this hypothesis, we perform what is called
as sampling of the data set or of the system. Now sampling is a very important step in a
statistical inferencing process. There is huge amount of literature in to what is meant
by correct sampling or what is called as a representative sample and so on. Now based
on the sampling of data set from the system, we either prove or refute our hypothesis.
That is we show a proof saying, yes this hypothesis is true because statistical sampling of the
system has shown that this is true otherwise it’s false. Now, when we sample for example if you are
performing a statistical inference about user preferences or let’s say some kind of market
analysis, we present questioner to different users based on our null hypothesis or based
on our conceptual model. Now it is this set of questioner, now this questioner has been
created by our conceptual model. So this questioner already knows what to look for and the proof
or the answers will either prove or refute the hypothesis but data mining on the other
hand is a completely different process or rather it’s the opposite process. In data mining we just have a huge data set
and we don’t know what is it that we are looking for. We don’t have any hypothesis, we don’t
have any null hypothesis to begin with. We just have a huge data set and we just have
some notions of interestingness. Now we use this interestingness criteria to mine this
data set and usually there is no sampling that is performed on the data set that is
the entire data set is scanned at least once by the data mining algorithm in order to look
for patterns. So there is no question of sampling and there is no null hypothesis to begin with.
So we just have a weighed notion of an interestingness based on which we present an algorithm, data
mining algorithm over the data set. Out of this comes out certain patterns, certain interesting
patterns which form the basis for forming a hypothesis. So it’s sometimes also called
hypothesis discovery. Obviously, of course we cannot discover complete hypothesis using
just data mining but we too discover patterns using which we can formulate a hypothesis.
So in a sense it’s an opposite process of statistical inference.
Let us look at some data mining concepts. Two fundamental concepts are of interest in
data mining especially in the core algorithms of data mining especially the apriori based
algorithms. These are what are called as associations and items sets. An association, when we say
an association it is a rule of the form if X then Y as shown in this slide here and it’s
denoted as X right arrow Y. For example if India wins in cricket sales
of sweets goes up, if India wins in cricket then sales of sweets goes up. So here X is
India wins in cricket and Y is the predicate that sales of sweets go up. So we say that
we discover such a rule if we are able to conclusively say based on analyzing the data
that whenever India wins in cricket, the sales of sweets go up. And on other hand suppose
if there is any rule of this form that is if X then Y then I can imply that if Y then
X. That is the ordering of this rules is not important. If India wins in cricket then sales
of sweets go up, if sales of sweets go up then India has won in cricket and so on which
may be true or may not be true but if that is the case then it is called an interesting
item set. That is it’s just a set of item. For example people buying school uniforms
in june also buy school bags or you can also say people buying school bags in june also
buy school uniforms. So it’s just a item set that is school uniforms and school bags are
a set of items which are interesting by themselves. Once we define the notion of a association
rule and an item set, we now come to the concept of support and confidence. That is how do
we discover a rule to be interesting. We say that a rule is interesting in the sense of
frequent occurrences of a particular rule, if the support for that rule is high enough.
That is the support for a given rule R is the ratio of the number of occurrences of
R given all occurrences of all rules. So we look into the exact or we will illustrate
the notion of support in the next slide with an example where it will become more clear. And when we say the confidence of a rule,
suppose I have a rule if X then Y then the confidence of the rule is suppose I know that
X is true, the ratio of all occurrences when Y is also true versus when for all other occurrences
when X is true and something else is here. So that is it’s a ratio of the number of occurrences
of Y given X among all other occurrences given X. So if I know that X is true with what confidence,
with what percentage of confidence can I say that Y is also going to be true? Let us look at some examples here. Let us
say these are some item sets let us say these are data that have been distilled from purchases
of different consumers over a period of time over, in a given month let us say. So the
first consumer has bought a bag, a uniform and a set of crayons, the second consumer
has bought books and bag and uniform, the third one has bought bag uniform and pencil
and so on and so forth. Now suppose I take the item set bag and uniform, (Bag, Uniform)
what is the support for this item set. Now the support for this item set is look at all
the transactions or the rows here in which bag and uniform occur 1 2 3 4 and 5 uniform
and bag. Out of a total of 10 rows, 5 of them have bag and uniform occurring in that. Therefore the support for bag and uniform
is 5 divided by 10 which is 0.5 that is with a this dataset supports the assertion that
bag and uniform will be bought together with 50% support that is 0.5 as its support. What
is the confidence that, what is the confidence for the rule if bag then uniform? That is
what is the confidence by which we say whenever somebody buys a bag, they also buy uniform.
For this we have to look at the set of all item sets or the set of all transactions or
rows here in which bag and uniform, bag occurs rather not just uniform in which bag occurs. So bag occurs in 1 2 3 4 5 6 7 8 different
rows, out of which bag and uniform have occurred in 5 different rows. Therefore the confidence
for this assertion or this association rule is 5 divided by 8 which is about 62%. That
means if some consumer has bought a bag then with 62% of confidence or 62.5 % of confidence,
we can say that the consumer will also buy a uniform, a school uniform along with this.
So the question now is how do we mine or how do we find out the set of all interesting
item sets and the set of all interesting association tools. Now have a look at this previous slide once
again. Now the association rule, when we talk about association rules we have just or rather
when we talk about item sets first we just saw a single item set having two different
elements here but that need not be the case, bag by itself could be an item set a single
element item set, uniform by itself could be a single element item set, crayons could
be a single element item set or let us say bag, uniform and crayons could be a three
element item set and so on. So item sets could be of any size size 1, size 2, size 3, size
n any set of elements. Now we have to find the set of all item sets that is the set of
all items that are bought together and that have been together frequently as part of this
transaction log here. Now how do we do that? Now there is a very
famous algorithm called the apriori algorithm which performs such a discovery process that
is a discovery process for all frequent item sets in a very efficient manner. The simple
idea behind apriori algorithm, it is shown in this slide here. However let us not go
through the slide in a lot of detail, since it will be more easier to explain apriori
through an example. The idea behind apriori algorithm is that,
the essential idea behind an apriori algorithm is that suppose I have any n element item
set. Let us say suppose I have any 5 element item set, that is interesting or that is frequent.
So if this 5 element item set is frequent then all sub sets of this item should also
be frequent. This seems obvious but this is a very important conclusion or it’s a very
important observation in the apriori algorithm. That is if I discover the set of all one frequent
item sets that is the set of all item sets of size 1 which are frequent then there is
no need for me to look at other item sets when I am looking for two frequent item sets.
That is the set of all item sets of size 2 which are frequent will be made up of combinations
of set of all item sets of size 1 which are frequent. So let us illustrate the process of apriori
with an example. Let us take our consumer database again, the previous consumer database
again where we have consumers buying several school utilities like bags and school bags
and school uniforms and crayons and pencils and books and so on and so forth. Now suppose we set when we say or when we
ask the apriori miner to mine for all interesting item sets, we have to the interestingness
criteria here is frequency that is frequent occurrence. Now frequency is or interestingness
here is parameterized by a threshold parameter which is called the minimum support or min
sup. So let us say minimum support is 0.3 that is we term an item set to be interesting
if its support is at least 0.3 or greater. Now given this what are all the interesting
one element item sets? What is that mean to say what are all the interesting one element
item set, which one element item sets occur at least at a rate of 30% or more. Now this
database here or this data set here has a total of 10 rows therefore we have to look
at all one element item sets which occur 3 or more times. So given this we see that all
of these are interesting that is bag, uniform, crayons, pencil and books. Bag occurs much
more than three times, uniform also occurs more than three times, crayons also occur
more than three times and so on. So all of these elements here occur more than thrice
which therefore all of this one element item sets have a minimum support of 30% or more. Now from this, suppose we have to look at
the set of all interesting two element item sets. Now how do we build the set of all interesting
two element item sets? We just look at all possible combinations between one element
item sets, therefore we have bag uniform, bag crayons, bag pencil, bag books, uniform
crayons, uniform pencil uniform books and so on and so forth. Now out of this for each
such two element item set that have been created, we have to see how many times they occur in
this data set. Now we see that it’s only these set of combinations which have a minimum support
of 0.3 or more. So for example bag uniform, bag crayons, bag pencil and bag books all
of them along with bag are interesting. However let us say uniform and book is not
interesting that is it doesn’t occur more than thrice. So let us see how many times
uniform and book occur? Uniform and books occur once and second one twice here, so they
occur only twice but we need a minimum support of three times so that’s not interesting.
Similarly a pencil and uniform, so uniform and pencil is again is not interested. So
therefore we have filtered away or we have thrown away certain item sets from our exploration
here and identified only a smaller subset of the set of all possible combinations of
one element item set. Now from this if we have to look for all three
element item sets, we have to generate the set of all candidate three element item sets.
What are the candidate three element item sets? Perform a union across all possible
combinations of these interesting two element item sets to create all possible distinct
three element item sets and then look for those three element item sets which occur
at least three times or more in this database. Given that we see that there is only one three
element item set that is bag, uniform and crayons that is interesting that is that occur
at least three times or more or that has at least, that has support of at least 30% in
this in this data set. So as you can see the apriori algorithm, you
can visualize the apriori algorithm in the form of let us say an iceberg. Such queries
are also called as iceberg queries when given on to databases that is at the base there
are large number of one element item sets. But once we start combining them together,
we start getting smaller and smaller numbers of combinations and we peak out at a very
small of large item sets which are frequent. So the beauty of the apriori algorithm is
that for every parse, it does not need to go through the entire data set. It does not
have to parse through the entire data set, it only needs to consult results of the previous
iteration or item sets that are of one element one lesser than the present iteration in order
to construct candidates for the present iteration. So given this algorithm here let us go back
and look at the apriori algorithm. Given the explanation here with an example let us go
back and look at the apriori algorithm which will now be a little more easier to understand.
Initially we start with a given minimum required support s as the interestingness criteria.
now given minimum support s as the interestingness criterion, first we e search for all individual
elements that is one element item sets that have a minimum support of s. Now we start,
we go into a loop where we start looking for item sets of sizes higher greater than 1. So from the results of the previous search
for i element item sets, search for all i plus 1 element item sets that have a minimum
support of s. This in turn is done by first generating a candidate set of i plus 1 item
sets and then choosing only those among them which have a minimum support of s. Now this
becomes the set of all frequent i plus element item sets that are interesting. So this loop
is repeated until the item set size reaches the maximum. That is there no more candidate
elements to be generated for the next item set or there are no more frequent item sets
in the current iteration. Now that was about item sets. A property of
item sets is that there is no, I mean you basically consider item sets as one entity
that is there is no ordering between the item sets. that is it does not matter if somebody
buys a bag first or a uniform first or a crayon first or whatever, as long as the, only thing
that we are going that we infer from this is that the item set bags, uniforms and crayons
are quite lightly to be bought together in in in one piece. Therefore if I am let us say a super market
vendor, I mean someone having a super market then it would make sense for me to place bags
and school uniforms and crayons next to each other. So because there is a higher probability
that all three of them are bought together. But when we are looking for association rules
we are also concerned about the direction of association that is there is a sense of
direction saying if A then B is different from if B then A. So association rule mining
requires two different threshold, the minimum support as in the item sets and the minimum
confidence with which we can talk about a, with which we can say or determine that a
given association rule is interesting. So how do we mine association rules using
apriori. Again we shall do the same thing like we did in the past. We shall come back
to this algorithm or the general procedure after we have illustrated an example by which
we can mine apriori, using apriori algorithm by which we can mine association rules. Now the main idea is the following. Now use
the apriori algorithm and generate the set of all frequent item sets. So let us say we
have generated a frequent item set of size 3 which is namely bag, uniform and crayons
with a min sup or of 0.3 that is a minimum support threshold of 30%. Now this bag, uniform
and crayons can be divided into the following rules. If bag then uniform and crayons or
if bag and uniform then crayons or if bag and crayons then uniform and so on so forth. Now what is this thing mean? this thing means
that when a customer buys a bag then the customer also buys uniform and crayons and this rule
means that if a customer has bought a bag and a school uniform then the customer will
also buy a set of crayons or if a customer has bought a bag and a set of crayons then
the customer will also buy a school uniform and so on. Now we have got all of these different association
rules. Now each of these association rule has a certain confidence based on this data
set. Now what is the confidence for each of these rules? What is the confidence for the
rule if bag then uniform and crayon. That is if a customer buys a school bag then here
she will also buy a school uniform and a set of crayons. In order to calculate the confidence
of this, we have to first look at which are all the item sets here that have bags that
is where the customer has bought a bag. So, there are 1 2 3 4 5 6 7 8 different entries
where customer has bought a school bag. Now among these 8 entries, in how many different
entries did the customer also buy uniform and crayons? 1 and 2 3, so there are 3 different
entries, 3 different instances out of 8 instances where this rule holds. Therefore whenever
a customer buys a bag, one can say with 3 by 8 or 37.5% of confidence that the customer
is also going to buy a set of school uniform and crayons. Similarly we can calculate the
confidence for each of these other association rules like this is 0.6, 0.75, 0.428 and so
on and so forth. Now, given a minimum confidence as a second
threshold and suppose we say that the minimum confidence is 0.7 then whichever the rules
that we have discovered, every rule that has confidence of at least 70% or more. That means we have discovered the following
three rules, bag if bag crayons then uniform, uniform crayons then bag and crayons then
bag and uniform. What is that mean in plain English? It means that people who buy a school
bag and a set of crayons are likely to buy a school uniform as well that is bag and crayons
implies uniform. Similarly people who buy a school uniform
and a set of crayons are also likely to buy a school bag that is here, somebody buys uniform
and a set of crayons then they are also likely to buy a school bag. Similarly if somebody
buys a set of crayons then they are very likely to buy a school bag and a school uniform as
well. So that is here, that is somebody buys crayons
then with 75% confidence one can say that they also buy bags and school uniforms. So
again it’s a question of direct marketing or whatever. If somebody is interested in
crayons then you might be reasonably sure that they are also interested in a bag and
a school uniforms so on. Now so let us look at look back at the algorithm here for mining
association rules. Simple mechanism for mining association rules
is first of all use apriori to generate different item sets of different sizes and at each iteration,
we can divide each item sets in to two parts an LHS part and an RHS part, the left hand
side part and the antecedent and precedent that is the right hand side part. So this represents a rule of the form LHS
implies RHS. Then the confidence of such a rule is support of LHS divided by that is
support of the entire thing divided by the support of LHS. That is support of LHS implies
RHS divided by support of LHS will give us confidence of this rule. And then we discard
all rules whose confidence is less than minconf. So now let us look in to the question of how
do we generate or how do we prepare a tabular data for association rule mining or let us
say item set mining and so on. Now because we use let us say relational data set, relational
database you might have observed that or you might have got a little doubt when we have
been considering a data set like this. There is something peculiar about this data set.
What is peculiar about this data set here? The peculiarity is that it looks like every
consumer coming to this store is buying exactly three items which is very unlikely. In fact what is more practical is that this
set, this data set contains records of variable length. That is one customer may have bought
just two different items whereas some other customer may have bought 10 different items
whereas a third customer may have bought only 5 different items and fourth customer may
have bought only one item and so on and so forth. So it is not possible to represent this item
set like a table, like a well form table like this because basically it is a set of all
items of different lengths. In fact the best way to represent this would be in a normalized
form let us say in a database where for example the same bill number here 15563 15563, both
of this refer to the same customer. That is it’s the same customer who has bought books
and crayons and this is not completely normalized because date is not really necessary here
but nevertheless here all of these records are of uniform length, if you order this based
on the set of bill numbers then we get the set of all different transactions. Now depending on what we are looking for this,
this ordering might make a difference. How does this ordering make a difference here
when we are looking at data set like this? Suppose given a dataset like this, here performing
group by’s on different fields will yield as different kinds of behavior data sets. So what does it mean? Suppose let us say we
perform a group by based on the bill number. So suppose we perform a group by on the bill
number on this table then each group will represent the behavior of one particular customer
that is one bill represents one or one bill number represents one particular customer
or one particular transaction. So suppose we group by based on bill numbers and then
perform apriori across these different groups then we would be getting frequent patterns
across different customers. On the other hand suppose we group by over
date, so rather than bill number. So all transactions happening on a given date will come in to
one group and all transactions happening on another date will come in to another group
but a given date may have transactions from several different customers but all of them
are now grouped in to one single group. And suppose we run apriori over this set, over
this different groups then we would actually be looking for frequent patterns across different
days that is across the different dates. So we have to interpret what we mean by something
that is frequent based on how we have ordered the data. If we have ordered the data over
different customers then it would show aggregate behavior over the set of all consumers with
whom you are interacting with. On the other hand if you are running apriori
or if you have performed group by over dates then it would show you aggregated behavior
over a given time period rather than over the set of all customers. Well, it also includes
the set of all customers but what is more important here is that how does the behavior
or how has the behavior changed over time. So if something is frequent over time, it
means that it is uniformly or in some sense consistent over this entire period of time. So let us summarize what we have learnt in
this session. We started with the notion of data mining and like I said in the beginning,
data mining is a very interesting sub field of databases which has elucidated a lot of
interest not just from researchers or and not just from the technology perspective but
from several other perspectives like defense perspective or security perspective, commerce
that is business perspective and so on. And there are several debate that have raged on
whether it is right to use data mining to look for certain behavior pattern. for example would it be right, if a government
uses data mining over let us say the set of all different activities of people and find
out the behavior pattern of any particular individual and so on. And their pros and cons
on both sides of the debate, one would say for security reasons it is right to look for
behavior patterns and one would say well for privacy reasons it’s not right to look for
behavior patterns and so on and so forth. so it’s a topic which is very much pertinent
and has spond a huge amount of interest from several different areas. And data mining is in some sense, I called
it as sub field of databases but that’s not entirely true in a sense that data mining
and knowledge discovery many would claim is a field in itself. That is it relies on database
concepts as well as several other concepts like learning theory or statistical inference
and several other concepts in order to perform data mine. So don’t be really surprised if
one would say that a data mining is a complete field in itself and its only associated with
databases not really sub field of databases. but anyway data mining as we said is the process
of discovery of previously unknown patterns in the sense that we have not really sure
what is it that database is going to give us or what new pattern or what new nugget
of knowledge so to say is we are going to learn as part of the data mining process.
As a result there is no query as part of a data mining process that is a data mining
algorithm is based around one or more interestingness criteria rather than a given query. And we saw that in conceptually, it is in
some way the opposite of statistical inference where we start with a null hypothesis and
either refute or prove or hypothesis by sampling, statistical sampling of the population. While
here we don’t start with a hypothesis but the end result of the data mining process
is the set of patterns which can help us in formulating a hypothesis. We also saw the
notion of association rules and item sets as well and the concepts of support and confidence
and two different algorithms the apriori algorithm for mining frequent item sets and from which
we also saw the apriori algorithm for mining association rules. In the next session on
data mining, we are going to look at several other algorithms like say classification or
discovery. So that’s brings us to the end of this session. Thank you.

Leave a Reply