UTS Distinguished Professor Jie Lu – Data – Learning – Decision: Innovation and Impact
Text on screen: Distinguished Lecture Series
Professor Jie Lu 25 March 2019, Aerial Function Centre, Level 7, Building 10, University of Technology Sydney Good evening everyone, welcome to today’s
lecture. My talk today is about data, learning and
decision. [if] We talk about the data, actually we consider
in a big data environment. We talk about the learning, it’s about machine
learning – [which is] an important part of artificial intelligence.
We talk about decision. It’s about how machine learning can learn from data to support decision-making. [now] In a big data situation, data is in
different forms. We have high-speed data, we have data streams, and we have multiple
data sources. Machine learning actually is algorithms or
models. How machine learning can learn from data to generate knowledge? And for what [and]
Knowledge will be used to support prediction and decision-making. So, my talk today has three sections. Section 1 is about data. We will talk about
the different forms of data, and we will identify the challenges in data. Section 2 is about machine learning.
We know machine learning has many different types, such as transfer learning, reinforcement
learning, deep learning, and many types and sub-types. But today, we will mainly talk
about three types of machine learning to deal with three challenges we identified in data. Section 3, we will talk about decision making.
In principle, we have two types of decision-making. One is to support managers, another is to
support customers. [Under the manager type, also have different
types]. So, [those are] my talk today will deal with
the three parts, but they are linked together: [so] Section 1 is about data, data is fundamental
to machine learning. [Many, not many, some of us learned machine
learning 30 years ago. Really.
At that time, I learned machine learning. And now we think the principle we learned
30 years ago and now actually are very similar. But why machine learning become very hot now?
[Now we think the principles learned 30 years ago and now actually are very similar.
But] Why machine learning becomes very hot now?
Why we have so many new types of machine learning algorithms and methods developed recently?
[Why it’s so hot in the industry to use machine learning?]
The reason is clear: we have data available. We have large data sets available. We have
big data available, which was not available 30 years ago. So, I first talk about Data.
For data, we have structured data and unstructured data; we have collected data and data streams.
I will discuss all the details one by one. Database and data warehouse:
Database deals with structured data. All data, we collected, put into a database, and built
ER models, primary keys, foreign keys and so on, is structured. [And] In many industry
applications, they have a data warehouse. That is semi-structured. In a big data situation, we often have data
streams, we may have lots of data graphs. In that case, we have unstructured data. We
cannot use the same methodology to deal with structured data and unstructured data. And from a machine learning point of view,
we have static data and streaming data: [We] Sometimes we do a survey or questionnaire,
we collect data. We put data into datasets. That is static data.
In a big data situation, data often comes as a stream. [Okay, keep coming]
Then in such a case, we will have new challenges. Because we need to consider if the data distribution
keeps the same, or the data distribution keeps changing. We need to deal with the changes
for data distribution if they occur. Labelled data and unlabelled data: that is
important for machine learning. [We have many pictures, but we don’t know
it’s a cat or a tiger. That are unlabelled data.
We have some pictures, we know which one is a cat. That are labelled data.]
[That’s ] How to deal with unlabelled data is also a challenge in machine learning. Based on the [our this] above analysis of
the data types, we identified three data challenges from a machine learning point of view. 1) In-domain Data Insufficiency;
2) Evolving Data Distribution; 3) Data Uncertainty.
Now I will discuss the three challenges from a machine learning point of view in details. (For Challenge 1): Sometimes we have a new
market, or we have a new product. [Now] We want to have prediction about usage,
user behaviours and so on. But we don’t have enough labelled data to
build a prediction model. However, we may realise that in a similar
domain, a lot of (labelled) data available. They can build prediction models through machine
learning. The two domains are similar.The idea is: can
we use the knowledge learned from this domain with available labelled data to transfer to
support prediction in a new domain, which doesn’t have enough labelled data, for new
products, for example. That is a data insufficiency problem.
And later on, we will discuss solutions. Challenge 2: about the changes of data distribution.
We (have) used mobile phone for a number of years.
Think about, six or seven years ago, [we use your mobile phone] at that time you only used
your mobile phone to make phone calls. But three years ago, the usage was different.
You used it for phone calls, and you used it to take photos, you used to access Internet,
and you used it for Google Map. And more recently, you used it for WeChat.
[Okay.] What is the usage distribution now? Now probably 30% of your mobile phone usage
is for Internet search; another 30% of your mobile phone usage is to take photos; and
20% of usage is for WeChat, and 10% for Google Map, and so on. Only 5% of the usage of your
mobile phone is to make phone calls. That is, the usage distribution changes.
In such a case, if you still use the data collected some years ago to build a usage
model for prediction about customer behaviours in current situations, of course it doesn’t
work, you cannot get a high accuracy of this prediction. So, we need to consider, to build a new model
when you have a new situation, [Or considering in distribution of data keep
changes, when you have significant change you need to train a new model.]
That’s Challenge 2. Challenge 3: Data uncertainty.
Data uncertainty includes data value uncertainty. A lot of data we collected are in linguist
terms, such as very high, very fast or above thousand dollars.
[Okay,] Those are linguist terms. [And] Sometimes we have data missing, [missing
data, and how to deal with.] And sometimes we consider data matching. [these
are issue of uncertainty. And for example, that’s a, that one here,
we can see those are it’s a data distribution change we collected. And dark places have
a significant changes of data distribution, now we want to identify exactly which region
has data distribution change, Okay, or which region we need to identify significant change
of data distribution. So,] All of those have uncertainty problems. So now we identified the three challengers
in data. We go to Section 2 to talk about machine learning. [As I mentioned before machine learning becomes
very hot.] 11:03 [The important reason is we have big data
available. Machine Learning, if you want to do any kind
of high [this is] accurate prediction or [this a] regression or decision support, you need
a big>>.] We talked about data forms, and we talked
about the three challenges in data, and now we can say how machine learning can deal with
the three challenges. In my following talk I will give you some
ideas how transfer learning can handle Challenge 1 – data insufficiency;
How concept drift learning can handle Challenge 2 – [comes] data distribution drift; how [machine
learning] fuzzy machine learning can deal with data uncertainty (Challenge 3). [Today my talk only focuses on machine learning.] What is machine learning?
[Many people know machine learning quite well. ]
Basically, machine learning is to discover some patterns from data, probably from pre-processed
data, then generate knowledge. [the knowledge for what,] Then you will use the knowledge
generated from the historical data to support decision-making or prediction. [So machine
learning is to discover patterns as a main aim, than based the data provided.] Now we go through the three typical machine
learning algorithms and methods to handle the three data challenges we identified. The first one is transfer learning.
The idea of transfer learning is to borrow knowledge.
You can see, we would like to have prediction about zebras behaviours.
But we don’t have [a, this a] enough labelled data, (that is,) data insufficient.
However, in another domain, (horse), a very similar domain, [who this a way] has many
labelled data. Can we use the knowledge learned here to support
our prediction about [zebra’s previous] zebra’s behaviours?
That is the idea of transfer learning, (which) aims to handle Challenge 1: in-domain data
insufficiency. 14:00 Now we give you the basic concept of transfer
learning. In transfer learning, you have a source domain,
in the source domain you have enough labelled data, so you can train a model and generate
prediction. And there is a target domain, you would like to do classification or regression
– build prediction models here. But you don’t have enough labelled data, so it’s impossible
to train a model. (So) the idea is how can we use the knowledge learned here (from source
domain) to support model establishment, or to support output generation, for the target
domain. What is the issue? The issue is sometimes
the target domain’s feature space is the same as the source domain (but different distributions),
but sometimes the source domain’s feature space is quite different from the target domain.
Sometimes you have some common features, sometimes you don’t. That is the challenge given to
transfer learning. Here are two models and (each has) two algorithms
we developed (for transfer learning). (Domain Adaptation): [The first one and the
second one] we consider the situation (that) source domain and target domain have the same
feature space. [And the first one, what is idea?]
The idea is we use the source domain to generate initial labels for the target domain. Then
we use the limited number of labelled data and the source domain’s data to do label refinement.
So in such a case, source domain data will be used from beginning to the end, and for
the target domain, we only do label refinement. The second (algorithm) has a different idea.
For the source domain, we build knowledge. After the knowledge is established, we don�t
use data from the source domain anymore. The knowledge could be described by models or
rules. We do knowledge refinement and finally we generate knowledge for the target domain.
This result has been published in �IEEE Transactions on Fuzzy Systems�. Cross-domain transfer learning: Source domain
and target domain have different feature spaces. Sometimes you have some common features, sometimes
even no common features. We developed two strategies to deal with.
The first one, from the common features, we put together, and then we build correlations
between the common features and the other features. [or common features and other features,]
Finally we build correlations between two domains.
The second is quite different. We are going to build a latent feature space and two mapping
functions. So there is a source domain, here is a target domain, they don�t have any
common features. But we build a latent feature space, try to find a mapping function for
the source domain, mapping to the latent space. For the target domain, also (find a function)
mapping to the latent space. In the latent space, the two domains have the same feature
space, so that we can do knowledge transfer. We are doing currently, [we this] fully supported
by current ARC DP: we have a number of source domains, and the target domain is one. How
to transfer knowledge from a number of source domains to support prediction in a target
domain. And the target domain sometimes needs multiple outputs. So how to transfer multiple
source domains with multiple outputs to support the target domain? So far, we have mentioned a number of solutions
to deal with Challenge 1. Now we go to the second part, concept drift learning, to deal
with Challenge 2, that is, the data distribution changes. We know that traditional machine learning
has an assumption that the past data and current new data have the same data distribution.
What is the main issue? [or the key technology is ] (What) is the best time to update the
training model or update a learning model. That is the key. So [in our team] we developed
a number of technologies to deal with that. Before I go to the deep of concept drift learning,
I need to mention (that) there are four types of concept drift. One is sudden drift, second
is gradual drift, and also incremental drift, and reoccurring drift. So, whatever which type of concept drift,
we can use that framework to deal with (and) to descript concept drift: how to identify
and how to react. We have a data stream, then we need to test.
Any drift? Yes or not. If there is no drift, that�s OK, [you just use all the data or]
you just use the model established by the historical data. And if it is �yes�, there
is a drift, then you need to do drift understanding. 21:00
What does it mean? That means where is the drift, when the drift occurs, and how big
(is) the drift? That is (called) drift understanding. After drift understanding, we need to do drift
adaptation. So, three important stages and important elements
of the framework about learning under concept drift. One is Drift Detection. The second is Drift
Understanding. The third is about Drift Adaptation. There are two big strategies to handle concept
drift. One strategy is �lazy� strategy. For the
lazy strategy, what means �lazy�? That means always detect, then do reaction. [OK,
so that is a lazy strategy.] The lazy strategy still has three different
sub strategies. One is detection and retrain model, the second is detection and adjust
model, the third is detection and don�t adjust model but adjust prediction – the output. The second strategy is �active� strategy.
What they are doing? Incremental learning, don�t do detection, directly learn. [And
this,] Another is ensemble learning and another is active learning. [ Consider there is a cleaner come to the
room, right? So cleaner has two strategies. Strategy one, the cleaner go around tables
and check everywhere to see, looks everywhere very clear. Oh, only here, it is a little
bit dirty. Only clean a little bit. That�s lazy strategy. Detection and reaction. The
second strategy, the cleaner come to the room, everyday eight clock in the morning, whatever
dirty or not, clean everywhere everything. Go. That is active strategy.] My talk today focuses on the lazy strategy.
We do detection. Under the lazy strategy, there are two different
methodologies. One is tracking the learning model performance.
For example, this one, you can see the accuracy, accuracy down. ,[ and accuracy down, and accuracy
down]. The accuracy down means a drift happens. The second strategy, we use two-sample test.
So you can see the distribution (of) sample 1, that distribution (of) sample 2; you can
see the changes of the distribution. My talk today focuses on the second (strategy),
two-sample test. To do two-sample test, I need to mention a
competence model developed by my team. That is sample 1 in time t, and sample 2 in time
t+1, we want to say (if there is) any drift happened between the two samples.
What is the main idea? How to calculate it? Impossible. Our idea is to transfer one sample
into one vector. So we can calculate the distance between two vectors so that to identify the
change of data distributions between two samples. To do it, we combine the two samples, mix
together, and establish related sets and related closure sets. And then we use [this one] (related
sets and related closure sets to) get data for the two vectors, then we calculate the
distance. [So] That is [together] the competence model, this is published in Artificial Intelligence
journal. So (if you) use the model, you only need to
see the changes of the competence model, so that you can [make this decision or identification]
identify the changes of data distribution. That is drift detection. About drift understanding, we can consider
where, and how and when, but today we only talk about where.
We know concept has a drift, but where is the drift region. [That is where?] So we developed
a method, published in �Pattern Recognition� that can identify the very accurate drift
region. Now we [do] (covered) detection, understanding, and the last one is drift adaptation.
We use case-based reasoning as an example. We use the competence model. When you have
this data stream as a new case, (we do) [come to] case base reasoning. [come to case base.]
Then you need to maintain or edit your case base. But when you do this [one], you need
to meet a challenge, which is redundancy. [Too many new cases come, and when we deal
with redundancy before our study, and people doing redundancy in that way. They only keep
the decision boundary here, because this one can still do classification. However, we think
they got some knowledge here lost. So] We developed new ideas and new algorithms
for concept drift adaptation, and [in this] (do) redundancy removal in this way: we remove
this redundant data, however, we still keep that figure. So that you can get that classification
decision boundary, but more importantly, you also can keep all the knowledge. This (is)
also published in �Artificial Intelligence� journal. 28:00 The third machine learning part, [we call
fuzzy systems] we use fuzzy systems to handle uncertainty to deal with Challenge 3.
We have different types of uncertainty. For example, data value uncertainty; data measure,
data relation, data processing, data missing, and output values all may have uncertainty. [Two strategies to deal with data uncertainty:
One, just ignore it. Don�t worry about this one, just ignore. The second one, no,]
We want to model an uncertainty, we want to deal with uncertainty to achieve high prediction
results. So, a number of fuzzy models have been developed.
For example, fuzzy measure, fuzzy rule, fuzzy relation, fuzzy classification, fuzzy clustering
and so on. We only mention some developments by my team.
For fuzzy data inputs, if you only have linguistic terms, say �very high�, what is �very
high�? You say �young�, he is �very young�, what does that mean exactly? We
use membership functions to describe the fuzzy value input. [For output, we received the data some years
ago from Chicago banking systems, about 2000 banks in last twenty years, bank failure and
bank survive. We will have prediction, you cannot say, this bank will survive or this
bank will fail. Normally not this way. You need to say how high belong to survive side
and how high belong to failure side. ] 30:00
[So that is the output of prediction we can use this fuzzy set.]
Fuzzy measure, this one is used in fuzzy transfer learning.
For one instance, we need to find 10 similar instances and 10 dissimilar instances. [seem
like distances and hope and similar one.] But how to find a similar one, what does mean
a similar one? So we introduced fuzzy sets into the distance
measure, then we can have 10 �fuzzy� similar one and 10 �fuzzy� dissimilar one. Fuzzy clustering, [we know] we build knowledge,
and knowledge can be described by rules. How can generate the rules? Rules come from clusters.
But for the clusters, we can say some [this a] instances are very close to the centre
and some (are) far from the centre. [Some within two clusters.] In such a case
we introduced fuzzy [this a] clustering to consider the membership functions of each
instance how highly or lowly belongs to a cluster.
And this is also published in �IEEE transaction fuzzy systems�. [Then we will have one of those fuzzy [this
a] clusters generated fuzzy rules. We can do [this a] fuzzy rules for [this a] knowledge
transfer between two domains.] That�s a source domain. We built fuzzy rules
through fuzzy clustering. And there is a list of fuzzy rules [as sample].
Then of course, we need to define a Phi function and optimize the Phi function. And finally,
we get fuzzy rules for the target domain and use the fuzzy rules [we get the fuzzy clustering].
That are fuzzy rules. So that we can generate prediction in the target domain.
This is also published in �IEEE transaction on fuzzy systems�.
[That�s one] Information granularity: This is so important.
A very good example: Everybody uses mobile phone at the moment.
And our mobile phone now probably [easily to] has 128 GB memory. [And] We want to transfer
the knowledge [to transfer learning] from [a] data of [last year or] last few years.
At that time you don�t have such a big[this a] memory in your mobile phone. At that time
a bigger [largest] one may have 1GB. [OK.] Then if you transfer one to one (1GB to 1GB),
of cause, [not work] not right. So what we need to do? We need to transfer
the one that at that time is [the largest to] �large memory�, and to convert 128GB
to �large memory�, and �large memory� transfers to �large memory�. That is [an]
information granularity and that is a very significant [this is] part to use fuzzy systems. Now we finished section one – data, section
two – machine learning. We go to the last session: decision making. In data part we identified the challenges,
and in machine learning part we developed a number of new machine learning algorithms
to deal with those challenges. Now we need to think how machine learning algorithms can
support prediction and decision-making. In this part, we need to mention decision
support systems to support managers and recommender systems to support customers. For decision support system, we can mention
3 main types. [The decision support systems 3 main types.]
One is model-based, one is data-driven or we call learning-based, machine learning-based,
another is a knowledge-based. [So that is decision making and here list
the two]. This one is the recommender system we developed.
And here is the decision support system we developed.� What is the model of decision support systems?
We first talk about decision support systems. 35:07
Model-based: that is programming, you build programming. [you’ll find this a math value,]
You get solutions. Nothing about machine learning. The [second] one is data-driven, machine learning-based.
[then this a] We have a programming, but for the parameters we need to use machine learning
to generate. Then we get a method for solutions. (In) the second situation even we don�t
use [this a] program. We just [this a] generate (solutions) using data directly or machine
learning to generate rules for example, and find a solution.
See, in that (first) case, we have machine learning to generate parameters. And the second,
the machine learning directly generates output. Also, we have knowledge-based decision support. This is an example about the model-based decision
making. [So map you can see,] We assume we have a number of [this] farms
as milk producers and we have a milk transport company. They have many trucks. The trucks
need to go to each farm to pick up milk. But we have lots of limitations.
1, mike must be picked up within 48 hours; 2, milk must be cooling down to 4 degrees
or less, you can pick up;� 3, for one particular farm of the milk, if
you pick up, you need to pick up all, then clean the container.
[Now for those one, what, what is the [a this a] decision support?]
They would like to minimize the cost of (transportation) [routing. And then you need to consider every
truck in the morning and how many farms, wait farm, you need to go to,] and make sure you
collect all the milk. So this is model-based decision making. [That is data-based decision making. ]
For data-based decision making, we have two types.
One, [we need to], we need to build these optimization models.
[Same Looks similar as model-based.] But all those parameters come from machine learning.
The first (model-based) one comes from experts. [That is the one.]
The second we don’t need to use any optimization model.
We just use data to generate rules and use the rules to generate options of decision
support. [So two types.] Consider the second type,
and what we need to do is: you collect raw data, then you [this a] do pre-processing.
[Then you have a transfer data,] Then you do machine learning. Then you generate results. For the impact of machine learning in decision-making,
we can list four main parts. One, [is you can use this one] you can use,
for example, concept detection to identify customer churn. Or to generator early warning
systems. This is for data analysis. The second, you can use machine learning to
support business for continuous evaluation. Because you need to update your evaluation
model [that a] on time. The third is real-time support for decision
making, [That�s], For example, reinforcement learning.
The last one we list here is about, when you have a new product or new market, then you
need transfer learning to deal with data insufficiency problems. Knowledge-based system: we can consider to
use case-based reasoning and [to [this a] to build knowledge bases and at the moment
some decision support systems combine knowledge bases with machine learning together.
So that you can learn rules built into knowledge and use knowledge to compare with business
rules to support decision-making. 40:05
That is decision making, which mainly supports business, supports managers.
In the final part I would like to mention recommender systems. It is an important part
of decision-making, but they aim to support customers, support users. This, for example,
is a recommender system we developed for (helping) small and medium businesses to find partners
we called �smart biz seeker�. In recommender systems, we have three main
types: collaborative filtering, content-based, and knowledge-based. But today I just quickly
talk about collaborative filtering. Collaborative filtering is the most popular one. The main
idea is they have a user-item matrix. Every user, after you do an online shopping and
you give a score (to an item). [That�s,] In such a case, we can use the
scores to identify similar users. [What is it mean?] For example, you mark item
1 very high, I also mark it very high; you mark item 2 very low, I also mark it very
low. That means we have same preferences, we are similar users.
[When we have, when we have similar users], We assume user 1 and user 2 are similar users.
And when user 2 gives a high mark for a new item, I assume user 1 will love it as well.
That is the main idea of recommender systems. Here is the outcome of recommender systems.
Just generate a recommendation about top three, one, two, three. [But this one] It�s a [the]
recommender system we developed for a telecom company. The users, not a single user, are
small and medium business. So every recommendation is a package. Recommender systems have [and still has some
[this a] problems,] challenges. One is sparsity. [You know mainly empty places.] And two, cold-start
/new users, i.e. I don’t have your previous data. And accuracy, we still want to increase
the accuracy of prediction. And uncertainty. [Our lab has developed ten [this a] algorithms.
And actually, we won�t say ten algorithms not enough more than ten algorithms.]
We should say that the ten new developments come from [particularly maybe] 5 or 6 Ph.D.
theses. And today I only quickly go through one, two,
three, four, five. Five of the ten are to deal with those issues in recommender systems.
The first one (is for) cold-start problems and new users. You don’t have my data. How
can you give me personalized recommendations? [So we use this a, a new], We use a social
network to identify more relationships between users.�
Two tree similarity-based recommender systems: This issue comes from real-world.
[When we do research the contract with Telecom industry, we realize the one item not one
book, not one movie, this not one cloth. It�s a tree. It�s a package. This a can be this
is fix one mobile phone and so on. For users, it�s a small business. It�s one hotel,
one real estate company, and one agent. So that�s the one users can have a tree structure
behaviours or preferences. So now] We need to think about when one item is in a tree
structure. How can we [this a] consider similarities between items or between users.
Then we developed theories and algorithms and to compare the two trees� similarity. Number three, group recommender systems: Sometimes
we generate recommendations, not to one person, but to a group, and the group of people may
know each other, may not, maybe just an online group. So, we developed strategies and to
think about how to have a group of users, to generate a recommendation for group in
large. Of course, two ways: 1), we aggregate their ideas and generate recommendations or
2) we generate personalised (individual) recommendations but combine together.
[So those papers we published in recommender system areas mainly in decision system journals.]
That�s a cross-domain recommender system. You can consider this as a transfer learning
application in recommender systems. Say, we may have a domain but don’t have enough
data to generate recommendations. Then, a similar domain has a lot of data available.
How can we transfer knowledge from this domain to help us to generate prediction of user
behaviours or user preferences and to generate finally recommendations (in another)?
The challenge here is the two domains may have different feature spaces.
And how to handle those issues? We developed some algorithms and this is published in IEEE
Transactions on Neural Networks and Learning Systems.
And the main idea is we need to [this a] transfer the knowledge learned in a source domain to
help the prediction model in the target domain. The final one is user interest-drift recommender
systems. [Users ID Users,] You know, any single user
or business user, [they], their [are] preferences may change.
For example, [this is a, two month ago, [I mainly check this], I’m going to, for example,
I am going to buy a property and two weeks ago I mainly check the property about town
house and dual-place. And suddenly I realized I don’t have enough
money. Then in recent couple of weeks, I searched [this a a] apartment. Okay,Then]
If you want to generate a recommendation to a particular online user for a property, you
need to quickly identify the changes of the user preferences.
[And not single user and you need to consider a group of users.]
Then to generate accurate personalized [are this] recommendation, through consider the
interests drift. Decision support systems and [this a a] recommender
systems are highly based on machine learning algorithms to deal with different types of
issues, and can be applied in different areas. As I mentioned, pricing for example, prediction
for example, determining user behaviours, and also generate a customer relationship
management [are this improve customer relationship management] and also can be used in e-Business,
e-Learning, e-Commerce, and e-Government. Those are the applications we have developed. [Long talk. And say,] What we have talked
today? We talked about the data and in the situation
of big data.�Then we talked about some challenges in data, and how to use machine learning to
deal with those challenges. We then talked about how to use machine learning to help
decision-making and recommender systems. The main idea is how machine learning can
use data to generate models, knowledge to support decision-making.�
[You can see from] My talk today covers quite a number of topics.
I would like to use this opportunity to thank my colleagues, my students and postdocs, and
all team members in my lab.� And I would like to use the opportunity to have a promotion
of our Centre of Artificial Intelligence. Thank you. 7