[email protected] | Machine Learning at Scale

Welcome to Machine
Learning at Scale. It’s great to have you here. Before getting into the
details of the class, I’d like to tell you
more about myself. My name is Jimi Shanahan. And I’ve been working in the
field of machine learning and data science
for over 25 years. I straddle both
academia and industry, meaning that by day, I work in
Silicon Valley at a startup. And in part-time basis,
I work at UC Berkeley and at UC Santa Cruz where I
teach a combination of classes, ranging from Introduction to
Machine Learning, Advanced Machine Learning, Distributed
Systems, and Optimization Theory. I’ve been doing this for the
past seven or eight years. But it wasn’t always like this. I spent the first
20 years of my life as a dairy farmer in Ireland. And then I pivoted
into machine learning and artificial intelligence
in the mid 1980s when I did my first degree in
computer science and business at the University of
Limerick in Ireland. Subsequently, I
went to Japan, and I worked on various artificial
intelligence problems at Mitsubishi. And then I came
back to do my PhD. And I was accepted to do
my PhD here at UC Berkeley, as it turns out. But I decided to go to
Europe and do my PhD at the University
of Bristol, where I studied probabilistic
machine learning systems. After that, I went to
work at Xerox Research and also went on
from there to work at Clairvoyance Corporation
in Pittsburgh, Pennsylvania. I started a company
while working at Xerox called Document Souls. It was a fantastic combination
of business and machine learning and a whole variety
of engineering challenges. And I recommend anybody
try to try a startup. After that, I moved
to Silicon Valley where I became the
founding chief scientist at a company called Turn. Turn is a demand-side platform. It’s an ad-tech company. And by the end of
this class, you’ll understand all of this
technology and jargon hopefully. Subsequently, I joined
another startup, which I founded myself. It’s a consultancy
company where we focused on helping Fortune 500
companies and a whole variety of startups around
data problems. We built a whole
variety of solutions. And ultimately, one of my
clients hired me, NativeX. And currently I work
at NativeX where I am the senior vice
president of data science and chief scientist. And as an aside,
I also kite board. And I had the great privilege
of representing Ireland last year in the World
Kite Boarding Championships in Turkey. OK. Enough about me. Let’s talk about this class. It’s a very timely
class in that there are no shortage of problems
in both the private sector and in the public
sector where these tools and techniques that
we’ll study this class can be leveraged to great avail. Take, for example, today we
have a social graph where we have around 1.5 billion people. And recently, we’ve
seen the evolution of the open graph, where people
want to connect to things. So for example, I am listening
to Will Smith on Spotify currently. This is creating
a whole new graph between people and things. The Internet of Things
takes this one step further. And by the year
2020, we’re going to have about 30
billion items connected. The idea here is to put sensors
on all types of objects, both animate and
inanimate objects. As a result, we’re going to
have lots more data flowing. Now, I’m happy to report
that machine learning has been evolving quite a bit
over the past few years. We’ve gone through three major
generations of machine learning in the last, say, 10 years. The first generation
focused primarily on single-node
computing, where we tried to load data into
memory and process it there. And so it was pretty limiting. Subsequently, we tried to
make use of general purpose distributive systems like Hadoop
to perform machine learning. But we found a lot
of limitations there. And then the third generation,
which we are currently using, is based upon
memory-based systems. And the idea is to have a more
fully functioning programming language whereby we can
code up algorithms easily. Now, the whole idea behind
this third generation is to use the MapReduce
flavor of parallelization. And it turns out that
the MapReduce framework is very good at dealing
with problems that are embarrassingly parallel. It turns out that machine
learning algorithms that are of use
in the real world are all very much
embarrassingly parallel. If you don’t understand what
“embarrassingly parallel” means at this point, don’t worry. You won’t be embarrassed
when you learn about it. In this class, we’re going to
follow a seven-step approach to modeling. And any modern day data pipeline
will follow similar steps. And we’ll start off by
understanding the domain. We’ll collect an
instrument’s various data. We’ll warehouse the data. We’ll do exploratory
data analysis. We’ll do feature
engineering, then do modeling, lab
testing, and finally, A/B testing in the wild. And we’ll do this at scale. So think of this course a being
organized as a spreadsheet where the rows correspond to
genres of machine learning algorithms. So think of supervised
learning, unsupervised learning, semi-supervised learning,
graph-based algorithms, or hybrids of those. Think of each column
as being related to different types of
algorithms in those categories. Then also think of
columns for case studies. Think of columns for theory. So for example, take
the hybrid genre of supervised
graph-based algorithms. We’re going to focus on
a random walk combined with supervised
machine learning. And we’re going to come up with
a supervised random walk that enables us to predict
the people you’re going to link to in future on
a social network like Facebook. And in fact, we’ll
talk about a study that was conducted over 1.5
billion people, where we had a trillion edges. So imagine doing this at scale. At the end of this class,
you’ll be able to do the same. Now, the emphasis
in this class will be on intuition and practical
examples rather than theory. Now, we will delve into
theory from time to time. So at this point, I would like
to welcome you to the class. I hope you will enjoy it. And I look forward
to seeing you online. Good luck.

Leave a Reply