Principles of Scientific Knowledge Engineering


Good day, everyone. This is Jack Van Horn from the University
of Southern California and the Big Data to Knowledge Training Coordinating Center. And I’d like to welcome everyone to the Big
Data to Knowledge, BD2K Guide to the Fundamentals of Data Science
webinar. Today, we’re very fortunate to have Dr. Gully
Burns from the Information Sciences Institute here at the University
of Southern California where he’ll be talking about the principles of scientific knowledge
engineering. Gully Burns,
as he is known to all of his friends, studied physics as an undergraduate at Imperial College
in London, England until one day, he had an epiphany that he
wanted to study how the brain works. And after completing his
doctorate at Oxford in 1997, he began work at here at USC in the neuroanatomy laboratory
of Professor Larry Swanson and began working building software
solutions such Neuroscholar and Neuroarts, Neuroanatomical
viewers. In 2006, he moved kind of down the streets
to the Information Sciences institute, which is located in Marina Del
Rey here on the west side of Los Angeles, to begin working in a little more of data
science solutions and systems for biomedical knowledge engineering. One of his main career goals is to transform
the way in which scientific knowledge is utilized so that scientific discovery
becomes commonplace, powerful, and easy. And as we are going
to begin hearing from Dr. Burns here in a second, I want to ask everybody to, just a
little reminder, if you have any questions for Dr. Burns throughout his presentation,
to use the little question submissions system here in the Go
To Webinar interface there probably on the right side of your screen. Send those in and during the last 10 minutes
or so of the hour, we’ll– I’ll read those off to him and you’ll get a chance to get
some of your questions answered. And I’m sure that he’d be delighted to share
his contact information with you. So should you have any questions,
you can reach out to him directly. So with that, and without further ado, Gully,
thank you so much for taking the time to present to us. We’re really looking forward to it. Thanks again. Thank you, Jack. Can everybody hear me? Well, is my microphone working? It seems as though it’s OK. It is. It is. Thanks. OK, so the organization of the talk today
is broken down into just basically five parts. I may have too
much material so I’ll probably start skipping stuff if we– if it looks as though we’re
running short of time. But
essentially, I’m going to try and just give a high level overview as to what scientific
knowledge engineering is really all about. I’m going to describe how this kind of works
currently in the community, how people do knowledge
systems and how scientific databases and scientific databases leverage the kind of technology
that scientific knowledge engineering is all about. But I also want to highlight some failings
and questiond about general approach that people have towards scientific
knowledge engineering in that most people, when they think of this
discipline, they think of knowledge schemas and database schemas. And that’s not the whole thing. Then, I’m just, at the end of the talk, if
we’ve got time, I’m going to talk a little bit about my own ideas and about
how I see the world and strategies that I think can help people who are trying to develop
their own systems, you know, how you could go about starting the
process of doing that and what are some important things to think
about when you’re actually kind of getting started with building a system. And I wanted to start– end with a kind of
high level idea of just presenting the kind of notion of artificial scientific
intelligence. What would it look like to have machines be
able to do science? OK, so first question is, what is scientific
knowledge engineering? And I wanted to kind of draw on a book by
Dr. Paul Rosenblum who wrote a book on the great
scientific domains. And he actually was talking about computing
as a great scientific domain in the same way that physical sciences, life sciences, and
social sciences are. He kind
of draws a nice idea that says, well, you know, you need– when we think of science,
we think of social science, physical science, and life science. But computer science is a valid kind of companion
of these three, but it also beautifully kind of complements them and works
together with them. And the definition that he has for computing
is one I really liked, which simply says that computing involves
transformation of information. And when we’re thinking about the evolution
of science, we go invariably to the great Jim Gray from Microsoft who– and if
anybody who’s listening kind of wants a good time, they should go to
this website at the bottom to look at some of Jim’s talks. He was a luminary. He’s sadly no longer with us, but
really fantastic work. And his vision was that whereas thousands
of years ago, people would perform empirical science. So you
basically look at the stars and you try to explain them. And then as we progressed, we would develop
analytical approaches to kind of explaining things. You build analytical models to describe the
physics of the motion of planets and such like. And then more recently, we’ve gotten into
computational science where all of a sudden, we have large scale data
sets and large scale analytical models. And we’re able to generate computation on
these things. And then the
final, the kind of fourth paradigm as he talks about is this notion of e-science where you
have a lot of data, huge amount of data big data captured by instruments
generated by simulators and then processed by high power,
high power tools such as workflows and then can be placed into a database, a file, a set
of files or a knowledge base. And then the scientist kind of interacts with
the database on those files in order to make predictions and do their
scientific work. And although, you know, I think this is a
beautiful representation of how things have evolved, we’ve
certainly not gotten rid of empirical science or analytic science. So it’s worth thinking that even though we
have evolved to include e-science in the whole
process, we still need to think about empirical science as a process that
is still valid and still important on certain things. And so, you know, the point that I’m making
here is really the competing, in other words, the notion that the
transformation of information has always been fundamental to the scientific work. And whenever we’ve done
science in the past or in the future, we are always going to have to have structures that
allow us to deal with information, transform it, and make sense
of it. Praise it. So before computers we had scientific notebooks. This is
a page from Darwin’s field notebook. We have libraries, literature. We have mathematics, scientific theory, classical
statistics, all of these structures that help us manipulate
information not on a computer, but, nonetheless, we’re still manipulating information. And now of course, we have
large scale data analysis methods and simulation tools and databases to help us with this process. But we
shouldn’t loose track of the idea that, essentially, computation, computing is kind of a fundamental
aspect of scientific work and has been from the very
start. Now just to kind of dive into a little bit
of philosophy, if you’ll forgive me, I wanted to introduce the term reification. And reification is an interesting kind of
idea that basically, it’s when we try to think of something abstract like a
concept or an idea as a material or concrete thing. And so what’s interesting about this is that
we can think of apparatus– so this is a quote from a book
called Laboratory Life, which is, again, a great read. It just says, it’s
like, we can think of, “apparatus as ‘reified’ theory. So when another member uses the NMR spectrometer,”
for example, “to check the purity of his compound,
he is using spin theory and the outcome of 20 years of basic
physics research.” So in other words, the equipment and the tools
and the systems that we build somehow kind of concretize
abstract scientific theory into a thing, into objects. And those objects can be virtual in the sense
they can exist as algorithms or computer programs, live in a
computer. Or they can be actual physical apparatus like
an NMR spectrometer. And so that’s really what this is about. Scientific knowledge engineering is really
about, how do you turn scientific knowledge into a material or concrete thing? And the process of basically developing information-based
systems to capture, store, and use scientific knowledge
is central to this whole idea. And so, you know, if you were to try
and turn around and boil down the essence of what scientific knowledge engineering is
really about as a fundamental challenge, essentially, we are
asking the questions, OK, how do you build a representation of
scientific knowledge? In other words, how do you write it down? How do you write the scientific knowledge
down into a structure that you then would process and use with computers
and such. And so at that, that’s the kind of high level
introduction to scientific knowledge engineering is, in my
mind. And so let’s go into a little bit of how this
has shown up in computational systems and things that we have
access to in the world. So this is taken from a paper which is a nice
recent review of scientific knowledge engineering I’d recommend. And all the way through the talk, I’ll be
citing information from other sources. And hopefully the review– and I’ll
include references so that anyone watching or reading this can see where the information
comes from. By all
means, please follow up on these links and see if you can find things. OK, so Dos Santos and Travassos in their review
describe the kind of basic fundamental idea of what scientific
knowledge engineering is when basically, we’re trying to build computational infrstructure. This is the little block
that you see on the right hand side. And we have scientists engaging with the machine
in two ways. First of all, as a knowledge engineer– in
other words, you’re basically trying to build the underlying schema and
infrastructure of the system to of make it work. And then the scientist himself interacts with
the system to make inquiries, to ask questions, and to get answers
back. And essentially, that’s a good way of thinking
about the core kind of role and approach to scientific knowledge
and knowledge engineering has. And so at this point, it’s kind of fun to
think of the very first biomedical database that was ever created by humans
in a real, concrete way. And this is, of course, the Protein Databank
or PDB. And this originated at a Cold Spring
Harbor Symposium in 1972. And it really kind of came about because technology
for getting hold of molecular structural data was becoming more and more
common. And not only that, but you were able to actually
display and showcase your computer programs being created at
that time, very expensive and impressive computer programs, that were able to render the images
of these 3D models and kind of display them for a user
to be able to see. So that’s revolutionary at the time. And basically,
what was it was interesting about this was that every single molecule representation
consisted of a bunch of different 3-D positions of atoms in 3-D space. And up until this point, no one had any real,
concrete way of being able to share the information within the community. So it’s a small group of people, all of whom
are experts. But they need to be able to share data. And so from– and
so they basically organized this transatlantic collaboration between America and Great Britain
and were able to generate two of these that has, since 1971,
has clearly really had an impact in the world and kind of has, still has
very, very significant information and has embraced, you know, has really motivated research
at the highest possible level. And so this is the kind of example of the
kind of data, the kind of systems that we’re talking about,
the kind of things that we’re talking about. Now, if we fast forward to 2017, there is
a data– there is a publication called Nucleic Acids Research who every
year publish a single article– oh, sorry, a single issue that is devoted to all of the
molecular biology databases that you can possibly think of in the world. And every year, they basically receive papers
that describe these databases. So each paper is a full write-up of each database
in detail. And every year, as you can see, they add about
200 or 300 individual databases to their repository. And since
they’ve started– this is now in its 24th year– since they’ve started, the collection
of databases that they have amassed in terms of the descriptions that
they have consists of 15 categories, 41 subcategories, and there’s
almost 2,000– 1,893 separate databases listed in that collection. And it’s worth noting that 110 of these systems
that they describe are really of the highest possible quality. These
are the so-called golden set of databases that consist of the kind of systems that pretty
much all biologists in the world use, so things like UniProt and IEBB,
these very, very high quality systems that are relied upon within the
community. Now it’s worth noting as well that these are
all molecular scientific databases. So it is a subset of the
systems– of the types of systems that are available. And not only that, but they are– the data
that goes into these databases tends to be more quantitative and more
tractable. So even though it’s complicated, even though
there’s a lot of, huge number of different dimensions of
this information that we need to track and keep and moderate, it’s actually, I would
say, at some level conceptually– excuse me– conceptually simple
and relatively easy to kind of capture. Although, of course, it’s still
very difficult. But it is possible. So the state of the art for scientific databases
currently is that the kind of technology that goes into generating
these systems, all 1,893 of them, is simply normal data-driven web applications. These are databases that serve
scientific communities, but the chances are the technology that underlies them is kind
of similar to the types of websites that you might see online from any
kind of industry or any kind of company. If anything, they probably lag
somewhat behind the curve in terms of the sophistication of the portal interfaces and
such like because we are on scientific budgets and we have to get money
from the government to do this rather than being able to hire, being
able to command the kind of salaries that Google and Facebook and these other places
are able to do. But nonetheless, the technology that underlies
it is pretty standard. Usually, it’s something like a website with
a relational database back-end implemented either
in a commercial systems like Oracle, MySQL, PostgreS– or
anything like that. And so within this context, what scientific
knowledge engineering really is all about here is really
how do you design a database and how do you design a web application. And of course, the interesting challenge of
that is that rather than dealing with information that you will find in
Facebook or a business company or something like that, you actually have to deal with
complex scientific knowledge. And that is, itself, of course, a challenge. Because scientific knowledge is hard to understand. It’s
complicated. It’s multifaceted. It has a whole bunch of different things going
for it. So, essentially, when we’re talking about
these, all these hundreds and thousands of databases that exist online,
essentially the question we have to ask from a scientific knowledge engineering standpoint
is, how do you structure your data? How do you understand how to put this together? So first question you have to ask is, how
should your database schema describe the data you want to share? The next question that you’d have to ask is,
well, where does the data come from you want to put in your
database? Now, this can come from the literature. And if you do derive it from the scientific
literature, maybe you need to hire people to read the literature. In other words, these are called curators
of curation style who are typically expert scientists themselves and,
therefore, are quite expensive to pay and to maintain. But nonetheless,
in order to be able to actually preserve a half-decent database, you need to have high
quality, high quality curated information. Or maybe you’re– maybe the database that
you’re working with wants to store laboratory-based information that
comes, that’s used, that isn’t published that is coming off of a machine somewhere. And if that’s the case, then
you have to consider how data is formatted. And also, what about metadata? What about the context of the– of
the data itself? So if you have– let’s say that you have an
MRI database that you, you want to store magnetic resonance
imagings from, you are going to have to actually– you can’t just take all of the files that
you have, put them online, and say, hey, everybody go at it. Take a look at this. You have to think about how the data itself
is structured that describes each individual person that the
scans are taken from and what the context and how it all works. That is itself a very big challenge too. In fact, that’s probably the bigger challenge
than storing the original data files. And of course, it’s like the other question
is, how should your website serve the data to your end users? So
you have a bunch of people who need access to this information. What’s the context of the access that they’re
looking for? How do you– what kind of functionality do
you provide? All of these questions are actually scientific
knowledge engineering questions and require you to think about and
construct the system pretty much in the same way that you would do for any other web application
or any other web system. But because it’s scientific data, you have
to think about these things in a more deep way. And then, finally, a question that comes up
that’s important for people in the community is, how are you handling
standardization of your information? So, as you can see, if we have 1,893 databases,
they can’t all use different ways of describing what a protein is. They need to standardize the approach that
they’re using. And there are– as
we’ll get into it, this is actually probably one of the key questions of scientific knowledge
engineering that we’re going to cover today. But, in fact, it’s certainly not an easy–
not an easy thing to kind of address. So the question then comes into, at what stage
in the scientific process do you actually build your system? This
diagram very quickly kind of shows the various different types of options that you have. And there are people who
build systems straight from the laboratory. My colleague Karl Castleman at ISI has built
a system called Deriva that does just this. And then perhaps you– or
you might want to build a system that describes the primary literature, the literature that
kind of captures the experimental data at source. Or maybe you want to build a database that
captures and summarizes the review literature. Each one of these has has a various different
set of parameters and balances and checks that needs to
happen. But it’s interesting that the further away
you get from the laboratory, the more kind of, well, the less
efficient the system is and the more kind of pre-chewed that the data can be. It’s actually– and, of course, this efficiency
question is pretty important. It seems silly for us to build, for us to
spend huge amounts of money on funding to scientists to do their work, publish papers,
and then have to spend huge amounts of money on funding to develop
ways of extracting that information from the primary literature in
order to put it to databases, which is actually kind of what we’re doing at the moment. So– OK, so, I’m going back to standardization,
this is a comic from XKCB, which I think captures very well how this
process, sadly, does actually work. If you have a situation where there are many
competing standards, as is the case with biomedical databases and all of
these various different approaches, if you come into the situation say,
oh my god, there’s 14 competing standards, we need to develop one universal standard
that covers everybody’s use cases, and everybody agrees with you and
this is great, then typically, what happens is your standard
becomes yet another example of a competing standard that has to compete with everything
else. Now, this is a
pitfall that a lot of people are aware of and are trying desperately to kind of work
around, but, nonetheless, as you’ll probably see quite soon, this is still
a real problem that causes headaches for scientists all the time. Now, so I just wanted to talk very briefly
about– just to give you kind of– just to give everybody on the call an idea
of the complexity of what a database typical, good quality database schema would look like. So there’s a system
called CHADO or CHADO– I’m not sure if that’s the correct pronunciation– which is a part
of the Generic Model Organism Database project. This is a database that’s implemented in PostgreSQL. And essentially, it’s designed
to provide standardized schema– I’m sorry, have a for Genomic data associated with generic
model organisms. Now genetic model organisms– organisms are
things like mice, safflower, fruit flies, cats, rats, anything, any kind
of standard organism that people use to perform experiments on. And one of the kind of goals of the GMOD
project is to standardize representational processing of data– of different types from
these various different model organisms in a standard way. The database itself is actually quite big. It consists of 18 so-called modules, which
subdivide the task into 133 tables. But it is used by a relatively large number
of databases for different model organisms. And so it’s a– you can consider this to be
a success in the community. And I just wanted to kind of show you that
the basic data structure of these things can actually get quite
complicated. So one of the tables is this notion of a feature
in the database. This is the kind of thing that you might
do that would be a general kind of, a general purpose phenotypic aspect of model organisms
process or, you know, something that’s happening in the model
organism that you want to keep track of. And if you draw out all
data tables associated with this– including links to the organism, links to publications
in the pub table, or locations of the features– this is actually quite a
fairly complicated and in-depth data structure. Understanding this is no joke. And of course, if you look at, in fact, if
you look at the structure of the database as a
whole. If you look at the scaled up for the whole
thing, it would be vastly more complicated. And this is actually
quite difficult to understand. In addition what’s important is that the semantics
and the structure of the database is often kind of embedded in the names of the
actual terms that occur. So the CV term is a table for a controlled
vocabulary term. And the nature of the semantics for the database
are going to depend very heavily on the types of terms that go
into that table. So from a scientific knowledge engineering
point of view, being able to understand and build this
kind of system, you have the underlying infrastructure of the database tables that you have to understand
if you’re working with GMOD databases. But you also have to understand the terminology
and you have to understand how the various different terms linked together
and link to existing databases and to all these things. So this is just to kind of give you a glimpse
of the level of complexity that’s required in these various different
approaches and systems. OK, so in order to kind of deal with this
notion of standardization across the various different biomedical databases that we will
encounter, people in the community have developed this– developed
this notion or rather used the notion of ontologies. And ontologies is a construct from AI, which
is, the best definition I’ve come across is that by Gruber
in 1993. He says simply that it is a specification
of a conceptualization. So in other words, the way in which it’s just
a kind of a representation, computational representation of how concepts come about. And my colleague Ed Hovy in 2005 wrote a beautiful
little chapter that discusses five methods of ontology
construction. And he breaks it down from a high– from a
very high level looking at the various different communities of people who build ontologies
within the computer science community– within the computer science
community. And he describes them as five different types. So there’s the philosophers, the cognitive
scientists, the linguists, computational reasoners, and the domain
specialists. And what’s interesting about this is that
typically– and we’ll go into a little bit of how biomedical
ontologies are largely driven by a philosopher’s perspective currently– but there are other
approaches that are based upon, which we won’t touch on in detail,
but there are other approaches derived from linguistic approaches
using just the meanings of words or domain specialists, such as biomedical scientists,
who just need to kind of represent the information in their community
as accurately and straightfowardly as possible. They typically have a
different approach than philosophers. But it’s worth kind of trying to consider
all of them. And I think the thing that I wanted to leave
everybody on the call with, if– just as a kind of a real, concrete piece of
advice– is that if anyone tells you that there’s one, only one correct way of developing
ontology, there’s a standard methodology that everybody uses,
that everyone agrees on, they’re wrong. This thing doesn’t– there
isn’t such a thing. There are schools of thought and there are
people who address the question in a given way,
but there’s no one correct way. I think that it would be– do a disservice
to anyone to try and suggest that only one is that the best way. And of
course, different types of methodologies trying to address this question run into different
pitfalls and problems. So
if we’re going to solve the problem appropriately and powerfully, we need to use all tools at
our disposal to kind of try and make sense of things by leveraging
any and all approaches that work and provide us with actual working
tools. OK, so I wanted to talk about the very important
set of work from the scientific community called the Open
Biomedical Ontologies, actually. And they call themselves the OBO Foundry. And this is an attempt to standardize
ontologies across all of biology, to provide a kind of common framework for building these
knowledge representations that everybody can use in
the same way. This approach doesn’t solve many important
standard ontologies such as the gene ontology, KB human disease
ontology, there’s one called the ontology of medical investigation, OB, that’s significant. And so there are a bunch
of these various different representations of knowledge that we– that you should be
aware of. And the approach
broadly uses the basic form of ontology as an [INAUDIBLE] ontology. And we’ll describe very briefly what basic
formal ontology is all about, just to give you an idea of how it all is
structured. And so OBO Foundry lists 156 separate ontologies. And this is very much driven by a community
of philosophers who have a kind of a clear idea
as to how they recommend seeing the world. And part of the
representation, as you’ll see, it tries to provide a kind of global catch-all type of
view of the world, of the things that exist in the world so that we can all kind
of fit our various different– if you want to build a representation of things
from domain A, then that should be able to fit into the schema provided by the– the
Basic Formal Ontology. If you
want to look at other things, they should all fit together. This is the kind of vision of the OBO community,
how they would like to see it. And this is a complicated diagram. Don’t panic. I’m going to kind of walk through it. But this is the kind of high level
representation of how OBO Foundry ontologies are constructed. The first, the top level object is just an
idea, it’s a theme. And then there are two types of underlying–
then they basically split the universe into two different types,
into continuance and occurrence. And so the idea is just like when you talk
about a thing and then the basic way– I’m not going to go into too much
detail about how all these various different elements work– and I hope I get this correct–
but occurrence are things that do not physically exist in the
world. Continuance exists in the world as physical
objects. So over here,
you see on the left hand side there’s an entity called the material entity. And that could be a supernova or a
neutron star or a cup or a chair or a pen or a neuron, right? Those are all physical objects. And, essentially, the
way in which– the way in which the BFO represents this is they kind of– they try and construct
things that exist in the world. And they distinguish between the different
things in order to kind of construct a kind of global theory of how things
are organized. So, for example, a spatial region like your,
I don’t know, country of, the United States of America is
a thing. But it’s kind of defined by a spatial region. And it’s not really a kind of– the landmass
upon which it’s built, it’s kind of a region that’s contained within
that. Anyway, I’m getting off [INAUDIBLE] topic. So you have continuance and you have occurrence. And occurrence are things like processes. So these can be
thought of as things because we can describe them with words, they are concepts. But they are not physical
entities in the same way that continuance are. So I probably just mangled. And any ontologists on the call are probably
tearing their hair out. But this is– I just wanted to kind of include
this as a significant part of the process of working
in biomedical, in the biomedical area because a lot of people use
Basic Formal Ontology. And you should go away and look at it, look
at this website, and become familiar with it if
you want to do work on this area. Another significant tool, another significant
approach is the BioPortal, which is put together by Mark Musen’s
group in Stanford. And it’s a large-scale knowledge— sorry,
ontology catalog. So it’s a list of various different
schema describing various different domains of knowledge. And it contains 658 ontologies, so a sizable,
sizably more than the OBO Foundry Onto– collection. But they’re pretty non-standardized. It’s like anyone who can submit something
to BioPortal is allowed to do so. And so these are usually– they’re not
kind of built with the same unifying principles and constraints of having to conform to a
standard complicated, difficult ontological design. Because you can– as you might be aware, if
you try to use the Basic Formal Ontology as the kind of core representation of your
work, you can fit it into the schema of everything else, but you also
increase your overhead in terms of being able to develop things in the short term. So BioPortal provides a nice
catalog of a large number of different representations where you can use [INAUDIBLE] kind of get
into very quickly without the overhead of having to kind of
fit it into this complex high level schema. And then, finally, I wanted to talk about
the FAIRsharing.org community. And FAIR is a kind of acronym that is
being adopted widely throughout the community to try and make sure that we, when we develop
our systems and our schema and our knowledge base and all
of these various different things, we want to– it stands for findable,
accessible, interoperable, and reusable. So and there are a whole set of principles
that are published that one can find online– I don’t think I have a link
here, I’m afraid, I apologize for that– that allows you to, that gives you a
checklist of things you need to do in order to make your tools and systems compatible
with these principles. And
Jack, actually, in a moment of brilliance, kind of suggested that we add the letter E
on FAIR for education. You
know, because I think that’s important because the– a lot of these systems are quite challenging
to understand. And we’re so busy a lot of the time kind of
dealing with computational interfaces systems and such like that we
forget that we need to actually make humans be able to use these things. Another thing about the FAIR sharing system
is that they talk about, they talk about standards, but they also talk
about kind of standards within the community’s policy. So as a repository describing a large number
of these various different artifacts and things that
I was looking at, this is a good place to look. And this is a good place to
find interesting material. Now, no talk about knowledge engineering and
the web and everything would be complete without talking at least
a little bit about linked open data and the linked data cloud. Now, this is– this is a diagram taken from
the lod.cloud.net website. And it’s this gargantuan image of all of the
various different parts of the semantic web community that deals with various different
types of knowledge. And the important thing about this is that,
so if you just take a look at this, the diagram as a
whole, you’ll notice that the various different parts of the diagram can be
split up into different sections. And, in fact, the life sciences forms probably
the largest individual subcommunity of the linked data cloud. And essentially, every dot in this diagram,
every circle describes a large-scale knowledge resource, a database if
you like. And the promise of what linked data can do
is that it allows you to actually connect entities from one
ontology, from one knowledge base into other common knowledge bases using ontologies. And so this is kind of
inspiring. If you just kind of sit back and look at this,
I mean, we’re not going to end the details of any individual node, but
this is, this kind of gives you an idea of what could be possible. This is some stuff of human knowledge that’s
available in this format currently. And if you think about every individual database
links together and how it could be used to inform things, that’s actually
pretty cool. Of course, the devil is really in the details
with this stuff. And you have to be able to understand the
schemas of these various ontologies to be able to query
them effectively. And so it’s not necessarily just an easy task
to pull information necessarily out of this cloud
whenever you can. But this is an expanding research area. I think it’s
pretty cool. So other important stuff that I don’t really
have time to talk about from the community are things like, well,
research objects. This is work by [INAUDIBLE] group when they
define– in Manchester– and they define, well,
these encapsulated packages of scientific knowledge. So in the same way that scientific publication
kind of draws together different elements into a single
place, a research object provides a semantic web-based representation
of everything that you would need to know about a data set, including the data set itself,
problems about where the data set came from, maybe even a work
cloud that generates it, maybe even the publication that was included
in these things, So I’d recommend looking at that as an important aspect and an important
development in the field. And then, of course, web flows are a [INAUDIBLE]
longstanding piece of research in the community where,
essentially, you’re basically putting computational analysis together in order to generate, in
order to be able to run programs reproducibly and reliably. But we don’t talk about that anymore. So let’s move on. So this– in the next part, I wanted to–
we talked a lot about how knowledge schemas and databases and the
underlying architecture of the systems work. But it’s important to leave with the notion
that schemas are not the whole story, especially when you’re dealing
with the kind of representation of what knowledge is. If you look at
this– so this figure shows you– it’s kind of a high level view of 80,000. Each dot represents an NIH grant that was
funded, I think, in 2010. I could be wrong about the exact year. And so there are 80,000 dots for each grant
that was funded by NIH, all broken down by color
for each individual agency that was, each of the agencies that were,
sorry, each institute that was funding the work. And it’s important to note that most of the
databases that we’re talking about here probably occur up here, right? There’s all of this kind of like knowledge
and stuff that’s collected into databases in areas where molecular
databases are well-founded. But I don’t know of many hematological databases
that are really constructed the same level of detail. I’m not sure any epidemiology databases that
are constructed in the same way. There are
individual examples, but there’s nothing like the kind of coverage that you see in molecular
biology. So, you know, just having schema, so the whole
point is that the variety of information you need to put into the
systems is very large. And when you’re kind of dealing with knowledge,
you can’t just talk about the structure of the knowledge itself, which is– in this diagram,
this is a kind of three level representation of an approach called
COMMONKADS to modeling– to modeling knowledge in a knowledge engineering context. And what’s interesting about this approach
is they have a whole methodology for trying to understand the context
of the information. Like, what’s the organization? How does the lab work that your information
is contained within? What is the task that the information is trying
to describe? What are the people? What are the agents involved in doing the
work? And then, you know, the actual
representation of the knowledge itself occurs here. But it is required communication models, kind
of figure out, OK, how does that knowledge– how is that
knowledge used by the agents trying to solve a problem that the
knowledge is useful for in communication? And all of these various different elements
combined together in the overall representation of the design model to
give you the overall knowledge representation. And, of course, you know, if you’re focused
only on building ontologies, then you’re only talking about
one small piece of the puzzle. There’s only one part of the overall
structure of how you have to think about them, the model approach. And then another thing that’s overlooked a
lot of the time is, how do you evaluate knowledge systems? And
according to this book by Adelman and Riddle, they describe this kind of multi-layered representation
of the various different types of evaluation you
should think about. So– and I’ll just, I’ll just put them all
out. I hope you
can see them. Now– and so this is broken down into four
different parts. First of all, you should try and understand. So when you evaluate a knowledge system, you
want to evaluate the requirements or to validate the knowledge
in the system, whether or not is the way in which the system deals with
its knowledge correct? Does it behave appropriately? Very importantly, you know, when you build
a website or you build a system that people are going to use, can you
learn it easily? Does it work well? You have to evaluate it according to that
kind of approach. And then, finally, you
know, performance evaluation, you need to be able to say whether or not the system does
what it’s supposed to do well. And so these are all kind of evaluation metrics
that are incredibly important for people doing the work that
I’m talking about. But if we’re focused only on schema and ontologies
and building these artifacts, we will miss this. And so, you know, at the end of the day, it’s
important to kind of– it’s probably more impactful to build
something that’s less well-formed ontologically and more useful for scientists. And that’s borne out by the example of a gene
ontology, which is the most widely used ontology in the world
today. It’s incredibly successful. And it’s incredibly useful and important. And yet, from a kind of classically,
philosophically-driven perspective, it doesn’t really kind of solve– it has– it makes lots
of short cuts. And it takes
and it kind of solves the problem in an ad hoc way. But it does so very well. And part of the reason why it’s useful
is because people can understand. And so just, OK, so I wanted to kind of jump
to– so that kind of concludes the kind of advice giving part of the talk. We’ve got about five minutes left as far as
I can see. So I’ll skim through the rest of it. And here I wanted to talk
about some high level ideas and kind of future directions of what we should be kind of thinking
about. And the work of Thomas Kuhn is an interesting
guide in this process. Because Thomas Kuhn kind of described the
notion of scientific paradigms. And his kind of notion is that it is basically
some accepted examples of scientific practice– and I think of this as almost like
a community of various different elements, which includes laws,
theories, applications, instrumentations, social models of how scientists work together
within a specific community, and the coherent tradition of scientific research
that constructs a paradigm can be thought of as everything that
contributes to how paradigm works. And Kuhn in his seminal work, Structure of
Scientific Revolution talks about how normal science– there is this kind
of big cycle that works where you’re going along in normal science and then you find
something that’s wrong. You
find an anomaly. And then the anomaly is so great that perhaps
you enter into crisis. And your faith in your
paradigm is shaken. And you stop believing that, you know, classical
physics works. And you have to invent an
entire new approach. You create quantum mechanics. You go through the [INAUDIBLE]. And then you have to
kind of figure out how this new paradigm works within the context of neuroscience. And it’s important to note that this very
rarely happens. The kind of normal scientific work that we
do day to day, the kinds of earth-shattering anomalies that
form the basis of this notion of scientific revolution very rarely
happens. And instead, we kind of find ourselves very
much in the normal science process where it’s puzzle
solving. Now, so within– and it’s important to note
that there are many different paradigms all kind of over
overlapping and– and working together that do not necessarily talk to each other. There’s a kind of silo effect. So if you work in molecular biology and you
work on, let’s say that you work on the molecular biology of
schizophrenia, for example, you’ll look at pathways, you’ll look at proteins, you’ll
look at molecules, but you won’t necessarily look at patients. You’ll read the literature about patients,
but you don’t understand how it works. So, in
a way, one can think of these– there are two paradigms kind of working together in
a coordinated way, but they don’t necessarily communicate. And part of the work that I’m trying to do
is to figure out how we– how more easily can we get these paradigms to
kind of work together. So I’m kind of getting off the subject a little
bit, but instead of talking about this large scale
paradigm shift pattern approach a better model is treating scientific knowledge evolution
as a kind of abductive argumentation process after [INAUDIBLE]. And an even better-known model, again, by
the luminary Carole Goble is this notion of knowledge turns. And a
knowledge turn is a bit like, is kind of described in this approach by this notion of the cycle
of scientific investigation, the idea that the scientists
start off with scientific knowledge, they then ask scientists questions, they
go from being– given a question, they design experiments, they then execute experiments
to get data, and they get knowledge. And essentially, what we’re trying to do is
to provide a framework to speed up the process of going
around the cycle. I am actually running out of time, so I’m
going to just quickly skim over this work. This is actually my contribution. This is something called knowledge engineering
from experimental design. The idea is that if you– one way of
dealing with knowledge in a– a principled way is to draw out the scientific protocol
of whatever experiment you’re talking about. You can then trace the prevalence back through
the protocols indicated by this red line from the
kind of starting point in the experiment. And if you look at these various different–
the idea is that each square block is an entity. Each circle is a process. And they’re variously parameterized by these
kind of blocks such as animal number, [INAUDIBLE], such like. If
you trace back the protocol in this way, you then can build a knowledge representation
that’s very generic and works very well across different methods. I won’t go into it. If you’re interested, please look at this
paper from Ross, 2011. That should let you know. And the idea is the kind of take-home of this
is that if you approach the methodology of modeling experiments using
this kind of technique, you should be able to actually go beyond the
molecular biology experiment and try and represent data from all sorts of different things. OK, and then finally, just to kind of– I’ll
finish on a kind of high note talking about artificial scientific intelligence. Because we talk about AI all the time. And I wanted to highlight the work of Ross
King who is one of my heroes. He’s awesome. He works in Manchester University. And he’s built, he’s spent his career building
robots that kind of go around the cycle of scientific investigation
automatically without humans. This is a robot that can reason scientifically,
construct its own theories, it comes up with its own hypotheses, and it tests those
hypotheses through experiment. It’s an absolute, you know, it’s really,
really interesting. Now, the thing about this is, though, it takes
a lot of– it’s a kind of a preprogrammed environment. The robot can
only ask one kind of question that it’s programmed to look at. It can’t think independently. And so one kind of big
high-level idea I have is, could we use educational methods that we normally score scientific
expertise in the high-level idea I have is, could we use educational
methods that we normally score scientific expertise in the
subdiscipline, could we apply that to computers and to our computers computer systems? And a colleague of ours, Rochelle Trachtenberg,
has put together a model of, in this case, statistical literacy that
rates people on levels of beginner, functional, skilled, independent, or master or expert. And I thought it would be
interesting, you know, why don’t we apply this to our computational systems? If we were to apply this to our
computational systems, the very best systems that we currently have would be at this level,
beginners, where they basically remember facts. You type in a query, and it tells you what
[INAUDIBLE] remember. What would it take for us to develop computer
systems that are capable of higher levels of reasoning and
intelligence to do with science? And that’s what I want to leave you with as
a final note. So thanks to– this is just a
list of people who have inspired me and helped me and kind of given terrific leadership and
feedback in my career. And, yeah, thank you for your attention. Thank you so much, Gully, that was a really
great overview of scientific knowledge engineering and a number of
different key aspects that you need to think about as you undertake this maybe yourselves
or you’re interacting with systems and kind of giving some insight
into how they’re built. One question that I would like to just remind
everybody is that if you do have questions, please send them in via the questions submission
system so that we’ll be able to get them in front of Dr. Burns. Gully, as people begin to consider ontologies
and how the thing about knowledge or the knowledge domain in which
they want to begin modeling, how do you, how did they begin and
how do they constrain it? I mean, it seems like you’re sort of throwing
a dart in a dark room, throwing a dart at a dart board hoping you hit
some place and then that’s the place that you begin and then things start to spider
outwards in terms of what you’re attempting to knowledge or to knowledge
model. And then how do you stop it so that it doesn’t
grow unconstrained and it becomes this intractable
problem? How do you kind of keep it real, if that makes
any sense? Yeah, OK, great, great question, Jack. So I think a good starting point is to look
at, actually, go to the big data U resource and look for courses on data modeling
generally. Now, and essentially there are tons of courses
that you can see about this kind of thing that describe,
you know, how do you develop models in the UML language, for
example, the Unified Modeling Language which is used by software engineers to develop things
like Java libraries and databases and such like. That gives you a kind of high level way of
building schema that gives you a kind of an overview that widely used within the industry. So, you know, master’s level computer scientists
can easily grasp that kind of thing. And then it’s a short jump from really taking
the notion of a class-based or object-oriented schema and converting
that into an [INAUDIBLE] ontology consisting of classes and attributes and elements, right? So and I think that,
you know, a colleague of mine once said, no representation without implementation as a
kind of general rule, you know, alluding to, of course, no taxation
without representation. But I think that that’s a really good idea. So don’t try and represent something that
doesn’t go in your system at some point, you know? Don’t try and
represent something that you don’t have an example for. What you need to do whenever you represent
data or knowledge, what does it look like if you take
a real worked example and kind of flesh it out? And so my advice is
that you basically get a pen and paper and draw out whatever full scale example, knowledge
that you’re trying to work from. And take a look at that and make sure that
that’s queryable and operable so that you can actually do
stuff with it. How does knowledge engineering take into account
a kind of counter-examples, if you will? And I’m thinking of,
you know, the ISA and HASA relationships of these, the tree the classic tree diagrams
that ontologies have. And,
you know, if you’re modeling something that’s, you know, a human is– has arms, you know,
and, but there are humans that don’t have arms. And so how can one kind of keep track of alternatives
within an ontological system that still kind of, you know, they’re still
human, but this person was born without arms or had them, you know, they
were lost in an accident or something. How does it handle that sort of stuff? Right. Well, I think that it’s like any piece of
computational architecture. You have to– the more complex and rigid
you make– the more complex and rigid you make it, the less likely it is to be useful. And yet the more easily, the
more quick and dirty you build it, you’ll be able to get it up and running, but it won’t
be able to solve these individual cases. I think that, again, I go back to this idea
of, what do you want this representation of a person with arms for? Is it a
patient intake form? Is it a representation, you know, what’s the
actual use case that the knowledge is going to be
used for? And then you can do an error analysis and
say, OK, look of all the people we’re going to see, this is not
an important set of information. But– and, of course, that isn’t a kind of
universal solution that solves the problem from a AI complete kind of point of view. But it’s still, that’s pretty much my approach. I haven’t talked about data modeling in a–
using statistical models and neural networks in this talk. Because
we’re– That was going to be my next question is,
can some– Yeah. —relationships be probabilistic, for example? Or–
Yeah. –yeah, I don’t know. I was going to ask you about that. Yeah, that’s a whole. That’s another talk I think that
there’s– you know, when we’re talking about biomedical databases, as you can see, right,
the databases that people use in the scientific community are
still mostly working with knowledge at the level of tables and attributes
and schema. Right, right. But there is a big push to try and build these
systems that are capable of reasoning over the complex data spaces
that you see in the real world like the space of all microscope images taken from brain
tissue. If you were sit down
and you kind of like look at those brain images and you try and build a taxonomy, an ISA taxonomy
of the different types of brain images, you will not succeed,
right? But if you try to classify– I’ve seen work
where people are trying to classify it who are able to leverage,
who are able to classify tumors or various different kinds of growth in
the brain and such like, but will also, and will also be able to kind of, yeah, do other
interesting things with the data. So neural networks is very much a, very much
a kind of an important and exciting framework for things to kind of
move forward with. Just in general, I think that’s an observation
that you see in literature. This is definitely kind of
the trend of [INAUDIBLE], probably the most exciting trend that I’ve seen in terms of
where this is going to go. It’s
not obvious exactly how it fits in with the existing symbolic representation of things
as a general open-ended question, but, you know, that’s– I think
that’s a very exciting way in which this field has developed. Well that may very well be the future. Well, we’ve reached the top of the hour, everyone. Thank you so much for
taking part in the Big Data to Knowledge Guide of the Fundamentals of Data Science. Thank you to Gully Burns
for sharing his thoughts on scientific knowledge engineering with us. We really appreciate it, Gully. And keep your eyes peeled in your email for
further announcements about future Guide to the Fundamentals of
Data Science webinars. And until then, everyone, have a great weekend. And we look forward to seeing you again
soon. Thank you so much. And thank you, Gully. Thank you.

Leave a Reply