Organizing the World’s Scientific Knowledge to make it Universally Accessible and Powerful:


GULLY BURNS: The underlying
model that I’m trying to kind of promote is this notion
that scientific work is a cyclic process. You start off with scientific
knowledge that is contained within a scientist’s head,
most of the time. They then analyze and think
about the work that they trying to do. They come up with new ideas
and new theories. And they come up with good
scientific questions that are relevant and important and
actually make a difference in the subject they’re asking. In order for this to be
scientific work, those questions need to be
addressable in an experimental way. So you have to actually have
experimental designs and protocols that test
and execute the processes for that. And then once you actually have
you scientific protocol, you execute an experimental
design. You acquire data and you
gather the data. You get hold of it. And then you interpret
it to add it to your scientific knowledge. So the whole purpose of what
we’re trying to do is to figure out how we can go around
the cycle more quickly. How can we build tools and
systems to accelerate the process around this so-called
knowledge turn. And the kind of key question is
how do you make scientific discoveries? What is the crucial element of
making scientific discoveries? Well, in some ways we want to
be able to analyze existing knowledge that comes up with
important questions that then are testable scientifically. If you think about it, this is
the absolute crux of what science is really revolving
around. And it comes down to how do you
find the right questions? How do you actually do that? Now, if you are someone who can
weigh the universe on the back of an envelope, this
is not really a problem. Someone like Einstein, you can
figure this stuff out for yourself without too
much difficulty. You could also be someone like
Rosalind Franklin, who is the kind of poster child of
someone who did really serious, diligent work, really
solved a problem that was important, and unfortunately
wasn’t actually recognized for it. She didn’t get the Nobel Prize
for her work because they don’t offer those things
posthumously. But instead, these guys, Watson
and Crick, were able to come up with exactly the right
question at exactly the right time and solve the problem in
a really kind of innovative and creative and interesting
way. And in a way, again this is
what computers can do. Perhaps computers and
computational technology and knowledge engineering technology
can help us find those questions and answer them
in an appropriate way, which is kind of cool. So a notion that underlies all
of this work that is actually interestingly missing from most
people’s accounts within bioinfomatics of how we should
address these kind of questions is this notion of
scientific paradigms. Now, I don’t know if you guys
are familiar with the work of Thomas Kuhn. But in 1962, he wrote the
seminal work called “The Structure of Scientific
Revolutions,” in which he talks about paradigms as being
some accepted examples of scientific practice. Examples which include laws,
applications, and instrumentation together, to
provide models from which spring coherent traditions
of scientific research. And so we have this kind of
underlying idea of what a paradigm is. But what’s really interesting
is how do we actually transform the knowledge within
a paradigm shift, within the framework of a paradigm shift? So let’s just go through kind
of simple example to illustrate what we mean by that,
a paradigm shift from elements to the elements, to
the chemical elements. And in fact, this is a kind
of illustrative example. So typically what happens when
people come across a new phenomena such as fire, they
try and categorize the different types of fire
that they have. They do stamp collecting. They kind of figure out OK,
this type of file is very green, this kind of fire is
blue, this kind of very bright, that kind of thing. And of course, a lot of science
actually falls into this category at the moment. People come across phenomena
that they don’t understand very well and they try and kind
of explore it and figure out how the different
phenomena work. And of course when they do that,
scientists typically will make up a theory and
they’ll postulate some mechanism that may
not be accurate. So in the case of fire, of
course, the original idea was that fire is the stuff called
phlogiston that has some kind of properties like a fluid
and such like. But of course, it was
the wrong theory. It didn’t work. So what happened there is that
paradigm kind of got stuck. It got in a loop. The predictions that were being
made weren’t accurate. The predictions that were being
made weren’t consistent with the data that was
being gathered. Until of course Lavoisier they
came along with the brilliant idea that in fact the correct
mechanism was that the process of combustion involves oxygen
and carbon, and produces water, and releases
energy, which is what fire is all about. Now of course, what’s
interesting about that is that this discovery then formed the
basis of a whole theory, a theory of modern chemistry. And once they were able to do
that, then they were able to export the theory as a kind
of structured, reasoning framework, something that
scientists in other fields could use for their
own purposes. And of course, the periodic
table is a great example of that kind of thing. And that’s actually the kind of
thing that we’re trying to address within the knowledge
engineering work that I’m going to be talking
about today. And of course, once you’ve
actually got a theoretical framework which you can operate
within and work within, then you can develop
new technology. You can do anything you like. You can actually build
combustion engines. You can cure disease. You can do all sorts of stuff. So one of the important
distinctions that I’d like to make about this cycle, this
cyclic process, is this distinction between
interpretations on the lefthand side, and observations
on the right. Now, what’s interesting about
this is that the people who are kind of great at dealing
with these different processes are different people. On the left, you have Professor
Honeydew, who is an expert in the field. He’s steeped in the knowledge
acquired over many generations. He knows the details of exactly
how to interpret everything and he’s the guy
who comes up with new experiments. Whereas on the other hand, you
have Beaker, on the other side, who is capable
of reproducing the experimental data. He is able to execute a well-defined scientific protocol. But he doesn’t necessarily
understand why. And that’s OK. That actually in our case
is a good thing. Now, the distinctions of
interpretations versus observations within this
context is that the interpretations are semantically
complicated, whereas the observations involve
only really data semantics, which is actually
quite tractable. And on the lefthand side, you
need to human in the loop. On the righthand side,
you don’t. You can automate stuff. On the lefthand side, you have
knowledge that has to be specific to the individual
paradigm. Like if you’re an expert in
neuroanatomy and you know how the structure of the brain is
functioned, when you’re trying to design and develop
experiments within that field, you’re going to be pulling on
that kind of theory, whereas a neurophysiolgist doesn’t
understand the details that you will using. Whereas, if you’re just trying
to process neuroanatomical data, you don’t need
to know that stuff. It’s just data. It’s nice and coherent. And so the idea here is that
the way in which we look at these two different situations
has to be a little different. In order to make interpretations
tractable, you want a scoped representation
of the paradigm that allows you to actually kind of reason
within the structure of that paradigm effectively. Whereas if you’re working with
observations, as long as you keep it restricted to data and
measurements, then it’s tractable and you can calculate
all sorts of statistics and statistical
effects quite happily. So that’s an important
distinction to make. And because we are trying to
keep it simple and because we’re trying to build this
within a computational framework, we’re going to tackle
first the observations and we’re going to address
that kind of question. So the work that we do is based
around this kind of notion called knowledge
engineering from experimental design. I’ll get to it in a second. The idea is that this is a slide
that you would probably seen in scientific talks about
biology all over the place. At the top is the scientific
statement that would be interpretative statement, saying
something like that rats eat cheese when they’re
hungry or something fairly straightforward. And it’s usually based
upon data. And the data is going to be
based upon measurements, the dependent variable. And the settings that you set
within your experiment before the experiment starts to test
your effects, as independent variables across the board. So here the kind of important
criteria that you’re looking for here is there is a
difference between conditions D and E. And it’s illustrated
by the measurements of the dependent variable. So that’s actually how
scientists think about their data most of the time. This is the underlying
conceptual model that they use to design and build experiments
and to actually do their technical work. And what we’ve done is we’ve
tried to extrapolate away from that to provide a simplified
framework for scientists to be able to describe what they’re
up to using these representational elements,
using processes, material entities, parameters, constants,
measurements, branch points, and
thought points. And given this kind of
vocabulary of conceptual structures, it’s actually
not too difficult to get scientists to describe their
protocol and to draw out their protocol in the way that
I’ve just show you. And actually, I gave a talk
at a neuroscience science undergraduate lecture. And I showed them this slide
and I asked them to put their hands up. And I said, how many of you
feel comfortable that you could do that? And most of them did. Which I was like great,
that’s awesome. So now what’s interesting about
this if we are trying to extract the relationships
between dependent variables and independent variables,
measurements and parameters, we get that for free from
this kind of diagram. Because in fact, we’re able
to trace back through the structure of the measurements to
see which of the variables that actually parameterize the
process upon which the measurements were taken. So in fact, you can see that
measurement 1 is dependent upon parameter 1 and parameter
3, but not dependent upon parameter 2, because
it lies on the different arc of the protocol. Now, that gives us our
relationships between independent and dependent
variables, that can form the basis of a data representation, which is kind of cool. We actually have a system that
we’ve built called BioScholar that in fact provides a
drawing palette that scientists can use to draw out
the protocol that they’re actually using. So here, this is a very
simple experiment. You take a mouse, you measure
its blood pressure. Nothing complicated
about that. And the system allows you to
describe the various different variables at different stages
with different data types and such like. Very happily, we were able to
publish this as a paper. Can I get this? For some reason the– ah, here we go. So you see on the righthand side
a reference to a paper in “BMC Bioinformatics.” And I’d
just want to point out the first author of this is
sitting right here. This is Tom Russ, thank you,
who was working with me on this work previously. And this is a highly-cited paper
within the community. So we actually got
some impact. People were reading it and
it was quite good to see. So having kind of set up the
structure of this knowledge representation where we’re
trying to capture the primary measurements by using these
problems, these relationships through the protocol, we then
confront the unfortunate fact that data processing
transforms data. Well obviously, right. That’s not surprising. But if we were able to actually
capture that as well by seeing– if you look at this
representation on the graph, the idea is that this is an
experimental protocol that consists of taking the MMSE
questionnaire, which is the mini-mental state exam, which is
a test of cognitive ability that people will take if
they’re suffering from Alzheimer’s. And you can actually see that
the data structure is relatively simple. You have your score, followed
by your ID number for the subject, and then the group
that you’re in. And if you want to calculate
the mean values of this specific measurement, which is
an important step to take, the process of calculating the mean
will average out the IDs. No longer the individuals
will be measured there. So by mapping the inputs and
outputs of each individual data processing, we can
construct a representation that will support a knowledge
representation for the following elements, for primary
measurements; for mean values; for statistical effects
between groups, which we’re showing at the bottom here
for the MMSE z-score; and of course correlations, which
are not shown here. So the idea is that all of
the kind of statistical relationships you might see in
a paper, if we adhere to this kind of modeling approach, we
should be able to capture and represent quite cleanly and
easily within this approach. Now, how expressive is this? How good is this as
a methodology? Now, this is a figure, a data
figure from a paper concerned with looking at vaccine
responses to an HIV vaccine. And what you’re looking at here
is a slide that shows kind of a horrible data slide. But this is not completely
atypical of the kind of things you see in scientific papers. The thing that you’re looking at
is that the black dots are opposed to be higher than
the white dots. The black dots shows an immune
response under conditions of the vaccine and the white
are a control. Now if you draw out the entire
protocol using the kind of conceptual framework that we
described previously, you get this complex representation. And in fact the measurement that
we’re looking at is shown by the number 5, the circled
measurement in the diagram. And if you trace back through
the protocol, through every single individual parameter that
indexes that measurement, they correspond exactly to the
axes of this diagram shown here, which is very
encouraging. So we were actually able to go
through this paper and curate all of the data points, every
individual data points from this paper, into a database
representation. Which is actually really, really
cool because that’s actually very, very difficult
to do using other database methodologies and data
modeling approaches. So that’s some of our
core technology. That’s the kind of approach
that we’re trying to take. Now, we’re also trying to think
about, OK, how do we apply this approach? How do we apply this approach
across a whole field? Now, given the complexity of
the different types of research that people do, we
don’t know the number of paradigms that people
are working within. There are all sorts of different
ways in which people are doing the data analysis. And three examples of the kind
of data formats that people use are show here, human
neuroimaging protein-protein interactions, gene expression. All of them involve different
types of data. But they also involve different
kind of hidden, intrinsic methodologies and
assumptions that people make when they’re driving the
kind of research that they’re doing. So we want to know how
do these different– well, let’s ask the question. How do we figure out how many
paradigms there are? Well, you guys are all very
familiar with topic modeling as a methodology. Maybe we could take all of the
text within a specific corpus of scientific papers and try and
automatically discover the kinds of paradigms
that within. This is an explanatory slide
that I think, after talking to Brian, is probably completely
unnecessary. The idea is obviously, topic
modeling involves taking documents like this. This is just an abstract from
the Society for Neuroscience meeting a few years ago. And what topic modeling
allows you to do is to identify papers– I’m sorry, words that tend to
reoccur together in the same document again and
again and again. So If you’re working on Parkinson’s
disease for example, you’ll see these kinds
of words like nigral, dopaminergic, substantia nigra,
all co-occurring within the same document. And so by doing topic modeling,
you can then model the structure of the document
itself as a mixture of different topics distributed
over the entire corpus. And that gives you a framework
within which you can then perform statistical analyses
on the papers that much more easily. And what we did was we wanted to
try to provide a graphical representation of the layout of
a subject that scientists could eyeball and examine. And so what we did was we took
a literature corpus. We generate a topic
model from it. We then calculate document,
document similarity scores based upon comparing the
different topics themselves for each document. And then we actually embedded
this in a Google Maps application to provide a kind of
a layout that you can zoom in to and look at and examine. And this is actually
quite old work. We did this back in 2007. And this is a cluster map
of the 2006 Society for Neuroscience meeting, which is
actually kind of interesting semantically as well. This is a conference that
about 30,000 different scientists come to every year. It is the main conference for
all of the neuroscientists across the world. And in a way, it’s kind of like
a snapshot of the state of art at that point, which
is kind of interesting. And all we’re showing here is
this layout, that you can kind of zoom into. The nodes are colored by the
different themes that they were manually classified into
within the conference description. And so you can see it forms
these clusters. And it’s an interesting idea is
to think can these clusters of different nodes be used as
the basis of defining or investigating or looking
for paradigms? And this is work that’s
still ongoing. We haven’t really developed
this a great deal. But this is an example of the
kind of work that Google could in fact do, obviously
at scale. Rather than dealing with the
11,000 or 12,000 documents that I’m looking at here, Google
could certainly examine for example the 20 million
documents that occurred within PubEd. These are things that I’m trying
to kind of throw into your space as possible avenues
and ideas for research that you might be interested
in taking up. Another example of this is the
NIHMaps.org Project, which was work that was derived from
our original idea. We worked with Ned Talley and
with a company ChalkLabs to develop this application. And actually I definitely want
to make it very clear that we had no real part in the
development work here. This is all ChalkLabs
development work, but nonetheless is kind of inspired
by our original idea. And what you’re looking at here
is a representation of the 80,000 or so grants that
were actually funded last year by the whole US government. And what we’re able to do is
query and search for grants that contain 30% or more of
their words classified as Parkinson’s disease
relevant words. And you can see that you zoom
in and you actually get a small cluster of papers that are
concerned with deep brain stimulation, which is a
specific methodology. So in theory, everything within
this kind of cluster may be relevant to the same
kind of thing, that could define a paradigm. Kind of interesting. And it’s also relevant because
these guys are all funded. So it’s not just a simple
theoretical question about who’s doing this work. These guys are all actively
processing and developing the same kind of ideas. This is kind of a little
hard to see. I apologize. But this is a breakdown of the
different categories of topics that occur in this corpus. And they cover the gamut. And of course, this is quite
interesting because then you can actually see what the
literature is telling you are the most prominent words that
are being used by scientists. Rather than ontologies that are
defined by informaticists, these are the actual words that
are in the documents. So that’s another interesting
avenue of research that we could all kind of think
about and work on. So now I want to talk about
something evil, dark data, dun, dun, dun. And of course if you’ve not
heard the term before, dark data is an illusion to the
concept of dark matter, which is data that exists– I’m sorry, matter that exists,
but matter that we can’t see. It’s kind of out there
somewhere. We know it’s out there. We have no idea what it is,
which is kind of cool. But it’s present. Now, when you’re thinking about
data and dark data, this is data that exists. We know that it’s in
people’s labs. It’s in people’s notebooks. In people’s filing cabinets and
proprietary file formats and all these different
places. But we can’t get hold of
it and can’t see it. And the fact is that so much
data is kind of hidden away. And the infrastructure that we
currently work with doesn’t actually promote the sharing
of data very effectively. It does so a little bit, but
not very effectively. It has a big impact. Last month, I went to a meeting
in Oxford where people were talking about scientific
integrity and scientific rigor and the whole process
of data sharing. And this guy Chas Bountra, kind
of stood up and talked for about 15 minutes, without
any side present, without any slides at all,. And he get this wonderful,
impassioned speech about how destructive the process of
concealing data from each other is within the
pharmaceutical industry. And so he heads the Oxford
division of the Structural Genomics Consortium, which
tries to provide a precompetitive space
for drug discovery. And he’s actually getting a lot
of really supportive money from pharmaceutical companies,
who are willing to chip in to try and basically support data
sharing at the stage before which drugs become
competitive. So there’s all sorts of
preliminary trials that people want to do to make sure that
molecules are viable targets for actually developing
drugs for. And he’s trying to kind of
develop the idea of being able to save some money
for doing this. And so obviously the quote is,
“it’s harder to get a working treatment for Alzheimer’s
Disease than putting a man on Mars.” The reason says he says
this is because according to his estimates, as far as I
remember, I could be wrong about this, he said that it
costs $30 billion or 30 billion pounds, I can’t remember
which one it was, for all the different research
that’s gone into trying to understand Alzheimer’s disease
and they still don’t have a working drug. Like all of the stage 2
pharmaceutical trials that have taken place, have all
failed, which if you think about it, is horrifying. We should be able to do
better than that. AUDIENCE: There were two recent
failures reported just a couple a couple of days ago. GULLY BURNS: Really? AUDIENCE: Yes. GULLY BURNS: But specifically
for Alzheimer’s disease? AUDIENCE: Yes. GULLY BURNS: OK. OK. So apparently there are more. There are recent failures in
the last couple of days for Alzheimer’s disease and
drug development. So this is a very topical
question as well, a very serious one. And so the people in the
knowledge engineering community and the semantic web
community have put together the Linked Open Data Cloud. It’s a little blurry,
I’m afraid. I’m sorry you can’t
see it very well. And within that space, there
are a number of different projects in the life sciences. And essentially what this is all
about is these guys take these large online open
databases and they dump them into IDF and they look for ways
in which you can make connections between them, which
I think is a very, very valuable starting point to
demonstrate how data can be linked and data can be
made interoperable in a very open way. However having said that, the
representation that they use doesn’t make any distinctions
of the kind that I just described at the beginning
of the talk. If there are interpretations and
observational data, they tend to be mashed together. And you need to know the
underlying schema of the database to be able to
make sense of them. And of course, the different
data sets that are contained within the Linked Open Data
Cloud are all basically representing that the data
that’s contained within the database is as it stands. So you have all of the problems
of linking together the terminology and trying to
understand how the different data schemas are representing
the data and such like. So we haven’t solved some of the
major conceptual problems about how this kind of
representation should be constructed and should be framed
in a paradigm-specific way that would actually allow
us to reason over and build representations for specific
scientific problems. But I think that this is a
good start and this is an interesting way. This is the way things should
be in the longer term. This is a framework within which
we could publish our data and actually this could
make a difference in the longer term. At the moment, this hasn’t
necessarily ready proven to be a game changing technology. But nonetheless, the promise
is there at some point in the future. And now, I’d like to also
mention an approach that’s being put forward by Barend Mons
and an extended community of people, which is called
Nanopublications. So the idea is that you can find
a representation in RDF that contains a named graph
that stores a scientific assertion or a scientific
data set, some form of representation that could be
fairly general purpose and lots of different types
of format that you could include there. But you have a standard
representation for the provenance and for the
supporting data that allows you to link together different
data types. And I think this is a really
valuable and interesting idea and is actually a way in which
we can try and think about transforming the way in which
scientific data is published and shared as part of the
whole publishing cycle. And some examples of this is
you could have data points linked to a paper citation,
you could have a representation of a statistical
effect, or you could have interpretation
based upon KEfED data. And these are all obviously
conceptual ideas. We haven’t got worked
examples in nanopublications to show you yet. But watch this space. This is something that
we want to work on. So having talked about this and
having talked about the underlying KEfED representation
and all of these various different other
components, I wanted to kind of bring it home by
talking about a real scientific problem. And actually, I’m talking about
how one might be able to encapsulate and really
represent a specific scientific paradigm
in a coherent way. So I’m talking about
biomarkers. Who here has heard of
biomarkers as a representation? One guy, one person. Great, thank you. And actually biomarkers is
a very well understood methodology within biology. The idea is that given an
individual in some kind of diseased state over time– is this OK? OK, good. So in some kind of disease state
over time, a person’s disease state will change from
normal, pre-clinical, mild clinical, to severe clinical,
and you can actually track over time. Now biomarkers, the idea is
that you have various different types of measurement
that you make on that individual over time that
actually allows you to track various different aspects
of the disease. And so here’s an example of
a numeric biomarker– of course, this is all
just cartoon stuff– or a binary biomarker that shows
a state of either being off or on, and imaging
biomarkers that actually indicate– you know, the MRI images that
reflect a person’s structure of the brain. And the idea is that the way in
which the biologists think about this is they’re looking
for an indicator test. They want to find that’s
something that indicates the transition from a normal
to pre-clinical state. And they want to find a test
that you can deliver as a blood test or a questionnaire or
something that’s indicative that this person may have
a disease of some kind. And so what’s interesting about
that is that if you think about this, you have an
individual whose state changes in discrete stages over time and
you have a whole bunch of features that indicate
measurements of some aspect of that individual’s measurable
quantities over time as well. Now, doesn’t that sound like a
hidden Markov model to you? Isn’t that something that
computer science can actually address in a very tractable
and straightforward way? Or not necessarily easy and
straightforward, because the features you’re looking at
are very complicated. But nonetheless, it actually
falls within the domain of computer science. You can address that
kind of question So the goal in a way of what
we’re trying to do by shaping and representing the data that
we get access to it is to make it available and accessible to
people like you, to computer scientists and to experts who
are really, really, really good at data analysis at scale
and can actually address these kind of questions. So let’s then see where
this is applicable to. So there are these two
projects called the Alzheimer’s Disease Neuroimaging
Initiative or ADNI and a more recent attempt
to analyze and look at Parkinson’s disease data in the
same way, called the PPMI, the Parkinson’s Progressive
Markers Initiative. Now, these two projects are
kind of interesting. They’re large scale, multi-site
projects that involve clinical sites across
the world, in lots and lots of different places, where they
have to coordinate very carefully the data acquisition
and the representation of different data from
different places. They’re actually really,
really good at that. And the idea is that they simply
capture and track 200 patients in a disease group and
a nondisease group over time, over a period
of five years. And they capture imaging data. They capture blood. They actually do spinal taps and
take cerebrospinal fluid and freeze it so that people can
do subsidiary, secondary back-up studies to
look for specific things within a tissue. They do questionnaires. They do clinical assays. They look for any kind of clues
they can find that may suggest a transition in the
progression of the disease. And importantly, they actually
provide access to all the data and they provide access
to the samples. You have to go through a
rigorous application procedure to get access to this
stuff, but it’s available for anybody. So these guys are really setting
the stage and pushing things forward and really kind
of having a big impact. And of course, surprisingly
enough, or rather not surprisingly enough, these
projects are really successful. There’s a paper out by the
guys who run ADNI. ADNI is now in it’s
second phase. It’s a more international
consortium now. It’s globalizing. It’s scaling itself
up quite well. And they published a paper that
shows something like 200 papers were published around
this individual study originally. And of course, the impact that
they have on secondary studies is even greater. So I’m– AUDIENCE: Because the
data is available. GULLY BURNS: Exactly. AUDIENCE: It’s not like hey,
I’m not going to show it to everybody because I want
to publish it first. GULLY BURNS: Yeah, exactly. They’re saying look, we
want you to publish. We want you to leverage
this stuff. We want you to succeed. So by taking this kind of
process, the people who developed ADNI really pioneered
the whole kind of data sharing framework. And they did so in a way
that’s actually kind of interesting because 200 patients
per group is not massively scalable. It’s not at a level of 10
million, the population of China, or something like that. But nonetheless, it’s actually
able to solve these problems and actually really make some
progress about trying to detect Alzheimer’s disease
in people. The PPMI project is
a little younger. It’s only been around for
a short period of time. Again, it’s a large-scale
international consortium of 24 separate clinical groups, all of
whom are working together. It’s sponsored by the Michael
J. Fox Foundation. And this is the main
presentation from the website. So they have a whole complex
governance structure where they have different working
groups developing and building these things. And of course, what’s
interesting about this too is that all of the researchers
who are working on this problem are working on their
own problems in their labs. So you have these top
researchers from these different clinical labs,
all contributing to the centralized project. But that’s not all
they’re doing. And so the community built by
this project itself is able to kind of share data in a knock-on
effect as well. So I think that’s something that
we kind of underestimate the impact of. And it’s actually one of the
reasons why I’m very excited to try and work with
these guys. Now, what’s interesting about
this as well is their sharing the data, but the data is all
shared as .csv files, text readable .csv files. There’s a data dictionary that
you can look stuff up in. And all of the data descriptions
are written down in PDF files, so that you can
actually learn and understand the data if you’re an expert
and if you take the time. But for a machine readable
approach, using semantic web and knowledge engineering,
it’s actually not enough. We need to innovate. We need to bring our technology
into this domain. And we need to be able to kind
of actually model the process using the kind of tools
that we’ve done. So we started this work. This is a KEfED, knowledge
engineering from experimental design model of PPMI. This is completely unreadable. So let’s make it a
little bigger. The idea is that you have an
individual participant and then you have a loop structure
here that actually provides a presentation of an
individual visit. And the visit is basically
where they all will work. And then you have a whole lot of
assays leading off into the distance over there, with the
various different types of measurements that people
will make. We’ve only just started
work on this. I basically put this model
together last Saturday. But we were able to actually
build a complete representation of all the
different types of variables that people do with the stuff. We haven’t done any of the
knock-on studies that people do on top of things like blood
samples, urine samples, and CSF samples. But nonetheless, we were able
to capture for example, the structure of the data that’s
being measured and represented within the original data set. So I’m very encouraged
about this. And I think that this actually
provides us with a framework that then we could publish
nanopublications, based upon PPMI data, assuming of
course that we get permission to do so. We have to apply through their
internal processes to make sure that’s OK. But then, exciting enough,
what I’d like to do is to present all of this data in a
machine readable form that people like you can
then process. And so we could actually present
this as a framed data set that anyone could have a
crack at with respect to using hidden Markov models or CRFs or
any kind of data modeling kind of approach that we have. So in a way, again this
illustrates the role that I think the work that we’re doing
actually provides by being a framework for capturing
the semantics of the biomedical processing in enough
detail so that we’re able to kind of actually capture
the underlying data and representation in order to
solve problems, but not so much detail that we start
trying to invoke all the knowledge to these guys carry
around in their heads. Because there’s no way
we can do that. But then we want to package it
and present it to you guys. So in a way that than exposes
all the data for all the sorts of kind of large scale, complex,
and interesting [INAUDIBLE] kinds of analysis that
are in your toolbox. So that’s kind of part of the
mission of what we’re trying to do and how we can actively
pursue this and actually make some progress and maybe even
cure some of these diseases. So going a little bit
further, that’s the main body of the talk. But looking further off into the
future, a couple of ideas for you to consider. So first of all, this is
a famous slide from a neuroscience textbook that
illustrates the different levels of organization of
the nervous system. You have genetics. You have various different
proteins and channels in cell surfaces. You have synapses,
microcircuits, neurons, local circuits, whole brain
structures, then behavior. And of course, a disease like
Parkinson’s disease actually affects all levels of
this framework. You have to be able to look
at the research at all the different levels. And in fact, if we’re interested
in studying not just the correlative effects of
these different symptoms, which is what biomarkers is
really doing, if we want to start digging into the causes of
how the disease works, then we have to start looking
at these various different elements. So LRRK2 is a gene. And is in fact I think the gene
than Sergey Brin has a mutation for, that is part of
the reason why he’s actually interested in studying
Parkinson’s disease and trying to find a cure himself. Part of the reason why
Parkinson’s disease causes damage is based upon
aggregations of these various different malformed proteins
that grow and cause problems and basically kill cells. We don’t know how. We don’t know why. But the processes that underlie
that is somehow related to the way in which
the proteins fold. Again, we don’t understand
exactly how that works. That itself is a paradigm. People work within that field
their entire lives, trying to understand that whole process. Then the process of– let me get this right,
autophagy, I can never pronounce that, autophagy is the
process by which your body actually consumes and eats up
the proteins that get made. And that’s broken in some
way with respect to Parkinson’s disease. There is evidence that suggests
that mitochondria might be implicated in
this kind of process. There’s evidence that
inflammation, the process of inflammation within the brain,
actually have a role to play within the causes of Parkinson’s
disease. But we don’t exactly know how. And of course we do know that
at the level of the brain regions and cell populations,
there’s a specific set of neurons that are damaged within
Parkinson’s patients, that causes movement disorders
and other sorts of long-term, devastating, horrifying effects,
that we want to try and understand and we want
to try and stop. So in order for us to go
further, we have to start building representations and
knowledge engineering approaches of these various
different aspects of these different research fields. And we have to find connections
between them. And we have to try and
synthesize the knowledge so that we can actually understand and cure this disease. That’s kind of what we’re up
to in the longer term. So coming here, I was wondering
how would Google kind of tackle this? My question was like, if we
were going to go Google on this problem, what would
that entail, what might that look like? And so, forgive me for being a
bit fanciful, but I thought well, wouldn’t it be cool if
we have a supermassive repository of all the scientific
data in the world. We found a place where we
could shove everything. And we’d put all the different
scientific observations from every single scientific
experiment that’s ever been done, we’d put it in there. First of all, the question
would be how would we populate it? Well, since this is Google, you
guys would develop really beautiful data management and
analysis tools that would deliver scientific expertise to
the scientists working in the lab in a way that would
transform the subject. So I was thinking about this. It’s like I remember the days
when I was driving around LA with my “Thomas Guide,” on my
lap, and not that I would ever have of course, but the
experience of doing that as opposed to the experience of
using Google Maps to kind of examine how one gets from A
to B, was transformative. Perhaps, maybe there’s something
that we can do within a scientist’s experience
of doing work within a lab, that would
be analogous. That would be cool. Of course, there’s a lot of
data in databases already. Perhaps you guys can figure
out ways of doing deep information integration, sucking
information from these open access databases. Anyone can do anything with the
data from these things. Maybe there’s some interesting
and really fantastic ways in which we can pull
data from them. And of course the scientific
literature, which is effectively a 16th
century artifact. These things are PDF files. They’re accessible
electronically, but they’re still just text on a page. Now, maybe there’s ways in which
we can transform the way in which these things
get written. Maybe there’s ways in which we
can pull the information out more interestingly,
more powerfully. There’s all sorts of things
that we could do there. And then of course once we’ve
gotten this system, what would then be possible? Well, you could do terminology
analysis. You could actually develop
ontologies that reflect scientists’ use. So scientists would actually
be willing to use the terminologies and use these
standardized terms in an easy and straightforward,
very natural way. Perhaps there’s ways in which we
can do meta-analyses across studies, which still remains
a very, very difficult and unsolved problem in general
for scientists to do. Trying to actually get the
data from one paper and compare it to the data from
another paper and being able to make the connections between them, not an easy problem. With this kind of repository,
that would easy. That would be straightforward. Then imagine for example, oh
yeah, we could scale up our representation of
data massively. So if you wanted to do
large-scale systems biology analyses of all the proteins
involved in a specific disease, bang, you would
be able to do it in no time at all. And then I’m kind of excited
about this idea. Judea Pearl is a giant
in the field. He’s someone who developed
probabilistic graphical models and really made a
huge difference. Now, some of his work is based
upon the idea of representing causality in a computer, being
able to actually look for causal relationships within the structures that he’s developed. Now, maybe we could use that and
we could leverage that in studying how things
really work. Maybe there’s actually some
underlying long-term kind of interesting research that we
could really leverage there. And then of course, we could
build a breakthrough machine. I don’t really know what
that looks like. But that’s kind of
what I’m up to. That’s what we’re
trying to do. Maybe there’s a way in which we
can automate the process of doing scientific breakthroughs
and we could actually turn that into an engineering
problem. So in that vein, I’m one of
the organizers of this workshop, the Discovery
Informatics Workshop. Well, there’s a workshop at the
AAAI seminar, at the AAAI Fall Symposium series coming
up in November. Deadlines for this
are in one month. We’re very, very excited to try
and get people involved and interested. And we’re also looking for
people who might want to act as keynote speakers as well. So if there are people who could
be potential speakers for this, we would be very
interested in talking to you. And we’re kind of excited
about this whole thing. So just to summarize, I’ve gone
through a lot of stuff and I’ve talked quite quickly. But I just wanted to kind of
summarize it, wrap it all up in a bow for you guys. So first of all, I’ve talked
about conceptual developments in terms of this cycle of
thinking about the scientific process as a cycle. And we want to try and represent
and think about paradigms explicitly. And this is I think an important
breakthrough that no one really is doing in
an explicit way. We have our own methodology,
knowledge engineering from experimental design, which is
the whole way in which we can accurately capture data from
scientific papers and scientific studies in the way
that they were originally intended to be. We have talked a little bit
about topic models of the scientific literature as a
methodology for identifying and examining paradigms. There’s also obviously other
types of information extraction work and all sorts of
natural language processing that can be done in this realm,
that I think are very interesting avenues
of research. We want to try to advocate
open standards to help alleviate the problem
about data. And then finally, I’m proposing
this specific project as a target, kind of
goal, for actually trying to address and support the
Parkinson’s research community in their work. This is thanks to
a lot of people. We obviously have a lot
of people to thank. This is just a small
subsample. And thank you for
your attention. Here’s my email address, my
website, and my blog. And feel free to get in contact
with me if you have any questions or if you
have any ideas. Thank you. MALE SPEAKER: We have time
for a few questions, if people have them. GULLY BURNS: I can repeat the
question too, if you like. AUDIENCE: Is there any more
low hanging fruit with citations and papers? Usually there’s a granularity
of one paper citing another. GULLY BURNS: Yes. AUDIENCE: So typically within a
given paragraph, you’ll have several citations. So like in a nanopublication, if
you did nothing but dissect all papers into paragraphs,
that’s a kind of distance metric physically between
citations. Is any work being
done on that? GULLY BURNS: Yeah. Absolutely. So Marti Hearst, I think at
Stanford, she came up with the notion of a citence, a citation
sentence, as being a kind of unit of representation
that you can do this kind of work with. I’m not sure if she’s
specifically looking at the kind of questions that you’ve
just described. But I think it’s kind of
in her wheelhouse. There’s also an ontology of
citations called CiTO being developed by David Shotton
at Oxford. And he’s trying to provide a
language for describing the different roles of a
citation when one is supporting another. But what you’re saying is
specifically just calculating the distance metrics within the
text as a way of looking at these things. I haven’t heard that before. I think that’s an interesting
idea. AUDIENCE: One analogy is just
with say Google search, where if you do nothing more than
just consider a new metric like distance, physical distance
between words and something might appear. Or the basic tension there bag
of words concept versus a knowledge grab. And it seems as though the
citation grab seems to be traditionally associated with
some kind of knowledge grab, when actually maybe more power
is simply a physical proximity of citations. And I’m just wondering if that
was being– and if you want to scale, that’s an issue where
scaling is much easier. GULLY BURNS: Yeah. I haven’t heard about the
physical location. But I think that’s an
interesting idea. And I think personally the way
in which citations are used is really undistinguished. I mean obviously you just point
to a paper and you have some underlying theory
of why it is you’re pointing to there. But there’s no way of really
telling why or what the purpose of that is. And also the other aspect of
this is that doing text mining of the full text is actually
difficult because of licensing concerns. So we have to be able
to get the text. We have to be able to extract
the text from the documents that is contained within. We have to be able to recognize
citation sentences. And then we can actually start
addressing the kind of questions, yeah. However, I don’t feel as though
I’ve really effectively answered your question. AUDIENCE: That’s because
it’s not fully formed. GULLY BURNS: OK, sorry. AUDIENCE: But I will consider
it more and maybe contact you later. GULLY BURNS: Sure. Please do. Yeah. AUDIENCE: So our understanding
of the diagrams that you show, the process model, can
also serve as the description for data. GULLY BURNS: Yes. AUDIENCE: So there is the
protocol and there is inscription. So basically, how do you
think [INAUDIBLE] descriptions? So the data which your proposal
connecting the database, actually it depends
on the instrument or method that [INAUDIBLE] and so on and so and so on. So increasingly, people are
taking data which make actually no sense with the
computer approach. So many different complex
instruments that you cannot look up them or something. You need to transform them. So how do you think it would
capture the description of methods and instruments and
everything to use that [INAUDIBLE] in 5 years, in 10
years, when the instruments are more powerful? GULLY BURNS: Right. That’s a good question. So I think that one size
does not fit all. And that you have to tailor the
representation of the data that you’re capturing in a way
that’s appropriate for the task that you need the
data at that time. Because frankly one of the
problems that we have is that the process of knowledge
acquisition from these things is a bit of a rabbit hole. You can go down as deep
as you like into the details of things. And unless it’s going to be
useful for you in the immediate future, you’re not
really going to be able to get people to do that. So my answer is that I don’t
know what level you have to take it at. But I think you want to aim for
a kind of process of just getting just enough as the
way of doing this. Because if you don’t do that, if
you dive down into as deep as you possibly can get, then
the process of curating the information becomes pejoratively
difficult and no one will do it. So you have to kind of
tailor your approach. And of course, what you need
as well is a kind of multi-level representation
of the work flow. So in fact, for example,
take an example. Let’s say that we’re doing a
genetic sequencing process. You could put that in a single
node in the process graph, as we’ve described. Or you could explode that into
all the different substeps that you need to take
within that process. So from our perspective, the
representation should be able to say OK, this single process
is actually made up of all these subprocesses and have
it link together. So then you could in
theory curate both. When you first make the
analysis, you want to curate to the maximum of your
ability at that time. But then let’s say that you
wanted to drill down into the data later on. You would want to explode out
that process into the different subsections
and do that. So I don’t really have a
good answer to that. But in theory, if you’re
automating this process and automating the process of data
acquisition, [INAUDIBLE], you want to capture everything. I mean that would be
the ideal process. But then you would need a
representation that allows you to make sense of that as you
say, 10 years down the line. AUDIENCE: Thank you. GULLY BURNS: You’re welcome. AUDIENCE: Actually there are
problem as well which generate more data than we can capture,
like for example our large– the collider, which generates
like 60 gigabytes per second of raw data. So they kind of process it and
discard it [INAUDIBLE]. GULLY BURNS: Right. But there are some experiments
that you can’t repeat. So in those experiments, you
have to try and capture as much data as you possibly can
because it’s incredibly valuable to be able to do so. So that’s a very important
question. And something that I think
scientists in the future are going to have to deal with. Yeah? AUDIENCE: A quick question
from VC? Can you hear me? GULLY BURNS: Yes. Please. AUDIENCE: So you describe
representing the data acquisition process. It sounded like essentially a
manual effort to create these flow diagrams. And you mentioned it could
be multiscale and so on. So once you’ve created this
by hand, then what? Do you have algorithms that
process this graph representation that
you’ve created? Or in what way does it add
value to the raw data? GULLY BURNS: OK. So all of this is development
work. So at the moment, what we’ve
done is we are able to actually capture the
organization of the protocol. And we’re trying to develop the
underlying infrastructure for being able to represent
data at each stage. It’s still very nascent. We’re still trying to look for
funding to be able to develop this further and make this
work more coherently. The way I think about this in
the future, so this is again a future direction kind of
argument, is that imagine for example that the processes that
we’re dealing with are kind of like a grammar, a
compositional grammar, to build experimental designs for
a whole class of different experiments. And so the kind of underlying
theoretical model that I have in my mind about how we can then
process this data is that you can look at the different
kind of fragments of an experiment that are dealing
with measuring specific things, and see how
they combine. This is the reason why this is
a complicated problem in that normal data modeling methods
don’t work because each experimental design when taken
as a whole, involves composing different pieces of the same
underlying design. And so if we are able to
construct a grammar-based approach to represent this type
of data, then perhaps we can use that to analyze the
underlying data and the underlying structures. And then maybe have a bunch of
graphs of work that we’ve done, for example in automating
the comparison of different studies in studies
of HIV vaccines. And so for example, instead
of a layout, the kind of graph-based layout of papers
based upon the similarities in topics, we could for example
drive analysis and a visualization of the different
types of experiments because of similarities in their
underlying design. So that’s one aspect of how I
think this kind of work could be used to help evaluate
and understand the data within the field. But it is still very
early days. And we’re still looking
to try and develop the ideas basically. AUDIENCE: OK. It makes sense. Thanks. GULLY BURNS: Thank you. Any other questions? Go ahead. AUDIENCE: [INAUDIBLE] you used some information, like
the axis on graphs and things like that to develop
[INAUDIBLE] on your models. GULLY BURNS: Yes. AUDIENCE: How much of that kind
of stuff do you think can be pulled out of papers? Like if you put all papers in a
particular topic cluster and you put all the graphs and
tables and that you just do some analysis on that. It could much be a model if you
could potentially reverse engineer it. GULLY BURNS: I think
that that’s a really interesting question. We did some work on looking
at a specific class of experiments, called tract
tracing experiments, which study connections
in the brain. And we found good results when
we knew what the independent and dependent variables were. So if you look at the results
sections of papers and you have a predefined idea of what
it is the measurements that are needed to be used in this
type of experiment, so for example labeling and brain
tissue or the location of an injection, or a specific type of
chemical that’s being used within the protocol, that
boils down to a simple information extraction
problem. So you get the same kind of
performance that you have when you’re trying to pull
out names of companies for example. It’s the same kind of process. So we think that we can do
that quite efficiently. But we need to have the
underlying models defined for a specific experimental type. And of course, that’s an
interesting question because some experimental types
are much more complicated than others. For example, I’ve started
looking at cell biology studies of cellular process to
try and get at ways in which we can look at, for example,
how LRRK2 is influencing different aspects of cell
biology as part of trying to study the causes of Parkinson’s
disease. Now, their representation of
experiments in cell biology is completely different from the
kind of thing that you see in vaccine studies. Vaccine studies, you have a
very complex design for a single process. Whereas in cell biology, you
have 25 separate experiments, all very simple ones, where you
basically get some cells, you incubate them,
you run a gel. You see what it looks like. You then do the same thing again
by tweaking some of the parameters. So it’s interesting and it’s
encouraging because we think the KEfED modeling approach
could easily handle that without too much difficulty. So you have a lightweight
framework for capturing those kinds of experiments. But again, if you’re trying
to automate the process of pulling that information from
the literature, it’s a whole different kind of ball of wax. You have to identify experiment boundaries within the paper. You have to look for
different things. But that would also come down
to how you prepare the background knowledge of what
experiment types you’re looking for and how you would
use that to seed the process of developing an information
extraction engine across the paper. But I think that what’s
interesting about that is that looking for the variables and
the measurements is a really powerful method.

Leave a Reply