Google I/O 2013 – From Structured Data to the Knowledge Graph


JASON DOUGLAS: So I’m
Jason Douglas. I’m the lead product manager
for Knowledge Graph and the Knowledge Graph platform. And I’m joined today by Dan
Bricley, who’s a developer advocate for Knowledge, which I
think is very deep title if you think about it. But more specifically, that is
about improving the state of structured data on the web
through things like schema.org and other standards. So I want to accomplish a couple
things here today. One is to give you an overview
of how search is changing and how Knowledge Graph is behind
a lot of those changes, and how that’s really increasing the
importance of structured data, and just understanding
on the web, and what that means for everyone here. And then Dan’s going to get into
a little bit more details about what that means
specifically, technically, in terms of schema.org and markup
standards and the evolving tool set to support that, and
then getting into a little bit– we’re still in the
early stages here. And there’s a lot of exciting
stuff that’s just starting to happen and I think will
continue to evolve. Also, part of the goal is to
give you an overview, like I said, provide some
context here. Because there are several
sessions later in the day that will actually dive down
deeper on some of the topics we’ll touch on. And we’ll try and highlight that
as we go when there is another session that will even
more detail on something. So I’m assuming most
of the people caught the keynote yesterday. I know it was pretty crowded– getting a general sense. But Amit was saying that
Search is changing dramatically right before our
eyes, which, if you think about it, is a pretty
big deal. This is our flagship product
used by billions of people, and yet we’re saying it’s
changing really fundamentally. So why would it be changing
really fundamentally? Well, because the nature of the
internet and information in our lives is changing
very fundamentally. In the last few years– all gotten smartphones,
internet-connected, more and more devices are connected. And information is not just this
research task that’s done at a desktop typing
in queries. It’s part of your life, right? It’s on the go. It’s information as you need it,
when you need it, and just what you need. And so Search needs to evolve to
meet that world and to meet those needs. And we are. And so some of the key things
that Search is doing now to fit in this world, to be that
seamless assistant– I think he used the analogy of
the Star Trek computer– as something that is available
when you need it to answer questions. So it does simple things like,
what’s playing at the Metreon across the street? Or when you’re sitting down for
lunch how many calories are in the cheeseburger that
you’re about to eat? Or just things that
you need to know. But also being able
to then follow up. And whenever you learn
something, it brings new questions, like, well,
how does that compare to a burrito? Or what time are those
movies showing? These kinds of things that
actually require some context of what’s going on. Or even, is my flight late
for getting home? And knowing that, as an
assistant would, what flight you were on, what you
had tickets for. And then even anticipate– so if that flight is going to
be late, letting you know, because of course you would
want to know that, or that it’s time to actually catch the
cab if you’re going to get there in time. So in order to enable all these
things, the thing that a personal assistant would have
this really an understanding of the real world– is a memory of what are the
important things to know, or what is the context
that we all share. And again, that context really
is the real world. So we’re sitting here in Moscone
today, which is in San Francisco, and there’s a
bunch of things nearby. I mentioned the Metreon
for movies. And there’s the museum and
tourist attractions– all these things. These are things that we
generally know, or are very knowable, things about
the real world. And the more of these things we
know, the easier it is to answer questions, have
conversations, and like I said, even anticipate needs. So the fundamental thing that’s
changing in Search to bring this about is, well, how
do you have an understanding of the world? And one of the first things is
you have to actually be able to think about things
as things. And traditionally, Search has
been very much about strings. So if you think about the
content on the web, the text in documents, those are words. But there’s some funny tricks
that words can play. And so I think when you see this
picture, I assume most people in the room immediately
recognize this gentleman. But he’s gone by, obviously,
very different names or many different names over the years,
almost amusingly so. And yet I think most people
don’t get too confused about who that is. But if all you’re thinking about
is strings, is words, then it actually can
be quite confusing. And so if you even just think
about your own systems, how do you keep track of
these things? Well, generally, if you want to
be able to deal with name changes in a database, you give
it an ID, and you make the name just an attribute
of that thing. And I think if you think about
what the backing database to most websites, a lot of people
do that for the important things in their systems. But usually at some level as you
go down, you think about the attributes of one of these
entities that at some point becomes a text string. Like the location of
where somebody is– now it’s a city name. It doesn’t have an ID. Or whatever it may be. At some point, it becomes
just strings. And so that’s kind of the crazy
thing of being Google that we’ve taken on, which is to
actually keep going up, and to give everything in the world
an ID, and trying to model all the connections and
relationships between all the things in the world. And this is what we call
Knowledge Graph. So I’m going to talk about how
Knowledge Graph gets exposed as features, as a platform,
et cetera. But at its core, really what
we’re doing this is creating some model of the world so we
have the context that we all have about the real world, and
can do things like answer questions, and converse,
and just have a model of the world. Another thing, in addition to
thinking about things as the real world stuff as abstract
entities, with the graph data model, we also can more easily
take on new facts about the world and create new connections
between things. And I think this is a lot
like the way people think about the world. So as we said, when Snoop comes
up with a new name or something else changes about
the world, it’s actually pretty easy to adjust
our understanding– much easier than it might
be otherwise. So what is some of these
things useful for, as I was saying? One of the things that Knowledge
Graph can help us do is answer very factual
questions like, how long is the 101? I don’t know if people have had
the pleasure of driving on 101, but even if you’re only
doing a section of it, it feels that long. But even more importantly,
there’s some subtle things about having this catalog
of real world things. Knowing what things exist
actually help us better focus in on what somebody might be
looking or help a user. So if you type something
ambiguous like “giants,” there’s a lot of things in the
world that are called giants. There’s multiple sports teams. There’s the mythological
creature of giant. In this case, in San Francisco,
I’m interested in seeing if there actually
is a Giants’ game. But it becomes very clear,
rather than having to hunt through results quite as much,
I can see that, no, San Francisco Giants is what I’m
really interested in, on the right there. And if I click on that, it
automatically refines the query to San Francisco Giants. And you start to get very
relevant information and results about the San
Francisco Giants. So one is that they actually
are out of town right now. But you can see when
the next game is. And a bunch of information
that we have in Knowledge Graph about the Giants– who
the manager is, the roster, and even where they play,
which is fairly nearby– AT&T Park. And so you can even click
through and refine to that and see information about AT&T
Park– where it is. Click through the Map to
actually get directions there. But we can go a step further. So we’ve sort of summarized
these things, exposed some of the connections between
the real world things. But also it can help discover
things, that when people are searching for AT&T Park, they
may be interested in the other sports venues around town. Or they might just be a tourist
and wanting to see other tourist attractions. So often when people are
searching for AT&T Park, they’re also looking
at things. If you look– it’s hard to
see at the bottom there– things like the ferry building,
or Fisherman’s Wharf, or Union Square, all
the well-known tourist attractions. So one thing from a developer
perspective before we’re going to go move on to structured
markup and the web as a whole– but I did want to take
an aside for a minute and make it clear that actually the
Knowledge Graph itself is available to developers. So any information that we learn
about the world that comes to us from openly licensed
sources, so that we actually can redistribute it or
share it with the world, is available and with Freebase
and the Freebase APIs– so developers.google.com/freebase. That’s a lot of stuff. So 40 million entities and 1
billion facts covers a lot of knowledge of the world,
which is very useful. So if you don’t have the time to
give IDs to all the things of a certain kind, you can use
Freebase as a catalog for those things. And there’s lots of APIs and
services for being able to match your data, for being
able to do entity auto complete for Lookup, get
facts about an entity, all kinds of stuff. So if you’re interested in that,
there’s two sessions. There’s one session later today
on search and suggest with the Freebase API. And there’s also a workshop, a
hands on thing, that I think sounds like it’s going to be
pretty cool tomorrow that also involves YouTube. So I don’t remember the
exact times for either of these things. But we all have schedules. OK, so so far I’ve been talking
entirely about the Knowledge Graph and how that’s
changing Search. But this is all based on just
this abstract model of the real world. What about the web and how
content and information from the web connects
to all of this? So I wanted to actually start
to show some stuff here. So how many people here are
familiar with rich snippets? Is this something that’s
fairly OK? About a third. All right, so maybe I’ll start
even a little earlier. So there has been a feature– so I was talking about the Star
Trek computer earlier. And there happens to be
a new Star Trek movie. So there has been the ability
for a while in Search, that given some structured
understanding of a page, that we could enhance the snippet. So normally a snippet in search
results is some section of the page that Google
believes is relevant to the query. But sometimes you can have an
enhanced presentation of that, like if we really know that
there’s ratings or reviews, for example, or events, or
information about a movie, those sorts of things, it can
give a more structured result that makes it a little bit more
obvious about what’s on the page, what it’s about,
to help the user choose the right result. But with Knowledge Graph, we can
go a step further, which is that it’s not just that this
is structured information from that page. It’s actually clear that that’s
about the movie, “Star Trek into Darkness.” And so
whenever we’re displaying information about that movie,
in whatever context the user might be, we actually know that
it is associated with it and can highlight it. So when I see the “Star Trek
into Darkness” knowledge panel, can see the review
scores, I’m like, oh, that’s pretty darn good. I tend to focus on the outliers,
the negative, when I’m trying to figure
these things out. So then I can very quickly from
the movie get right to the most negative things said
about it, and make my decision based on there. Some people might like to do
it the other way around. So that’s one example of
associating the existing structured data on the web– two entities in that
same context. But that’s not all
that exciting. That was all on the page. But we can go a step further. So say a friend of mine mentions
that they’re going a Mumford and Sons’ concert. And I’m like, I don’t
even remember. Like, I didn’t know
there was one. Where is it? When is it? That sort of thing. If I even just search for the
band, I’m able not just have information about these facts
about who’s in the band, and what are their songs, and that
sort of thing, but from the web, being able to learn that,
basically, what are their upcoming events, what
are those concerts. I can see, well, in fact, it’s
next week in Berkeley. But I think the key thing here
is I didn’t know when and where it was. But now it was very
easy to find. And if I click on it, I get a
very focused experience about that specific thing. And so I immediately see,
OK, it’s at the Greek. And if I want to buy tickets,
it’s right there. I can find out more
about the venue. I can click through to actually
buy tickets, and got to that result very quickly,
even though I only just knew it was Mumford and Sons. I didn’t have to figure out
how to find that specific event or where it was. So these are some very early
features about how structured information and content from
the web can connect to Knowledge Graph. Like I said, I think we’re at
the very, very beginning. This was leveraging the
stuff that already existed from rich snippets. There’s so much more that
I think can happen. And we’ll get into some of the
new features towards the end of the talk. But I think the key thing is
that if Search is really starting to think about things
in terms of the real world and these entities and how they all
relate, it’s going to be important to actually understand
websites and their content in that same way in
order to be able to surface it in the seamless way that users
are now expecting. When you’re talking about
answering questions over Voice on Mobile, or via Glass, or
whatever, that it needs to be as seamless as possible. And it takes that level
understanding to be able to surface it for users
at the right time. So with that, I was actually
going to hand it over to Dan to actually get into
the meat of this. This is all the vision stuff
and the context for this. But we want to get into specific
[INAUDIBLE], like, well, how does this happen
and where is it going? DAN BRICKLEY: So I think
that’s a great slide. Because it shows, on the right
hand side, the world of schema.org, and on the
left, the world of the Knowledge Graph. And I love Freebase. I love the Knowledge Graph. I carry around a lucky copy of
Freebase on my data stick. The web is a lot, lot bigger
than Freebase. It’s a lot messier. So I’m going to be talking
mostly today about structured data in the context
of schema.org. schema.org is for the rest
of the web, for that big, sprawling chaos. And I’ll start to develop
project, about its origins and partners, and the community
we’re building out at W3C. I’ll also walk you through a
bit of markup, just to show you how simple it can be, and
a bit of the data model that sits behind it. And you’ve already seen the
data model already. It’s graph. it’s attribute
value pairs. And finally I’ll say a few
more words about how schema.org fits with the
Knowledge Graph, and give a quick look at some of the tools
we’re building up to make your structured
data lives easier. So that’s the website. It’s not the most exciting
site in the world. It’s patterned after
sitemaps.org which was another collaboration with partner
search engines. So where did this come from? Why schema.org? Well, to describe real little
things in a structured way that you can understand, that
others can understand, you need some vocabulary. And then the question is,
which vocabulary? A couple of years
ago, [INAUDIBLE] from Google had the insight
that search engines really should be cooperating on these
vocabularies rather than competing, and that by doing
so, webmasters’ lives, developers’ lives, publishers’
lives would be a lot easier, and that this would broaden the
adoption and utility of structured data on the web. So [INAUDIBLE] got talking with teams over at
Bing and Microsoft, at Yahoo, and they were thinking
similarly. So from those conversations
emerged schema.org– ta-da. So since that launch,
we’ve seen quite impressive adoption. And we’ve been joined by,
fairly early on in the project, by Yandex. And we’re talking now with Baidu
on how schema.org can work better in China. So maybe you weren’t expecting
to see these company logos in a Google Talk. But the thing is, we’re proud to
be collaborating with these teams from Microsoft, from
Yahoo, from Yandex. schema.org, with its global
ambitions to improve the whole web, just would not be
possible without such partnerships. We think the collaboration’s
going well. We’re happy, not sad. It’s going well fro the
project and for the web as a whole. But it’s not just these
four organizations. So a number of individuals,
groups, organizations, companies have stepped forward
to propose extensions to schema.org since we launched. I’ll run through a few of these
here that we’ve already adopted and talk about some of
the things in the pipeline. Pretty early on, we had the
rNews group from ITPC propose a vocabulary for describing
news articles. That went in late 2011. We added late last year a
language for e-commerce for describing offerings
on the web. We very recently added
a language for describing data sets. I think you saw yesterday the
statistics showing up in front page of Search. So that’s, I think, a taste of
what can be coming as it becomes easier and easier to get
hold of data sets, and to understand just enough about the
data set to integrate it, cross reference it, and
to provide some user interface around it. So that’s really
exciting work. And that’s picking up on work
was done at W3C in the link data working group
for government. Another really exciting
development, and, again, recent, was the LRMI, Learning
Resource Metadata Initiative. So this is describing open educational learning resources. That was co-led by the
Association of Education Publishes and the Creative
Commons. There’s a whole pile
of other groups. And then this is
tricky problem. We’ve got one schema. We’re trying to make it simple
as possible for all topics, for all domains, for anything
that could be on a web page anywhere. No one really knows how to
standardize these things. I used to work for W3C. I’ve seen web standards
made for technologies. And the key there is to be super
focused on one thing. schema.org is super focused
on the entire web. So we’re making up the process
as we go along. We’ve created a community
at W3C. We’ve made it as open
as possible. The four search engines
step back and take an oversight role. They’re a handful of search
engineers from four competing companies– they’re not the right people
to come up with descriptive vocabularies. But they do share the role of
making sure these things can hang together. I think going forward, W3C, the
community around W3C, the open mailing list that we
have is the place to be. It’s all linked from the
schema.org site. You’re very welcome on
the mailing list. Yeah, come along. So schemas in practice. Before we jump into the markup,
I wanted to say a few words on the basic thinking,
the conceptual model behind this. And it’s not complex. The point of schema.org is to
serve, essentially, as a dictionary, listing kinds of
things, kinds of properties, for each type of thing that you
might have a description of in a web page. So you see a fragment
of the schemas here. Markup describing a
creative work, it might mention an author. It might mention the intended
audience, the date it was created, and so on. The hierarchy is very flat. We don’t try and model the
world super formally. It’s just a way to organize
our properties. And so, for example,
a video object– the description might use a
video frame size property, alongside a bit rate property
by virtue of being a media object, or genre, because all
these things really are creative works of
different kinds. schema.org is just this huge
collection of properties organized by their association
with these types. So for each property
we say which types they make sense with. There will be a list of
properties that work with local business, with
[INAUDIBLE], with blog, with exercise plan. Developers should recognize the
data model immediately. Let’s move on to markup. It’s attribute value pairs. It’s graphs. It’s things related
to other things. So this basic data model
is decades old. But when we launched, we had
to become more specific, markup exact computer structures
to put into HTML. We launched using HTML5
microdata, which gives you a handful of attributes that
mix into your markup. W3C’s RDFa Lite is another
very fine alternative. As far as Google is concerned,
they both work. In either case, the conceptual
model is the same– things, properties, values, attribute
value pairs. So how does this look
in practice? So let’s described a couple of
things, keeping it simple– a movie and a person,
if the person was an actor in a movie. So this basic example gives us
enough to understand how schema.org works, and three of
the essential patterns that go into schema.org markup. The examples are given in
microdata syntax, by which the words here, itemscope, itemtype,
itemprop, RDFa would look very similar. And these are not just
two examples. If you go to Rotten Tomatoes,
Netflix, IMDB sites today, things you saw earlier, you’ll
see this marked up, live in the wild web. So here we see the simplest
microdata example. We’re effectively saying,
there’s a thing. It’s a movie. And here’s its name and here’s
its description. You can’t get more unexciting
than that. It’s great as far as it goes. But how can we do more? How do we integrate the
relationships between the things that we’re describing
on the web? What if we wanted to talk
about, say, the actor Johnny Depp? What have we got here? Why would we want that
relationship? I think structured data, it’s so
much more useful when it’s connected into larger
structures. Every new fact we learn about
the actor indirectly teaches us just a little about the
movie, and vice versa. So let’s add an actor. This second paragraph here– you’ll see we’re writing a
paragraph that says, in effect, and that movie has an
actor relationship to a person with the name Johnny Depp. schema.org’s job is
simply to be the collection of those terms– actor, person, name– and to grow and grow as people
ask for new types of thing, new types of relationship. We can go a little
bit further. Third scenario– here we see the actor link
points off to a whole page about the person. It’s great describing Johnny
Depp inline, but we can’t say everything we want to say
about Johnny in the page about the movie. So we start to break up our
descriptions and spread them across the web– a little bit of
the movie page, just enough to identify him, and then
hyperlink across to maybe a landing page on IMDB
or somewhere. So a href equals a pointer to
the page about Johnny, and the relationship becomes
a type link. It says a bit more about the
association between the movie and the person. All right, so you’ve seen now
the three core patterns at the heart of schema.org markup. There are attribute value pairs,
where the value is A, something simple like the string
“Johnny Depp.” Simple– this is the oldest data
model in the book. B, a thing, but you describe
it in inline with markup in the same page– more attribute value pairs. They’re recursive. And C, the interesting one, is
where you start to spread this across the website– so
attribute value pairs, where the value itself is a thing, and
the thing is described in more detail on a linked page
somewhere, hopefully a page that has further schema.org
markup. So schema.org makes it possible
to do useful things with the simplest of markup, but
also to provide mechanisms that allow descriptions from
multiple pages to be composed to larger and larger linked
graphs of structured data. sameAs– so we want to grow out
this graph of data. And we want to ensure that
Google understands what entities your content and
services are pertaining to. And then to help Google and
the other search engines understand this, we recently
proposed a new schema.org property called sameAs. And sameAs means pretty much
what you might expect– that it’s the same entity. So you can easily point to
well-known pages for that same entity elsewhere on the web. So examples are Wikipedia or any
Wikidata project, Freebase that we heard about earlier,
IMDB, these other well-known sites. Any of those pages can serve as
a proxy identifier for the thing they describe. With the sameAs markup, you
can make your web pages unambiguous about you talking
about Paris Hilton or Paris the city. If you want to use
a bit of hidden markup, that’s just fine. Slightly different bit
of markup here. That’s pretty much it for
the markup examples. So I just wanted to try and
put the modern graph data models in context a little. So the web of linked data idea
is not a new idea. schema.org, alongside the Knowledge Graph,
has essentially the same graph-shaped data model as
W3C RDF, for example. Wikipedia’s Wikidata project
also recently adopted this property-centric data model. The type links idea is
older than the web. But since the web itself
recently celebrated 20 years as an open platform, I thought
let’s revisit the words from a page from 1992 on link. Types. This is one of the very first
web pages, recently restored to the web by CERN. I think [INAUDIBLE] probably
wrote this. He said, “Some link types for
example express relationships between the things described by
two nodes.” And that’s it. That was there, probably
the fifth or the sixth page on the web. This idea has been there on
the web from the start. It’s not a surprise. It’s been there in logic
for centuries. schema.org’s role is to be a
provider of this vocabulary, particularly for relationship
types. Many kinds of things it
describes in the web– we name our types as a way
of organizing these relationship types. So place or organization might
be associated to contained in founder, name. Some of these properties
take strings. But a lot of them take things. And in a web context, things
can be represented by relationships and by links. So at the heart of schema.org
and each schema.org vocabulary is this very old and
foundational idea, that it’s useful to give names to
properties that express relationships between the things
described by two nodes in the web. So if that was there right from
the start, right from day one of the web, why is
it taking us so long? In less than two years of
schema.org rollout, we’ve seen tons of this stuff out there. I think what’s become clear to
us is that when publishes have a clear and growing incentive
to add markup, up it goes. I don’t have statistics handy,
but there’s a lot of it out there, spread well
across the web. And the alliance of search
engines supporting schema.org have given a huge boost
to this way of thinking about data. It’s actually useful now. Secondly, schema.org did make
some subtle design tweaks to make life easier for publishers
and to aid mainstream adoption. So I’d like to expand on that
last point briefly. We at schema.org try and
provide a unified documentation hub. The point of this is that we
don’t want busy developers, busy publishers, having to
rummage around the web piecing together bits of vocabulary– something here for geography,
something here for people, something here for places, each
would path to a different data model or different
syntax. And so we’ve taken on
more of that work. We’ve worked hard to unify
these different ways of describing things so that you
just go to schema.org, find a useful piece of vocabulary, and
get on with your lives. We’re are trying to find a
balance by drawing together independent designs, but
providing a consistent environment and a consistent
data model for widespread adoption. I just wanted to say that our
approach owes different things to different communities. From the link data, or from the
semantic web community, there comes a concern for the
graph data model, and a concern for supporting
decentralized descriptions. You shouldn’t have everybody in
the same room when you’re creating a schema. You should have conventions that
allow different groups to go away, come up with some
list of types, list of properties, and a model
that allows you to plug them all together. From the microformats community
and the HTML5 community, we’ve taken a concern
for simple, clear, and developer-friendly markup. And as a semantic web person,
I think that’s been huge. A lot of the semantic web work
is amazing, but is super hard for developers to
get a handle on. So we’re really trying to be
simple, easy, and clear with the schema.org site. So these conversations between
different groups in the web community, they’re
rarely easy. But I think they have led to
stronger web standards. So W3C’s RDFa Lite, that learned
a lot from microdata. Microdata learned a lot– obviously redesign of RDFa 1,
which in turn microdata also drew upon microformats. So as far as Google’s concerned,
it’s the graph that matters, the graph that
we care about. We can get the graph
from RDFa, from microdata, from JSON LD. Schema.org’s approach let’s us
make parallel progress on syntaxes and on schemas. So every new and improved schema
can be used immediately in any syntactic format
you like. Some of the smart mail scenarios
you’ve heard about this week are using
JSON, for example. It doesn’t matter to us. It’s all a big graph. Maybe you prefer Drupal,
XHTML, RDFa. Again, that’s great. That’s your choice. That’s your website. Go crazy. The approach we’ve taken allows
different sites to share common data vocabularies
even if they disagree on syntax. All right, so I’m just going to
run quickly through some of the webmaster tools. There’s a session following this
one that goes into a lot more detail on these tools. Firstly, schema.org itself– all right, don’t get
too excited. I’m going to show
you a website. If you visit the schema.org
site, you’ll find a very regular website– very structured. Essentially there’s a
page for every type. There’s a hierarchy. So if we go to the documentation
section– the full type hierarchy– so basic types. Thing– so in the schema.org universe,
everything is a thing. And the interesting work is
saying, well, what are the subsets of thing? So the creative work itself
has subsets article, news articles, scholarly article,
medical scholarly article. We try not to go too deep. But these types give us
attachment points for properties, and also, in a
social sense, attachments points for collaboration. So perhaps the medical
collaboration– focus on the medical scholarly
article type. Codes, that came from the team
at Microsoft, data catalog, from the recent data
sets addition. It goes on. It’s smaller than Freebase. It’s smaller than Wikipedia. But it’s hard to memorize. And it’s got to the stage– let me give you an example– place of worship. So I’m really happy that we’re
collaborating with the Wikidata project at Wikipedia. So if you look here at this
example– place of worship is a civic structures,
is a place. We listed at launch
I think six types of places of worship. There are many, many
you could list. And this is one of the
places where we say, stop, we can’t do this. This is where Wikipedia
comes in. And the collaboration with the
Wikidata guys at Wikipedia is giving us a story for how
schema.org gets big, but it doesn’t get too big. And then Wikipedia comes along
and the masses of thousands of Wikipedia community members can
figure out how many kinds of place of worship are there. We don’t want to
be doing that. We’re a handful of search
engineers collaborating while working for competing
companies. We’re not the right people
to do this list. So that’s been very important
for us, figuring out where to stop, and where to let
others plug in. All right, that’s probably
enough schema.org. So the Markup Helper
is a new tool. It helps you write markup. You put HTML in. You work with the interface,
and it helps you add this stuff without knowing the
standards inside out. I don’t know this
tool super well. So I’m going to shot up now,
point to these guys who will be speaking later. The Testing Tool I
use every day. This used to be called the Rich
Snippet Testing Tool– so now the Structured
Data Testing Tool. It shows you what Google thinks
your page is saying. It also shows you what Google’s
custom search engine will make of your page,
which is great. Because that’s a tool that can
quite sophisticated use of structured data now. Thirdly, instead of thinking of
your structured data page by page by page, we
started see tools. So in Webmaster Tools for Google
Office Publishes, we’re starting to see tools
that show a view of your entire site– all the schema.org types
that you publish. I think as we start to talk
more about graphs, that’s going to become increasingly
important, that each of your pages tell a little
bit of a story. Your entire website can
start to be thought of as a data set. Finally, Data Highlighter– so this is a tool. I think of it as a way of
getting started quickly. You can go in here, annotate
sections of a page, and teach Google about your website. And Google will figure out, with
your help, how to turn highlighted sections of the page
into the graph data model of Freebase and schema.org. I think at this point I’m going
to pass back to Jason. JASON DOUGLAS: Those last
couple of things, the developer tools– the Markup Helper and
Data Highlighter– went very quickly
through those. I think they’re actually
incredibly cool, and make both learning how to do markup,
and just getting started experimenting with structured
data incredibly easy. So I would really encourage
folks who have any interest in this, since it’s part of the
future the web, [INAUDIBLE]. You guys are next, right? OK, so there will be full
demos, deep dive. I’m sure you probably have no
idea what that was from just the screen shots. But as I said, we’re really
trying to make this easier. We’ve already seen a lot of
adoption of structured data given how difficult it’s
been historically. And I think as it gets easier,
we’ll see even more. And as I was alluding to before,
I think it’s really important because, increasingly,
this is becoming a core part of how we index the
web, how we understand the world, and how everything
relates to it. And so some of the stuff that
we’ve been seeing so far really is the beginning. It’s the early days of Knowledge
Graph itself. We’ve been talking about how
much it’s been changing Search, but that’s been over
just the last year. So coincidentally, today is the
exact one year anniversary of the launch of Knowledge
Graph. Like I said, it’s been
exactly one year. And so that’s what’s
happened so far. But I think there’s
a lot of new and interesting things coming. So one, I don’t know how many
people caught this yesterday. There was the smart
mail presentation. And if you think about it, we
get a lot of notifications or a very interesting emails that
have important information about our world over email. And those tend to come from
websites, from services. And those emails are
often in HTML. So if you think about it, well,
everything we’ve been describing with markup and
schema.org and all this stuff is a way of layering
this structured data on top of HTML– works just as well
on email as well. And so they’ve started
to do that. So this was what was announced
yesterday. And so some of it can be like
what we saw with rich snippets and Search, where instead of
just a text snippet, we actually saw a structured result
with stars, or the time and place of an event, or
that sort of thing. You can do the same thing with
something like a flight reservation, but even
go a step further. So again, by understanding the
actual real world context of this, we can actually join it
with flight status, which wasn’t in the email. I just know I have
a reservation. But now I can actually even know
whether the flight is on time or not when I am looking
back at what gate is it at– where’s my flight? I’m getting all this information
that I would need in real time as well. And this is being done with
schema.org vocabulary, and the same structure market techniques
that we’ve been talking about for the web. So that’s cool. And as I was saying, one of the
things about email is its your stuff. It’s about your world. And if we want to be this Star
Trek computer or that personal assistant, then you have as
many questions about your world, about your flights, and
your reservations, and all this stuff, as you do about that
general world knowledge that we’re talking about, like
what’s in San Francisco? So that’s things like, when
is my next meeting? When is my flight leaving? Where am I going? Where’s that birthday party? These kinds of questions that,
with this level of structured understanding of content, become
very answerable, so that you can actually find
this stuff, that you can search for it as it was a
question, but even do things like trigger now cards when
we’re talking about anticipate when this information
might be useful. OK, and then finally, and this
is one that I actually personally find especially
exciting, but it’s also very new– it’s at the public proposal
phase at schema.org, but we’re starting to do some things
with already– which is, I’ve been talking
a lot about things, and entities, and all those
real world stuff. And those are the nouns
in the world, right? Well, what about the verbs? Verbs are action. They’re where it’s happening. And we haven’t been providing
vocabulary for that to date. But now we are. So that can enable some things
like making information more actionable. So when we look at that flight
reservation and the flight status, actually being able to
check in with a single click, or RSVP to an invitation. And we can actually start
to create a vocabulary. And it’s basically a simple,
declarative API– just like HTML is simple
and declarative– for these kinds of things. We already had markup
for an event. So saying that there is an
action that you can do to an event like RSVP actually
becomes a fairly simple extension or layer
on top of that. One other nice thing about the
actions vocabulary is not only going can it be used in this way
to provide action buttons, but we can actually use it to
describe the state of an activity in progress, or even
in the past, like somebody took an action, like
you did RSVP. And those things can be
described as well. And so I think it’s this stuff
where you actually start to see that this is more than just
connecting content to the things, but actually starts to
become more of a platform– like I said, a declarative
API– that you’re putting into your
website, but in a simple and open way, just like the web. So that then becomes available
to any search engine, to any client. So hopefully I gave
you some sense of where things are going. Like I said, the follow-up talks
are the Markup Tools in that help, which if you have any
interest in implementing this stuff, definitely
encouraged. There’s the Freebase
talks I mentioned. And then there’s also a deep
dive on the email and the searchable email stuff
later today as well. So with that I’ll
take questions. AUDIENCE: Hi. I’m currently publishing a lot
of schema.org marked up pages– about 400,000 entities
that I’d like to contribute to Freebase. The problem is that it’s legacy
data that’s cruddy. So I’m wondering, I guess, two
things– one, the process for actually getting our data into
the Knowledge Graph such that we can then start pointing to it
and other people can start pointing to it, I guess
via Freebase. And two, what, I guess,
requirements there are for the quality of the data that’s
coming in, and whether that can be accepted provisionally
and then cleaned up after the fact. JASON DOUGLAS: Right. So in terms of the technical
specifics, definitely encourage– there’s sandbox
hours and office hours and we can get into more detail. But I think, on the quality
question, that is a very big one. If you think about it, if you
are trying to give unique IDs to things in the world and
building up your understanding of the world by attaching
more things to that, errors are a big deal. They propagate very quickly. They mess up your understanding
of the world. So generally, the Freebase
community– and this is true with all the
work we with Knowledge Graph and Google, too– think of at least 99% quality–
so a very high bar. And that’s not just
a fact accuracy. In fact, the most important
part is actually the identity precision. Meaning that we don’t have
dupes, and that we don’t have over-clustering either. Meaning that we’re attaching
attributes of different entities to the same
entity when they’re actually different. So it is actually a
fairly high bar. But we can talk it. We also have tools and things
that we’re doing to try to clean up data. Because there is a lot of
messy data in the world. AUDIENCE: Thanks. AUDIENCE: Hi. I’m wondering, is there any
collaboration with the library science community? If they’ve been studying this
problem for awhile. And then on the other hand,
Facebook’s Open Graph, and there’s obviously
overlap there. I was just wondering. DAN BRICKLEY: So let me talk
about libraries rather than library science as such. There was a teleconference an
hour ago, which is a W3C community group called Bib
Extend, and their mission is to extend schema.org’s coverage
of books and the graphic content. And what they’re trying to do
is reflect into schema.org some of the conversations that
are happening in the library world, about how library
catalogs evolved from card catalogs to thinking about
entities and relationships. So that’s really our main
connection there. AUDIENCE: [INAUDIBLE]. DAN BRICKLEY: Ah,
we should talk. Facebook. JASON DOUGLAS: There’s a
couple things there. One is we try to support
whatever we can that’s out there. So when we talked about rich
snippets, the existing microformats standards, a lot
of existing standards out there are supported as
part of these things. But when it came to schema.org,
it was really important to have an open
process that was extensible, so that groups that have real
industry expertise or area of knowledge– [INAUDIBLE] talk about library sciences,
or government data sets, or these sorts of things– could actually go somewhere to
approach getting vocabulary added and extended. And the schema.org model
has fit that very well. That’s why all the search
engines are participating. And Facebook’s welcome
to as well. DAN BRICKLEY: Just one more
point on that is the underlying syntax– so when they launched the Open
Graph, I guess it was three or four years ago, they had
to pick a syntax. And the best thing that was
around at the time was RDFA 1, which they stretched a bit. And RDFa 1.1 was a response both
to that stretching and to our choice to go
with microdata. So the latest work at W3C, RDFa
Lite, is very close to what we’ve been doing with
microdata and what Facebook has been doing with
their RDFa. So I think at the syntactic
level, there is a convergence path. That should make things lighter
on your web page, even if there are two vocabularies
there. JASON DOUGLAS: Like I said,
we make our best effort to understand everything
that we can. But it’s easier when
people share. AUDIENCE: OK, my question is
what are the databas you use for keeping your Knowledge
Graph? Maybe it is [INAUDIBLE],
maybe it is MySQL. JASON DOUGLAS: So yeah, what
we use internally for representing the Knowledge Graph
is proprietary, like most of Google’s
infrastructure– or not proprietary, but was
developed to deal with the scale that we’re dealing with. But yeah, there are an
increasing number of tools out there for graph data as I think
people are starting to realize that it has some
benefits as a data model. And I certainly hope to see more
of that, that the tool set for people managing
graphs on their own continues to develop. AUDIENCE: In my opinion
Knowledge Graph, it’s very interesting kind of graph. Maybe you do have
a [INAUDIBLE] paper about some characteristics
of this. JASON DOUGLAS: Yeah, I think
there’s several things. The idea of semantic
graphs have been out there for awhile. There’s a lot of academic
papers on that. There also was a SIGMOD paper
a few years ago on– so Knowledge Graph started
with Freebase at Metaweb, which was a startup acquired
by Google a few years ago. That’s how I came to Google. And the technology that we used
at Freebase, there’s a technical paper or a SIGMOD
paper on that. It’s called Graph D. It’s not
what we’re using today. But it’s interesting in terms
of thinking about, what does it mean from a database
perspective to have a fast queryable system? AUDIENCE: You mentioned one of
the points of schema.org is to have something that’s
easily extensible. But what isn’t very clear to me
on the site is how you go about getting new data types
put in, how you go about suggesting them. I work with a lot of data. I work with a lot
of legal data. And most of the data does
not fit in any type that you have available. I’m also very annoyed that you
called schema.org/attorney a business instead of a person. Because an attorney
is a person. A law firm is a business. An attorney is a person. But for things like case law,
code, stuff like that, they just don’t fit. The closest they fit into is
scholarly article, but it’s not scholarly article. So I really want to try to mark
it up and submit to you guys, here’s all the pieces
of information. Let’s get it out there. DAN BRICKLEY: So to be honest,
initially we hid it away a little bit. Because we weren’t sure if we
were going to be drowning in thousands of proposals from
SEO advocates and so on. And we’ve gradually made it
easier to find on the schema.org site. There’s a community at W3C
called the Web Schemas Group. And as I said earlier, we don’t
want to make a formal standards committee. But we wanted to make a place
where people could come and collaborate. And there’s a mailing list,
[email protected] And if you can’t find that
link, I’ll make it bigger later today. Let’s just chat afterwards. JASON DOUGLAS:
[email protected] is the key mailing list. And generally what happens is
somebody makes a proposal of, we think the schema
falls short, exactly like you’re saying. So for law, it’s not
up to snuff. And so you can see in some areas
like in medicine, there was a major proposal made. And so now it’s very rich around
health and medicine. We’re going to need that for
lots of verticals, and that’s why we want to do it in this
way is because we don’t understand these verticals as
well as the people who work with that data every day. And the more that these
vocabularies can actually reflect the data– well, you
see, we were talking about actions as one of
the new things. DAN BRICKLEY: All kinds
of crazy discussions. We were arguing about writing
“labor union” or “trade union–” eventually settled on
“workers’ union,” because “labor” is spelled two ways
in different countries. Those are the kind of
conversations we have. We’re working to get more of the
corporate conversation in public, too. And that’s going well. You can see some of the bibliographic discussions there. I just wanted to show you
quickly before we’re kicked out of the room– the Web Schemas Group,
there’s a wiki. This is all on the W3C. And there’s a table of
extension proposals. So this top table are
things in progress– citations, organizational
departments, which should be do this week, same as email
messages, TV and radio improvements, comics. And then down at the bottom,
all the things we’ve added. So quick answer– sling something there. Start a thread on the
mailing list. The more consensus you can build
the more likely it is it will go in. JASON DOUGLAS: All right,
one last question. AUDIENCE: All right, so this is
perhaps a related question, And it maybe more Knowledge
Graph side of things– are there any plans to index
things are a part of our real world but less physical nature,
ideas, like more of the Wolfram Alpha, what’s
gravity, things like that, maybe philosophical ideas. I know there’s overlap
with Wikipedia there. JASON DOUGLAS: There’s more of
that stuff than you might expect already in
Knowledge Graph. I think we try. Like everything at Google, if
people are searching for it or trying to get answers around it,
then it’s going to be an area of focus. DAN BRICKLEY: OK, Google,
what is love? JASON DOUGLAS: That
are answerable.

Leave a Reply