Knowledge mining using the knowledge store feature of Azure Search

>>You’re not going to
want to miss this episode of the AI show where we take Azure Search to the next level with something called knowledge store.
Make sure you tune in. [MUSIC]>>Hello and welcome to
this episode of the AI show. I’ve got a special guest. We are talking about
search but a special kind. Tell us who you are and
what you do, my friend?>>Hi. I’m Vinod Kurpad. I’m a Principal Program Manager
on the Applied AI team, and I work on Azure Search.>>Fantastic. So tell
us about Azure Search real quickly for
those who don’t know. Then let’s talk about how
AI is making it better.>>Great. So Azure Search
is a search as a service. So essentially, when
you think about search, there’s a number of different ways
that you could use search, use SharePoint Search for example, but Azure Search gives
you-all the reach controls that you’d need to build
a custom search experience. So things from custom analyzers to custom ranking,
custom scoring profiles. So you can build
a search experience that matches exactly what you want to
provide for your end customers. That’s the the overall part of it.>>Got it. So I’ve heard of
something called cognitive search. Is that the same thing? Is that a different thing? Or is it all the same thing we just seen and understand
a little better?>>Yeah. Absolutely. Cognitive
search is essentially a feature within Azure Search
that allows you to enrich your data using
AI skills as you’re moving the data from your
data source into your index. So think about things like one of the examples
that I’m going to demo today is sentiments,
sentiments on reviews. So you should now be able to
ingest the reviews data but also enrich it with sentiment
analysis to be able to recover which reviews
are negative, which reviews are positive, and you should be able to
store that information, as well, in the index, which allows you to
prove and give you a richer search experience because not only do you have the review test, but you also have the sentiment score associated with each of
those reviews in the index now.>>So basically now you’re
not just indexing over what’s in the actual document, but you can enrich it with
any cognitive service?>>Absolutely. Right. So
that’s the whole objective is you should not only be able to enrich it with any cognitive service, but you should be able to build a custom domain specific skill
as well that you might have. So most organizations have unique IP within
their organization that deals with the data that
they constantly see, and you should be
able to take that IP, fractionalize that
as a skill and make that available to your
enrichment pipeline, as well. That can then enrich your data with your skills in addition
to cognitive skills.>>So you’re talking
about skills and I feel like you have a process diagram. Can you explain how
this process works?>>Yeah. So if you think about it, you have data sitting
in a data source. That data might look like
either Lab Storage, SQL tables. There’s a whole host of options
that you might store your data in. You should be able to
take that data out of your data source with
what we call an indexer. Think of an indexer as
nothing more than a crawler that they would reach
into your data source and determine what’s changed and only index the information
that’s new or updated. So the indexer takes that data, brings it into enrichment pipeline. We have a notion of a skill set, which is essentially
a collection of skills. Those collections of skills could be either cognitive skills or skills that you might
build for yourself. Those skills are
applied on the data as it’s moving through
this enrichment pipeline. Then at the end of
the enrichment pipeline, you have what it looks like this enriched document
or this enrichment tree. You can then shape
that enrichment tree to match the schema of your index, as well as now, we have a new feature
called the knowledge store, which allows you to then project this information into the
knowledge store as well.>>I see. So before
what it would do is because I remember running this
on images, which blew my mind. Right? Because it actually
put a label on here. This is a giraffe, and if you’re
able to search for giraffe, it brought up pictures. But now you’re saying,
there isn’t now and even a new feature that people can use that can protect that data
into a special place as well?>>Exactly. Right. So
if you think about it, you’re going to be spending a lot of money enriching a lot of data. If you’re a large customer running this over
a real scale environment. You may not necessarily have search as the only way you want
to use this data, right?>>I see.>>So you might have other options for how you want to use this data, whether that be an in-app experience, whether that be for
retraining your models, whether that be for
training new model.>>For anything.>>For anything, right? So there’s
a number of scenarios where customers have come to us and said, “It’s great that we can do this. But how do I use the same data
to retrain my models?”>>That’s really cool. So the
data is basically it’s forking, you could use it for
indexing for searching, and you can use it for whatever
other things you want. That’s the enrichment
pipeline has done.>>Absolutely. The other
thing to consider is the enrichment pipeline is
a per document enrichment. So you are in a situation where each document that flows through the enrichment pipeline gets enriched using the skills
that you have. But what if you want to do
something that may be Corpus way? TFIDF might be a great example
of one scenario like that. You might want to build
a custom vocabulary. You might want to build
a custom Word embeddings. So in those scenarios, you need skills that are able
to act on the entire purpose of data and not necessarily
just on a per document basis. The knowledge store allows you to essentially unlock those sets
of capabilities as well.>>I feel like we’ve
talked a lot about it. Maybe you can show us how this works.>>Absolutely. So let me
switch to a quick demo. What we’ll do is we will start
looking at the Azure portal. What I’m going to do is I’m going
to walk you through the process of taking some of that reviews data set that
I just talked to you about. Then running that through
the cognitive search pipeline, being able to use the knowledge store to be
able to export that data.>>Which is a new feature.>>Which is a new feature.
Then what I’ll switch to when I’m done is how easy it is to layer Power BI interface on top of that to just
be able to visualize that data and be able
to explore it and just be able to catch
a few critical insight.>>That’s Power BI over the knowledge store that
we’re now moving data to.>>Absolutely.>>Got it. Let’s take a look.>>Right. All right. So I’ve got data sitting in
Blob storage right now. So I’m going to click
my input data workflow. Within my import data workflow, I am going to create
a new data source which is going to be
Blob storage again.>>This is the Azure Search, and this is pointing
your index to a blob storage.>>Exactly. I’m going to be
creating three assets here. So I’m going to be
creating an indexer. I’m going to be
creating a data source, and I’m going to be
creating a skill set.>>Got it.>>With those three assets, I’m going to configure
them to work with each other to be able to extract the
data out of your data source, enrich it, and layer
it into the index.>>Got it. So that when
you’re importing data, you’re basically, this
is where the stuff is. The indexer is the worker and then the indexer will push it up into
the knowledge store as well.>>And the index.>>Got it. Perfect.>>So I’m going to call this
the AI show data source. In this case, so you can tell your indexers or your data sources what information
you want out of them. So in this case, I still want
the content in metadata. The metadata is things
like which file is in, and so you get all that information. The semantic information about
where this data is stored.>>Better to have more than less.>>Exactly. So you could
use it if you wanted to. If you didn’t, you
can always shape it.>>Absolutely.>>Right? In this case, parsing more allows me to determine
how I’m treating this file.>>Absolutely.>>So the parsing mode,
so for example, if I was dealing with PDFs,
I would leave it at default. Then it will extract all the text, all the images, and then it
would allow you to act on those. In this case, I know I’m dealing
with a deliberative text files. So I’m just going to choose that.>>Does it figure out how
to parse it correctly?>>Yes. So the default
parsing mode is default. So essentially, I could
still use default as a parsing mode for
this particular file. But that would result in
its cardinality challenge.>>Got it.>>So if you think about it, a CSV file with many rows, it’s now going to get
a single file in your data source. It’s going to result in a number
as many rows as it has. It is going to result in
those many documents in your search index when
I use a CSV file. If I use a default parsing mode, it’s going to be
cardinality of one-to-one.>>Got it.>>Each document in my data source, I’m going to get
one document to in my index.>>Cool.>>All right. So in this case, I know my first line has headers. I know my delimiter is a comma. I’m going to choose
a connection string to connect it to a storage account. In this case, I am going to pick my knowledge store
demo storage account. I know I’ve put my file in
this container called the AI Show.>>We basically named the AI Show, so it would be at the top.>>Exactly.>>Of all of the demos, that’s why.>>That’s perfect. I never thought
of that but that’s a good idea. So at this point, it’s doing a little bit
of inferencing. So it actually opened up my file, looked at the schema and
validated that it can actually work with it because
that’s what we do. Because in order for us to make some recommendations
on the index schema, we need to be able to look at
your data to be able to see like what what data
we’re dealing with. So the next step is essentially
setting up my enrichment pipeline. So in this case, I’m going to attach
a Cognitive Services account. The reason I need to do this
is we support a free tier which allows you to enrich
up to 20 documents a day. But if you’re trying to do
anything more than 20 documents, you need a Cognitive Services key. The Cognitive Services key
is only used in cases where we are actually performing an enrichment for you
using Cognitive Services. But you still need to
make it available for us to use if you have a data source.>>This is what would
be called the skill.>>This is exactly what we’d
be using for for a skill. The thing about it is, we do a lot of work around how we are able to perform these enrichments for you in the most efficient
manner possible, and we only charge you for
enrichments that actually succeed. So you can’t really call Cognitive
Services by yourself at skill, and be more efficient
than we are for you.>>Right.>>So this is where I am actually going through
the process of adding some enrichments to
my enrichment pipeline. So I’m going to call
this my AI Show skillset. Now, if you notice here, it’s actually looked at my file
and it’s determined what are the different columns that
are available for me to use. So I’m going to be selecting “Reviews Text,” since that’s the actual column that I want to be able to enrich. Now, at this point I can say, “Okay, I can use my reviews text completely but then Cognitive Services also has limitations in terms of
the payload size that it can accept. So if you’re dealing
with a really large file like thousands and
thousands of lines, you’re going to have
to chunk this up. So we provide you the ability
to split and chunk things up. So in this case, since it’s a review, I know it’s going to be less than 5,000 character chunks. I’m
just going to choose that. Once I do that, I can now select what skills I
actually want to run. So in this case, I’m going to do
things like detecting sentiment, I’m going to detect language, and I’m also going to
detect key phrases for now. I can also choose to do other things, but I think those are the ones that I really care about for
this particular demo. Then the last thing
I need to do now is determine how I want to project this information out into
the knowledge store.>>This is the new part.>>This is exactly the new part. So what the knowledge store comprises off is two sets
of capabilities. You can either project
it as tables or objects. Objects are nothing more
than structured JSON.>>Got it.>>But they’re stored in Blob. Tables are stored in Azure Tables. Both these are scenarios
that you could use. You could use either one or both. It really depends on how
you plan to use the data.>>That’s cool.>>So now, I’m going to choose an existing storage
connection string. I will probably choose a new dataset enrich data.>>Again, as you’re looking for it, this is the direction
that you’re going to be sending the enriched data.>>Right, exactly. So I’ve selected
a Blob storage account. Now, that the portal
experience for Tables is we start to look
at your data and we determine what might be the different options that
you might want to store. But this is essentially controlled through a skill called
the Shaper skill. This is one way that you
can shape your data but there’s other ways
that you might shape your data based on how
you plan to use it. So all you’d have to do is write a different Shaper skill that creates the data in
a different shape. So it’s that enriched to
your information is being projected into a specific shape before it’s stored into
the knowledge store.>>Got it.>>You control what that shape
needs to look like.>>It’s basically a shaping skill that forms part of
that whole pipeline.>>Right. So you can take
different nodes of the tree, parent them under different objects, and then be able to create
a new shape that says, “Okay, I want my data
to look in this shape.”>>That’s cool.>>So in this case, I’m just going
to select all of these options, and then the last thing I
do is I customize my index. Again, this is basically
looking at the schema of the file that I’m processing and then making some recommendations
in terms of like, these are the fields that we
could possibly use in the index. I’m going to accept
most the defaults here because as you look at it.>>Yeah. I mean, it’s
pretty standard stuff.>>That’s pretty standard stuff.
So “Retrievable” for example, implies that when you
perform a search, this particular row comes back or this particular column comes back. Similarly, “Filterable”
implies that you should be able to write a where clause filter. “Facetable” implies that you
get the facets on the left nav, it gives you counts of
how many each of these.>>Listen kids, I called
it a face table at one time and then one of
these smart people came and said, “Did you mean facetable?”
I said, “yes, I did.”>>So this is our nozzle.>>Not face table>>So the naming is like, when we think about
it, we just imply it.>>It’s like category. It’s like it’s a way to do like
tagging more or less.>>Exactly.>>Got it. Not a face table, kids>>So at this point, I’m just going to hit
“Next,” create “Indexer.” So the indexer again is
something that you can run on either schedule or just one time, and also has some
interesting features in terms of like how you manage editors. So for example, you could set things like max failed
items to say like, if it felt at least on
five documents, you want to stop. So things like that. So at this point, I hit “Submit,” my index runs and now it goes through the process of
actually looking at my data source, opening up that file, cracking it open, extracting
the information out of it, running it through all the skills
and then performing the projections both for
the index as well as string.>>That’s really cool.
Because, I mean this is obviously a one little file
but you may have thousands of log files that you can add skills to and
they do some cool stuff. So you mentioned that there
was a Power BI that you use to build on top of that.
What does that look like?>>Yeah. So like
any good cooking show, I actually run this prior to
coming in here and I built this Power BI dashboard based on
the data that I run this through. You can see that, for example, I ran it through
about 35,000 records.>>Which is the rows in that table.>>Which is the rows in that file. You can see that there’s
simple things that I can get out of this that are just
easy for me to comprehend. The other thing is, building
this dashboard is really easy because the other thing
that knowledge store does is we maintain the relationship across
the different tables that you project information into. So for example, you have
a document that has key phrases and you project
key phrases out of the second table. Now, we are able to relay each
of these key phrases back to the specific document
that it came out of. So Power BI can just suck up that whole thing and then it just knows how to deal with
these relationships, and then we’re able to basically
do all these filtering and like selecting of information
to be able to create the views based on all this data,
and it’s through a data box.>>Now, here’s the thing, I’ve
always found that whenever I run Cognitive Services over things, I find discrepancies
between what people think is in the data and what
the AI thinks is in the data. Could we find anything
like that here?>>Yeah, absolutely. So like
the first thing I’ll show you is, one of the things you see here
is the count of country is one. Whereas if you look at the map,
you can tell that there’s data that’s spread geographically
across the different countries. The reason for that is
the reverse geocoding of the dataset is incorrect. So reverse geocoding
always uses a US city and state and so that’s why when
I look at the country code, the country code is always US. So one of the things that we’re doing now is in the next version
of this demo, we’re actually creating
a custom skill does the real reverse geocoding based on the latitude and longitude
to put the actual values. The second thing you’ll notice
is there’s something weird going on now with
the language code and the reviews. So certain language codes
are scored on a review scale of 10 versus certain language codes are scored on review scale of five.>>That’s a problem.>>That’s a problem. So now we’ve
got to normalize this data, and that’s another thing
that we’re doing as part of that custom scale.>>That’s cool because
you’ve actually gone through with minimal work, you put something in and you’ve
learned a lot about your data within an hour depending on how much stuff you’ve got it in
there which is pretty cool.>>Exactly, and not only your data. So we can do more interesting
things with this. So for example, I’ve
been this information for the sentiment based
on the sentiment score. So now I can do certain things
like for example, if I filter for airport, and if I look for all reviews that have a key phrase of airport associated
with it and I say, “Well, I want to really see specific hotels around the airport and I want to find all the ones that have sentiment score of
less than 20 percent.” If I do this, now you’re going to see and
this is probably hard to read in the resolution that
we’re in right now. But if you look through
a lot of these reviews, you’re going to find
that there’s two things that’ll stick out to you; the first is the two
real indicators or negative sentiment
in the airport hotel are either the breakfast
or the bathrooms. Those are the only two things that really drive negative sentiment.>>It’s true. I’ve
seen in airport hotels and both the breakfasts and
the bathrooms were subpar. But this is cool because
that sentiment score is not something that was in
the original document, it’s something that was added through cognitive skill sets
inside of search.>>Exactly. Both the sentiment score
and the key phrases were things that were enriched
as part of Cognitive Services.>>Fantastic. So where can people go to find out more, to close up?>>The documentation page is
definitely where I’d recommend that people look at in terms of saying how do they start
using the knowledge store. There’s a couple of
things to look at there. There’s the new Shaper skill
that helps you use the knowledge store as well as there’s a getting started guide in terms of using
the knowledge store. So they’re both great
places to start.>>All right. Thanks so much for
all that awesome information. Thanks so much for watching. If you want to find out more,
go to You’ll see all of the blogs, documentation everything
that he talked about. Thanks so much for watching, we’ll
see you next time. Take care. [MUSIC]

Leave a Reply