September 9, 2010

Discussing LDA and SEO

The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

In this week's Whiteboard Friday Rand Fishkin and Ben Hendrickson discuss LDA (Latent Dirichlet Allocation) and SEO (Search Engine Optimization). There has been a lot of discussion about the relationship between these two topics lately and this video answers many of the questions people in the community have been asking. It is comprehensive (25 minutes) and uses many easy to understand diagrams and examples to discuss what impact LDA may have on the SEO industry. We look forward to reading your comments below.

Video Transcription

Rand: Howdy, SEOmoz fans. Welcome to another edition of Whiteboard Friday.
Today, I am joined by Ben Hendrickson. Ben?

Ben: Hello. We've met before.

Rand: Have we really?

Ben: I think so.

Rand: So, Ben is our senior scientist here at SEOmoz. He does a lot of our
research work and has been working on some interesting projects.
Lately, we posted about one of those projects and asked for some
feedback and got some great responses. A lot of people are very
passionate, very excited. And some people are a little confused. So,
we wanted to dive deeper with this LDA stuff.

What's LDA, Latent Dirichlet Allocation. We wanted to talk about topic
modeling in general. There was some feedback, right, and I am sure
you saw some of it too, that was like, "I'm not quite sure. You're
saying on-page maybe is more important because of this LDA stuff,
and I always thought on-page just meant keyword density or stuffing
your keywords."

Ben: Yeah. Clearly words used matter. For any given SERP, a huge number of
links aren't going to rank for it because they have nothing to do
about it because they never use the word at all. Right? I mean,
Google.com ranks a very few things and it has a ton of links. So, of
course, words matter that are on the page.

Rand: But we've always, as an SEO, even when you've done your previous
research, it was sort of like, boy, it sure does look like links are
a whole lot more important than . . .

Ben: Using the keyword in the title box. Right. Yeah. So this was
something that actually was very surprising for us, which is why we
showed it. What was that? It seems like using other sort of related
words to the query in a very specific way seemed to help a lot.
Right?

Rand: And we were kind of weirded out by that.

Ben: Yeah.

Rand: Or we were at least surprised by that. So, that is why we are sharing
it. So, let's go back in time a little bit and talk about this whole
. . . for people who are kind of going, "I don't understand what you
mean when you say it's more sophisticated than keyword density, or
it's more sophisticated than a normal keyword metric or keyword
usage." Keyword density is just like the percent of times that the
word is used out off all the words in a document.

Ben: Yeah.

Rand: Super simple to game. Kind of useless for IR is my understanding.

Ben: Well, I mean, it gets you a lot of the way. I mean, at least you have
that word in the document you return to people. But, like your blog
post earlier in the week showed, there is a lot of basic situations
where you can't tell what is the better content just by doing this.

Rand: Right. And so, IR folks in the '60s came up with this TF-IDF thing,
which is essentially like looking at whether the terms that are
being used are more frequent in the corpus as a whole. So, if you
are like a library, they look at all the books in the library. Or if
you are a card catalogue, they'll look at all that. And now that
there are search engines, they look at all of the documents on the
Web.

Ben: Yeah, right. So, the big intuition here is that they are searching
for multiple words. The word that is rarely ever used is the one
that actually matters the most. So, if you are searching for the
SEOmoz building, a document that includes a building and SEOmoz is
probably very relevant. A document that contains "the building" or
"the SEOmoz" is a lot less relevant. So, the basic story there is
that you are biased against caring about words that are very common.

Rand: Right. So I like your Lady Gaga example where you're like, well,
documents that have Gaga on them are probably way more relevant than
those that just have lady on them, even though lady and Gaga are
both four letter words in the phrase.

Ben: Yeah, exactly.

Rand: All right, cool. So we evolved to this TF-IDF stuff. And then there
is this like co-occurrence thing, which we talked about on the
SEOmoz blog a long time ago. Co-occurrence is kind of interesting
where we look at, and let me make sure I am getting this right. It
is essentially that, oh well, oftentimes when I see, for example,
Distilled Consulting and building and SEOmoz and building, I find
those frequently together because it turns out that we share offices
with Distilled and we do lots of work together and those kinds of
things. So, maybe a document that has both Distilled and building
and SEOmoz might be more relevant than just the one that just says
SEOmoz.

Ben: Exactly. Right. So, if you are trying to basically figure out if it's
just an offhand reference to it or if it's something that is
actually valid a whole lot, right, the fact that it is using a whole
lot of other words that also occur with the keyword would be a good
indication of that.

Rand: But then topic modeling, I think that even I get a little bit
confused when I think about topic modeling versus co-occurrence,
because it seems like topic modeling is maybe very similar to this.

Ben: Well, this is great because you drew a Venn diagram that shows the
difference really well.

Rand: Right. Super smart of me.

Ben: It's like you kind of knew. So you can imagine that you could have a
whole bunch of words that would have a very high co-occurrence with
Star Trek. Right? You could have documents that talk about gravity,
space, planet, and tachyon. But it still might not be about Star
Trek, even though you've got four words that co-occur a lot with
Star Trek. It could about astronomy. Those are all real things that
exist in the real world, or at least people think they might exist
in the real world in the context of tachyons. But if you have
something that is talking about tachyons and gravity and William
Shatner, that's probably Star Trek. Right?

And so, it's not just the number of words you have that co-occur.
You are actually trying to figure out are these words being used in
the context where they are talking about Star Trek, or are these
words being used in the context of talking about astronomy. The way
we can do this is because in general fewer topics is better. So,
it's possible that we have something that is talking about astronomy
and TV and it happened to use gravity and tachyon and William
Shatner in the context of something else he did. But it's more
likely to just have . . .

Rand: So normally, we might say like, "Okay, I can imagine Google using
this to try and do a couple of things." Right?

Ben: Right.

Rand: For weird queries, where maybe the word Star Trek wasn't used but
they think it might be about that and they think that's what the
person wanted, maybe they would do it. But for ordinary rankings, it
seems like using these words when I'm talking about astronomy or
using these words when I'm talking about Star Trek isn't going to
help me any more than not using them. But then we did this topic
modeling work and we tried to analyze that. Right? So we used a
process called LDA, which maybe we can talk about in a sec. But we
used this process to basically build a model that has all these
different topics.

Ben: Right.

Rand: And essentially, the topics, as I understand them, aren't actually
keywords. They're just like a mathematical representation of a
subject matter. Like you were saying there's probably a cartoon
topic, but it's not like the word occurred necessarily.

Ben: Yeah, right. So, it has actual words in it. Right?

Rand: Yeah.

Ben: You can look at a given topic and you can see all of the words in it
and see how much each word is in it. But no human went by and said
we should make a topic about this to show what words may be put
together. So, if you look at papers, people pretty much refer to
topics by whatever the most common word in it is, which in the case
of cartoon might be cartoon.

Rand: Like I remember one of the early ones we were looking at was
Transformers.

Ben: Yeah, right.

Rand: It was like, oh, well, Optimus Prime and Megatron and Sydney, the
woman who's in the all of the movies now. She came up a lot. Megan
Fox was in there.

Ben: Is she related to Vanessa Fox.

Rand: I don't think so.

Ben: Okay.

Rand: In fact, I strongly suspect no.

Ben: Okay.

Rand: I'd guess it's a screen name. But so, in any case, you get these
topics. You have these words in them. And then when we say, "Well,
how much does this matter? Like how much does it matter if I am
writing a page about Star Trek and I have lots of links pointing to
me, but I'm not ranking as well as I think I should. Could it be
that maybe I have not included keywords that would tell Google that
I am actually about the topic Star Trek or about related topics?"
Yes. And so, we don't know how important that is. And that's why we
did something about correlation to try and figure this out.

Ben: Yeah, right. Because, obviously, we don't work at Google.

Rand: We just have to look at the outcome.

Ben: We have to look at the search results and then decide if this seems
like what they are doing. Yeah. So we try to see.

Rand: All right. So, let's talk about that correlation process. So Ben,
we're talking about this correlation thing and a part of me is kind
of going like, as a classic SEO, like non-statistics, math major,
this kind of thing, I kind of go, "Isn't the best way to test
whether this works is to have like two random documents on the Web,
and I'll try putting your LDA stuff to work and see if it raises up
one of them or doesn't raise up the other?" And I can do tests that
way. Like, what's this correlation? Why do I need that? Is that a
better way to do it?

Ben: I mean, they are just different. We've tried doing control tests
where we put the keyword and title tag on one and not the other and
we see which one ranks. But it's very hard to do enough of those to
reach statistical significance. It's pretty easy to set ten websites
where one is doing stuff one way and the other is doing stuff the
other way. But you end up doing like four one way and six the other,
or three one way and seven the other.

Frequently, a lot of these effects aren't that big. Google sees it
as hundreds of things that influence SERPs. So even if you try to
control for as many variables as you can to try and make it the same
between these two, there is just a lot of noise in terms of what
actually ranks higher. So it takes a very large amount of work to
make enough samples to say something with statistical confidence.

Rand: And you never know when you might have some weird factor that is
influencing all of them in some weird way.

Ben: Yeah. There is another problem that you are probably looking at this
really tiny page and little tiny domains because you are not setting
a huge number of large-scale domains to try to this out. Right?

Rand: Right.

Ben: So you are going to get an answer. The question is: Is this answer
going to scale up to real pages people care about from my small
pages that have ten links to them? So, it is a very interesting
process, and I actually would be very fascinated that people get
good results from it. But, we have tried it and the results have all
kind of been . . .

Rand: Middling at best.

Ben: Middling, yeah.

Rand: There are no good conclusions from anything. So instead, we use this
correlation process. Right?

Ben: Right.

Rand: If I understand your process right, you basically run across not a
dozen or a hundred, but hundreds or thousands, in some cases, of
different search results looking for elements that will predict that
something ranks higher or lower.

Ben: Yeah.

Rand: And so I saw that Danny Sullivan left some great comments in our blog
post about LDA. He said, for example, "Well, you guys said that
correlation with keywords in the title is very low. I don't believe
that at all because, when I look at search results, all the search
results I see almost always have the keyword in the title tag. So,
what are you measuring here that I'm not seeing?"

Ben: Right. The difference is measuring what a keyword is in the search
results versus measuring what is correlated with making it appear
higher in the search results.

Rand: So if all of these included the keyword Star Trek in the title
element, then what's the ranking correlation of the title element
with the keyword?

Ben: It would be zero. Right?

Rand: Because they are all the same. What's the possibility that something
will be a blue link appearing on Google?

Ben: That's an interesting thing. We computed some data a while ago using
the correlations where we were comparing Bing and Google. It
actually was interesting to see Google tends to have a lot of stuff
with this element. Bing had fewer things with that element. It
actually tells you how the search engine is different. It's
interesting just looking at raw prominence when you are trying to
compare two search engines. But it's not very interesting when you
are trying to compare two features because . . .

Rand: Or when you're trying to figure out what will help you rank well.

Ben: Exactly.

Rand: Okay. So, got you. So what Danny Sullivan is talking about with this
"I see the keyword in the title tag like 70 percent of the time or
more," that's this raw prominence thing.

Ben: Right.

Rand: That's like how many times does it appear in there? But correlation
of a specific feature with ranking higher is essentially looking at
all of these and then saying like, hmm, you know, on an aggregated
basis across hundreds or thousands of search results . . . I think
the study you did for the Google/Bing thing was like 11,000
different search results. Right?

Ben: It took a long time making search, writing it down on paper.

Rand: Yeah. I bet it did. You're totally incredible for having done it
manually. So, you look at all of those and then you would say, "Oh,
well this particular element on average like, having the keyword
exactly match the domain name, the top level the domain like it does
here, boy that sure looks like it is correlated with ranking much
higher." I think having the keyword in the domain name was one of
the highest correlated single features that we saw.

Ben: Yeah, right.

Rand: And the same thing goes for number of linking word domains, like
diversity of different link sources that you got. Like in tons and
tons of different websites, I have a link to Amazon, that seems to
predict or correlates well with it doing pretty darn well.

Ben: Right.

Rand: And if I recall, I think correlations for title tags and keyword-
based stuff, with the exception of the domain name, was in the like
0 to 0.1 range. Maybe 0.15, something like that.

Ben: Yeah. In fact, some of them were actually a little bit negative.

Rand: Why would it be negative?

Ben: Because it is quite plausible that if it's in the title, someone put
it there because they would like to rank higher than they actually
do and (_________) a lot of other things and it's just not a very
good page.

Rand: So you're saying, because of keyword stuffing SEOs, there could be a
negative correlation or other conflicts.

Ben: Yeah. Exactly.

Rand: So this on-page stuff, pretty small correlation. Right? So then, we
looked at things like links. A lot of those were in the 0.2 to 0.3
range, with 1 being a perfect correlation. So there was like a link
to your domains. That was pretty decent, like 0.24 or 0.23 or
something like that. Things like page authority, which is a metric
we calculate, was really quite nicely high. It was like 0.35 almost,
0.34, something like that.

Ben: I can't confirm or deny these numbers. I don't remember them off the
top of my head.

Rand: All right. But there are different ranges. Right?

Ben: Yeah.

Rand: So, when we looked at linking stuff, it was almost always better than
on-page stuff.

Ben: Yeah, right. Links seem to be, if you had to develop a Google search
algorithm to sort the things and you had to make a choice of Google
as you could, just looking at links seemed to get you most of the
way in terms of anything that we did.

Rand: So then when we saw this LDA thing at 0.32 something, that seems
whacky. That seems crazy high for an on-page factor, because we
never looked at anything that was about the features of the words or
how you use them, with the exception maybe of the keyword in the
domain name, that was this high in correlation. So that sort of
struck us as being very odd, and this is one of the reasons that we
wrote about it and were excited about. But let me just throw this
out there. Correlation is not causation. Right? It could be that
maybe domain name is really the thing that is being ranked. But
maybe it's other features. Right? Correlation doesn't necessarily
mean that that is what is causing it.

Ben: Right. And almost certainly our LDA model is not causing it, because
Google doesn't use our LDA model. They're not asking for numbers.
Right? Then almost certainly Google is not going to do LDA like we
have done it. They have not used our corpus. We have a model that is
correlated with Google's results, and it is certainly not causing
Google's results. But the thing is that it is a very high
correlation. So, they are doing something that is somehow producing
results that are correlated with a LDA model. It is hard to imagine
really what that would be, unless it was some sort of topic modeling
or something like looking at the words used on the page.

Rand: So, there's two things that come out of this. One is that, to my
mind, when I see something that high and assuming all the numbers
look right, I think some people gave your numbers a hard time, but
it looks like at the least the criticism they have received so far
has not made us doubt that we have done something wrong.

Ben: Yeah. I spend most of the day running code. But it is quite plausible
that I did something wrong. I'm sure I have. But the specific
complaints people have come up with so far aren't very credible.
But, you know, in the future, it will certainly happen someday.

Rand: I'm sure we are all excited for that day, Ben. Assuming that these
numbers are quite high, doesn't it sort of say like maybe we've been
wrong about this on-page stuff not mattering all that much? Maybe we
should do more on that front, like more investigation, test out the
results, try putting our keywords on the pages in certain ways.

Ben: Well, Google always says to spend time writing good content. Right?
And that's a little bit hard to apply, but you can interpret that as
being right content makes it clear what your topic is by using words
that are going to eliminate any topic from being (________) except
for the one that you are trying to rank for. So, I don't know if
it's that revolutionary. It seems like people have worried a lot
about their content in the past and a lot of people say to do so.

Rand: But so people in the past, they talked about things like, oh, we
should use like the Google Wonder Wheel. And we should use related
searches and put those words on our pages. We should use things like
synonyms that we get from the service. Well, how is the LDA stuff
different? Or is it? Like if I just do these things, am I going to
do great over here?

Ben: Well, I mean they are not going to be bad. But if you can imagine
that when you put a whole bunch of synonyms for tachyon, it's not
going to actually help clarify if you're about astronomy or Star
Trek. Right? So, you don't actually or that you're trying to discuss
bark collars and you want to just clarify that you are talking about
dogs as opposed to the stuff that wraps trees. You are not going to
want to put a whole bunch of synonyms for collars or barking. Yeah,
but that's sort of weird and unnatural. You much more want to put
other related words to make it clear that we are talking about some
sort of bark preventive system.

Rand: So, let's talk really briefly about the tool today. It doesn't do
exactly this. Right? Instead, it give us a score.

Ben: Yeah.

Rand: All right. Let's look that.

Ben: Okay.

Rand: Now this LDA score, tool might be an overstatement. It's a Labs. You
can look and see it. It works. You can put stuff in. But we have a
lot of really beautiful tools here at SEOmoz, and this is not one of
them. So, it's not the prettiest thing in the world. But it does
leverage the topic modeling work, and you use the specific process,
LDA, which we think is sort of better than some other ones, but not
being as good as the sophisticated stuff Google does.

Ben: Almost certainly.

Rand: I enter a query up here. Something I want to rank for. I put in some
words here, and it will give me a percent telling me how topically
relevant it thinks this content here is to the word here. And it
will do the same thing like if I enter a URL down here, it will
populate this box with the content from that page.

Ben: Right.

Rand: So this gives me sort of a rough sense of I can play around and see
does SEOmoz's LDA tool work. LDA scores seem to predict anything
that I can rank better. So, I could look at the top ten results and
be like, "Wow, I'm winning on links. I think I'm doing a good job of
keyword usage. But boy, all these other people have much higher LDA
scores than I do. Maybe I should try increasing that." Is that sort
of a suggested application here?

Ben: That would seem very reasonable to me. Like it is kind of new. No one
has a huge amount of experience with it. So far, it seems like
people have said that it chains up a higher score and it has helped
them rank, but that's very anecdotal. There's a very plausible
reason why you would think that that would work. But, we're kind of
on the bleeding edge here.

Rand: We're not trying to say that like you definitely enter something in
here, you should use this and boost up the rankings of all of your
pages. It will work perfectly or anything like that

Ben: Yeah, exactly. But it seems very plausible that basically getting a
higher score helps you rank higher. And the tool let's you see
clearly what this kind of topic modeling is going to be able to
figure out. It sort of shows you the kind of connections that Google
certainly will be able to make in figuring out that pizza is related
to food but donkey is not related to food. So you can sort of
explore and see how this stuff works.

Rand: Cool. One weird thing that people have noted and the last point is
that this fluctuates a lot. Oftentimes, when I run it, it will
fluctuate one to five percent change. Like I'll hit go on the same
URL, the same content, the same keyword, and it will change one
percent to five percent. Sometimes it seems like it can go to maybe
seven, eight, or nine percent. A couple of people have reported --
we haven't been able to see them -- rare instances where it is more
than ten percent fluctuation. So, explain to me what is going on
there. What is the sampling that the tool does?

Ben: Right. So there's a very large possible number of ways that you could
explain the document with topics. It could be about Star Trek. Or it
could be about astronomy and TV shows. There are lots of different
ways that you could explain the different word usages in there. So
we can't actually just try all of them and weight them by the
probability because that would take years to answer anybody. So
instead, we sample them based upon their likelihood and then we
average that. So, if you wanted to figure out are most people going
to vote Democrat or Republican this year, you might sample 100
people and you're going to conclude that 40 percent are going to
vote Democratic this year.

Rand: But then if you sample a different 100 people . . .

Ben: It will be a little bit different. Generally, you can come back and
say 70 percent are going to vote Democratic this year. It's in
theory possible, but it doesn't happen that frequently.

Rand: Got you. So you can essentially use this number. If I was really
interested, I would have to get more precise. I could run it a bunch
of times, and I would be getting a bunch of different samples and I
would average those out

Ben: Yeah. In the back end, we're doing it a bunch of times for you and
averaging them. So averaging it yourself on the front end as you go
isn't terrible.

Rand: It's just a big use of our bandwidth.

Ben: Oh, yeah. It really helps our numbers of hits to our website.

Rand: Oh, yeah. I'm sure that's all correlated with rankings too.

Ben: I know like unique visitors. What's that?

Rand: All right. Well, Ben, we're excited about this tool. We really
appreciate you doing this research work. It's exciting and
interesting. I think we'll know more in the future, in the months to
come, whether this is really great and applicable for SEO or that it
turns out that maybe it's some other things causing this weird
correlation.

Ben: Absolutely.

Rand: Well, thanks very much for obviously building this and joining us.
And thanks to all of you for watching Whiteboard Friday. We'll see
you again next week.

Ben: This was a long one.

Rand: Very impressed that you watched it. We do appreciate it.

Video transcription by SpeechPad.com

[UPDATE by Ben (sept 10th, 12:50pm PST): In the video I stated that "specific complaints people have come up with so far aren't very credible." This was directed at the claims, not the people who raised them, and I wish I has used the word "accurate" instead of "credible." My apologizes to anyone who was offended. Credible people can say things I disagree with. Indeed, the back and forth over their concerns about the unweighted mean Spearman's rank correlation coefficient has been a useful context to explain exactly why we consider it a better statistic to use than commonly suggested alternatives.

Also, I noticed that Russ Jones did work to reproduce some of our findings. He used a different dataset and different methodology, emphasized good qualifications to keep in mind, and broke out competitive vs non-competitive which we didn't do.]

[ERRATA by Ben (sept 16th, 2:00pm PST): The blog post above reports the correlation measurement as 0.32. It should have been 0.17.]