Overview
As SEOs, we are interested in the most influential factors in commercial web search results. Accordingly, we have been conducting ongoing studies looking at the relationship between web search results and links, social media signals, and on-page factors.
This document explains our methods, including the construction of the data set and statistical analysis. The following section includes some details on the dataset itself, from the choice of the keyword list to data sources and feature extraction. The last section describes our statistical analysis methods.
Before diving into the details, we want to mention several important things about the analysis. We only considered English language, U.S. search results from Google's search engine, although we plan to extend the analysis in the future. All data collection was conducted during early May 2015.
Dataset construction
Keyword list
The first step in building the dataset is selecting a list of keyword queries. Since this list of queries determines the composition of the dataset, it is important to ensure that a wide variety of subjects and query types are included. To this end, we used the suggested queries from all 22 top-level categories in the Google Adwords tool. The tool provides 800 queries for each category, for a total of 17,600 queries. After removing duplicates from the list (some of the queries are included in more than one category) we obtained the final list of 16,521 queries.
The final list of keywords contains samples from the head, middle, and tail queries, as measured by search volume. Table 1 contains counts of the number of queries in different local search volume buckets. All search volumes are well-represented, from infrequent (less than 1,000 searches per month) to frequent (more than 20,000 searches per month), and the data contains some keywords with more than one million searches per month.
Table 1: Distribution of keyword search volume in the final data set
Local monthly search | Number of queries |
---|---|
< 10,000 | 4,357 |
10,000 - 20,000 | 2,713 |
20,000 - 50,000 | 4,860 |
> 50,000 | 4,591 |
SERPs
We pulled the top 50 search results for each of the queries on the query list from Google's U.S. search engine, in a location- and personalization-agnostic manner. We removed all non-web results (images, video, news, etc.) from the response. Finally, we excluded all queries that returned less than 25 results to ensure that each SERP had sufficient data points for analysis.
Factors
With the SERPs from the second step, the final step in building the dataset is to calculate ranking factors. We collected factors from a variety of sources, as follows:
- Mozscape URL metrics. All of the link-related factors were sourced from Mozscape, using the URL-metrics API call.
- Mozscape anchor text. For each URL, we pulled the top 1,000 anchor text terms and phrases using the Mozscape anchor text API call. Then, for each query/URL combination, we determined whether there was a partial and/or exact match to the query. Here, "exact match" means that the entire query exactly matches the anchor text, while "partial match" means that at least one word from the query matches the anchor text.
- Social media signals. For each URL, we obtained a variety of social media signals from Facebook, Twitter, Google+, LinkedIn and Pinterest.
- On-page factors. We retrieved the original HTML/XML content for each URL, and, for each query/URL combination, computed various factors of interest such as the TF-IDF score of the page, the length of the document, etc.
- Domain/URL factors. We also extracted a variety of factors related to the URL and domain such as whether the query matched the domain name, whether the domain contained any hyphens, etc.
- SimilarWeb factors. We partnered with SimilarWeb to provide factors related to the each domain including rank, estimated traffic, etc.
- Ahrefs factors. We partnered with Ahrefs to provide link metrics for each URL and domain.
- DomainTools factors. We partnered with DomainTools to provide domain registration/expiration dates as well as a flag for whether the domain was registered using a privacy service.
The complete list of factors and a description of each can be found in the accompanying dataset with all results.
Statistical analysis
The main goal of our analysis is to order the factors from most influential to least influential, while providing some estimate of the relative influences across factor type (link vs. on-page vs. social, etc.). To this end, we have computed a number of different evaluation metrics between search position and the ranking factors, as well as a measure of overall prominence of a factor in the SERPs.
Mean Spearman correlations
This is our preferred metric, and the one illustrated elsewhere in this report. Since we have a wide variety of factors and factor distributions (many of which are not Gaussian), the Spearman correlation is preferred to the more familiar Pearson correlation (as Pearson correlation assumes the variables are Gaussian). In our analysis, we treated each query as independent and computed the Spearman correlation for each query, then averaged over all queries and reported the result.
Overall prominence
In addition to Spearman correlations, we also calculated the overall prominence of each factor in the results. We chose to measure prominence as the percentage of results in the entire dataset that contain the factor, where "contain" generally means the factor is not zero (the exact definition depends on the factor). For example, the prominence of keyword matches in the title is calculated as the percentage of results that include at least one keyword match in the title, the prominence of Facebook shares is the percentage of results that have at least one Facebook share, the prominence of number of linking pages is the percentage of results with at least one link, etc.