Ways to analyze a 1M rows dataset of search queries
-
Hi,
I have this large dataset, about 1 million search queries with visits, bounce rate and a few other metrics. I'm trying to explore this data to find keyword "buckets" (such as include product name, location name, transactional objective, informational, etc.), as well as explore the density of certain keywords (keywords as in instances of a single word amongst all queries)
My idea was to use Excel and a macro to split all queries in separate words (also clearing punctuation and uppercase/lowercase), then storing this word in a new worksheet, adding to another column the visit counts from the row where the word was extracted (as to give a sense of weight). Before adding the word to the new worksheet, the script will look if the word already existed, if so it would just add the current value of visits to the existing visit counts etc.
In the end it will create sort of a "dictionary" of all the keywords in all search queries ranked by weight (= visits from search query including this keyword)
This would help me get started I believe, because I can't segment and analyze 1M raw search queries...
My issue is: this VBA has been running on my (fast) PC for the last 24hr and it doesn't seem to get to an end. Obviously excel+VBA is not the best way to do text mining and manipulation in such a large dataset (although it's just a 30mb file)
What would you do if you had this dataset and would like to mine the text/semantic as I am doing? Any idea of tools? process?
I'm considering dumping this data into a MySQL db and doing the processing through PHP (the only backend language I'm versed in), and getting the "summified" data stored into another table, which I'll then be able to export to a Excel for analysis. But I'm afraid that I'll be facing memory limit issues and such...
In the meantime, I'm definitely interested into knowing what you guys would do if you had this data and wanted to simply start exploring its constituencies
Thanks!
-
Yeah, Access can process any number of rows. It's Microsoft's database program. You can upload data, and then create queries. They have a design view where you can construct queries in a WYSIWYG fashion, or if you want, you can write your own SQL.
-
Thanks a lot John!
I'm going to try this out tonight!
So, I assume, Access won't have the same processing limitations with 1 million rows, will it?
Once I'll be done with the "discovery phase" I'm going through with this keyword list, I'll definitely use Advanced filters (in Excel) as you recommend to understand keyword groups in details
-
I had a similar problem going through my search query reports. If you're already familiar with VB you could do this with a Microsoft Access database rather than setting up a MySQL one w/PHP. I've been working on creating an Access database that I can import my data into, and have it spit out all sorts of useful info (for example negative keywords and placements), but it's only in its early stages right now.
If you just want to see it for a few terms and don't mind doing it one at a time, in the past I've filtered data like this in Excel without VB using advanced filters. I found that using advanced filters rather than VB sped up the process quite a bit; I'd imagine because it's an innate Excel function. Using 4 filters you can match whole words in the queries. For example, to find queries with "blah", you'd set a filter for "blah", "* blah", "blah " and " blah *". Then you can use the Subtotal command to do calculations over the visible rows and calculate the data.
More about advanced filters: http://office.microsoft.com/en-us/excel-help/filter-by-using-advanced-criteria-HP005200178.aspx
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can't see backlinks in Search Console
Hello, We have 7 sites and have noticed that for one site, we don't see any backlink info in Search Console even if it's been linked for over 8 months. Other tools show thousands of backlinks, but Search Console is still pending. I also see very little info in regards to Performance even if we are getting thousands of hits a day. Could this be a sign of a bigger problem? In summary, the site is up and running, getting hits, getting backlinks, but Search Console still looks like it's a new site with no activity several months after being installed.
Reporting & Analytics | | CJolicoeur0 -
Organic search traffic down 60% since 8/1/18\. What now?
I have a small health & fitness blog, and my Google search traffic suddenly dropped 60% around August 1 (I've attempted to link an image). My rank has dropped for 86 keywords. I have no manual penalty, so I'm guessing I was affected by the algorithm change. My technical skills are VERY limited. I've tried to find answers on my own, but every time I try to "fix" something, I only seem to make it worse. I do seem to have some structural/performance issues with my site (e.g., lots of 404 errors from uninstalled plugins and unwanted permalinks). I asked my server for assistance (I used managed Wordpress hosting), and they said they couldn't help. As you can imagine, this is quite devastating, and I have no clue where to go from here. I don't know if I'm allowed to link to my site here, but it's mommyrunsit dot com. Any assistance is greatly appreciated. Thanks. Sharon 0lPu4wY
Reporting & Analytics | | RoniFaida1 -
Is there a way to filter all computers on a specific IPv6 network in Google Analytics?
Is there a quick way of filtering the IP addresses for all the computers on a network that's using IPv6? I want to filter out visits to our websites from the devices on our office network, but each computer (and phone and tablet) seems to have a different address. It _looks _like they all start the same way, though. One computer is xxxx:xxxx:xxxx:xxxx:aaaa:aaaa:aaaa:aaaa, another is xxxx:xxxx:xxxx:xxxx:bbbb:bbbb:bbbb:bbbb, my phone is xxxx:xxxx:xxxx:xxxx:cccc:cccc:cccc:cccc, etc. Does this mean that xxxx:xxxx:xxxx:xxxx is the address for our network as a whole, and I can just set up a Google Analytics filter for "IP addresses starting with..."? Or would doing that also filter out hits from, like, every visitor within a 20 mile radius of our office? If I need to simply put in the individual addresses for each and every device, I will. I'm just hoping it doesn't come to that. Thanks!
Reporting & Analytics | | BrianAlpert780 -
Are these Search Console crawl errors a major concern to new client site?
We recently (4/1) went live with a new site for a client of ours. The client site was originally Point2 before they made the switch to a template site with Real Estate Webmasters. Now when I look into the Search Console I am getting the following Crawl Errors: 111 Server Errors (photos) 104 Soft 404s (blogs, archives, tags) 6,229 Not Found (listings) I have a few questions. The server errors I know not a lot about so I generally ignore. My main concerns are the 404s and not found. The 404s are mostly tags and blog archives which I wonder if I should leave alone or do 301s for each to /blog. For not found, these are all the previous listings from the IDX. My assumption is these will naturally fall away after some time, as the new ones have already indexed. But I wonder what I should be doing here and which will be affecting me. When we launched the new site there was a large spike in clicks ( 250% increase) which has now tapered off to an average of ~85 clicks versus ~160 at time of launch. Not sure if the Crawl Errors have any effect, I'm guessing not so much right now. I'd appreciate your insights Mozzers!
Reporting & Analytics | | localwork0 -
Google Search Console (new GWT) - Does a language specific sub folder need its own GSC profile
HI I've got a clients site set which targets 3 language/countries: English via the main site on the domain.com Turkish via a Turkish language site on a subfolder domain.com/tr/ And German via domain.de The devs have set up .com and .de in GSC and is reporting data in both However there's no data in the domain/com/tr GSC profile ! Is that because its on a subfolder so data pertaining to it is being reported in the main domain.com GSC account ? Or does something more need to be done to set up the Turkish subfolder in GSC ? If so what ? All Best Dan
Reporting & Analytics | | Dan-Lawrence0 -
What does WMT measure against for search query fluctuation?
Hi guys, I always thought that when checking out search queries on Webmaster Tools that it measured change per week. However, I noticed that my impressions and clicks were actually up but WMT was reporting that they were down over a week so it can't be that. Does anyone know? Attached screenshot for clarity. Thanks in advance! XkWANTo_
Reporting & Analytics | | Whittie0 -
Google Making all searches secure - "Not provided" data to increase in Analytics
A lot of you might already be aware of the recent Google change at encrypting all search activity except for clicks on ads. Rand did a whiteboard session on this recently. How is everyone planning to adjust their research data to accommodate for this change?
Reporting & Analytics | | SEO5Team0 -
Does Google Analytics parse visits from search apps?
Does anyone know if Google Analytics reports visits to your website differently from individual search apps like Google and Bing? Or do they just treat them the same as any other keyword visit from Google or Bing search engine? I suppose the end result is probably the same as in the Google app you're using Google so it would just be a different access point versus a new tool. I'm just curious if there is a way to see how many Mobile visits are coming from the apps vs the browser. For me personally I have the Google and Bing iOS apps installed but rarely use them, opting for the Safari search bar 99% of the time.
Reporting & Analytics | | nsauser0