Last week we got SEO analysis about one of our portals, that analysis included thorough statistics about website SEO measures, like missing and duplicate <title>,<h1> and meta tags, broken and invalid links, duplicate content percentage…etc . It appears that the SEO agency that prepared that analysis use some sort of crawler to extract these information.
I liked that crawler idea, and wanted to implement it in PHP. After some reading of web scrapping and Goutte I was able to write a similar web spider that extracts the needed information, and I wanted to share it in this post.
About web scrapping and Goutte
Web scrapping is a technique to extract information from websites, its very close to web indexing because the bot or web crawler that search engines use, performs some sort of scrapping the web documents through following the links, analyzing keywords, meta tags, URLs and ranking them according to relevancy, popularity, engagement..etc.
Goutte is a screen scraping and web crawling library for PHP, it provides an API to crawl websites and extract data from the HTML/XML responses. Goutte is wrapper around Guzzle and several Symfony components like: BrowserKit, DOMCrawler and CSSSelector.
Here is a small description about some libraries that Goutte wraps:
- Guzzle: framework for building RESTful web service, it provides a simple interface to perform cURL, along with other important features like: persistent connections and streaming request and response bodies.
- BrowserKit: simluates a behaviour of a web browser, providing abstract HTTP layer like request, response, cookie…etc.
- DOMCrawler: provides easy methods for DOM navigation and manipulation.
- CSSSelector: provide an API to select elements using same selectors used for CSS (it becomes exremely easy to select elements when it works with DOMCrawler).
* These are the main components I interested in for this post, however, other components like:Finder and Process are also used in Goutte.
Basic usage
Once you download Goutte(from here), you should define a Client object, the client used to send requests to a website and returns a crawler object, as in the snippet below:
Here I declared a client object, and called “Request()” to simulate browser requesting the url “http://zrashwani.com” using “GET” http method.
Request() method returns an object of type Symfony\Component\DomCrawler\Crawler, than can be used to select elements from the fetched html response.
but before processing the document, let’s ensure that this URL is a valid link, which means it returned a response code (200), using
$client->getResponse() method returns BrowserKit/Response object that contains information about the response the client got, like: headers (including status code I used here), response content…etc
In order to extract document title, you can filter either by XPath or CSS selector in order to get you target HTML DOM element value
In order to get the number of <h1> tags, and get the contents of the tags that exist in the page,
for SEO Purposes, there should be one h1 tag in a page, and its content should have the main keywords in the page. Here each() function is quite useful, it can be used to loop over all matching elements. each() function takes a closure as a parameter to perform some callback operation on the node.
PHP closures is anonymous functions that started to be used in PHP5.3, its very useful to perform a callback functionality, you can refer to PHP manual if you are new to closures.
Application goals
After this brief introduction, I can begin explaining the spider functionality, this crawler will detect broken/invalid links in the website, along with extracting <h1>,<title> tag values that are important for SEO issue that I have.
my simple crawler implements Depth-limited search, in order to avoid crawling large amounts of data, and works as following :
- Read the initial URL to crawl along with depth of links to be visited.
- crawl the url and check the response code to determine the link is not broken, then add it to an array containing site links.
- extract <title>, <h1> tags content in order to use their values later for reporting.
- loop over all <a> tags inside the document fetch to extract their href attribute along with other data.
- check that depth limit is not reached, and the current href is not visited before, and the link url does not belong to external site.
- crawl the child link by repeating steps (2-5).
- stop when the links depth is reached.
These steps implemented in SimpleCrawler class that I wrote, (It still a basic version and should be optimized more):
and you can try this class functionality as following:
getLinksInfo() method returns an associative array, containing information about each page crawled, such as url of the page, <title>, <h1> tags contents, status_code…etc. You can store these results in any way you like, for me I prefer MySQL for simplicity in order to be able to get desired results using query, so I created pages_crawled table as following:
and here I store the links traversed into mysql table:
Running the spider
Now let me try out the spider on my blog url, with depth of links to be visited is 2:
Now I can get the important information that I needed using simple SQL query of the pages_crawled table, as following:
in the first query, I returned the number of pages with duplicate h1 tags ( I find alot, I will consider changing the HTML structure of my blog a little bit),
in the second one, I returned the duplicated page titles.
now we can get many other statistics on the pages traversed using information we collected.
Conclusion
In this post I explained how to use Goutte for web scrapping using real-world example that I encountered in my job. Goutte can be easily used to extract great amount of information about any webpage using its easy API for requesting pages, analyzing the response and extract specific data from Dom document.
I used Goutte to extract some information that can be used as SEO measures about the specified website, and stored them into MySQL table in order query any report or statistics derived from them.
Update
thanks to Josh Lockhart, this code is modified for composer and Packagist and now available on github https://github.com/codeguy/arachnid
No comments:
Post a Comment