Screenscraping with Go and Colly

Golang is a ridiculously fast and fun language to develop in - and it helped us migrate content from an old legacy website

Recently we were working with a client who has a pretty well established website - but it runs on a very old content management system and they had accumulated just over 1,000 pages of content. As we were working together to bring them towards a more modern tech stack - we came to the task of moving their old content to our new system.

Faced with the task of manually migrating 1,000+ pages - we decided:

“We’re a team of developers! Let’s automate this!”

So we drew up a rough plan

  • Build a small script that would visit their existing site - and click on the links we needed it to
  • Once on the appropriate pages - find the content we want (mainly Title, Body content, and some category meta-data)
  • Take the content and put it in our new database

Attempt 1: PHP

For this particular client the backend was written in PHP, not Go. Yes - we write fully functional backends in Go all the time here at systemseven (which is one way we know the language) but the requirements for this project had us sticking with a PHP build.

Writing this in PHP isn’t hard - just a simple script to get out and file_get_contents('http://clienturl.com') and then use something like PHP Simple DOM Parser to move through the HTML on the pages and grab the data we need. This script wasn’t bad - we wrote it in about an hour.

It was when we ran it that it became a problem!

The first problem - each page was taking around 3-5 seconds to completely parse and save the content. Some quick “back of the napkin” math told us that would be roughly an hour to an hour and a half to import all that content.

That’s not fast enough!

The second problem - The script we wrote hit their server hard enough to do 2 things:

  • Degrade the performance of the existing website
  • Trigger some ‘malicious activities’ warnings from their existing web host

That’s not good!

Now a few quick caveats here - we could have made this faster than 3-5 seconds, thread it out so that doesn’t slam their server and generate warnings that think we’re a bad guy all with PHP - it’s perfectly possible.

But we thought - let’s spend 1 hour trying this with Go - if it works great, if it doesn’t we’ll go the PHP route.

Attempt 2: Go

Enter Go - “an open source programming language that makes it easy to build simple, reliable, and efficient software.” - Yes, we ripped that description straight from the official Google Go website

If you’ve not used Go before I’d encourage you try it out, it’s one of those things that makes programming fun again. It has a super expressive syntax, a massive amount of built in libraries, and an extremely active developer community.

Now usually when we’re writing with Go we find that we don’t need to pull in many packages because Go has so much built into it already. But for this case we did pull in a package called Colly. Colly is designed to parse out webpages (“scrape” - if you will) and do it in an extremely fast manner - in fact it says it can handle up to 1,000 requests a second.

Installation was a breeze using Go’s built in go get command

go get -u github.com/gocolly/colly/

You need to wrap your head around how Colly works as it will by default spider out to every link on the site similar to how a crawler like Google would. So we did a couple things to keep it from trying to parse any external links

c := colly.NewCollector(
    colly.AllowedDomains("www.clienturl.com"),
    colly.Async(true),
)

Here we’re doing 2 things. Limiting Colly to parse only links that are on the clienturl.com domain and we’re turning on Async processing of links (this is where we get a HUGE speed increase as we’ll talk about in a bit)

Remember the ‘malicious warnings’ we got because we hit the server to fast and frequent? Colly can help us with that too

c.Limit(&colly.LimitRule{
    DomainGlob:  "*clienturl.*",
    RandomDelay: 5 * time.Second,
})

Translated: When the domain matches clienturl - we want to institute a random delay before firing off another request. This let’s us throttle and control our requests coming in, especially since they are async and would happen almost at the exact same time if we didn’t do this.

Great - now we’re making requests and crawling the sites - but we need to DO stuff. Colly works with lifecycle methods on each request, similar to things like componentDidMount or componentWillRecieveProps in React. The docs have a full listing of the methods (there aren’t that many) but for our script we ended up using onHTML almost exclusively.

Here’s how we parsed a ‘company listing’ on this clients site that was a paginated example that included links like ‘0-9, A, B, C…’ for paging through results

// Find and visit all alpha paging links
c.OnHTML("#mn-alphanumeric a[href]", func(e *colly.HTMLElement) {
    e.Request.Visit(e.Attr("href"))
})

// 'Click' on the title of each listing
c.OnHTML(".mn-title a[href]:first-child", func(e *colly.HTMLElement) {
    e.Request.Visit(e.Attr("href"))
})

onHTML fires when any HTML result is generated - so ie: every time a page is loaded. So what we do is parse through that HTML result using jQuery-esque selectors. In our case - links to the pages with content that we want - and we essentially “click” on them by visiting their link value.

Now that we’re on a detail page with the content we want - it’s just as simple as grabbing that data - populating a struct and then writing that struct to the DB (we used Gorm here for simplicity)

c.OnHTML("#mn-resultscontainer", func(e *colly.HTMLElement) {
    temp := Business{}
    temp.Name = e.ChildText("h1")
    temp.Category = e.ChildText(".mn-member-cats li:first-child")
    temp.URL = e.ChildAttr("#mn-memberinfo-block-website a", "href")
    temp.Address = e.ChildText(".mn-address1")
    temp.Phone = e.ChildText(".mn-member-phone1")
    temp.Desc = e.ChildText(".mn-section-content p")

    db.Create(&temp)
})

So there we go - a few dozen lines of code and we have an extremely fast parser - especially when we turn on the async() functionality of it.

We ended up using this solution and it ran - in totality - in about 10 minutes (vs the hour for the PHP script) but we did have a few lessons learned…

Results and Lessons Learned

So why was it so much faster? It mainly boils down to how Go works at a language - it gives us an extremely high level of concurrency, an optimization and use of multiple CPU cores and since it compiles down to a binary it’s that much closer to the ‘hardware’ as it were - as opposed to something like PHP which is compiled at a higher level.

We did notice when running this - both in PHP and Go - that our Dev DB was a bottleneck (our plan was to import to your Dev DB and then populate up to other envs with dump files) - if we were doing this in a production environment - we could probably do some tweaking there and get that time a little under 10 minutes - but for our use case it really wasn’t that big of an issue

We ALSO found - that we needed to bump up our random delay just a little because we flooded our Dev DB with connections and it started to reject them (another example of how fast concurrent Go can be)

But all in all - being able to automate a task and then taking that automation from 50 minutes to 10 minutes - that’s a huge win for us - so much so we were able to run this script a few times before launch without fear to grab the latest content.

Further

If you’re interested in some of the tech we used here, check out some of the following resources: