Algolia Search Party Crawling the web

Search Party #18 — Crawling Edition

We were happy to organize our regular Search Party last Wednesday, June 12th, 2019. This time it was about crawling web content.

People tend to think crawling is about stealing other people’s data. Although some crawlers do that, crawling itself is simply the act of extracting content from websites. The motive is more often legitimate than illegal. During this event, we had three amazing talks that presented different ways to crawl web content and discussed some easily overlooked challenges when developing a crawler.

The challenges of crawling the web — and how to overcome them

Samuel Bodin, Algolia

In the first presentation, Samuel Bodin gave us a glance into how Algolia indexes complex documents like PDFs, Words, Spreadsheets, … Also, how to render websites with javascript at enormous scale.

He also spoke about the common trap with websites, more specifically, the “Rabbit Hole”, a place where your crawler gets stuck forever.

Last but not least, he gave a quick presentation about how Algolia manages crawling with security concerns. Especially when executing javascript written by customers on Algolia’s server without exposing any sensitive data.

 

Writing a distributed crawler architecture

Nenad Tičarić, TNT Studio

In the second presentation, Nenad Tičarić talked about the architecture of a web crawler and how to code it quickly with the php framework Laravel.

He broke his presentation down into two parts. He started with a good overview of crawlers and introduced a few terms that you’ll likely want to know before digging into the subject. He also described how to design and architect an automatic web crawler at scale.

The second part focused on how to achieve that very simply with PHP, and more specifically Laravel, and a very few basic tools like Guzzle and Artisan.

 

Automatic extraction of structured data from the Web

Karl Leicht, Fabriks

For the last talk of the day, Karl Leicht spoke about how to achieve automatic and smart attribute extraction with a crawler.

How to crawl millions of different websites? That’s the interesting question Karl asked us today. He described how to scale your code without reinventing the wheel for every website.

We saw how to differentiate programmatically a listing page and a product page, the importance of microdata and where to find the more valuable information in a page.

The second part focused on the challenges of maintaining this code in the long run, with a long look at tests and monitoring.

The Next Event

We host our Search Party in our Paris office. It’s for everyone and… it’s free! Please join us next time.

Follow us on EventBrite so you can be notified for the next event.