Two NGI grants and use of a Fed4Fire+ testbed is helping Julien Nioche turn what was a simmering idea into an enterprise-grade solution for URL Frontier, an open source crawl frontier API and implementation.
Where does your passion/motivation for this subject originate – what drove you to work on this?
I’m originally from France but I’ve been living in the UK since 2005. I have been involved in web crawling for many years and have worked with numerous organisations. Most of what I do is open source and I have contributed under various forms to quite a few open source web projects related to web crawling, such as Apache Nutch, or the distributed web crawler I created, which is called StormCrawler.
How did you come up with the idea for your project?
The project I worked on in the context of Fed4Fire+ is called URL Frontier and it is related to web crawling. The original idea for it has been slowly maturing in my mind and it is the result of many conversations with various people, in particular with Andy Jackson, who is the Web Archiving Technical Lead at the British Library.
What problem does your project solve?
Web crawlers need to store the information about the URLs they process, this is called a crawl frontier. Typically, each web crawling software has its own way of implementing this. What my project, URL Frontier does, is (1) define an API for the operations that web crawlers do when communicating with a web frontier and (2) provide an implementation of a crawl frontier.
By externalising this functionality from web crawlers, we can reuse the same code and make it better instead of constantly reinventing the wheel. Because the API and implementations are based on gRPC, URL Frontier can be used by crawlers regardless of the programming language they are written in.
How did NGI project support your idea and what’s new about it?
URL Frontier was developed thanks to a grant from the NGI Zero Discovery framework. The initial version of the project was completed at the end of 2021.
Although the concepts of crawl frontiers are well known, there is no open source project equivalent to URL Frontier.
Getting the funding from NGI was instrumental in achieving URL Frontier. Without it, I would not have been able to dedicate the time needed to work on it.
Did you have a concept of your final project idea, or did it evolve during the process?
The project and its objectives were pretty clear from the start but an important part of the project was to design the API. This was done in collaboration with a community of developers involved in web crawling and it was nice to see a consensus emerge after a couple of iterations. Similarly, there were several unknown factors as far as the implementation was concerned. Would we be able to deliver something that could work well enough at a large scale? We seem to have succeeded in doing this.
How did you test your experiment on Fed4Fire+?
We used the testbed at IMEC called Virtual Wall, and named our experiment “Running a large scale distributed web crawl with URL Frontier and StormCrawler”.
The aim of the experiment was to run a large scale crawl with StormCrawler using URL Frontier. We wanted to check that the implementation of the service could cope with a large crawl in a timely fashion, as well as fix any bugs and improve the performance of the service.
The output of the crawl was a collection of WARC files, which is the standard format used in web archiving. These files were donated to the CommonCrawl foundation, which will host and distribute them for free.
The web crawl ran for several weeks and at the end of it had fetched 354M URLs and discovered an additional 1.2B URLs. All this was stored on a single URL Frontier instance, which confirmed that it could be used reliably on a large scale. A total of 36.8 TB of the WARC files was stored on AWS S3.
How did testing the project carry it forward?
Running the experiment allowed us to check that the code worked well even at a scale close to a real-world deployment. We also fixed quite a few bugs and improved several things in the process.
Will you be taking the idea further now that the support from NGI and F4F+ is over?
By running the experiment on a large scale, we were able to better describe possible improvements to URL Frontier. This allowed us to apply for another round of funding with NGI Zero Discovery and we now have a second version of the project going! This will add enterprise features to URL Frontier and help it go from a reliable piece of software to a more complete one, such as the ability to get metrics from the frontier and monitor its performance in real time as well as working as a cluster and be able to cope with an even larger amount of data.
Furthermore, we are now seeing a growing rate of adoption of URL Frontier with various open source web crawlers and businesses testing it as a solution for their distributed web crawls.
This is very exciting and would not have been possible without NGI and Fed4Fire+.
More info about:
Julien Nioche, Director at DigitalPebble Ltd: https://digitalpebble.com
URL Frontier: https://nlnet.nl/project/URL-Frontier-2/
Common Crawl Foundation: https://commoncrawl.org/