To Boost your web crawler's efficiency!
When I am crawling websites, web crawlers being blocked by websites could be described as the most annoying situation. To become really great in web crawling, you not only should be able to write the xpath or css selectors quickly but also how you design your crawlers matters a lot especially in the long run.
During the first year of my web crawling journey, I always focus on how to scrape website. Being able to scrape the data, clean and organise it, this achievement already can make my day. After crawling more and more websites, I found out there are 4 important elements which are the most vital to build a great web crawler.
How to determine a great web crawler? You might want to consider the following few points:
Speed of the crawler
Are you able to scrape the data in your limited time?
Completeness of the data scraped
Do you manage to scrape all the data you are interested?
Accuracy of the data scraped
How can you ensure the data scraped you have scraped is accurate?
Scalability of the web crawler
Could you scale the web crawler when the amount of websites increases?
In order to answer all the questions above, I will be sharing some of the tips that could help you to build a great web crawler.
Tip #1 Decrease the number of times you need to request a web page.
Using web scraping framework: Selenium as an example. If we want to scrape this website: https://www.yellowpages.com.sg/category/cleaning-services
Let’s say, we want to get the data for address and description of the company. So, for Selenium, we might use driver.findElement twice to retrieve the address and description separately. A better way would be using the driver to download the page source and use BeautifulSoup to extract the data you need. In conclusion, hit the website once instead of twice to be less detectable!
Another situation is, when we are using WebDriverWait(driver, timeout, poll_frequency=0.5, ignored_exceptions=None) to wait for a page to fully loaded, remember to set poll_frequency(sleep interval between calls) to a higher value to minimize the frequency of making a request to the webpage. More details you can read through this official doc!
Tip #2 Write the data to CSV once one record is scraped
Previously when I was scraping websites, I will output the records only once — when all records are being scraped. However, this method might not be the smartest way to complete the task.
A way instead is that after you scrape a record, you will need to write into the file, so that when problems occur ( for example your computer stop running, or maybe your program stop because of an error occurred), you can start from the website where the problem occurs, to start your crawler/scraper again:)
For me, I normally use python writerow function to write records to the output file, so that I am able to start from the website where I do not need to re-scrape all previous scrapped data if my scraper is stopped.
Tip #3 Reduce Times Of Being Blocked By The Website
There are quite a lot of ways to implement stuff so that crawlers or scrapers won’t get blocked, but a good crawling framework will reduce your effort to implementing them.
The main library I am using is Request, Scrapy or Selenium. You can refer to this link for comparison between Scrapy and Selenium. I prefer Scrapy as it already implemented some of the ways to decrease times of being blocked by the website.
1. Obey robots.txt, always check the file before scraping the website.
2. Use download delay or auto throttling mechanism which already exists in Scrapy framework, to make your scraper/crawler slower.
3. The best way is to rotate a few IPs and also user agents to disguise your requests.
4. If you are using scrapy-splash or selenium, random clicking and scrolling do help to mimic human behavior.
Refer to the below links for more information
Tip #4 Retrieve data through API
For example twitter. You can scrape the website or using their api. Instead of going through the hurdle, access their data through API will definitely make your life much easier!
Tip #5 Crawl google cache’s website instead of the original website
Given https://www.investing.com/ as an example:
You can access the cache version as shown below:
You will not be able to get the latest data by scraping the cache’s website instead of the website. However, if the website’s data only changes a few weeks or even months, scraping google cache website will be a much wiser option to do.
More information about google cache you can check on these websites:
This original blog post link: https://towardsdatascience.com/https-towardsdatascience-com-5-tips-to-create-a-more-reliable-web-crawler-3efb6878f8db?source=friends_link&sk=947d1ddc86ac015bb55c007d41dd9798
About the Author
Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipelines and also implementing machine learning models on solving business problems.
He provides crawling services which is able to provide you the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling services.
Comments