- knightmaster
Web Socket: The Fastest Way To Scrape Websites
Scraping In Another Dimension
You may realize when you are scraping website, you will normally do these following steps:
1. Check whether the website provides RESTful API, if so just use RESTful API, if not continue to the next step.
2. Inspect HTML elements that you want to scrape.
3. Maybe try simple Request to get the elements.
4. Success, hell yeah!
5. If not, maybe try another CSS, XPath and etc.
6. Or if still can’t, maybe this is a javascript website, need to use other scraping tools such as Selenium, Scrapy-splash and etc.
7. Keep repeating 1–5 until success
Sharing of my experience
Yeah, these steps sound very repetitive, it is doing the same routine to retrieve data. Sometimes you will find different things happening, for example being blocked by a website.
Most of the time, you will say yeah the website is smart, it can detect human behavior. However, from what I observe some people will say, this website I can scrape all elements on the website, but just certain parts of the website may be the price of the product I can’t scrape. Then when I check the code, it is highly possible the reason comes from incorrect XPath or incorrect preprocessing of scraped text.
So most of the time, when the website is blocking your bot, you will normally do these to disguised the website
1. Rotate multiple user-agents.
2. Rotate multiple IPs.
Web Socket to the rescue
But if I tell you, there might exist a way to retrieve data without repeating those boring steps, but only apply to certain websites.
If you ever wonder how news or chat website is updated in a lightning speed? This might be the website is using WebSocket which I am going to introduce today.
WebSocket Protocol is an open standard for developing real-time applications. It provides a persistent connection between a client and server that both parties can use to start sending data at any time with or without login. When you can connect to the tunnel, you don’t need to inspect any element, just need to inspect what to send to the tunnel, and you will be able to retrieve your data continuously.
Python Code Part
Here I am just gonna demonstrate a simple code on how to connect through WebSocket using Python.
I will take this website: https://www.kolumbus.no/ruter/kart/sanntidskart-internt/?c=58.974238,5.691347,14&lf=all&vt=bus,ferry , as an example. This website provides a live view of buses in Norway/Stavanger.
After you run this snippet of code in the terminal, you will be able to see a similar result as shown below:
From what I had experience in retrieving data through a web socket, you will be shocked by how fast you will be able to retrieve the data compare to normal scraping through inspecting HTML elements.
Benefits of using Web Socket
Keep the connection alive as most proxy servers disconnect after 65 seconds.Retrieve data in a nice JSON format.Do not need to do repetitive work of inspecting elements.Easier to maintain as XPath or CSS path will change constantly.
Final Thought
I am currently working as a Data Scientist, and what I can inform you is that crawling is still very important. So if you are able to connect through the web socket to retrieve data, please do use it as it will benefit you especially in the long term.
Thank you for reading this post. Feel free to leave comments below on topics that you may be interested to know. I will be publishing more posts in the future about my experiences and projects.
This post is originated from this link: https://towardsdatascience.com/scraping-in-another-dimension-7c6890a156da
About the Author
Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipelines and also implementing machine learning models on solving business problems.
He provides crawling services that are able to provide you the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling services.