How Web Crawling Benefit Data Science
Explain in 3 real-world scenarios
You might be wondering, is web crawling even needed in data science?
When I am a data scientist, the labeling part of the data will not be my business, so why should I bother to learn web crawling to retrieve data?
Having this question in mind is quite normal, as web crawling is not a must for those who wanted to become a data scientist. However, having this skill is definitely a plus. Back then when I was studying at university, I do not think that having this skill will be beneficial. I picture web scraping as just an automation tool.
However, after working for the past few years, I begin to notice there is an increasing demand for this skill as we are in an era of information explosion. Every single piece of information you could see on the website could be easily scraped.
Without further ado, I would like to share some of the real-world use cases I observed in the job market. You will observe how web crawling directly benefit in data science.
Let’s imagine that you are currently doing online shopping. Your target is to buy a new iPhone to reward yourself for being on finishing a big project. So, you go to taobao.com to search the iPhone.
The listings are likely to be all over the places. You are scrolling to find which model you would like to purchase. However, what you observe are the latest model which are iPhone 11 Pro and iPhone 11, you prefer to see whether there are other models selling in the platform.
Then, you decide to search for a more generic word: iPhone as shown in the picture. You notice there are suggested models for iPhone below, which you are not only able to click on to see all the listings by model but also to view some of the models offered in Taobao.
How does Taobao actually do it?
There are many ways to do it, but I will only share how to achieve it using web crawling. One of the ways is to crawl the websites which have all the model name of the particular listing. Then, train a model to classify the listings to the model name scraped.
Boom! This is how web scraping could help to obtain the data you need for free. Maybe you just need some simple cleaning on the data you scraped, you will be able to own a clean data set to start to train machine learning models.
On the other hand, let’s say you going for a data science job interview for an AI company. The company you are working for is a company that provides multiple real-time API services but here let’s just imagine there is only one service.
This particular service is given a web URL, the API should return whether the website is a legit website. In other words, that means the company is actually carrying out legal business operations.
There may exist two scenarios. The company will either sub the web scraping part to other companies to do or to hire other people to the job. Let’s say the company sub the project to another company. At the same time, you know how to scrape websites. Do you think there is a higher chance for the company to hire you? The answer is obviously yes, but it still depends on the performance of your interview.
So don’t think that web scraping is useless if you want to stand a chance to enter the data science field.
Take every chance you get in life. Because some things only happen once. — Karen Gibbs
You just graduated from university and you got an offer for a data science job in a Fintech company. The main project you are working on is credit scoring. Let’s assume that the company just has some basic data for example date of birth, gender and etc. However, it does not have relatively more important data like how much salary do the person earns, does the person carries any loan and etc.
Maybe your company is large enough so that you can mine important features through websites or apps. What if the company you are working for is a small company?
You will be struggling to improve the performance of the machine learning model. You already included all the data of the company have but still could not boost the performance.
What you could do?
If you know how to crawl facebook websites, you will be able to get more features. What if one of those features that appear to be the gem you are looking for?
By having more data, you will be having more chance to find the breakthrough you need. More often, you will find that it is easier to find a breakthrough by getting the right data for the model.
This is not an article to brainwash you to learn web scraping. However, I would like to point out from what I could see or experience. The three scenarios are actually real-world examples of how web crawling provides impacts on the data science field. Believe it or not, this skill is worth to learn and it will benefit your career.
The capacity to learn is a gift;the ability to learn is a skill;the willingness to learn is a choice. — Brian Herbert
I really appreciate that you read until the end. Hope that you like this article and find it beneficial. Comment below if you have any topic you would want me to discuss!
This original blog post link:https://towardsdatascience.com/how-web-crawling-benefit-data-science-a6ff0bd4cd1
About the Author
Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipeline and also implementing machine learning models on solving business problems.
He provides crawling services that can provide you with the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling services.