How to get the latest COVID-19 News using Google News Feed
Retrieve live data through RSS feed
Due to the current stay at home policy, I am working from home. When I am browsing through the internet, to find COVID-19 related news article, I am curious how some of the websites are able to fetch the latest news or articles. These websites could retrieve news or articles from various websites.
I am thinking, fantastic! Do they build multiple crawlers on various websites to get the latest news? But that will need a lot of work.
Or maybe there exists some API to retrieve the news for free? However, I find out these API providers only track certain websites.
I am thinking is there an easier way to get the latest google news? As Google bots are scraping most of the websites, it should be a good way to start. Nevertheless, Google is known for its efficiency in restricting bots.
After hours of research, I figure out that there is a way to retrieve the latest google news, which is scraping through Google RSS news feeds.
If you are interested to know how to build a website that will display the latest news? Or maybe you are trying to integrate live news to your website? Or perhaps you are just interested in knowing how to retrieve the news?
This article is for you.
What is RSS?
RSS stands for “Rich Site Summary” or “Really Simple Syndication”, it is a format for delivering regularly changing web content. It allows a user to keep track of many different websites in a single news aggregator. Hence, many news-related sites, weblogs, and other online publishers syndicate their content as an RSS Feed to whoever wants it. So, you have a better understanding of what an RSS is, let us start with the scraping!
What is the exact URL we need to scrape?
BASE URL: http://news.google.com/news?
Here is the list of parameters that I think is useful to retrieve English related COVID-19 news. If you are interested to find out more, feel free to click here for google official XML API documentation.
q: This stands for the query term which I want to make a request to, in this case, it is COVID-19.
hl: The host language of your user interface that you are using. I prefer to use en-US.
sort: This parameter is optional. I will want to sort the news according to the date, so the value is the date.
gl: This is the boosts search results whose country of origin matches the parameter value. The value is the US as this is the default value for my web browser.
num: Number of news you would like to get. I will just go for the maximum number which is 100.
output: The format of your desired output. I will go for RSS.
Here is the final string which I will be sending requests to.
Congratulation, you are now left with the code part.
You will notice that I create a class called ParseFeed, with two methods clean and parse. The clean function will extract all the text from the HTML documents and replace the character \xa0 with a space character.
Besides, the parse function will parse the HTML and print some of the fields which I think is important. For instance, the title, description, published date, and URL of the news.
Here is a snapshot of the output after running the code above on the Jupyter Notebook.
As you might observe, you are able to retrieve the news through RSS in a much convenient way. If you are interested to track other kinds of news, you can just tweak the parameters to get what you want. However, do note that there are limits to some of the parameters.
Thank you so much for reading until the end. If you have any kind of topics you would like me to discuss, feel free to comment below.
Stay home, Stay safe everyone!
About the Author
Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipelines and also implementing machine learning models on solving business problems.
He provides crawling services that are able to provide you the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling services.