Search
  • knightmaster

Why Pyspider May Be One of the Best Scraping Dashboard For Beginner

Pyspider — A Practical Usage on Competitor Monitoring Metrics

If you know the enemy and know yourself, you need not fear your results of a hundred battle — Sun Tzu

Recently, I built multiple crawlers for companies and I am starting to find it hard to keep an eye on the performance of the crawlers. So, I searched online to see whether there exists a python package that not only simpler to build a crawler but also have an inbuilt dashboard to keep track of the execution of crawlers.


And then, I found out this python package — Pyspider, which is really useful especially in this few points below:


1. Easier to debug — It has a UI for you to know which part goes wrong.

2. Inbuilt dashboard — For monitoring purposes.

3. Compatible to Javascript — Unlike Scrapy, you need to install scrapy-splash to render the javascript website, but Pyspider provides Puppeteer, which is a very famous and powerful library developed by Google in Javascript for web crawling.

4. Database support — MySQL, MongoDB, and PostgreSQL.

5. Scalability — Distributed architecture.


Monitor Competitor metrics

Now, I would like to show you a use case on Pyspider, and also tell you the cool features of Pyspider, but if you want me to introduce Pyspider, please leave a comment below and I will write a post to introduce you the framework and its key concepts.


Let’s get started!


Similarweb — an online competitive intelligence tool that provides traffic and marketing insights for any website. I will be showing you how to scrape this website by leveraging the power of Puppeteer.

Install pyspider

pip install pyspider


Launch pyspider

As we will be scraping javascript rendered web page, so we will use pyspider all instead of pyspider. Run it in your command prompt.

pyspider all

If you see the above message, it means that pyspider is successfully launch on your localhost at port 5000.


Browse pyspider

Open your browser and browse to the address localhost:5000 (i.e. typing http://localhost:5000/ in your url bar), and you will see something similar like the picture below.

Then we will click the create button (on bottom left corner) to create our scraper.

Then, fill in the Project Name as Similarweb and click Create button again.

After that you will be able to see the above screenshot and let’s get started.


Scripting

This is a simple script that I have built. I will separate it into 3 parts in order to explain how this works.


Scripting — Part 1

Initialise the headers for the request as shown in line 8–15 of the gist above.

@every(minutes=24 * 60) 
def on_start(self):
    self.crawl('https://www.similarweb.com/', 
                fetch_type = 'chrome',
                validate_cert = False,
                headers = self.headers,
                callback=self.index_page)

By default, the function will execute on_start function. There are five variables to fill in to the crawl function. You will notice the every decorator, it means this function will be execute every 24*60 minutes which is 1 day.


1. https://www.similarweb.com/ : The url which you want to scrape, so in this example, we will first crawl the main page.

2. fetch_type: The argument is set to chrome to indicate to use Puppeteer to render javascript website.

3. validate_cert: Argument is False so that Pyspider will skip the validation of the server’s certificate.

4. headers: Use the headers we defined previously while requesting webpages.

5. callback: Calls the index_page function as the next function to parse.


Scripting — Part 2

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
    print(response.cookies)
    self.crawl('https://www.similarweb.com/website/google.com', 
                fetch_type = 'chrome',
                validate_cert = False,
                headers = self.headers,
                cookies = response.cookies,
                callback=self.index_page_1)

In our situation, the competitor we are interested in is google.com. That’s why the url is https://www.similarweb.com/website/google.com.


You will notice there is another decorator called config, this decorator is to indicate that if this function will run only once every 10 days. The arguments for crawl function is similar to the previous one with only 1 difference:


1. cookies: We are getting the cookie from previous session as the input to this session while requesting data.


Scripting — Part 3

@config(age=10 * 24 * 60 * 60)
def index_page_1(self, response):
    return {
       response.doc('span.engagementInfo-param.engagementInfo-param--large.u-text-ellipsis').text(): response.doc('span.engagementInfo-valueNumber.js-countValue').text().split()[0]
    }

This function is just to return Total Visits as a dictionary. Pyspider use Pyquery css as the main path selector.


Result

So copy the gist code and paste to the right panel as shown. Then click the save button on the top right corner (as highlighted in the box with purple edge) to save the script. After that click the run button (as highlighted in the box with blue edge) to run the code.

Click the follow button to follow the flow of the scraping process.

Then click the arrow button (as highlighted in the box with purple edge) to continue to the next process of scraping journey.

Click the arrow button again.

Looking at the purple box area, this is the output of the scraping. There are 60.49 billion total visits to google.com.


Dashboard

This is an overview dashboard for your scraper. You can click the purple box (run button) to execute the crawler. Besides that, you can also save your result into a CSV or JSON file by clicking the red box (Results button).

Looking at the purple box on top right corner, you will find the options of file to be downloaded as I stated previously. Click the button and you will get the result in the format you need (i.e. JSON/CSV).


Final Thought

Pyspider is a really useful tool and it can scrape really fast but if you are dealing with websites that implements the anti-crawling mechanism, I will suggest you using Scrapy instead.


Thank you for reading till the last part of the post, really appreciate that.


I will be publishing content weekly so feel free to leave comments below on topics that you may be interested to know and I will work hard to create the content for you.


About the Author

Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipelines and also implementing machine learning models on solving business problems.


He provides crawling services that are able to provide you the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling services.


You can connect with him on LinkedIn and Medium.


51 views

The Data Knight is a Data-as-a-Service (DaaS) provider that can crawl publicly available data accurately.

Company

Address

Connect With Us

    © 2020 by The Data Knight