Crawlab — The Ultimate Live Dashboard For Web Crawler
Search
  • knightmaster

Crawlab — The Ultimate Live Dashboard For Web Crawler

To Monitor all your crawlers!




Recently I discovered a very interesting and yet powerful project. Although it only started 6 months ago, this project already has around 2.4k likes. This project just launched in March 2019 and it seems promising for the following points.


1. Able to monitor different kinds of language for web crawlers. For example, Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.


2. Include a great looking real-time dashboard.


3. Able to visualize the data crawled and they can be downloaded by just clicking a button.


4. You can create a crawler by just inputting XPath and URL, the so-called “configurable crawler” (Unfortunately, the latest version v0.3.0 have disabled this feature temporary, reference can be found here (in Chinese).


Pyspider vs Crawlab


Previously I had shared about Pyspider as one of the greatest monitoring scrapper tool, if you haven’t read it, you can read through it by clicking the link here.


Well, they do share some similarities, for example, both of them are great dashboard for crawlers, they can be scheduled, have an impressive dashboard to visualize… but, if you want to know significant differences, here you go:


1. Pyspider is better in terms of visualizing the journey of crawling websites.

2. Crawlab is better is you want to integrate different languages or web crawler frameworks.

3. Crawlab is written in Golang, which is generally more efficient and faster.


Case study — Integrate Scrapy spider to Crawlab


Part 1 — Install Crawlab


Prerequisite — Install Docker on your laptop.

version: '3.3'
services:
  master: 
    image: tikazyq/crawlab:latest
    container_name: master
    environment:
      CRAWLAB_API_ADDRESS: "localhost:8000"
      CRAWLAB_SERVER_MASTER: "Y"
      CRAWLAB_MONGO_HOST: "mongo"
      CRAWLAB_REDIS_ADDRESS: "redis"
    ports:    
      - "8080:8080" # frontend
      - "8000:8000" # backend
    depends_on:
      - mongo
      - redis
  mongo:
    image: mongo:latest
    restart: always
    ports:
      - "27017:27017"
  redis:
    image: redis:latest
    restart: always
    ports:
      - "6379:6379"

Copy the above code and save it as docker-compose.yml. Then in the same directory, type the command docker-compose up in your terminal. The docker image will be downloaded to your local.


Part 2 — Launch Crawlab and Log In


Navigate to localhost:8080 on your browser and you will be able to see the login page as shown below.



Default username: admin

Default password: admin


Part 3 — Upload Scrapy Project


Go to this URL and then click Add Spider button as shown in the screenshot below.


I am using my crawler for gadgets now website.



Above snapshot is my scrapy spider directory, go one level down to the directory that contains scrapy.cfg (as highlighted in a red box), and then zip the 3 items. Lastly, upload the zip file.


Part 4 — Obtain the IP address of the MongoDB


Retrieve the docker ID of this docker image: mongo:latest. You can view the docker id by using the command below.

docker ps

Then input the docker id as stated in the command below:

docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <input your docker id here>

Next, you will get the IP address of your MongoDB in the docker container. In my case, the IP address is 172.18.0.2.


Part 5 — Input the IP address and modify pipelines.py


import osfrom pymongo import MongoClientMONGO_HOST ='172.18.0.2'
MONGO_PORT = 27017 
MONGO_DB = 'crawlab_test'class GadgetsnowPipeline(object):
    mongo = MongoClient(host=MONGO_HOST, port=MONGO_PORT)
    db = mongo[MONGO_DB]
    col_name = os.environ.get('CRAWLAB_COLLECTION') 
    if not col_name:
        col_name = 'test'
    col = db[col_name]def process_item(self, item, spider):
        item['task_id'] = os.environ.get('CRAWLAB_TASK_ID')
        self.col.save(item)
        return item

This is the modified Python script for pipelines.py. Here are some points that I would like to highlight:

Input the MongoDB’s IP address: MONGO_HOST = ‘ 172.18.0.2’, which we obtain previously.Copy the above process_item function and replace it in your original pipelines.py file.The value for MONGO_DB can be any of your desired database name in the MongoDB, for my case I set it as crawlab_test .


Part 6 — Add two new fields in items.py


task_id = scrapy.Field()
_id = scrapy.Field()

These two fields are required to be added to items.py.

Firstly, task_id is the identifier to each task you have executed, you can view it in spider -> spider_name -> any of the tasks -> overview tab.

Secondly, _id is the unique identifier for each object in your MongoDB.


Part 7 — Run your spider



Click into your newly uploaded spider, then input the execute command. Since the name of my Scrapy crawler is gdgnow, therefore my command would be:

scrapy crawl gdgnow

Then, click the Save button first and follow by Run button to start scraping.


Part 9 — Visualize the result



By using my crawler as an example, snapshot above shows the output of my crawler and most importantly, you can download in CSV format by just clicking the Download CSV button.


For the item_desc field, it is showing undefined because my item_desc is in JSON format, but Crawlab is not supporting the output JSON field yet. If you would like to have the JSON field in your output data, so far the only option is to log in to the docker that contains MongoDB where crawlab stream its data to and extract them from there.


The display is pretty amazing, kudos to all the developers in Crawlab!


Final Thoughts



Thank you so much for your patience to read until the end. Crawlab is still in an early stage but it is a very promising framework for crawler especially in monitoring multiples web crawlers.


Since this is just a brief introduction for Crawlab, I haven’t included all the functions for Crawlab, for instance, cron job, how to integrate other web crawlers’ frameworks, etc. If you do want me to share more about Crawlab, comment below, I will create another post for that!


This post is originated from this link: https://towardsdatascience.com/crawlab-the-ultimate-live-dashboard-for-web-crawler-6c2d55c18509


About the Author

Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipeline and also implementing machine learning models on solving business problems.

He provides crawling services that can provide you with the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling services.

You can connect with him on LinkedIn and Medium.


0 views

The Data Knight is a Data-as-a-Service (DaaS) provider that can crawl publicly available data accurately.

Company

Address

Connect With Us

    © 2020 by The Data Knight