Home
/
Blog
/
Web Scraping Using Cheerio.Js

Web Scraping Using Cheerio.Js

Web scraping refers to a process of data extraction from a website. Users can manually perform web scraping to identify crucial information but a better and faster way is to use bots for extracting valuable content from the websites. Websites contain company contacts, stock prices, product details, and other useful data. Another exciting thing about web scraping is that it can provide underlying HTML code and database content to replicate the information anywhere the user needs. There are various tools to perform this scraping to make it less time-consuming.

Uses of Web Scraping
Types of Web Scrapers
Web Scraping with Cheerio.Js

Uses of Web Scraping

Websites contain data in many forms. Therefore, there is a need to find the best tool for web scraping. These tools possess a variety of features and functionalities for practical data extraction. Users can export valuable data in the form of spreadsheets or APIs. Web scrapers are used in various fields. Web scraping has the following uses:

Price Monitoring: Some websites use web scrapers or bots to perform price comparisons. These bots automatically fetch valuable data and product details for related selling websites. This comparison proves beneficial for selling products and making profits.
Website Ranking: Search engines like Google and Bing use crawling bots or web spiders to analyze a website’s content, index it, and rank it so that the website appears in the relevant search results.
Real Estate Listing: Web scraping helps real estate agents to build an API that automatically populates information regarding available properties to the website for their customers.
Industrial Insights: Various companies use web scraping processes to maintain databases to maintain industrial insights, which they can, later on, sell to other companies to make essential business decisions.

Moreover, web scraping is helpful for refining machine learning models, tracking news, aggregation of financial data, and various other fields.

Types of Web Scrapers

Various web scrapers provide users with multiple features and vary in every situation. Therefore, four types of web scrapers are helpful to tackle every complex problem. Following is a brief description of these types:

1. Pre-built or Self-built

Users can build customized web scrapers with advanced programming skills. These web scrapers can have a variety of features depending on the users. Moreover, various pre-built web scrapers are available to obtain relevant data. These pre-built web scrapers can have scrape scheduling, exporting Google Sheets and JSON, and many others.

2. User Interface

Different web scrapers can have different user interfaces. Some web scrapers can run with a primary user interface and command line. However, others can have a complex user interface with various features, and the users can choose any feature from them. The latter type is beneficial for the users having minimum technical knowledge.

3. Browser extension vs Software

Usually, there are two types of web scrapers. One is the browser extension, and the other is software. Users can add the browser extensions to their relevant browsers, just like the extensions for ad blockers and themes. These browser extensions are easy to run and integrate. However, these extensions have a drawback by integrating with the browsers. For example, it will be impossible to implement certain features like IP rotations outside the browser limitation.

Therefore, another type is web scraping software. Users can easily download and install this software on their personal computers. These scrapers provide advanced features outside the browser extensions’ limitations.

4. Local vs Cloud

Local web scrapers slow down the user’s computer by using its internet connection, CPU, RAM, and other resources. Therefore, local web scrapers can halt other user activities while scraping data. Moreover, scrapers working on longer tasks or large URLs can impact ISP data caps.

On the other hand, cloud-based web scrapers do not engage the computer resources. They work on an off-site server and perform their tasks without interpreting other user tasks and notify for data exportation only. Moreover, they can provide advanced features like IP rotation and avoid blockage from various websites.

Web Scraping with Cheerio.Js

Cheerio implements jQuery, which is quick, flexible, and helpful for web scraping. Cheerio has features like familiar syntax since it is a subset of core jQuery, providing rapid services for parsing, manipulating, rendering, and efficiency. Moreover, cheerio uses a parse5 parser and can use FB55’s htmlparser2 making it more efficient to parse any HTML and XML document. Therefore, the users find it one of the best tools for scraping websites and exporting valuable information.

Users can perform web scraping with Cheerio.js by using any text editor with Node.js and understanding Document Object Model (DOM). Following is a simple example of scraping a permissible website using Cheerio.js. Following is a simple example of scraping ISO 3166-1alpha-3 codes for the countries and authorities step by step.

Project Directory

The first step is to create a working directory for the cheerio.js project using the following command on the terminal:

mkdir simple-cheerio

The above command creates a folder “simple-cheerio” for the project.

Project Initialization

The user has to navigate the project directory and initialize the project. The user has to open the directory in a text editor and initialize the project by running the following command:

nmp init -y

The code creates a package.json file at the root of the project directory.

Dependencies Installation

The next step is to install dependencies by running the following command:

npm i axios cheerio fs

The above command adds three dependencies under the dependencies field. These dependencies are “axios”, “cheerio”, and “fs”. Axios helps with the cheerio markup parser by fetching from the website.

Inspecting Web Page

The next step is to investigate the HTML structure for the website. This example uses the current codes section of the ISO 3166-1 alpha-3 codes page. Moreover, the users have to use ctrl+shift+t on chrome and inspect the page using DevTools.

Coding for Web Scraping

There is a need to create a file sample.js at the root of the directory by using the following command:

touch sample.js

The next thing is to declare the variables:

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

Following is the code to scrape data for the example as mentioned earlier:

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");
const url = "https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3";
async function scrapeData() {
  try {
    const { data } = await axios.get(url);
    const $ = cheerio.load(data);
    const listItems = $(".plainlist ul li");
    const countries = [];
    listItems.each((idx, el) => {
      const country = { name: "", iso3: "" };
      country.name = $(el).children("a").text();
      country.iso3 = $(el).children("span").text();
      countries.push(country);
    });
    console.dir(countries);
    fs.writeFile("coutries.json", JSON.stringify(countries, null, 2), (err) => {
      if (err) {
        console.error(err);
        return;
      }
      console.log("Successfully written data to file");
    });
  } catch (err) {
    console.error(err);
  }
}
scrapeData();

After loading dependencies and providing relevant URLs for the above code, there is a need for the function scrapeData() through which axios fetches the markup. The next task is to load this HTML data to cheerio. The code selects all the “li” elements with class “plainlist” using each() method. The code scrapes and stores every country’s data in an array. The program writes the scraped data into the file “countries.json”. Users can visualize the data by running the following command:

node sample.js

Moreover, there are various other tasks to implement with cheerio depending on the scraping task.