Web Scraping using Pure JavaScript - Two different methods

← PrevNext →

Web scraping (or known as Screen Scraping) is process of extracting data like HTML codes etc. from websites or web pages. The extracted data is often used for analysis or to simply display the extracted contents on other web pages. Here in this article I am going to show you two different web scraping methods using pure JavaScript.

➡️ Before web scraping a website, please make sure that the site you are scraping, allows web scraping. Don't do it otherwise. Do it ethically.

web scraping using async/await function in JavaScript

1) First Method using XMLHttpRequest Object

In this method, I am using XMLHttpRequest Object, one of the key features of Ajax.

<body>
  <div id='container'></div>
</body>

<script>
  const url = 'https://www.encodedna.com/excel/formula-to-get-the-sum-of-unique-values.htm';

  const web_scrape = () => {
    const oXHR = window.XMLHttpRequest ? new XMLHttpRequest() : new ActiveXObject('Microsoft.XMLHTTP');

    function reportStatus() {
        if (oXHR.readyState == 4)
            searchResult(this.responseText);
    }

    oXHR.onreadystatechange = reportStatus;
    oXHR.open("GET", url, true);  // true = asynchronous request, false = synchronous request.
    oXHR.send();

    let searchResult = (txt) => {
      let blogPage = new DOMParser().parseFromString(txt, 'text/html');

      // extract contents from the element with class name Quote.
      const el = blogPage.getElementsByClassName('Quote');
      document.getElementById('container').appendChild(el[0]); // show the content.
    }
  }

  web_scrape();
</script>
Try it

First, I have defined the URL or address of the web page, which the script will scrape.

Note: You can use a different URL. All you have to do is, open the web page and inspect the "element" whose content you want to extract.

The Ajax sends a "GET" request to the remote web page (oXHR.open("GET", url, true);). If the "request" is successful, it calls a "searchResult()" function (a user defined function) along with the document (the entire web page with HTML elements and its contents).

➡️ Learn more about XMLHttpRequest Object and properties.

Inside "searchResult()" funtion, we'll extract the contents of the <span> element with the class named "Quote". There's only element with this class name in the given web page.

The parseFromString() method of DOMParser interface is used to parse XML or HTML source. In the above script, I am parsing the HTML source of the web page. Once the parsing if complete, it "extracts" the content of the <span> element.

Web Scraping images using JavaScript

Let's take this to the next level. Web scraping in JavaScript, is not limited to just text contents. You can extract or web scrape images too.

I'll use a similar method (the method I have shown above)

This web page (the one I have used the above example) also has an image. So, lets how we can "scrape" the <span> element and the first available image.

<body>
  <div id='container'></div>
</body>

<script>
  const url = 'https://www.encodedna.com/excel/formula-to-get-the-sum-of-unique-values.htm';

  const web_scrape = () => {
    const oXHR = window.XMLHttpRequest ? new XMLHttpRequest() : new ActiveXObject('Microsoft.XMLHTTP');

    function reportStatus() {
        if (oXHR.readyState == 4)
            searchResult(this.responseText);
    }

    oXHR.onreadystatechange = reportStatus;
    oXHR.open("GET", url, true);  // true = asynchronous request, false = synchronous request.
    oXHR.send();

    let searchResult = (txt) => {
      let blogPage = new DOMParser().parseFromString(txt, 'text/html');

      // element with class Quote for text content.
      const el = blogPage.getElementsByClassName('Quote');  
      document.getElementById('container').appendChild(el[0]);

      // element with class "imagecontainer" for image.
      const article_image = blogPage.getElementsByClassName('imagecontainer');
      if (article_image.length > 0) {
        if (article_image[0].childNodes[0].src != undefined) {
          let img = new Image();
          img.src = article_image[0].childNodes[0].src;

          document.getElementById('container').appendChild(img);  // show the image.
        }
      }
    }
  }

  web_scrape();
</script>
Try it

Image(s) are placed inside an HTML element with the class name imagecontainer in the URL (or the web page) in the above example.

Since we have parsed the entire web page, it will provided us with all HTML DOM elements. Element <img> is one of elements. Therefore, it iterates (or loops through) all available image element. However, it will show only one image from the web page.

2) Second Method using async/await function

In the second method, I am using async and await function for web scraping. The method is cleaner and thinner than the method I have used in the above example.

<body>
  <div id='container'></div>
</body>

<script>
  const url = 'https://www.encodedna.com/excel/formula-to-get-the-sum-of-unique-values.htm';
  
  const web_scrape = async() => {
    let response = await fetch (url);
    const ar = await response.text();
    
    let blogPage = new DOMParser().parseFromString(ar, 'text/html');
    // element with class Quote for text content.
    const el = blogPage.getElementsByClassName('Quote');
    document.getElementById('container').appendChild(el[0]);

    // element with class "imagecontainer" for image.
    const article_image = blogPage.getElementsByClassName('imagecontainer');
    if (article_image.length > 0) {
      if (article_image[0].childNodes[0].src != undefined) {
        let img = new Image();
        img.src = article_image[0].childNodes[0].src;

        document.getElementById('container').appendChild(img);
      }
    }
  }
  
  web_scrape();
</script>
Try it

The result is the same like the first method.

Conclusion

We explored two different ways to web scrape a web page, using XMLHttpRequest and async/awaits function.

Both XMLHttpRequest and async/await make asynchronous request.

XMLHttpRequest is an old method to make HTTP requests to a server. On the other hand, "async/await" is a modern method of making asynchronous HTTP request. It is build on top Promises.

➡️ Learn more about async, awaits and promise.

← PreviousNext →