In 2018 Google released version 1.0.0 of Puppeteer which has since enabled developers to do all sorts of things, from testing their user interfaces end to end without ever opening a browser, to taking screenshots of websites on a schedule, almost anything you can think of that can be done with a browser, can be automated using Puppeteer.

Puppeteer provides a great set of functionality to perform a large amount of tasks, however in this article we're going to be talking to chrome directly using the exposed WebSocket, which lets us talk to Chrome using the DevTools protocol.

The full list of API methods available for Puppeteer are listed here.

To not leave them unmentioned, there are many tools available to perform tasks in an automated way, for example Selenium, Cypress, PhantomJS, and many others, however Puppeteer was developed specifically for Node.js to control Chrome/Chromium over thier DevTools Protocol, allowing for both headless and full UI mode, this is ideal since most people are already familiar with both Chrome and its DevTools.

One form of testing that is being seen more and more is utilising Google's Lighthouse tool to ensure the quality of a web page is up to scratch, it's a great tool, yet it would be even greater if we could automate this testing as well using.

Luckily Google also provides a Node.js CLI tool to do this testing for us, but lets take it one step further. We can utilise this tool as a Node.js module, and by invoking it with the WebSocket from Puppeteer we can programmatically run lighthouse on whatever webpage we like by just providing the URL of the page.

To do this we're going to rely on the two npm modules: puppeteer and lighthouse. Additionally we're going to create a little service that can be used to trigger the jobs, so we're going to use express as well, and we will need a way to store our report results, which we will use levelup and leveldown for. We will also install esm so we can use the esm syntax within our Node.js code.

For this article I am assuming you have both Node.js and the npm cli installed, and are familiar with the command line.

I am running these examples in the directory ~/src/lighthouse-playground on macOS 10.14 with Node.js 11.13.0 and npm cli version 6.7.0

To install these modules run:

npm init # Create your package.json file
npm install --save puppeteer lighthouse express levelup leveldown
npm install --save-dev esm

This will also download Chromium ready for use by puppeteer.

We're going to need a couple of files as well, one for our service, another for our lighthouse code, another for schedulling and running reports, and one to be served to the browser, we're also going to make a playground.js file that we'll use to walk through the functionality.

touch lighthouse-util.js
touch service.js
touch reports.js
touch index.html
touch playground.js

In our lighthouse-util.js we're going to create a function that will create a function that will return an instance of the browser, we want a new browser instance each time so we know we aren't messing up seperate lighthouse reports.

import puppeteer from "puppeteer";

export function createBrowser() {
   return puppeteer.launch({
     args: ["--show-paint-rects"] // Required by lighthouse
   });
}

After our createBrowser function we want to create a function that creates our lighthouse report, given a browser instance, url, and options:

import lighthouse from "lighthouse"; // This should be at the top of the file

export function createReportWithBrowser(browser, url, options = { output: "html" }) {
  const endpoint = browser.wsEndpoint(); // Allows us to talk via DevTools protocol
  const endpointURL = new URL(endpoint); // Lighthouse only cares about the port, so we have to parse the URL so we can grab the port to talk to Chrome on
  return lighthouse(
    url,
    Object.assign({}, {
      port: endpointURL.port
    }, options) // Allow options to override anything here
  );
}

Now with these two functions we're ready to start creating reports, to test this lets run a bit of code in our playground.js file:

import { createBrowser, createReportWithBrowser } from "./lighthouse-util.js";
import fs from "fs";
import Assert from "assert";

// IIFE (https://developer.mozilla.org/en-US/docs/Glossary/IIFE) so that we can use async in the top level
(async () => {
  
  const browser = await createBrowser();
  
  const result = await createReportWithBrowser(
    browser,
    "https://example.com",
    {
        output: "html"  
    }
  );

  Assert(result.report, "No report returned");

  fs.writeFileSync("report.html", result.report, "utf-8");
  
  await browser.close();
})()
    // Catch anything that went wrong!
    .catch(console.error)
    .then(() => {
       console.log("Finished!");
    });

To run this we're going to use:

node -r esm playground.js

After running we should see Finished! in the console, and a new report.html file saved in our project's directory, if we open this up in our browser we will be able to see the generated lighthouse report for https://example.com!

Now that we know that's all working, lets build a way to create jobs to run lighthouse reports one at a time, on a small scale this is an okay solution to allow for each site to get the best report it can, however on a larger scale it would be wise to look into being able to run parallel reports across multiple service instances (ideally on different machines/nodes).

In our lighthouse-util.js file we're going to add a function that will grab a task of our "queue" and create the resulting report, which will be added to results object, each item in our queue will look something like this:


In our reports.js file we're going to create a createReportStore function, which will create an instance of our store:

import levelup from "levelup";
import leveldown from "leveldown";

export function createReportStore() {
  const database = leveldown("./store");
  return levelup(database);
}

We're also going to utilise an additional npm module level-jobs to schedule tasks to complete.

npm install --save level-jobs

level-jobs uses a worker function that is invoked when there is new work to complete, when the function is invoked it will be passed 3 functions, id, payload, and a callback to report the work is complete, we're going to make a function that binds the worker to the store, so we can still access it while we're running the worker. We're also going to ignore the id parameter since we have our own identifier that will be included in the payload

We're also going to use async, meaning we will need to wait for the returned promise to be fulfilled or rejected before calling the callback.

Inside out doReportWork function we want to create the report associated with the payload

 // These should be at the top of the file
import assert from "assert";
import { createBrowser, createReportWithBrowser } from "./lighthouse-util.js";

async function doReportWork(store, payload) {
  assert(payload.id, "Expected payload to have an id");
  assert(payload.url, "Expected payload to have a url");
  
     const browser = await createBrowser();  
  
  const result = await createReportWithBrowser(
    browser,
    payload.url,
    payload.options || { output: "html" }
  );

  await browser.close();

  // Save our result ready to be retrieved by the client
  console.log(`Saving report for ${payload.id}`);

  const document = Object.assign({}, payload, {
    result
  });

  await store.put(payload.id, JSON.stringify(document)); 
}

function createReportWorker(store) {
  return (unused, payload, callback) => {
    doReportWork(store, payload)
        .then(
          () => callback(),
          (error) => callback(error)
      );
  };
}

Next we need a function to create the queue with a reference to our store and worker:

import Jobs from "level-jobs"; // This should be at the top of the file

export function createReportQueue(store) {
  const options = {
        maxConcurrency: 1
  };
  return Jobs(store, createReportWorker(store), options);
}

Next we're going to need a function that saves our required reports, and returns an identifier that can be used to reference the report, we're going to use the npm module uuid to generate our reports identifier, and prefix the UUID value with report: so we can namespace the document within our database, in case we wanted to use it with anything else:

npm install --save uuid
import uuid from "uuid"; // This should be at the top of the file

export async function requestGenerateReport(store, queue, url, options = { output: "html" }) {
  const id = `report:${uuid.v4()}`;
  // Notice the use of JSON.stringify, levelup will accept Buffers or strings, so we want
  // to use JSON for our value
  const document = {
    id,
    url,
    options
  };
  await store.put(id, JSON.stringify(document));
  await new Promise(
    (resolve, reject) => queue.push(document, error => error ? reject(error) : resolve())
  );
  return id;
}

Now whenever we invoke requestGenerateReport we will schedule the report to be generated.

All we need now is a way to actually schedule our reports to be generated, for this we're going to create a nice little server that serves the index.html file and accepts a JSON array to be POSTd' to /report.

In service.js:

import express from "express";
import { createReportQueue, createReportStore, requestGenerateReport } from "./reports.js";
import assert from "assert";

const app = express();

// Allows us to grab these within our request handlers
app.locals.store = createReportStore();
app.locals.queue = createReportQueue(app.locals.store);

app.post("/report", express.json(), (request, response, next) => {
  if (!Array.isArray(request.body)) {
    return response.sendStatus(400); // Bad request, we expected an array
  }
  Promise.all(
      request.body.map(async ({ url, options }) => {
      assert(typeof url === "string", "Expected url to be provided");
         // We want to return an array for each item so we can match up the url
      return [
        url,
          await requestGenerateReport(
          request.app.locals.store,
          request.app.locals.queue,
          url,
          options
        ) 
      ];
    })
  )
      .then(identifiers => response.send(identifiers))
      // Catch any errors and allow express to handle it
      .catch(next);
});

We will also want a way to check up on scheduled reports, so we will want a GET /report route as well, this will accept an identifier which will be used to look up the report in our store:

app.get("/report/:id", (request, response, next) => {
  request.app.locals.store
      .get(request.params.id)
        .then(report => {
        if (!report) {
          return response.sendStatus(404); // We couldn't find it
      }
        return response.send(report);
      })
      // Catch any errors and allow express to handle it
      .catch(next);
});

We also want a way to clear our store, so we will allow users to invoke DELETE to remove old reports:

app.delete("/report/:id", (request, response, next) => {
    request.app.locals.store
      .get(request.params.id)
        .then(report => {
        if (!report) {
          return response.sendStatus(404); // We couldn't find it, may have already been deleted
      }
      return app.locals.store
          .del(request.params.id)
          .then(() => response.sendStatus(204)); // All deleted
      })
      // Catch any errors and allow express to handle it
      .catch(next);
});

Next we want to serve up our index.html file:

app.get("/", (request, response) => response.sendFile(`${process.cwd()}/index.html`));

We will also want to start listening to requests, we will listen to port 8080 by default, or process.env.PORT if it is a valid integer:

// IIFE so we don't need to define `port` as `let` ¯\_(ツ)_/¯
const port = (() => {
  if (/^\d+$/.test(process.env.PORT)) {
      return +process.env.PORT;  
  }
  // Maybe you have other defaults you want to check here to decide on the port
  return 8080;
})();

Now we have our port, we can start listening and log what port we're listening on:

app.listen(port, () => console.log(`Listening on port ${port}`));

Now we can work on our client, we want users to be able to submit either a list of comma or line seperated urls, with these we will send them to our service, and then display the results.

First we need our text box that accepts urls, a button to , and a list that we can append results to

In index.html:

<html>
<body>
<textarea id="urls" rows="10"></textarea>
<button id="schedule" type="button">Schedule Reports</button>
<ul id="results"></ul>
<script type="application/javascript">
/* Our code will code here */
</script>
</html>

Now inside our script tag we're going to add an event listener for when schedule is pressed, this will grab the content of #urls, ensure its somewhat valid, and then clear the textarea and send the urls to the service.

We're first going to need a reference to all of our DOM elements:

const urls = document.querySelector("#urls"),
      schedule = document.querySelector("#schedule"),
      results = document.querySelector("#results");

Then we will want to have a function that accepts a string, and returns either a message, or a list of valid URLs:

function processUrls(value) {
     if (!value) {
        return { message: "No urls provided" }; 
  }
  
  const validatedUrls = value
      .split("\n") // Split by line
      .reduce((urls, split) => urls.concat(split.split(",")), []) // Split by comma
      .map(url => url.trim()) // Trim each value, so we don't have extra white space
      .map(url => {
             // Creating a URL will do the "validation" for us, it will throw if it is invalid
            try {
        const instance = new URL(url);
        if (!instance.origin || !instance.protocol) {
          // We want to have an origin or protocol so that 
          // we can reference it correctly in our service
            return false;
        }
                return instance.toString();
      } catch(e) {
        // Return false if any errors so we can alert the user of an issue
           return false; 
      }
    });
  
    // So we can visualise whats happening in our dev tools
    console.log({ validatedUrls });
  
  if (validatedUrls.includes(false)) {
      return { message: "One or more url provided is not valid" };
  } 
  
  return { validUrls: validatedUrls };
}

After we have our results, we will want a way to delete the report on our server, so lets create a function for that:

function deleteExternalReport(identifier) {
  fetch(`/report/${identifier}`, {
        method: "DELETE"
  })
      .catch(console.warn)
}

We will also need a function to send our urls to our service, and also once to display the results, when displaying the results we need a function that checks if the identifier has a result yet, which then updates the DOM element with a link to the results:

function checkResults(element, identifier, originalUrl) {
  const intervalHandle = setInterval(doCheck, 2500);
  function doCheck() {
    fetch(`/report/${identifier}`)
        .then(response => response.json())
        .then(report => {
          if (!(report.result && report.result.report)) {
            return;
        }
          // We don't need to check again
          clearInterval(intervalHandle);
          // Create a data uri that can be opened directly
          const link = document.createElement("a");
          link.innerText = `Complete: ${originalUrl}`;
          // Open up in a new tab
          link.target = "_blank";
          link.href = URL.createObjectURL(new Blob([report.result.report], { type: "text/html" }));
          // Empty out the element
          element.innerHTML = ``;
          element.appendChild(link);
                deleteExternalReport(identifier);
        })
        .catch(console.warn)
  }
}

Now when we receive our identifiers, we can display them and then have them updated once there is a result:

function displayResults(identifiers) {
    identifiers.forEach(
        // Each item will be an array with the original url and the new identifier
    ([originalUrl, identifier]) => {
        const element = document.createElement("li");
            element.innerText = `Processing: ${originalUrl}`;
            // Append to our results list
      results.appendChild(element);
            checkResults(element, identifier, originalUrl);
    }
  )
}

We will also need a function that takes in our urls, and sends them to our service, then adds the values to our UI:

function send(urls) {
     // We want to receive an html output
  const options = {
      output: "html" 
  };
  fetch("/report", {
        method: "POST",
    headers: {
        "Content-Type": "application/json" 
    },
    body: JSON.stringify(urls.map(url => ({ url })))
  })
      .then(response => response.json())
      .then(displayResults)
      .catch(() => alert("Something went wrong while sending our request"))
}

Now we can bring it all together by adding an event listener that waits for clicks on our schedule button, this will invoke our send function which will trigger the report generation

schedule.addEventListener("click", () => {
  const value = urls.value;
  
  const { validUrls, message } = processUrls(value);
  
  if (message) {
      return alert(message); 
  }
  
  // Clear out the textarea so we can schedule more urls
  urls.value = "";
  
  // Trigger the process
  send(validUrls);
});

We now have a full application that allows scheduling of multiple sites to be reported on, once inputting a list of urls (or a single url), you see the message Processing: https://example.com, and then once its done it will be updated to Completed: https://example.com, the url will then be clickable, which will open up your lighthouse report in a new tab!

Hopefully this article inspires a few of you to look into lighthouse more and push accessibility and speed to your organisations, if you inspect the responses from the service in your development tools you will be able to see all the data that lighthouse collects, meaning all the data you can report on!

Next in series: Moving from single fire to scheduled tasks with Puppeteer and Lighthouse Part 1