Can't use Puppeteer for NodeJS?

richardreeze · February 19, 2023, 11:32pm

I’m still on a mission this year to switch to Replit as my full-time IDE.

Unfortunately, I’m noticing some limitation. For example I can’t use Puppeteer on NodeJS (which is a shame because I’m taking a scraping course on Udemy and can’t follow it because of that limitation).

Scraping works fine with something like axios. But since Puppeteer relies on opening a browser window, it gets blocked.

Or is there a way around this? Am I missing something? I know some people here use Replit as their full-time IDE, so I wanted to ask them.

sonicx180 · February 19, 2023, 11:42pm

I tried, but I couldn’t. Really sorry. Also, you realize how useful replit is when you have a chromebook

richardreeze · February 20, 2023, 12:03am

I do Replit.
With a few key tweaks I’ll be able to never have to look at another IDE again

GrimSteel · February 20, 2023, 12:12am

I’m not sure about Puppeteer, but I was able to get Selenium to work. I think Puppeteer could work too with a properly configured replit.nix

replit.nix: (this file is hidden by default - click Show hidden Files to show it)

{ pkgs }: {
  deps = [
    pkgs.nodejs-18_x
    pkgs.chromedriver
    pkgs.chromium
    pkgs.glib
    pkgs.nss
    pkgs.fontconfig
  ];
  nativeBuildInputs = [
    pkgs.chromedriver
  ];
}

index.js:

import  { Builder, Browser, until } from 'selenium-webdriver';
import { Options } from "selenium-webdriver/chrome.js";

const chromeOptions = new Options()
  .addArguments("disable-dev-shm-usage", "no-sandbox");

const driver = await new Builder()
  .forBrowser(Browser.CHROME)
  .setChromeOptions(chromeOptions).build();

try {
  await driver.get("https://www.google.com/search?q=webdriver")
  await driver.wait(until.titleIs('webdriver - Google Search'), 5000);
} finally {
  await driver.quit();
}

When run, Chrome shows up in the Repl “output” window

If you do end up scraping on Replit, make sure it’s legal/ethical and follows the TOS

richardreeze · February 20, 2023, 12:34am

This blew my mind, thanks!

However I’m trying to follow your code and got lost. This is my code

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.mostrecommendedbooks.com/');

  // Get the website's title
  const title = await page.title();
  console.log(`Website title: ${title}`);

  await browser.close();
})();

And this is my replit.nix (I also installed Chromium on packages)

{ pkgs }: {
	deps = [
		pkgs.nodejs-18_x
    pkgs.chromedriver
    pkgs.chromium
    pkgs.nodePackages.typescript-language-server
    pkgs.yarn
    pkgs.replitPackages.jest
	];
}

And yes haha it’s legal/ ethical. I’m taking a few coding courses so I can learn again and scraping is one of the topics.

richardreeze · February 20, 2023, 12:36am

Oh sorry I forgot to share my error message

/home/runner/random/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:300
            reject(new Error([
                   ^

Error: Failed to launch the browser process!
/home/runner/.cache/puppeteer/chrome/linux-1095492/chrome-linux/chrome: error while loading shared libraries: libgobject-2.0.so.0: cannot open shared object file: No such file or directory


TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

    at onClose (/home/runner/random/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:300:20)
    at Interface.<anonymous> (/home/runner/random/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:288:24)
    at Interface.emit (node:events:525:35)
    at Interface.emit (node:domain:489:12)
    at Interface.close (node:internal/readline/interface:536:10)
    at Socket.onend (node:internal/readline/interface:262:10)
    at Socket.emit (node:events:525:35)
    at Socket.emit (node:domain:489:12)
    at endReadableNT (node:internal/streams/readable:1359:12)

Node.js v18.12.1

GrimSteel · February 20, 2023, 12:38am

Hmm. It looks like this is happening because it can’t find “libgobject”. AFAIK, this is part of “glib”

Try adding:

to your replit.nix

GrimSteel · February 20, 2023, 12:41am

Also, if you aren’t using TypeScript, Yarn, or Jest, you can remove these packages to save a little storage space on your Repl:

richardreeze · February 20, 2023, 1:37am

Hmm still nothing Now I’m getting this error

/home/runner/random/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ProductLauncher.js:127
                    throw new Error(`Could not find Chromium (rev. ${this.puppeteer.browserRevision}). This can occur if either\n` +
                          ^

Error: Could not find Chromium (rev. 1095492). This can occur if either
 1. you did not perform an installation before running the script (e.g. `npm install`) or
 2. your cache path is incorrectly configured (which is: /home/runner/.cache/puppeteer).
For (2), check out our guide on configuring puppeteer at https://pptr.dev/guides/configuration.
    at ChromeLauncher.resolveExecutablePath (/home/runner/random/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ProductLauncher.js:127:27)
    at ChromeLauncher.executablePath (/home/runner/random/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ChromeLauncher.js:206:25)
    at ChromeLauncher.launch (/home/runner/random/node_modules/puppeteer-core/lib/cjs/puppeteer/node/ChromeLauncher.js:93:37)

Node.js v18.12.1

Any other solutions? Funny part is when I click on that url in the error is takes me to a 404 page. So no luck.

Thanks for letting me know about those packages though. Now I’ll know to remove them each time.

dragonhunter1 · February 20, 2023, 2:27am

richardreeze:

 1. you did not perform an installation before running the script (e.g. `npm install`) or
 2. your cache path is incorrectly configured (which is: /home/runner/.cache/puppeteer).

Did you follow these steps? Also because .cache is not in the main path, it will get deleted every time your repl restarts (I think)

GrimSteel · February 20, 2023, 2:52am

I figured out how to launch the browser at least (nothing else works but this is a start!)

replit.nix:

{ pkgs }: {
	deps = [
    pkgs.nodejs-18_x
    pkgs.chromium
    pkgs.glib
    pkgs.nss
    pkgs.fontconfig
	];
}

index.js:

import puppeteer from 'puppeteer-core';
import { exec } from "child_process";

function findChromium() {
  return new Promise((res, rej) => {
    exec("nix eval nixpkgs.chromium.outPath --raw", (error, stdout, stderr) => {
      if (error) rej(error.message);
      else if (stderr) rej(stderr);
      else res(`${stdout}/bin/chromium`);
    });
  });
}
const chromiumPath = await findChromium();

const browser = await puppeteer.launch({ 
  headless: false, 
  executablePath: chromiumPath,
  args: ['--no-sandbox', '--disable-setuid-sandbox']
});
await browser.close();

Once I start this, it opens the browser and just stops there at about:blank. I added in some console.logs and it doesn’t look like it’s getting past the puppeteer.launch call

Notice that I’m using puppeteer-core instead of puppeteer as the chrome executable is downloaded with Nix

WARNING: FOr simplicity, I disabled the sandbox. As the docs say, only do this if you absolutely trust the websites you are scraping!

richardreeze · February 20, 2023, 11:11am

Regarding #1, I did install chromium using the Replit “Packages” tool (I only use that now instead of npm or yarn… that’s ok right?)
Regarding #2, how do I reconfigure my cache path? I can’t find it.

richardreeze · February 20, 2023, 11:26am

Thanks for this! I edited your code a bit and got an interesting result

const puppeteer = require('puppeteer-core');

(async () => {
  const browser = await puppeteer.launch({ 
    headless: false, 
    executablePath: '/nix/store/x205pbkd5xh5g4iv0g58xjla55has3cx-chromium-108.0.5359.94/bin/chromium-browser',
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  
  // Navigate to a URL and log the page title to the console
  const page = await browser.newPage();
  await page.goto('https://www.mostrecommendedbooks.com');
  const pageTitle = await page.title();
  console.log(pageTitle);

  await browser.close();
})();

It outputs the page’s h1 (which is what I wanted! ) even though it doesn’t launch a browser.

This is my replit.nix

{ pkgs }: {
	deps = [
		pkgs.nodejs-18_x
    pkgs.chromium
    pkgs.glib
    pkgs.nss
    pkgs.fontconfig
	];
}

Leaving it here in case anyone else needs help with this

richardreeze · February 20, 2023, 11:39am

Update: Eventually the browser did get launched I’m not sure why but… I won’t complain.

GrimSteel · February 20, 2023, 5:46pm

Great! I might be wrong but it looks you solved it by replacing chromium with chromium-browser

I have very limited knowledge with Nix, and I may very well be wrong, but I think the executable path changes every time your Repl boots and installs the nix packages. I know there’s a way to get Nix to store this path in an environment variable, but I wasn’t able to find it… (That’s why I did the thing where it runs the nix command to find it)

Other than that, thanks for sharing your solution!

EDIT: Huh, I tried your code out and it seems like it only works when I hardcode the executable path! I think it has something to do with how nix eval nixpkgs.chromium.outPath outputted the path to Chrome 92, while you have Chrome 108 there. It also seems like it’s perfectly fine to hardcode the path. Thanks again!

PanditSiddharth · April 4, 2023, 8:55am

is there you found any solution for this also why it openning browser in new tab

PanditSiddharth · April 4, 2023, 9:00am

I done headless: true and now its not openning !

thanks all for your this conversation on this problem

dragonhunter1 · April 4, 2023, 1:10pm

Headless mode is the definition of not opening. The browser window is the “head”, so If you want to browser window, turn it back on. Otherwise, headless mode is probably faster and less egress intensive.