Electron Js Web Scraping

Warning

  1. Electron Js React Js
  2. Make A Browser Electron Js

Use the Electron Framework to build compelling cross-platform desktop applications with the latest web dev technologies What you’ll learn. Electron for Desktop Apps: The Complete Developer’s Guide Course Site. Learn how to make native-feeling applications using web development technologies; Master the intricacies of development with Electron JS. Web scraping is a technique for extracting content from websites in order to archive data in a structured way. Be careful, however, to respect the terms of use of the website concerned. Electron is a framework for creating native Windows/Mac/Linux applications with web technologies (Javascript, HTML, CSS). Electron.js, electron js tutorial, electron js jobs, electron js protocol handler, electron js point of sale, electron js examples, data persistence electron js, electron js apps, electron js offline, electron js web scraping, electron js webrtc, electron js alternatives, electron js advantages, electron js review, electron js ios. The easiest way to scrape Web sites. ScrapingAnt uses the latest Chrome browser and rotates proxies for you. The easiest way to scrape Web sites. ScrapingAnt uses the latest Chrome browser and rotates proxies for you. The easiest way to scrape Web sites. ScrapingAnt uses the latest Chrome browser and rotates proxies for you. Javascript Web Scraping Guy. I got the Electronjs app up and running with the Electron Dead Link Checker v0.0.1. This week is a stark contrast as I feel like very.

Electron's webview tag is based on Chromium's webview, whichis undergoing dramatic architectural changes. This impacts the stability of webviews,including rendering, navigation, and event routing. We currently recommend to notuse the webview tag and to consider alternatives, like iframe, Electron's BrowserView,or an architecture that avoids embedded content altogether.

Enabling

By default the webview tag is disabled in Electron >= 5. You need to enable the tag bysetting the webviewTag webPreferences option when constructing your BrowserWindow. Formore information see the BrowserWindow constructor docs.

Overview

Display external web content in an isolated frame and process.

Process: Renderer

Use the webview tag to embed 'guest' content (such as web pages) in yourElectron app. The guest content is contained within the webview container.An embedded page within your app controls how the guest content is laid out andrendered.

Unlike an iframe, the webview runs in a separate process than yourapp. It doesn't have the same permissions as your web page and all interactionsbetween your app and embedded content will be asynchronous. This keeps your appsafe from the embedded content. Note: Most methods called on thewebview from the host page require a synchronous call to the main process.

Electron js web scraping api

Example

To embed a web page in your app, add the webview tag to your app's embedderpage (this is the app page that will display the guest content). In its simplestform, the webview tag includes the src of the web page and css styles thatcontrol the appearance of the webview container:

If you want to control the guest content in any way, you can write JavaScriptthat listens for webview events and responds to those events using thewebview methods. Here's sample code with two event listeners: one that listensfor the web page to start loading, the other for the web page to stop loading,and displays a 'loading...' message during the load time:

Internal implementation

Under the hood webview is implemented with Out-of-Process iframes (OOPIFs).The webview tag is essentially a custom element using shadow DOM to wrap aniframe element inside it.

So the behavior of webview is very similar to a cross-domain iframe, asexamples:

  • When clicking into a webview, the page focus will move from the embedderframe to webview.
  • You can not add keyboard, mouse, and scroll event listeners to webview.
  • All reactions between the embedder frame and webview are asynchronous.

CSS Styling Notes

Please note that the webview tag's style uses display:flex; internally toensure the child iframe element fills the full height and width of its webviewcontainer when used with traditional and flexbox layouts. Please do notoverwrite the default display:flex; CSS property, unless specifyingdisplay:inline-flex; for inline layout.

Tag Attributes

The webview tag has the following attributes:

src

A String representing the visible URL. Writing to this attribute initiates top-levelnavigation.

Assigning src its own value will reload the current page.

The src attribute can also accept data URLs, such asdata:text/plain,Hello, world!.

Electron Js React Js

nodeintegration

A Boolean. When this attribute is present the guest page in webview will have nodeintegration and can use node APIs like require and process to access lowlevel system resources. Node integration is disabled by default in the guestpage.

nodeintegrationinsubframes

A Boolean for the experimental option for enabling NodeJS support in sub-frames such as iframesinside the webview. All your preloads will load for every iframe, you canuse process.isMainFrame to determine if you are in the main frame or not.This option is disabled by default in the guest page.

enableremotemodule

A Boolean. When this attribute is false the guest page in webview will not have accessto the remote module. The remote module is unavailable by default.

plugins

A Boolean. When this attribute is present the guest page in webview will be able to usebrowser plugins. Plugins are disabled by default.

preload

A String that specifies a script that will be loaded before other scripts run in the guestpage. The protocol of script's URL must be either file: or asar:, because itwill be loaded by require in guest page under the hood.

When the guest page doesn't have node integration this script will still haveaccess to all Node APIs, but global objects injected by Node will be deletedafter this script has finished executing.

Note: This option will appear as preloadURL (not preload) inthe webPreferences specified to the will-attach-webview event.

httpreferrer

A String that sets the referrer URL for the guest page.

useragent

A String that sets the user agent for the guest page before the page is navigated to. Once thepage is loaded, use the setUserAgent method to change the user agent.

disablewebsecurity

A Boolean. When this attribute is present the guest page will have web security disabled.Web security is enabled by default.

partition

A String that sets the session used by the page. If partition starts with persist:, thepage will use a persistent session available to all pages in the app with thesame partition. if there is no persist: prefix, the page will use anin-memory session. By assigning the same partition, multiple pages can sharethe same session. If the partition is unset then default session of the appwill be used.

This value can only be modified before the first navigation, since the sessionof an active renderer process cannot change. Subsequent attempts to modify thevalue will fail with a DOM exception.

allowpopups

A Boolean. When this attribute is present the guest page will be allowed to open newwindows. Popups are disabled by default.

webpreferences

A String which is a comma separated list of strings which specifies the web preferences to be set on the webview.The full list of supported preference strings can be found in BrowserWindow.

The string follows the same format as the features string in window.open.A name by itself is given a true boolean value.A preference can be set to another value by including an =, followed by the value.Special values yes and 1 are interpreted as true, while no and 0 are interpreted as false.

enableblinkfeatures

A String which is a list of strings which specifies the blink features to be enabled separated by ,.The full list of supported feature strings can be found in theRuntimeEnabledFeatures.json5 file.

disableblinkfeatures

A String which is a list of strings which specifies the blink features to be disabled separated by ,.The full list of supported feature strings can be found in theRuntimeEnabledFeatures.json5 file.

Methods

The webview tag has the following methods:

Note: The webview element must be loaded before using the methods.

Example

<webview>.loadURL(url[, options])

  • url URL
  • options Object (optional)
    • httpReferrer (String Referrer) (optional) - An HTTP Referrer url.
    • userAgent String (optional) - A user agent originating the request.
    • extraHeaders String (optional) - Extra headers separated by 'n'
    • postData (UploadRawData[] UploadFile[]) (optional)
    • baseURLForDataURL String (optional) - Base url (with trailing path separator) for files to be loaded by the data url. This is needed only if the specified url is a data url and needs to load other files.

Returns Promise<void> - The promise will resolve when the page has finished loading(see did-finish-load), and rejectsif the page fails to load (seedid-fail-load).

Loads the url in the webview, the url must contain the protocol prefix,e.g. the http:// or file://.

<webview>.downloadURL(url)

  • url String

Initiates a download of the resource at url without navigating.

<webview>.getURL()

Returns String - The URL of guest page.

<webview>.getTitle()

Returns String - The title of guest page.

<webview>.isLoading()

Returns Boolean - Whether guest page is still loading resources.

<webview>.isLoadingMainFrame()

Returns Boolean - Whether the main frame (and not just iframes or frames within it) isstill loading.

<webview>.isWaitingForResponse()

Returns Boolean - Whether the guest page is waiting for a first-response for themain resource of the page.

<webview>.stop()

Stops any pending navigation.

<webview>.reload()

Reloads the guest page.

<webview>.reloadIgnoringCache()

Reloads the guest page and ignores cache.

<webview>.canGoBack()

Returns Boolean - Whether the guest page can go back.

<webview>.canGoForward()

Returns Boolean - Whether the guest page can go forward.

<webview>.canGoToOffset(offset)

  • offset Integer

Returns Boolean - Whether the guest page can go to offset.

<webview>.clearHistory()

Clears the navigation history.

<webview>.goBack()

Makes the guest page go back.

<webview>.goForward()

Makes the guest page go forward.

<webview>.goToIndex(index)

  • index Integer

Navigates to the specified absolute index.

<webview>.goToOffset(offset)

  • offset Integer

Navigates to the specified offset from the 'current entry'.

<webview>.isCrashed()

Returns Boolean - Whether the renderer process has crashed.

<webview>.setUserAgent(userAgent)

  • userAgent String

Overrides the user agent for the guest page.

<webview>.getUserAgent()

Returns String - The user agent for guest page.

<webview>.insertCSS(css)

  • css String

Returns Promise<String> - A promise that resolves with a key for the insertedCSS that can later be used to remove the CSS via<webview>.removeInsertedCSS(key).

Injects CSS into the current web page and returns a unique key for the insertedstylesheet.

<webview>.removeInsertedCSS(key)

  • key String

Returns Promise<void> - Resolves if the removal was successful.

Removes the inserted CSS from the current web page. The stylesheet is identifiedby its key, which is returned from <webview>.insertCSS(css).

<webview>.executeJavaScript(code[, userGesture])

  • code String
  • userGesture Boolean (optional) - Default false.

Returns Promise<any> - A promise that resolves with the result of the executed codeor is rejected if the result of the code is a rejected promise.

Evaluates code in page. If userGesture is set, it will create the usergesture context in the page. HTML APIs like requestFullScreen, which requireuser action, can take advantage of this option for automation.

<webview>.openDevTools()

Scraping

Opens a DevTools window for guest page.

<webview>.closeDevTools()

Closes the DevTools window of guest page.

<webview>.isDevToolsOpened()

Returns Boolean - Whether guest page has a DevTools window attached.

<webview>.isDevToolsFocused()

Returns Boolean - Whether DevTools window of guest page is focused.

<webview>.inspectElement(x, y)

  • x Integer
  • y Integer

Starts inspecting element at position (x, y) of guest page.

<webview>.inspectSharedWorker()

Opens the DevTools for the shared worker context present in the guest page.

<webview>.inspectServiceWorker()

Opens the DevTools for the service worker context present in the guest page.

<webview>.setAudioMuted(muted)

  • muted Boolean

Set guest page muted.

<webview>.isAudioMuted()

Returns Boolean - Whether guest page has been muted.

<webview>.isCurrentlyAudible()

Returns Boolean - Whether audio is currently playing.

<webview>.undo()

Executes editing command undo in page.

<webview>.redo()

Executes editing command redo in page.

<webview>.cut()

Executes editing command cut in page.

<webview>.copy()

Executes editing command copy in page.

<webview>.paste()

Executes editing command paste in page.

<webview>.pasteAndMatchStyle()

Executes editing command pasteAndMatchStyle in page.

<webview>.delete()

Executes editing command delete in page.

<webview>.selectAll()

Executes editing command selectAll in page.

<webview>.unselect()

Executes editing command unselect in page.

<webview>.replace(text)

  • text String

Executes editing command replace in page.

<webview>.replaceMisspelling(text)

  • text String

Executes editing command replaceMisspelling in page.

<webview>.insertText(text)

  • text String

Returns Promise<void>

Inserts text to the focused element.

<webview>.findInPage(text[, options])

  • text String - Content to be searched, must not be empty.
  • options Object (optional)
    • forward Boolean (optional) - Whether to search forward or backward, defaults to true.
    • findNext Boolean (optional) - Whether the operation is first request or a follow up,defaults to false.
    • matchCase Boolean (optional) - Whether search should be case-sensitive,defaults to false.

Returns Integer - The request id used for the request.

Starts a request to find all matches for the text in the web page. The result of the requestcan be obtained by subscribing to found-in-page event.

<webview>.stopFindInPage(action)

  • action String - Specifies the action to take place when ending<webview>.findInPage request.
    • clearSelection - Clear the selection.
    • keepSelection - Translate the selection into a normal selection.
    • activateSelection - Focus and click the selection node.

Stops any findInPage request for the webview with the provided action.

<webview>.print([options])

  • options Object (optional)
    • silent Boolean (optional) - Don't ask user for print settings. Default is false.
    • printBackground Boolean (optional) - Prints the background color and image ofthe web page. Default is false.
    • deviceName String (optional) - Set the printer device name to use. Must be the system-defined name and not the 'friendly' name, e.g 'Brother_QL_820NWB' and not 'Brother QL-820NWB'.
    • color Boolean (optional) - Set whether the printed web page will be in color or grayscale. Default is true.
    • margins Object (optional)
      • marginType String (optional) - Can be default, none, printableArea, or custom. If custom is chosen, you will also need to specify top, bottom, left, and right.
      • top Number (optional) - The top margin of the printed web page, in pixels.
      • bottom Number (optional) - The bottom margin of the printed web page, in pixels.
      • left Number (optional) - The left margin of the printed web page, in pixels.
      • right Number (optional) - The right margin of the printed web page, in pixels.
    • landscape Boolean (optional) - Whether the web page should be printed in landscape mode. Default is false.
    • scaleFactor Number (optional) - The scale factor of the web page.
    • pagesPerSheet Number (optional) - The number of pages to print per page sheet.
    • collate Boolean (optional) - Whether the web page should be collated.
    • copies Number (optional) - The number of copies of the web page to print.
    • pageRanges Object - The page range to print.
      • from Number - Index of the first page to print (0-based).
      • to Number - Index of the last page to print (inclusive) (0-based).
    • duplexMode String (optional) - Set the duplex mode of the printed web page. Can be simplex, shortEdge, or longEdge.
    • dpi Record<string, number> (optional)
      • horizontal Number (optional) - The horizontal dpi.
      • vertical Number (optional) - The vertical dpi.
    • header String (optional) - String to be printed as page header.
    • footer String (optional) - String to be printed as page footer.
    • pageSize String Size (optional) - Specify page size of the printed document. Can be A3,A4, A5, Legal, Letter, Tabloid or an Object containing height.

Returns Promise<void>

Prints webview's web page. Same as webContents.print([options]).

<webview>.printToPDF(options)

  • options Object
    • headerFooter Record<string, string> (optional) - the header and footer for the PDF.
      • title String - The title for the PDF header.
      • url String - the url for the PDF footer.
    • landscape Boolean (optional) - true for landscape, false for portrait.
    • marginsType Integer (optional) - Specifies the type of margins to use. Uses 0 fordefault margin, 1 for no margin, and 2 for minimum margin.and width in microns.
    • scaleFactor Number (optional) - The scale factor of the web page. Can range from 0 to 100.
    • pageRanges Record<string, number> (optional) - The page range to print. On macOS, only the first range is honored.
      • from Number - Index of the first page to print (0-based).
      • to Number - Index of the last page to print (inclusive) (0-based).
    • pageSize String Size (optional) - Specify page size of the generated PDF. Can be A3,A4, A5, Legal, Letter, Tabloid or an Object containing height
    • printBackground Boolean (optional) - Whether to print CSS backgrounds.
    • printSelectionOnly Boolean (optional) - Whether to print selection only.

Returns Promise<Uint8Array> - Resolves with the generated PDF data.

Prints webview's web page as PDF, Same as webContents.printToPDF(options).

<webview>.capturePage([rect])

  • rectRectangle (optional) - The area of the page to be captured.

Returns Promise<NativeImage> - Resolves with a NativeImage

Captures a snapshot of the page within rect. Omitting rect will capture the whole visible page.

<webview>.send(channel, ...args)

  • channel String
  • ...args any[]

Returns Promise<void>

Send an asynchronous message to renderer process via channel, you can alsosend arbitrary arguments. The renderer process can handle the message bylistening to the channel event with the ipcRenderer module.

See webContents.send forexamples.

<webview>.sendInputEvent(event)

  • eventMouseInputEvent MouseWheelInputEvent KeyboardInputEvent

Returns Promise<void>

Sends an input event to the page.

See webContents.sendInputEventfor detailed description of event object.

<webview>.setZoomFactor(factor)

  • factor Number - Zoom factor.

Changes the zoom factor to the specified factor. Zoom factor iszoom percent divided by 100, so 300% = 3.0.

<webview>.setZoomLevel(level)

  • level Number - Zoom level.

Changes the zoom level to the specified level. The original size is 0 and eachincrement above or below represents zooming 20% larger or smaller to defaultlimits of 300% and 50% of original size, respectively. The formula for this isscale := 1.2 ^ level.

NOTE: The zoom policy at the Chromium level is same-origin, meaning that thezoom level for a specific domain propagates across all instances of windows withthe same domain. Differentiating the window URLs will make zoom work per-window.

<webview>.getZoomFactor()

Returns Number - the current zoom factor.

<webview>.getZoomLevel()

Returns Number - the current zoom level.

<webview>.setVisualZoomLevelLimits(minimumLevel, maximumLevel)

  • minimumLevel Number
  • maximumLevel Number

Returns Promise<void>

Sets the maximum and minimum pinch-to-zoom level.

<webview>.showDefinitionForSelection()macOS

Shows pop-up dictionary that searches the selected word on the page.

<webview>.getWebContentsId()

Returns Number - The WebContents ID of this webview.

DOM Events

The following DOM events are available to the webview tag:

Event: 'load-commit'

Returns:

  • url String
  • isMainFrame Boolean

Fired when a load has committed. This includes navigation within the currentdocument as well as subframe document-level loads, but does not includeasynchronous resource loads.

Event: 'did-finish-load'

Fired when the navigation is done, i.e. the spinner of the tab will stopspinning, and the onload event is dispatched.

Event: 'did-fail-load'

Returns:

  • errorCode Integer
  • errorDescription String
  • validatedURL String
  • isMainFrame Boolean

This event is like did-finish-load, but fired when the load failed or wascancelled, e.g. window.stop() is invoked.

Event: 'did-frame-finish-load'

Returns:

  • isMainFrame Boolean

Fired when a frame has done navigation.

Event: 'did-start-loading'

Corresponds to the points in time when the spinner of the tab starts spinning.

Event: 'did-stop-loading'

Corresponds to the points in time when the spinner of the tab stops spinning.

Event: 'dom-ready'

Fired when document in the given frame is loaded.

Event: 'page-title-updated'

Returns:

  • title String
  • explicitSet Boolean

Fired when page title is set during navigation. explicitSet is false whentitle is synthesized from file url.

Event: 'page-favicon-updated'

Returns:

  • favicons String[] - Array of URLs.

Fired when page receives favicon urls.

Event: 'enter-html-full-screen'

Fired when page enters fullscreen triggered by HTML API.

Event: 'leave-html-full-screen'

Fired when page leaves fullscreen triggered by HTML API.

Event: 'console-message'

Returns:

  • level Integer - The log level, from 0 to 3. In order it matches verbose, info, warning and error.
  • message String - The actual console message
  • line Integer - The line number of the source that triggered this console message
  • sourceId String

Fired when the guest window logs a console message.

The following example code forwards all log messages to the embedder's consolewithout regard for log level or other properties.

Event: 'found-in-page'

Returns:

  • result Object
    • requestId Integer
    • activeMatchOrdinal Integer - Position of the active match.
    • matches Integer - Number of Matches.
    • selectionArea Rectangle - Coordinates of first match region.
    • finalUpdate Boolean

Fired when a result is available forwebview.findInPage request.

Event: 'new-window'

Returns:

  • url String
  • frameName String
  • disposition String - Can be default, foreground-tab, background-tab,new-window, save-to-disk and other.
  • options BrowserWindowConstructorOptions - The options which should be used for creating the newBrowserWindow.

Fired when the guest page attempts to open a new browser window.

The following example code opens the new url in system's default browser.

Event: 'will-navigate'

Returns:

  • url String

Emitted when a user or the page wants to start navigation. It can happen whenthe window.location object is changed or a user clicks a link in the page.

This event will not emit when the navigation is started programmatically withAPIs like <webview>.loadURL and <webview>.back.

It is also not emitted during in-page navigation, such as clicking anchor linksor updating the window.location.hash. Use did-navigate-in-page event forthis purpose.

Calling event.preventDefault() does NOT have any effect.

Event: 'did-navigate'

Returns:

  • url String

Emitted when a navigation is done.

This event is not emitted for in-page navigations, such as clicking anchor linksor updating the window.location.hash. Use did-navigate-in-page event forthis purpose.

Event: 'did-navigate-in-page'

Returns:

  • isMainFrame Boolean
  • url String

Emitted when an in-page navigation happened.

When in-page navigation happens, the page URL changes but does not causenavigation outside of the page. Examples of this occurring are when anchor linksare clicked or when the DOM hashchange event is triggered.

Event: 'close'

Fired when the guest page attempts to close itself.

The following example code navigates the webview to about:blank when theguest attempts to close itself.

Event: 'ipc-message'

Returns:

  • channel String
  • args any[]

Fired when the guest page has sent an asynchronous message to embedder page.

With sendToHost method and ipc-message event you can communicatebetween guest page and embedder page:

Event: 'crashed'

Fired when the renderer process is crashed.

Event: 'plugin-crashed'

Returns:

  • name String
  • version String

Fired when a plugin process is crashed.

Event: 'destroyed'

Fired when the WebContents is destroyed.

Event: 'media-started-playing'

Emitted when media starts playing.

Event: 'media-paused'

Emitted when media is paused or done playing.

Event: 'did-change-theme-color'

Returns:

  • themeColor String

Emitted when a page's theme color changes. This is usually due to encountering a meta tag:

Event: 'update-target-url'

Returns:

  • url String

Emitted when mouse moves over a link or the keyboard moves the focus to a link.

Event: 'devtools-opened'

Emitted when DevTools is opened.

Event: 'devtools-closed'

Emitted when DevTools is closed.

Event: 'devtools-focused'

Emitted when DevTools is focused / opened.

If you’ve ever visited a website and thought the information was useful but the data wasn’t available through an API, well I have some good new for you. You can scrape that website data using Node.js!

Web scraping refers to the collection of data on a website without relying on their API or any other service. If you can visit a website on your browser then you can visit that website through code.

All websites are built from HTML, CSS, and Javascript. If you open up the developer tools on a website, then you’ll see the HTML code of that website.

So to scrape the data from a website using web scraping methods, you’re getting HTML data from that website, and then extracting the content you want from the HTML.

This article will guide you through an introduction to web scraping using Javascript and Node.js. You’ll create a Node.js script that visits HackerNews and saves the post titles and links to a CSV file.

Important concepts

Before we get started, you should know about a few concepts relevant to web scraping. These concepts are DOM elements, query selectors, and the developer tools/inspector.

DOM Elements

DOM Elements are the building blocks of HTML code. They make up the content of a HTML website and can consist of elements such as headings, paragraphs, images, and many others. When scraping websites, you search for web content by searching for the DOM elements they’re defined within.

Query Selectors

Query selectors are methods available in the browser and Javascript that allow you to select DOM elements. After you select them, you can read the data or manipulate them such as changing the text or CSS properties. When scraping the web, you use query selectors to select the DOM elements you want to read from.

Developer Tools

Chrome, Firefox, and other browsers have tools built into their browser that allow developers to have an easier time working with websites. You can find the DOM elements of the content you want using the developer tools and then select it with code.

Different tools/libraries you can use for web scraping

There are many different tools and libraries you can use to scrape the web using Javascript and Node.js. Here is a list of some of the most popular tools.

Cheerio

Cheerio is a library that allows you to use jQuery like syntax on the server. Cheerio is often paired with a library like request or request-promise to read HTML code using jQuery on the server.

Nightmare.js

Nightmare.js is a high-level browser automation library that can be used to do some interaction on the website before you scrape the data. For example, you may want to enter a form and then submit it before you want to scrape the website. Nightmare.js allows you to do this with an easy to use API.

Puppeteer

Puppeteer is a Node.js library that can run headless Chrome to do automation tasks. Puppeteer can do things such as:

  • Generate screenshots and PDFs of pages.
  • Automate form submission, UI testing, keyboard input, etc.
  • Test Chrome Extensions.
  • And more.

Axios

Axios is a popular library for making requests over the web. This library is robust and has many features. To make a simple request to get a website’s HTML content is simple for this library. It’s often used in combination with a library like Cheerio for scraping the web.

Tutorial

In this tutorial, we’ll be scraping the front-page of HackerNews to get the post titles and links and save them to a CSV file.

Prerequesites

  • Node.js installed on your computer.
  • Basic understanding of Javascript and Node.js.

1. Project setup

To start, we’ll need to setup a Node.js project. In your terminal, change directories into an empty directory and type:

yarn init -y

Or

npm init -y

To initialize a new Node.js project. The -y flag skips all the questions that a new project asks you.

We’ll need to install two dependencies for this project: Cheerio and Axios.

In your terminal, type:

yarn add cheerio axios

That will install the packages in your project.

Now let’s get something printing on the screen.

Create a new file called scraper.js in your project directory and add the following code to the file

-- CODE language-js --console.log('Hello world!');

Next, in your terminal run the command:

node scraper

And you should see the text Hello world! in your terminal.

2. See what DOM elements we need using the developer tools

Now that our project is set-up, we can visit HackerNews and inspect the code to see which DOM elements we need to target.

Visit HackerNews and right-click on the page and press “Inspect” to open the developer tools.

That’ll open up the developer tools which looks like:

Since we want the title and URL, we can search for their DOM elements by pressing Control + Shift + C to select an element. When you hover over an element on the website after pressing Control + Shift + C then the element will be highlighted and you can see information about it.

If you click the highlighted element then it will open up in the developer tools.

This anchor tag has all the data we need. It contains the title and the href of the link. It also has a class of storylink so what we need to do is select all the elements with a class of storylink in our code and then extract the data we want.

3. Use Cheerio and Axios to get HTML data from HackerNews

Now it’s time to start using Cheerio and Axios to scrape HackerNews.

Delete the hello world console log and add the packages to your script at the top of your file.

-- CODE language-js --const cheerio = require('cheerio');
const axios = require('axios');

Next, we want to call axios using their get method to make a request to the HackerNews website to get the HTML data.

That code looks like this:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
console.log(response.data);
});

If you run your script now, then you should see a large string of HTML code.

Here is where Cheerio comes into play.

We want to load this HTML code into a Cheerio variable and with that variable, we’ll be able to run jQuery like methods on the HTML code.

That code looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
});

The $ is the variable that contains the parsed HTML code ready for use.

Since we know that the .storylink class is where our data lies, we can find all of the elements that have a .storylink class using the $ variable. That looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
console.log($('.storylink'));
});

If you run your code now, you’ll see a large object that is a Cheerio object. Next, we will run methods on this Cheerio object to get the data we want.

4. Get the title and link using Cheerio

Since there are many DOM elements containing the class storylink, we want to loop over them and work with each individual one.

Cheerio makes this simple with an each method. This looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
console.log(i);
console.log(e);
}
});

i is the index of the array, and e is the element object.

What this does is loop over all the elements containing the storylink class and within the loop, we can work with each individual element.

Since we want the title and URL, we can access them using text and attr methods provided by Cheerio. That looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
let title = $(e).text();
let link = $(e).attr('href');
console.log(title);
console.log(link);
}
});

If you run your code now, you should see a large list of post titles and their URLs!

Make A Browser Electron Js

Next, we’ll save this data in a CSV file.

5. Save the title and link into a CSV file.

Creating CSV files in Node.js is easy. We just need to import a module called fs into our code and run some methods. fs is available with Node so we don’t have to install any new packages.

At the top of your code add the fs module and create a write stream.

-- CODE language-js --const fs = require('fs');
const writeStream = fs.createWriteStream('hackernews.csv');

What this does is it creates a file called hackernews.csv and prepares your code to write to it.

Next, we want to create some headers for the CSV file. This looks like:

-- CODE language-js --writeStream.write(`Title,Link n`);

What we’re doing here, is just writing a single linke with the string Title,Link n.

This prepares the CSV with headings.

What’s left is to write a line to the CSV file for every title and link. That looks like:
-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
let title = $(e).text();
let link = $(e).attr('href');
writeStream.write(`${title}, ${link} n`);
});
});

What we’re doing is writing a new line to the file that contains the title and link in its appropriate location and then adding a new line for the next line.

The string in use is called template literals and it’s an easy way to add variables to strings in nicer syntax.

If you run your code now, you should see a CSV file created in your directory with the title and link of all the posts from HackerNews.

Your final code should look like this:

Searching DuckDuckGo with Nightmare.js

In this tutorial, we'll be going over how to search DuckDuckGo with Nightmare.js and get the URLs of the first five results.

Nightmare.js is a browser automation library that uses Electron to mimic browser like behavior. Using Nightmare, you're able to automate actions like clicking, entering forms, going to another page, and everything you can do on a browser manually.

To do this, you use methods provided by Nightmare such as `goto`, `type`, `click`, `wait`, and many others that represent actions you would do with a mouse and keyboard.

Let's get started.

Prerequisites

- Node.js installed on your computer.
- Basic understanding of Javascript and Node.js.
- Basic understanding of the DOM.

1. Project setup

If you've initialized a Node project as outlined in the previous tutorial, you can simply create a new file in the same directory called `nightmare.js`.

If you haven't created a new Node project, follow Step 1 in the previous tutorial to see how to create a new Node.js project.

Next, we'll add the nightmare.js package. In your terminal, type:

yarn add nightmare

Next, add a console.log message in `nightmare.js` to get started.

Your `nightmare.js` file should look like:

-- CODE language-js --console.log('Hello from nightmare!');

If you run `node nightmare` in your terminal, you should see:

Hello from nightmare!

2. See what DOM elements we need using the developer tools

Next, let's visit [DuckDuckGo.com](https://duckduckgo.com/) and inspect the website to see which DOM elements we need to target.

Visit DuckDuckGo and open up the developer tools by right-clicking on the form and selecting `Inspect`.

And from the developer tools, we can see that the ID of the input form is `search_form_input_homepage`. Now we know to target this ID in our code.

Next, we need to click the search button to complete the action of entering a search term and then searching for it.

Right-click the search icon on the right side of the search input and click `Inspect`.

From the developer tools, we can see that the ID of the search button is `search_button_homepage`. This is the next element we need to target in our Nightmare script.

3. Search for a term in DuckDuckGo using Nightmare.js

Now we have our elements and we can start our Nightmare script.

In your nightmare.js file, delete the console.log message and add the following code:

-- CODE language-js --const Nightmare = require('nightmare');
const nightmare = Nightmare({ show: true });
nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.then();

What we're doing here is first importing the Nightmare module, and then creating the nightmare object to work with.

The nightmare object takes in some options that you can see more of [here](https://github.com/segmentio/nightmare#nightmareoptions). The option we care about is `show: true` because this shows the electron instance and the actions being taken. You can hide this electron instance by setting `show` to `false`.

Next, we're telling the nightmare instance to take some actions. The actions are described using the methods `goto`, `type`, `click`, and `then`. They describe what we want nightmare to do.

First, we want it to go to the duckduckgo URL. Then, we want it to select the search form element and type 'web scraping'. Then, we want it to click the search button element. Then, we're calling `then` because this is what makes the instance run.

If you run this script, you should see Nightmare create an electron instance, go to duckduckgo.com, and then search for web scraping.

4. Get the URLs of the search results

The next step in this action is to get the URLs of the search results.

As you saw in the last step, Nightmare allows us to go to another page after taking an action like searching in a form, and then we can scrape the next page.

If you go to the browser and right-click a link in the search results page of DuckDuckGo, you'll see the element we need to target.

The class of the URL result we want is `result__url js-result-extras-url`.

To get DOM element data in Nightmare, we want to write our code in their `evaluate` method and return the data we want.

Update your script to look like this:

-- CODE language-js --
nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.wait(3000)
.evaluate(() => {
const results = document.getElementsByClassName(
'result__url js-result-extras-url'
);
return results;
})
.end()
.then(console.log)
.catch((error) => {
console.error('Search failed:', error);
});

What we added here is a `wait`, `evaluate`, `end`, `catch`, and a console.log to the `then`.

The `wait` is so we wait a few seconds after searching so we don't scrape a page that didn't load.

Then `evaluate` is where we write our scraping code. Here, we're getting all the elements with a class of `result__url js-result-extras-url` and returning the results which will be used in the `then` call.

Then `end` is so the electron instance closes.

Then `then` is where we get the results that were returned from `evaluate` and we can work with it like any other Javascript code.

Then `catch` is where we catch errors and log them.

If you run this code, you should see an object logged.

-- CODE language-js --{
'0': { jQuery1102006895228087119576: 151 },
'1': { jQuery1102006895228087119576: 163 },
'2': { jQuery1102006895228087119576: 202 },
'3': { jQuery1102006895228087119576: 207 },
'4': { jQuery1102006895228087119576: 212 },
'5': { jQuery1102006895228087119576: 217 },
'6': { jQuery1102006895228087119576: 222 },
'7': { jQuery1102006895228087119576: 227 },
'8': { jQuery1102006895228087119576: 232 },
'9': { jQuery1102006895228087119576: 237 },
'10': { jQuery1102006895228087119576: 242 },
'11': { jQuery1102006895228087119576: 247 },
'12': { jQuery1102006895228087119576: 188 }
}

This is the object returned from the evaluate method. These are all the elements selected by `document.getElementsByClassName('result__url js-result-extras-url');`.

We don't want to use this object, we want the URLs of the first 5 results.

To get the URL or href of one of these objects, we simply have to select it using `[]` and calling the `href` attribute on it.

Tutorial

Update your code to look like this:

-- CODE language-js --nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.wait(3000)
.evaluate(() => {
const results = document.getElementsByClassName(
'result__url js-result-extras-url'
);
const urls = [];
urls.push(results[2].href);
urls.push(results[3].href);
urls.push(results[4].href);
urls.push(results[5].href);
urls.push(results[6].href);
return urls;
})
.end()
.then(console.log)
.catch((error) => {
console.error('Search failed:', error);
});

Since the first two elements are URLs of ads, we can skip them and go to elements 2-6.

What we're doing here is creating an array called `urls` and pushing 5 hrefs to them. We select an element in the array using `[]` and call the existing href attribute on it. Then we return the URLs to be used in the `then` method.

If you run your code now, you should see this log:

-- CODE language-js --[
'https://en.wikipedia.org/wiki/Web_scraping',
'https://www.guru99.com/web-scraping-tools.html',
'https://www.edureka.co/blog/web-scraping-with-python/',
'https://www.webharvy.com/articles/what-is-web-scraping.html',
'https://realpython.com/tutorials/web-scraping/',
];

And this is how you get the first five URLs of a search in DuckDuckGo using Nightmare.js.

Your final code should look like this:

# What we covered

- Introduction to web scraping with Node.js

- Important concepts for web scraping.

- Popular web scraping libraries in Node.js

- A tutorial about how to scrape the HackerNews frontpage and save data to a CSV file.

- A tutorial about how to get the search results on DuckDuckGo using Nightmare.js.

Electron

What we covered

  • Introduction to web scraping with Node.js
  • Important concepts for web scraping.
  • Popular web scraping libraries in Node.js
  • A tutorial about how to scrape the HackerNews frontpage and save data to a CSV file.