Skip to content

AlanLeverenz/Web-Scraper

Repository files navigation

Web-Scraper

Web-Scraper

What the app does

This app scrapes the Jerusalem Post website (www.jpost.com/breaking-news) for breaking news headlines.

JPOST Breaking News Header

JPOST Breaking News List

The headline, link, reporter, and date of the report are captured, stored, and rendered to the app's home page. Here is how a headline is displayed in the Web-Scraper app.

Unsaved Article

Articles can be marked as 'saved' by clicking on the SAVE ARTICLE button.

Clicking on the headline itself will load the linked article in another web tab, as displayed below.

Linked Article

The Home page navbar has links to the Home page and Saved articles.

Unsaved Article Navbar

Click on the Saved Articles link to view the list of saved articles. Saved articles have two buttons for either removing it (DELETE FROM SAVED), or adding notes to it (ARTICLE NOTES).

Saved Article

Here is the Notes (modal) bootbox. Notes can be saved or removed from the list.

Note

The Saved Articles navbar has a link to return to the Home Page, as well as a CLEAR ARTICLES button. In this version of the app, this button removes the list of headlines from the webpage without deleting them from the database.

Technology

The dependencies for this nodejs app are:

  • axios
  • bootbox
  • cheerio
  • express
  • express-handlebars
  • mongoose
  • morgan
  • request

The database used by the app is MongoDB. The database name is mongoHeadLines. It stores two collections, Headlines and Notes, which are defined in two Model files. To relate notes that may be entered for a particular headline, the Notes model includes a reference id to Headline model using the _headlineId data record.

Web data is requested and returned using the Axios fetch method. Specific data elements are accessed using Cheerio and stored in a MongoDB database.

The Headline collection in three records:

  1. headline
  2. link
  3. reporterDate

The reporterDate field is created by slicing the

  • text returned after finding its parent
      tag.

      var headline = $(this).find("a").attr("title");
      var link = $(this).find("a").attr("href");
      var rd = $(this).find('ul').children('li').text();
      var len = rd.length;
      var date = rd.slice(-19);
      var reporter = rd.slice(0, (len - 19));
      

      The date and reporter slices are concatenated into the reportDate data record:

      if (headline && link) {
          articles.push({
          headline: headline,
          link: link,
          reporterDate: reporter + " " + date
          });
      }
      

      The page rendering engine is Express Handlebars. The main.handlebars {{{body}}} content is served by home and saved handlebar view files.

      Author

      Alan Leverenz (awleverenz@gmail.com)

  • About

    node app for scraping websites with a MongoDB database

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published