Skip to content

narendranathe/Web-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Web-Scraping

This project demostrates web scraping all the reviews from "trip advisor" and "Yelp"


Yelp Web & Trip Advisor Scraping Project

Overview

This project is designed to scrape data from Yelp and Trip Advisor to analyze local businesses, reviews, and user interactions. By collecting data from Yelp, we aim to provide insights into customer preferences, popular services, and regional business trends.

Libraries Used

  • requests: For making HTTP requests to Yelp's website.
  • BeautifulSoup from bs4: For parsing HTML and extracting the necessary data.
  • pandas: For organizing the extracted data into structured formats like DataFrames, making it easier to analyze and export.
  • selenium: For automating web browser interaction, necessary for pages with dynamic content loaded by JavaScript.

Code Structure

The project is structured into several key components:

  1. URL Generation: Functions to construct search URLs based on different parameters (location, search query, etc.).
  2. Data Extraction: Code to request pages, parse the received HTML, and extract relevant data fields (business name, ratings, number of reviews, etc.).
  3. Data Processing: Scripts to clean and structure the scraped data into a usable format.
  4. Data Storage: Methods to save the processed data into files (CSV, Excel) or databases for further analysis.

Example

import requests
from bs4 import BeautifulSoup

# Example function to scrape business names and ratings
def scrape_yelp(search_query, location):
    url = f"https://www.yelp.com/search?find_desc={search_query}&find_loc={location}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    businesses = soup.find_all('h4', class_='businessName__09f24__3Wql2')
    for business in businesses:
        name = business.find('a').text
        rating = business.find('span', class_='display--inline__09f24__c6N_k').text
        print(f'Business Name: {name}, Rating: {rating}')

# Example usage
scrape_yelp('coffee shops', 'San Francisco, CA')

Challenges Faced

  • Dynamic Content Loading: Pages where content is loaded dynamically with JavaScript required using Selenium to simulate browser interactions.
  • IP Blocking: Frequent requests to Yelp's servers sometimes led to temporary IP blocking, necessitating the use of proxies or pause intervals between requests.
  • Data Structure Changes: Yelp's website structure changes occasionally, breaking the scraper. Regular maintenance and updates were required.

Outcomes

The project successfully retrieved and analyzed data for over 1,000 local businesses across several categories. Insights derived from this data were used to identify key factors contributing to business popularity and customer satisfaction.

Future Work

Future work I am performing sentiment analysis on the collected data, I will add the repo link soon, this repo implements more sophisticated data analysis techniques, expanding the scope of data collection to include user reviews, and automating the process of updating the scraper to adapt to website changes.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors