Web-Scraping

This project demostrates web scraping all the reviews from "trip advisor" and "Yelp"

Yelp Web & Trip Advisor Scraping Project

Overview

This project is designed to scrape data from Yelp and Trip Advisor to analyze local businesses, reviews, and user interactions. By collecting data from Yelp, we aim to provide insights into customer preferences, popular services, and regional business trends.

Libraries Used

requests: For making HTTP requests to Yelp's website.
BeautifulSoup from bs4: For parsing HTML and extracting the necessary data.
pandas: For organizing the extracted data into structured formats like DataFrames, making it easier to analyze and export.
selenium: For automating web browser interaction, necessary for pages with dynamic content loaded by JavaScript.

Code Structure

The project is structured into several key components:

URL Generation: Functions to construct search URLs based on different parameters (location, search query, etc.).
Data Extraction: Code to request pages, parse the received HTML, and extract relevant data fields (business name, ratings, number of reviews, etc.).
Data Processing: Scripts to clean and structure the scraped data into a usable format.
Data Storage: Methods to save the processed data into files (CSV, Excel) or databases for further analysis.

Example

import requests
from bs4 import BeautifulSoup

# Example function to scrape business names and ratings
def scrape_yelp(search_query, location):
    url = f"https://www.yelp.com/search?find_desc={search_query}&find_loc={location}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    businesses = soup.find_all('h4', class_='businessName__09f24__3Wql2')
    for business in businesses:
        name = business.find('a').text
        rating = business.find('span', class_='display--inline__09f24__c6N_k').text
        print(f'Business Name: {name}, Rating: {rating}')

# Example usage
scrape_yelp('coffee shops', 'San Francisco, CA')

Challenges Faced

Dynamic Content Loading: Pages where content is loaded dynamically with JavaScript required using Selenium to simulate browser interactions.
IP Blocking: Frequent requests to Yelp's servers sometimes led to temporary IP blocking, necessitating the use of proxies or pause intervals between requests.
Data Structure Changes: Yelp's website structure changes occasionally, breaking the scraper. Regular maintenance and updates were required.

Outcomes

The project successfully retrieved and analyzed data for over 1,000 local businesses across several categories. Insights derived from this data were used to identify key factors contributing to business popularity and customer satisfaction.

Future Work

Future work I am performing sentiment analysis on the collected data, I will add the repo link soon, this repo implements more sophisticated data analysis techniques, expanding the scope of data collection to include user reviews, and automating the process of updating the scraper to adapt to website changes.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Nixon_scrape_Trip_Advisor.ipynb		Nixon_scrape_Trip_Advisor.ipynb
Nixon_scrape_Yelp.ipynb		Nixon_scrape_Yelp.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Scraping

Yelp Web & Trip Advisor Scraping Project

Overview

Libraries Used

Code Structure

Example

Challenges Faced

Outcomes

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web-Scraping

Yelp Web & Trip Advisor Scraping Project

Overview

Libraries Used

Code Structure

Example

Challenges Faced

Outcomes

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages