Skip to content

vedanta28/social-media-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Social Media Scraper

Arnav Kumar Behera, Vedanta Mohapatra

October, 2022

Build Status

In this project, we have implemented a social media scrapper for select supported websites and also pre-processed the extracted content. The project is built using snscrape. Kindly refer to Download section in the README.md file of the linked github page for help in installing it.

Features

  • Supported Websites: Twitter, Reddit
  • Supported Features for Twitter:
    • Scrape tweets from a particular User, or any searches
    • Incase, of searches either the tweets can be extracted either in latest or top order.
    • The follwoing things are extracted: ['Unique ID', 'Date', 'User', 'Tweet', 'Preproccesed Tweet']
  • Supported Features for Reddit:
    • This code can scrape comments/posts from a particular User, Sub-reddit or any searches.
    • For Comments/Posts the following data are extracted:
    • ['Unique ID', 'Date', 'Sub-reddit', 'Author', 'Title/Comment', 'Preprocessed Title/Comment']
    • Incase of Posts Title is extracted, and for Comments the Comment(body) is extracted.
  • Preprocessing the data: The pre-processing involves removing URLS, expanding contractions, lower-casing all the texts, removing punctuations, removing numbers, removing extra white spaces, removing stop words, replacing emojis with words (implemented using emoji), lemmantizing the words. Any combinations of these pre-processing can be used depending on the use case.
  • Storing the Extracted data into .csv files.

Softwares used

Contact Us

About

Scrapes Data from supported Social Media Sites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages