Skip to content
This repository has been archived by the owner on Apr 20, 2023. It is now read-only.

Releases: chris-greening/instascrape

v2.1.2

05 Feb 05:48
Compare
Choose a tag to compare

Fix

Fixed wrong profile_pic_url and profile_pic_url_hd scrapes when passing a sessionid to Profile.scrape

v2.1.0

20 Jan 01:28
fe7b6ce
Compare
Choose a tag to compare

New feature

instascrape.scrape_tools.scrape_posts

Takes a list of unscraped instascrape.Post objects and scrapes them with a variety of different configurations and options for usage. Returns successfully scraped posts as well as the posts that were not successfully scraped.

Sample Usage

from instascrape import Post, scrape_posts

# Some code creating a list of posts and valid header info, etc...

# Scrape the first 100 posts 
scraped_posts, unscraped_posts = scrape_posts(posts_list, headers=headers, limit=100)

# Scrape all posts since January 1st, 2020
import datetime 
scraped_posts, unscraped_posts = scrape_posts(posts_list, headers=headers, limit=datetime.date(2020, 1, 1))

etc.

Available arguments

  • posts : List[instascrape.Post]
    Required, list of unscraped Post objects
  • session : requests.Session
    Optional, custom requests.Session object
  • webdriver : selenium.webdriver.chrome.webdriver.WebDriver
    Optional, custom Selenium webdriver (overrides session if passed)
  • limit : Union[int, datetime.datetime]
    Optional, integer or date value to stop scraping at. Defaults to all posts
  • headers : dict
    Optional, dictionary of request headers
  • pause : int
    Optional, pause between scrapes
  • on_exception : str
    Optional, available options when an exception occurs are "raise", "pass", "return". Defaults to "raise".
  • silent : bool
    Optional, print output while scraping. Defaults to True (no output)
  • inplace : bool
    Optional, directly modifies the post objects that are passed. Otherwise, creates a copy and returns lists of copies

v2.0.2

17 Jan 17:34
Compare
Choose a tag to compare

Fixes

  • Fixed default None argument for instascrape.scrapers.Profile.get_posts. Passing a specific amount works but not passing anything resulted in a comparison between NoneType and int

v2.0.0

17 Jan 04:34
78255b5
Compare
Choose a tag to compare

New features

Below is a list of new features

scrape tools

  • json_from_soup

Returns JSON Instagram data from BeautifulSoup

  • flatten_dict

Returns a flattened dictionary of all leaf nodes in a tree of JSON data

  • New flatten argument for json_from_* functions, returns a flattened dictionary

scrapers

  • New inplace argument for the scrape method

Similar to the pandas inplace parameter except the default is True as opposed to pandas's False. By default, scrape will modify an instance inplace, setting attributes equal to the scraped data. If False, the current instance will remain untouched and scrape will instead return another instance with the scraped data. Useful for chaining methods

  • New 'sessionparameter for thescrape` method

Allows passing of a custom session object

  • New webdriver parameter for the scrape method

Uses a webdriver for scraping the data instead of a session

Fixes

  • fixed Post scraper KeyError that was occuring on all scrapes

Breaking changes

Below is a list of breaking changes to the library

  • Renamed instascrape.scrapers.json_tools to instascrape.scrapers.scrape_tools
  • Renamed parse_json_from_mapping function to parse_data_from_json
  • Removed FlatJSONDict, replaced with the flatten_dict function in scrape_tools that will flatten any dictionary
  • json_from_* functions now return a list of all JSON dictionary's from the page as opposed to just the first dictionary.

Non-breaking changes behind the scenes

Below is a list of everything that changed behind the scenes that has no bearing on the API

  • refactored out a lot of complexity from instascrape.core._static_scraper._StaticHtmlScraper's implementation, greatly improving code readability
  • Changed imports to reflect file moves
  • Reimplemented to rely more on reusable functions as opposed to static methods unnecessarily bound to classes
  • Changed how data is loaded into namespace when using the scrape method to make room for the inplace argument. inplace is defaulted as True so this doesn't break any existing code but instead provides a new alternative.
  • updated documentation with docstrings

v1.7.1

26 Dec 16:37
Compare
Choose a tag to compare

Deprecated data point

Removed business_email as an available data point from instascrape.scrapers.Profile scraper. Instagram seems to have removed the ability to view business email's from the web version of the platform and all values were being returned as nan. This will be explored further in the future but for now it is being removed.

v1.7.0

22 Dec 19:17
Compare
Choose a tag to compare

Deprecations

Officially removed deprecated methods from all scrapers as listed below

All scrapers

  • load instance method

instascrape.scrapers.Hashtag

  • from_profile class method

instascrape.scrapers.Post

  • from_shortcode class method

instascrape.scrapers.Profile

  • from_username class method

The functionality for all of these methods is covered by the scrape instance method and are thus redundant and less powerful.

Documentation

  • Removed misleading documentation for outdated scrapers. Improved existing scrapers
  • Added and improved type hints

v1.6.1

14 Dec 06:02
Compare
Choose a tag to compare

Docs

Added type hints for better documentation

v1.6.0

14 Dec 03:16
9c8d610
Compare
Choose a tag to compare

New feature

Added instascrape.scrapers.IGTV for scraping IGTV posts. instascrape.scrapers.IGTV is a subclass of instascrape.scrapers.Post and thus inherited all of its methods and behaviors

Sample usage:

from instascrape import IGTV 
google_igtv = IGTV('https://www.instagram.com/tv/CIrIIMYl8VQ/')
google_igtv.scrape()

v1.5.0

14 Dec 00:03
f4466ba
Compare
Choose a tag to compare

New feature

Introduced the Reel scraper for scraping Instagram reels. Reel is a subclass of Post so pretty much everything you expect from Post is available in Reel as well.

Sample usage:

from instascrape import Reel
sample_reel = Reel("https://www.instagram.com/reel/CIrJSrFFHM_/")
sample_reel.scrape()

Bug fixes

json_from_url

Added optional/default request headers argument to instascrape.scrapers.json_from_url

unit tests

Fixed some of the broken unit tests. The library was fine but some of the tests were a little outdated and needed what appears to be required browser headers now to run properly.

v1.4.0

10 Dec 22:50
Compare
Choose a tag to compare

New features

Location scraper

Ability to scrape Instagram Location pages.

Sample usage

from instascrape import Location 
url = "https://www.instagram.com/explore/locations/212988663/new-york-new-york/"
new_york = Location(url)
new_york.scrape()
print(f"{new_york.amount_of_posts:,} people have been to New York"
>>> 61,202,403 people have been to New York

Optional header for requests

Now supports passing an optional browser header to the scrape method of all scraper objects. Syntax is exactly the same as a header dict you would pass to requests.get.

The default header is

headers={"User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57"}

Sample usage is

from instascrape import Profile 
headers={"User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57"}
google = Profile("google")
google.scrape(headers=headers)

Fixes

It appears Instagram tightened restrictions overnight, all GET requests from the library were being returned 429 HTTP response status codes (Too Many Requests). Prior to now, instascrape did not pass or have any support for passing browser headers. This newest default and option to pass in headers seems to have returned library functioning for now. Keep an eye out for more robust session handling and better cookie support in later updates