Extend Twitter Crawler to store downloaded tweets as JSON and to explain Data Format #96

carlosparadis · 2018-02-15T03:46:44Z

This issue concerns the following notebook: https://github.com/sailuh/perceive/blob/master/Crawlers/Twitter/cve_twitter_extraction.ipynb

Saving as JSON

Currently, both the Stream code block, and the Search code block do not store any of the data objects on disk, but on the variable searched_tweets_srch. We need to address this so that the tweets are stored as JSON files.

Add a new code block just below:

APP_KEY = myvars["APP_KEY"].rstrip()
APP_SECRET = myvars["APP_SECRET"].rstrip()
OAUTH_TOKEN = myvars["OAUTH_TOKEN"].rstrip()
OAUTH_TOKEN_SECRET = myvars["OAUTH_TOKEN_SECRET"].rstrip()

And add a save_path variable with a path to a folder to save the contents.

Use the said variable as input to a new function you will create named save_tweets_json(save_path), which then stores each tweet as a .json file. As such, for example, if you download 200 tweets, it should accordingly have 200 files in the folder.

In order to save the files, we need to adopt a file name convention. This is usually best to be an ID which we can trace to the original tweet easily by just pasting into an URL and being able to see it in the browser. Which brings us to the next task:

Data Model

In order to find the desired file name, we need to understand the data model. Currently, the notebook does not explain the data model, but contains shy references to it (CTRL +F URL 1 and URL 2. You should add the necessary references to the Notebook and explain overall how the API looks like with respect to our interests. For example, upon a quick look at the data you provided me, this appears to be the format of each tweet:

    'created_at': 'Sun Feb 11 07:29:53 +0000 2018',
    'id': 962589493164433415,
    'id_str': '962589493164433415',
    'text': '7th Grade Division - Pool A - Utah Force def. CVE Elite 60-55 #youngbloodelite @exposurebball',
    'truncated': False,
    'entities': {
        'hashtags': [{
            'text': 'youngbloodelite',
            'indices': [62, 78]
        }],
        'symbols': [],
        'user_mentions': [{
            'screen_name': 'exposurebball',
            'name': 'Exposure Basketball',
            'id': 917226631,
            'id_str': '917226631',
            'indices': [79, 93]
        }],
        'urls': []
    },
    'metadata': {
        'iso_language_code': 'en',
        'result_type': 'recent'
    },
    'source': '<a href="https://exposureevents.com" rel="nofollow">Exposure Events</a>',
    'in_reply_to_status_id': None,
    'in_reply_to_status_id_str': None,
    'in_reply_to_user_id': None,
    'in_reply_to_user_id_str': None,
    'in_reply_to_screen_name': None,
    'user': {
        'id': 774048329969860608,
        'id_str': '774048329969860608',
        'name': 'Youngblood League',
        'screen_name': 'youngbloodelite',
        'location': 'South Jordan, UT',
        'description': 'Most competitive league in the State of Utah 5th-8th grade. Invite Only league with Weekly stats, All-league awards, giveaways, prizes, gear, & Mini clinics',
        'url': None,
        'entities': {
            'description': {
                'urls': []
            }
        },
        'protected': False,
        'followers_count': 255,
        'friends_count': 285,
        'listed_count': 0,
        'created_at': 'Fri Sep 09 00:54:37 +0000 2016',
        'favourites_count': 9,
        'utc_offset': None,
        'time_zone': None,
        'geo_enabled': False,
        'verified': False,
        'statuses_count': 498,
        'lang': 'en',
        'contributors_enabled': False,
        'is_translator': False,
        'is_translation_enabled': False,
        'profile_background_color': 'F5F8FA',
        'profile_background_image_url': None,
        'profile_background_image_url_https': None,
        'profile_background_tile': False,
        'profile_image_url': 'http://pbs.twimg.com/profile_images/774245070824275968/optcBQeV_normal.jpg',
        'profile_image_url_https': 'https://pbs.twimg.com/profile_images/774245070824275968/optcBQeV_normal.jpg',
        'profile_link_color': '1DA1F2',
        'profile_sidebar_border_color': 'C0DEED',
        'profile_sidebar_fill_color': 'DDEEF6',
        'profile_text_color': '333333',
        'profile_use_background_image': True,
        'has_extended_profile': True,
        'default_profile': True,
        'default_profile_image': False,
        'following': False,
        'follow_request_sent': False,
        'notifications': False,
        'translator_type': 'none'
    },
    'geo': None,
    'coordinates': None,
    'place': None,
    'contributors': None,
    'is_quote_status': False,
    'retweet_count': 0,
    'favorite_count': 0,
    'favorited': False,
    'retweeted': False,
    'lang': 'en'

Consider the existing tweet we discussed previously:

https://twitter.com/patrickwardle/status/912254053849079808

It appears the url format is of the form /user/some_id

Upon a quick inspection on the example I provided above, it seems the fields:

'screen_name': 'exposurebball'
'id': 962589493164433415

Could be "plugged in" the user and some_id. Indeed, if you try to do so, you will be able to open the said tweet on your browser:

https://twitter.com/youngbloodelite/status/962589493164433415

However, I am not too clear yet if necessarily this "reconstruction process" would work for all tweets we download. You should assess if this is possible or not, perhaps by trying to understand what the status word means and see if all our tweets are "status".

Depending on your findings, it may suffice to name each file on the fom <screen_name>_<id> to desambiguate them, and avoid us storing duplicate tweets.

Misc

Please follow-up on this issue here instead of Slack now that it has been formalized. Make sure you read through and fully understand how to submit Pull Requests on the format we use on this repo: https://github.com/sailuh/perceive/blob/master/CONTRIBUTING.md

The text was updated successfully, but these errors were encountered:

carlosparadis added stat:next type: new feature prio:normal labels Feb 15, 2018

carlosparadis assigned riyachanduka Feb 15, 2018

riyachanduka added a commit to riyachanduka/perceive that referenced this issue Mar 22, 2018

sailuh#96 extending twitter crawler to store tweets in json format

8669c88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Twitter Crawler to store downloaded tweets as JSON and to explain Data Format #96

Extend Twitter Crawler to store downloaded tweets as JSON and to explain Data Format #96

carlosparadis commented Feb 15, 2018 •

edited

Loading

Extend Twitter Crawler to store downloaded tweets as JSON and to explain Data Format #96

Extend Twitter Crawler to store downloaded tweets as JSON and to explain Data Format #96

Comments

carlosparadis commented Feb 15, 2018 • edited Loading

Saving as JSON

Data Model

Misc

carlosparadis commented Feb 15, 2018 •

edited

Loading