Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Twitter Crawler to store downloaded tweets as JSON and to explain Data Format #96

Open
carlosparadis opened this issue Feb 15, 2018 · 0 comments

Comments

@carlosparadis
Copy link
Member

carlosparadis commented Feb 15, 2018

This issue concerns the following notebook: https://github.com/sailuh/perceive/blob/master/Crawlers/Twitter/cve_twitter_extraction.ipynb

Saving as JSON

Currently, both the Stream code block, and the Search code block do not store any of the data objects on disk, but on the variable searched_tweets_srch. We need to address this so that the tweets are stored as JSON files.

Add a new code block just below:

APP_KEY = myvars["APP_KEY"].rstrip()
APP_SECRET = myvars["APP_SECRET"].rstrip()
OAUTH_TOKEN = myvars["OAUTH_TOKEN"].rstrip()
OAUTH_TOKEN_SECRET = myvars["OAUTH_TOKEN_SECRET"].rstrip()

And add a save_path variable with a path to a folder to save the contents.

Use the said variable as input to a new function you will create named save_tweets_json(save_path), which then stores each tweet as a .json file. As such, for example, if you download 200 tweets, it should accordingly have 200 files in the folder.

In order to save the files, we need to adopt a file name convention. This is usually best to be an ID which we can trace to the original tweet easily by just pasting into an URL and being able to see it in the browser. Which brings us to the next task:

Data Model

In order to find the desired file name, we need to understand the data model. Currently, the notebook does not explain the data model, but contains shy references to it (CTRL +F URL 1 and URL 2. You should add the necessary references to the Notebook and explain overall how the API looks like with respect to our interests. For example, upon a quick look at the data you provided me, this appears to be the format of each tweet:

    'created_at': 'Sun Feb 11 07:29:53 +0000 2018',
    'id': 962589493164433415,
    'id_str': '962589493164433415',
    'text': '7th Grade Division - Pool A - Utah Force def. CVE Elite 60-55 #youngbloodelite @exposurebball',
    'truncated': False,
    'entities': {
        'hashtags': [{
            'text': 'youngbloodelite',
            'indices': [62, 78]
        }],
        'symbols': [],
        'user_mentions': [{
            'screen_name': 'exposurebball',
            'name': 'Exposure Basketball',
            'id': 917226631,
            'id_str': '917226631',
            'indices': [79, 93]
        }],
        'urls': []
    },
    'metadata': {
        'iso_language_code': 'en',
        'result_type': 'recent'
    },
    'source': '<a href="https://exposureevents.com" rel="nofollow">Exposure Events</a>',
    'in_reply_to_status_id': None,
    'in_reply_to_status_id_str': None,
    'in_reply_to_user_id': None,
    'in_reply_to_user_id_str': None,
    'in_reply_to_screen_name': None,
    'user': {
        'id': 774048329969860608,
        'id_str': '774048329969860608',
        'name': 'Youngblood League',
        'screen_name': 'youngbloodelite',
        'location': 'South Jordan, UT',
        'description': 'Most competitive league in the State of Utah 5th-8th grade. Invite Only league with Weekly stats, All-league awards, giveaways, prizes, gear, & Mini clinics',
        'url': None,
        'entities': {
            'description': {
                'urls': []
            }
        },
        'protected': False,
        'followers_count': 255,
        'friends_count': 285,
        'listed_count': 0,
        'created_at': 'Fri Sep 09 00:54:37 +0000 2016',
        'favourites_count': 9,
        'utc_offset': None,
        'time_zone': None,
        'geo_enabled': False,
        'verified': False,
        'statuses_count': 498,
        'lang': 'en',
        'contributors_enabled': False,
        'is_translator': False,
        'is_translation_enabled': False,
        'profile_background_color': 'F5F8FA',
        'profile_background_image_url': None,
        'profile_background_image_url_https': None,
        'profile_background_tile': False,
        'profile_image_url': 'http://pbs.twimg.com/profile_images/774245070824275968/optcBQeV_normal.jpg',
        'profile_image_url_https': 'https://pbs.twimg.com/profile_images/774245070824275968/optcBQeV_normal.jpg',
        'profile_link_color': '1DA1F2',
        'profile_sidebar_border_color': 'C0DEED',
        'profile_sidebar_fill_color': 'DDEEF6',
        'profile_text_color': '333333',
        'profile_use_background_image': True,
        'has_extended_profile': True,
        'default_profile': True,
        'default_profile_image': False,
        'following': False,
        'follow_request_sent': False,
        'notifications': False,
        'translator_type': 'none'
    },
    'geo': None,
    'coordinates': None,
    'place': None,
    'contributors': None,
    'is_quote_status': False,
    'retweet_count': 0,
    'favorite_count': 0,
    'favorited': False,
    'retweeted': False,
    'lang': 'en'

Consider the existing tweet we discussed previously:

https://twitter.com/patrickwardle/status/912254053849079808

It appears the url format is of the form /user/some_id

Upon a quick inspection on the example I provided above, it seems the fields:

  • 'screen_name': 'exposurebball'
  • 'id': 962589493164433415

Could be "plugged in" the user and some_id. Indeed, if you try to do so, you will be able to open the said tweet on your browser:

https://twitter.com/youngbloodelite/status/962589493164433415

However, I am not too clear yet if necessarily this "reconstruction process" would work for all tweets we download. You should assess if this is possible or not, perhaps by trying to understand what the status word means and see if all our tweets are "status".

Depending on your findings, it may suffice to name each file on the fom <screen_name>_<id> to desambiguate them, and avoid us storing duplicate tweets.

Misc

Please follow-up on this issue here instead of Slack now that it has been formalized. Make sure you read through and fully understand how to submit Pull Requests on the format we use on this repo: https://github.com/sailuh/perceive/blob/master/CONTRIBUTING.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants