2 July 2020
@dog_rates Twitter Data Wrangling and Visualization
The objective of this project is to perform data wrangling on twitter data of @dog_rates twitter account. This report consists a detailed explanation of how the data was gathered, assessed and cleaned. The @dog_rates account rates peoples’ dogs in a humorous way. The contents of this report are divided in three parts as mentioned previously.
The data was gathered from following sources:
- Twitter Archive File: ‘twitter-archive-enhanced.csv’
- Server: ‘https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv’
- Twitter API
This file (‘twitter-archive-enhanced.csv’) was provided by the instructor, it is the twitter ar-chive of the account @dog_rates. The file contains tweet-data (tweet_id, timestamp, tweet_url, text, rating_numerator, rating_denominator, etc.) as it stands on August 2017. The contents of the file were loaded into a pandas.DataFrame.
Requests to the server were made using the url – ‘https://d17h27t6h515a5.cloudfront.net/to-pher/2017/August/599fd2ad_image-predictions/image-predictions.tsv’. A get request was made to the server using python’s requests module. The contents of the response were written to a file image-prediction.txt. The file data was then read into a pandas.DataFrame. The data-set contains data like image_url, three dog_breed predictions, confidence-level of the predic-tions, etc.
A twitter developer account was made, Twitter API can be accessed only if one has a developer account. All the tweet_ids from source 1 were used to make get_status requests through a li-brary called tweepy. The json response for each tweet was written to a file named tweet_json.txt using python’s json library. Its contents were then loaded into a pandas.DataFrame. It contains tweet_id, retweet_count, favorite_count for each tweet the account has made until August 2017.
Visual analysis was performed and then summary of the data-set was printed. The summary depicted that the data-set contains 2356 rows and 17 columns. Certain columns like retweeted_status_id, retweeted_status_user_id, re-tweeted_status_timestamp were deemed as extraneous data. Some of the tweet_ids were actually retweets such rows were also deemed peripheral. Upon inspecting the data-types it was ascertained that the data-type of rating_numerator is integer which should have been float. Upon further assessment some values of rating_denominator were found to be more than ten, text columns of such rows were examined, it was evident that these ratings were actually used to rate more than one dog. While examining the text column the inaccuracy of some ratings due presence of dates like 9/11 and store names 7/11 in the text column was pointed out. After assessing the name column, invalid data like use of articles like ‘a’, ‘an’, ‘the’ instead of name and use of None instead of null values was found. It was also noted that some dogs were assigned multiple slang cate-gories.
After performing visual assessment and creating a sum-mary it was noted that this data-set contains 2074 rows and 12 columns. Some column names like p1, p1_dog, p1_conf were non-descriptive and confusing. The jpg_url col-umn had a couple of png image links along with 66 duplicate rows.
Data-set summary was inspected. The data-set had 2356 rows and 3 columns. Out of these only 2331 rows were non-null, the rest of the rows were classified as missing data. Upon inspecting the data-types it was found that the tweet_ids were floating point numbers instead of integers.
A copy of the data-set was created, and cleaning operations were performed on the copy. Extraneous columns were dropped, rows that contained retweet data were also dropped, rating_numerator and rating_denominator columns were cleaned by extracting correct data from the text column, the rows with rating_de-nominator greater than 10 were cleaned by dividing the rating_numerator and rat-ing_denominator with appropriate numbers. Then both the columns were merged into a single column rating. Another column, rating_num was created that stores the floating point rating for the purpose of analyses. The column name was purged for wrong names and “None” values were replaced with numpy.NaN. The dogs that were categorised into more than one slang category were assigned null value, also the multiple category col-umns were merged into a single column called slang_category.
A copy of the data-set was created, on which the cleaning operations were performed. The column names were confusing so they were change to prediction1, prediction1_confidence, prediction1_iscorrect and so on. Duplicated rows from the url column were dropped. All the prediction related columns were merged into two columns predicted_breed and prediction_confidence. The breed names were cleaned by replacing underscores with whitespace.
The column name id changed to tweet_id, then changed the data type of the same column to integer. Finally all three data-sets were merged on their tweet_id, this merged data-set is saved in a file named twitter_archive_master.csv
This report is a part of the Udacity Data Analyst Nanodegree’s Wrangle and Analyse Project. It includes a few insights along with visualizations about the twitter account @dog_rates. The twitter account rates peoples’ dogs in a humorous way and the ratings are mostly of the format 13/10, 14/10, etc. After wrangling the tweet data, it was analysed thoroughly and observations were recorded.
The account contains 5000 + tweets. This report however provides an analysis of around 1500 tweets. After grouping retweet count on the basis of dog breed it was observed that tweets about Golden Re-trievers gained the maximum amount of retweets.
The number of retweets and favourites were grouped by the month in which they oc-curred starting from November 2015 to July 2017. It can be observed that the number of favourites has always been more than the number of retweets for the account, dur-ing the mentioned period the account received most number of retweets in December 2015 and most number of favourites in July 2017. April 2016 was not a good month since the lowest retweets and favourites were from that period
It can be observed that tweet_count has decreased drastically since December 2012. The maxi-mum number of tweets were made during December 2012.
Median Rating was calculated for each dog breed, it was observed that the highest median rating was received by three dog breeds namely – Afghan Hound, Saluki and Bouvier Des Flandres which was 13/10.
- twitter-archive-enhanced.csv
- twitter-archive-master.csv
- tweet-json.txt
- wrangle_act.ipynb