Skip to content

Describes collection and access info for dataset housed at Michigan re: tweets posted by/about politicians

License

Notifications You must be signed in to change notification settings

casmlab/politicians-tweets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter U.S.and India Politicians dataset

What's in the Data

We collect tweets posted by politicians in the U.S. and India and save the JSON provided by the Twitter API. Lists of politicians are generated by NivaDuck, software developed at Microsoft Research - India for automatically identifying accounts that belong to politicians.

As of April 21, 2021, the data includes:

Data Collection Process

Two scripts ran daily, one each for India and the U.S., to pull new tweets posted everyday by each politician in the respective lists. For India, the list of accounts includes journalists, media outlets, celebrities, and influencers.

You can view the scripts for collection in the scripts folder.

Data Access

The data is archived at the Social Media Archive (SOMAR) at ICPSR. Visit SOMAR to apply for access to the data.

Twitter User Metadata

We are manually checking all accounts NivaDuck identified and will provide periodic metadata updates.

Metadata Fields

See the codebook for a list of metadata fields, descriptions, variable types, valid values, etc.

Here's an example of the minimum metadata:

id id_str screen_name confirmed_account_type state twitter_name real_name bioguide office_holder party district level woman birthday last_updated
0 986781648 986781648 jeffsessions 1 Alabama Jeff Sessions 4/20/21
29 1155335864 1155335864 repdonaldpayne 1 New Jersey Rep. Donald Payne Jr Donald Payne P000604 1 1 10 3 FALSE 12/17/58 4/20/21
74 2970462034 2970462034 repkathleenrice 1 New York Kathleen Rice Kathleen Rice R000602 1 1 4 3 TRUE 2/15/65 4/20/21

Archived metadata files are available in the metadata folder as well.

Contributors

Anmol Panda and Armand Burks wrote the scripts to collect and archive Tweets using the Twitter Public API (via tweepy). Joyojeet Pal conceived the project at MSR India with Anmol Panda, and his team regularly contributes new accounts for the India dataset. Libby Hemphill generated this documentation and manages the team who collect and update data and metadata. Evan Parres handled metadata updates, and Najmin Ahmed manually verified many state labels for 2020 election candidates.

This project was a continuation of work initiated by Joyojeet Pal and Anmol Panda at Microsoft Research India.

Funding for the staff and infrastructure were provided by

We are grateful to Ballot Ready for providing data on political candidates in the U.S.

How to Cite/Acknowledge the Data

Cite the Data

@techreport {panda2023,
author = {Panda, Anmol and Hemphill, Libby and Pal, Joyojeet},
year = {2023},
title = {Politweets: Tweets of politicians, celebrities, news media, and influencers from India and the United States},
institution = {Inter - University Consortium for Political and Social Research},
number = {SOMAR44-v1},
address = {Ann Arbor, MI},
note = {DOI:10.3886/xm68-rw44},
}

Cite the Papers

NivaDuck Paper

BibTeX

@inproceedings{
10.1145/3400806.3400830,
author = {Panda, Anmol and Gonawela, A’ndre and Acharyya, Sreangsu and Mishra, Dibyendu and Mohapatra, Mugdha and Chandrasekaran, Ramgopal and Pal, Joyojeet},
title = {NivaDuck - A Scalable Pipeline to Build a Database of Political Twitter Handles for India and the United States},
year = {2020},
isbn = {9781450376884},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3400806.3400830},
doi = {10.1145/3400806.3400830},
abstract = {We present a scalable methodology to identify Twitter handles of politicians in a given region and test our framework in the context of Indian and US politics. The main contribution of our work is the list of the curated Twitter handles of 18500 Indian and 8000 US politicians. Our work leveraged machine learning-based classification and human verification to build a data set of Indian politicians on Twitter. We built NivaDuck, a highly precise, two-staged classification pipeline that leverages Twitter description text and tweet content to identify politicians. For India, we tested NivaDuck’s recall using Twitter handles of the members of the Indian parliament while for the US we used state and local level politicians in California state and San Diego county respectively. We found that while NivaDuck has lower recall scores, it produces large, diverse sets of politicians with precision exceeding 90 percent for the US dataset. We discuss the need for an ML-based, scalable method to compile such a dataset and its myriad use cases for the research community and its wide-ranging utilities for research in political communication on social media. },
booktitle = {International Conference on Social Media and Society},
pages = {200–209},
numpages = {10},
keywords = {united states, india, archive, twitter, politics},
location = {Toronto, ON, Canada},
series = {SMSociety'20}
}

APA 7th

Panda, A., Gonawela, A., Acharyya, S., Mishra, D., Mohapatra, M., Chandrasekaran, R., & Pal, J. (2020). NivaDuck - A Scalable Pipeline to Build a Database of Political Twitter Handles for India and the United States. International Conference on Social Media and Society, 200–209. https://doi.org/10.1145/3400806.3400830

DISMISS Paper

BibTex

@article{Arya_De_Mishra_Shekhawat_Sharma_Panda_Lalani_Singh_Mothilal_Grover_Nishal_Dash_Shora_Akbar_Pal_2022, 
title={DISMISS: Database of Indian Social Media Influencers on Twitter}, 
volume={16}, 
url={https://ojs.aaai.org/index.php/ICWSM/article/view/19370}, 
DOI={10.1609/icwsm.v16i1.19370}, 
number={1}, 
journal={Proceedings of the International AAAI Conference on Web and Social Media}, 
author={Arya, Arshia and De, Soham and Mishra, Dibyendu and Shekhawat, Gazal and Sharma, Ankur and Panda, Anmol and Lalani, Faisal and Singh, Parantak and Mothilal, Ramaravind Kommiya and Grover, Rynaa and Nishal, Sachita and Dash, Saloni and Shora, Shehla and Akbar, Syeda Zainab and Pal, Joyojeet}, 
year={2022}, 
month={May}, 
pages={1201-1207} }

APA 7th

Arya, A., De, S., Mishra, D., Shekhawat, G., Sharma, A., Panda, A., Lalani, F., Singh, P., Mothilal, R. K., Grover, R., Nishal, S., Dash, S., Shora, S., Akbar, S. Z., & Pal, J. (2022). DISMISS: Database of Indian Social Media Influencers on Twitter. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 1201-1207. https://doi.org/10.1609/icwsm.v16i1.19370

Reporting Issues and Getting Help

Use issues to report bugs and request changes to the collection process or metadata. We will not be providing hands-on help with the data, but we will try to answer questions if they come up.

About

Describes collection and access info for dataset housed at Michigan re: tweets posted by/about politicians

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published