This repository contains data and code of Deep-CENIC, a deep learning classifier for classifying ideational impact of research papers. In our Decision Support Systems paper: "Classifying the ideational impact of Information Systems review articles: A content-enriched deep learning approach", which provides further details on the data and model, Deep-CENIC was evaluated in the context of information systems (IS) review articles (RAs).
The repository was developed with reproducibility in mind, offering a starting point to reuse the following components:
- Definitions and implementation of NLP features (see table in the following Data section)
- Coding of ideational impact as a gold-standard for training and evaluating ML algorithms
- Blueprint for a citation analysis repository structure (based on Cookiecutter Data Science)
The dataset contains IS RAs published in major IS journals (Lowry et al. 2013) between 2000 and 2014. A forward search has been conducted on a sample of the original dataset comprising only citing publications, which cite a RA from the IS business value domain and appeared in a journal included in the Senior Scholars' basket of journals. The citation data was extracted from Google Scholar and Web of Science. A large scale coding of the ideational impact of IS RAs was conducted, which serves as the input for the Deep-CENIC model. The data is structured into three stages:
- raw
- interim
- processed
Due to copyright restrictions some data has been removed from the repository or is presented in truncated form:
- Raw TEI files: TEI files contain the full text of published papers and have therefore been removed. Refer to GROBID for details on how to generate TEI files from PDF papers.
- raw/LR.csv, interim/LR.csv and interim/CP.csv contain paper abstracts which have been truncated.
- interim/CITATION.csv contain citation sentences which have been truncated and only a random sample has been retained in the repository.
The final dataset (FEATURE_FRAME.csv) is fully available:
Feature | Description | Data Type | Source |
---|---|---|---|
citation_key_lr | Bibtex citation key | String | raw |
citation_key_cp | Bibtex citation key | String | raw |
focal_citations | Number of citations toward the focal RA | Integer | summarize_citation_df |
textual | Number of textual citations | Integer | is_textual_citation |
separate | Number of standalone citations | Integer | is_separate |
comp_sup | Number of comparative and superlative clauses | Integer | has_comp_sup |
prp | Number of personal pronouns | Integer | has_1st_3rd_prp |
pos_0 | Appearances of POS pattern 0 | Integer | find_pos_patterns |
pos_1 | Appearances of POS pattern 1 | Integer | find_pos_patterns |
pos_2 | Appearances of POS pattern 2 | Integer | find_pos_patterns |
pos_3 | Appearances of POS pattern 3 | Integer | find_pos_patterns |
pos_4 | Appearances of POS pattern 4 | Integer | find_pos_patterns |
pos_5 | Appearances of POS pattern 5 | Integer | find_pos_patterns |
sentence_popularity amin | Number of different citations in the citing sentence (min aggregate) | Integer | get_popularity |
sentence_popularity amax | Number of different citations in the citing sentence (max aggregate) | Integer | get_popularity |
sentence_popularity mean | Number of different citations in the citing sentence (mean aggregate) | Double | get_popularity |
context_popularity amin | Number of different citations in the citing context (min aggregate) | Integer | get_popularity |
context_popularity amax | Number of different citations in the citing context (max aggregate) | Integer | get_popularity |
context_popularity mean | Number of different citations in the citing context (mean aggregate) | Double | get_popularity |
sentence_density amin | Focal citations divided by total citations in the citing sentence (min aggregate) | Double | get_density |
sentence_density amax | Focal citations divided by total citations in the citing sentence (max aggregate) | Double | get_density |
sentence_density mean | Focal citations divided by total citations in the citing sentence (mean aggregate) | Double | get_density |
context_density amin | Focal citations divided by total citations in the citing context (min aggregate) | Double | get_density |
context_density amax | Focal citations divided by total citations in the citing context (max aggregate) | Double | get_density |
context_density mean | Focal citations divided by total citations in the citing context (mean aggregate) | Double | get_density |
position_in_sentence amin | Position of the reference in citing sentence (min aggregate) | Double | get_position_in_sentence |
position_in_sentence amax | Position of the reference in citing sentence (max aggregate) | Double | get_position_in_sentence |
position_in_sentence mean | Position of the reference in citing sentence (mean aggregate) | Double | get_position_in_sentence |
sentence_neg amin | Negative sentiment of the citing sentence with regards to the RA (min aggregate) | Double | get_sentiment |
sentence_neg amax | Negative sentiment of the citing sentence with regards to the RA (max aggregate) | Double | get_sentiment |
sentence_neg mean | Negative sentiment of the citing sentence with regards to the RA (mean aggregate) | Double | get_sentiment |
sentence_neu amin | Neutral sentiment of the citing sentence with regards to the RA (min aggregate) | Double | get_sentiment |
sentence_neu amax | Neutral sentiment of the citing sentence with regards to the RA (max aggregate) | Double | get_sentiment |
sentence_neu mean | Neutral sentiment of the citing sentence with regards to the RA (mean aggregate) | Double | get_sentiment |
sentence_pos amin | Positive sentiment of the citing sentence with regards to the RA (min aggregate) | Double | get_sentiment |
sentence_pos amax | Positive sentiment of the citing sentence with regards to the RA (max aggregate) | Double | get_sentiment |
sentence_pos mean | Positive sentiment of the citing sentence with regards to the RA (mean aggregate) | Double | get_sentiment |
sentence_compound amin | Compound sentiment of the citing sentence with regards to the RA (min aggregate) | Double | get_sentiment |
sentence_compound amax | Compound sentiment of the citing sentence with regards to the RA (max aggregate) | Double | get_sentiment |
sentence_compound mean | Compound sentiment of the citing sentence with regards to the RA (mean aggregate) | Double | get_sentiment |
context_neg amin | Negative sentiment of the citing context with regards to the RA (min aggregate) | Double | get_sentiment |
context_neg amax | Negative sentiment of the citing context with regards to the RA (max aggregate) | Double | get_sentiment |
context_neg mean | Negative sentiment of the citing context with regards to the RA (mean aggregate) | Double | get_sentiment |
context_neu amin | Neutral sentiment of the citing context with regards to the RA (min aggregate) | Double | get_sentiment |
context_neu amax | Neutral sentiment of the citing context with regards to the RA (max aggregate) | Double | get_sentiment |
context_neu mean | Neutral sentiment of the citing context with regards to the RA (mean aggregate) | Double | get_sentiment |
context_pos amin | Positive sentiment of the citing context with regards to the RA (min aggregate) | Double | get_sentiment |
context_pos amax | Positive sentiment of the citing context with regards to the RA (max aggregate) | Double | get_sentiment |
context_pos mean | Positive sentiment of the citing context with regards to the RA (mean aggregate) | Double | get_sentiment |
context_compound amin | Compound sentiment of the citing context with regards to the RA (min aggregate) | Double | get_sentiment |
context_compound amax | Compound sentiment of the citing context with regards to the RA (max aggregate) | Double | get_sentiment |
context_compound mean | Compound sentiment of the citing context with regards to the RA (mean aggregate) | Double | get_sentiment |
self_citation | At least one author of the RA and CA is identical | Boolean | is_self_citation |
title_similarity | Semantic similarity of the RA and CA titles | Double | get_title_similarity |
abstract_similarity | Semantic similarity of the RA and CA abstracts | Double | get_abstract_similarity |
SYN | Knowledge developed in the cited RA includes synthesis | Boolean | raw |
TT | Knowledge developed in the cited RA includes theory testing | Boolean | raw |
TB | Knowledge developed in the cited RA includes theory building | Boolean | raw |
RG | Knowledge developed in the cited RA includes identification of research gaps | Boolean | raw |
CRI | Knowledge developed in the cited RA includes critical assessment | Boolean | raw |
RA | Knowledge developed in the cited RA includes development of a research agenda | Boolean | raw |
total_references | Total number of references in the CA | Integer | extract_total_references |
total_citations | Total number of citations in the CA | Integer | extract_total_citations |
weighted_citation_count | RA citations divided by total citations | Double | __main__ |
mention_positions_10 | Number of citations in the first 10% of the paper | Integer | summarize_citation_df |
mention_positions_20 | Number of citations in the second 10% of the paper | Integer | summarize_citation_df |
mention_positions_30 | Number of citations in the third 10% of the paper | Integer | summarize_citation_df |
mention_positions_40 | Number of citations in the forth 10% of the paper | Integer | summarize_citation_df |
mention_positions_50 | Number of citations in the fifth 10% of the paper | Integer | summarize_citation_df |
mention_positions_60 | Number of citations in the sixth 10% of the paper | Integer | summarize_citation_df |
mention_positions_70 | Number of citations in the seventh 10% of the paper | Integer | summarize_citation_df |
mention_positions_80 | Number of citations in the eigth 10% of the paper | Integer | summarize_citation_df |
mention_positions_90 | Number of citations in the ninth 10% of the paper | Integer | summarize_citation_df |
mention_positions_100 | Number of citations in the tenth 10% of the paper | Integer | summarize_citation_df |
heading_category_NA | Number of citations in the rest of the paper | Integer | summarize_citation_df |
heading_category_intro | Number of citations in the introduction section | Integer | summarize_citation_df |
heading_category_background | Number of citations in the background section | Integer | summarize_citation_df |
heading_category_theory | Number of citations in the theory section | Integer | summarize_citation_df |
heading_category_methods | Number of citations in the methods section | Integer | summarize_citation_df |
heading_category_results | Number of citations in the results section | Integer | summarize_citation_df |
heading_category_implications | Number of citations in the implications section | Integer | summarize_citation_df |
heading_category_appendix | Number of citations in the appendix | Integer | summarize_citation_df |
ref_in_title | Citation in the title of the paper | Boolean | check_ref_in_title |
ref_in_heading | Number of citations in section headings | Integer | ref_in_heading |
ref_in_figure_description | Number of citations in figure captions | Integer | ref_in_figDesc |
ref_in_table_description | Number of citations in table captions | Integer | ref_in_tableDesc |
USE | Ideational impact target variable | Boolean | raw |
- Download and install Docker
- Clone this repository:
git clone https://github.com/julianprester/deep-cenic.git
- Build docker container:
make dockerize
- Run code:
make run
The focus of this repository is on developing and providing an ideational impact dataset. Thus, it does not include the machine and deep learning models trained on the data. For details regarding the standard implementations using the Keras and Tensorflow libraries refer to the paper and the figure below.
This repository is part of a broader research program, comprising the following work:
- Schryen, G., Wagner, G., & Benlian, A. (2015). Theory of Knowledge for Literature Reviews: An Epistemological Model, Taxonomy and Empirical Analysis of IS Literature. In: Proceedings of the 36th International Conference on Information Systems, Fort Worth, Texas. link.
- Wagner, G., Prester, J., Roche, M. P., Benlian, A., & Schryen, G. (2016). Factors Affecting the Scientific Impact of Literature Reviews: A Scientometric Study. In: Proceedings of the 37th International Conference on Information Systems, Dublin, Ireland. link.
- Prester, J., & Wagner, G., & Schryen, G. (2018). Classifying the Ideational Impact of IS Review Articles: A Natural Language Processing Based Approach. In: Proceedings of the 39th International Conference on Information Systems, San Francisco, California. link.
- Schryen, G., Wagner, G., Benlian, A., & Paré, G. (2020). A Knowledge Development Perspective on Literature Reviews: Validation of a new Typology in the IS Field. Communications of the Association for Information Systems, 46. link.
- Schryen, G., Wagner, G., & Benlian, A. (2020). Distinguishing Knowledge Impact from Citation Impact: A Methodology for Analysing Knowledge Impact for the Literature Review Genre. Available at SSRN: link.
- Hassan, N. R., Prester, J., & Wagner, G. (2020). Seeking Out Clear And Unique Information Systems Concepts: A Natural Language Processing Approach. In Proceedings of the 28th European Conference on Information Systems, Marrackech, Morocco. link.
- Prester, J., & Wagner, G., Schryen, G., & Hassan, N. R. (2020). Classifying the Ideational Impact of Information Systems Review Articles: A Content-enriched Deep Learning Approach. Decision Support Systems, forthcoming. link.