Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding notices about one sentence per line #2030

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

markcmiller86
Copy link
Member

@markcmiller86 markcmiller86 commented Mar 6, 2024

I've added a notice to issue templates and checklist to ensure people submit content with one sentence per line.

@bernhold
Copy link
Member

bernhold commented Mar 6, 2024

Would it be more reasonable to make this a request rather than a requirement? I personally find it very hard to conform to the one-sentence-per-line approach. I think it is a muscle-memory thing. In the absence of some tool to automate this for me, my submissions are unlikely to conform, and I'm definitely not going to manually reformat blog or other posts that I receive from others. Sorry, but that's how I see it.

@markcmiller86
Copy link
Member Author

@bernhold that may be but the fact is a very large part of our tooling (diff, merge, GitHub suggestion feature, see more comments below), operate on quanta of lines. To not require our content to follow that basic standard really cuts us off from using these tools effectively or in some cases, at all. We don't write code this way. Or, if we do, nobody will attempt to enhance or debug it.

I feel like people should compose however they like but when content is submitted it really should (must) adhere to this format requirement.

How I see it is how we are either enabling or inhibiting our ability to use a large ecosystem of tools all designed to operate on ascii text file structed as one sentence per line...

The request for tools that process ASCII text files, particularly benefiting from the format of one sentence per line, spans a wide array of applications across different domains including natural language processing (NLP), data analysis, and software development. Here are several categories and examples of tools and applications that assume or benefit from this format:

  1. Natural Language Processing (NLP) Tools:

    • Tokenizers and Sentence Splitters: Tools like NLTK (Natural Language Toolkit), spaCy, and Stanford NLP provide tokenization functionalities that can directly benefit from or can be used to enforce a one-sentence-per-line format for more efficient processing.
    • Text Annotation Tools: Software like BRAT and WebAnno allow for the annotation of text. They can work more effectively with text that is structured one sentence per line, simplifying the annotation process.
  2. Machine Learning (ML) and AI Training Data Preparation:

    • Corpus Preparation Tools: For machine learning models, especially those related to text and language processing (e.g., text classification, sentiment analysis), having text files with one sentence per line simplifies the creation of training and testing datasets.
    • Data Preprocessing Scripts: Custom scripts (often written in Python, R, or another scripting language) that preprocess text for ML models typically benefit from a one-sentence-per-line structure to ensure consistency and ease of parsing.
  3. Text Processing and Manipulation Tools:

    • AWK, SED, GREP: Unix/Linux command-line tools for text processing can be more effectively used when each record (in this case, a sentence) occupies a single line, enabling simpler pattern matching and text manipulation.
    • Diff Tools: Tools that compare text files line by line (like diff in Unix/Linux) can more easily identify changes when text is structured as one sentence per line.
  4. Version Control for Text:

    • Git, SVN, and other version control systems: These systems track changes line by line. When dealing with documentation or any text content, having one sentence per line can make it easier to identify the specific changes made to the text over time.
  5. Documentation and Writing Tools:

    • Markdown Processors and Documentation Generators: Tools like Pandoc or Jekyll can work with text files for generating documentation or blog posts. Structuring content with one sentence per line can simplify version control and editing.
  6. Programming and Scripting Languages:

    • Text Parsing Libraries: Libraries in Python (like re for regular expressions), Perl, and other languages that are used for parsing and manipulating text can benefit from a one-sentence-per-line structure, making it easier to apply regular expressions or other parsing techniques.

This list is not exhaustive, as the specific utility of a one-sentence-per-line format can vary based on the task at hand and the preferences of the developer or analyst. Additionally, the choice to use this format may depend on the nature of the text being processed and the objectives of the text processing task.

@bernhold
Copy link
Member

bernhold commented Mar 6, 2024

Hi Mark, I understand all of the reasons. There's no denying that it is desirable. But I think it is unenforcable in practice. So rather than reject submissions because they're not in this format (or lead people to think we might), I'd rather be realistic and express it as a preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants