Skip to content
This repository has been archived by the owner on Jan 3, 2024. It is now read-only.

Add additional info in cases where Wiktionary does not have a page with useful results #107

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions readme_modifications.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
### Some notes on the modifications made by bjkeefe to `core.py`

#### Motivation

I had been happily using WiktionaryParser for several months. One day, I was developing
an application where I wanted to be able to distinguish between two cases: (1) where
Wiktionary does not have any English definitions for a given word, and (2) where
Wiktionary does not have any entry at all.

I made a few modifications to `core.py` to support this desire. The returned value
remains a `list`, containg a `dict`, in all cases. If the `word` and `language` passed to
`.fetch()` yield a Wiktionary entry, the results will be the same as before.

#### So, what's new?

If there is no entry for a given `word` and `language`, the returned value is now no
longer an empty `list`, but a `list`, containing a `dict`, whose only key is
`"additional_info"`, whose value is a `str`: `"no <language> entry for <word>".`

If there is no entry at all, same as above, except the value in the `dict` becomes
`"Wiktionary does not yet have an entry for <word>"`.


#### Source code differences

Here is the diff output (ignoring whitespace) between the new version and the original:

```
$ diff -w core.py core.py.abo
119c119
< return [{"additional_info": f"no {language} entry for {self.current_word}"}]
---
> return []
126c126
< return [{"additional_info": f"no {language} entry for {self.current_word}"}]
---
> return []
285,288d284
< search_string = "Wiktionary does not yet have an entry for " + word
< result = self.soup.find_all(string=re.compile(search_string))
< if result:
< return [{"additional_info": search_string}]
```

#### Testing

The new version of `core.py` passes all tests in `tests/test_core.py.`

Because I didn't have time to modify the existing tests, I wrote some quick tests that
explicitly test the modifications I made. These are in `tests/test_core_new.py`. This
file expects to be run with `pytest`, because I am less familiar with `unittest`. All of
the tests pass when run against the new version of `core.py`.

NB: the new tests will NOT all pass if run against the old version of `core.py.`

Also, I wrote a little script called `driver.py`. This is intended for interactive testing.

```
$ py driver.py -h
usage: driver.py [-h] [-m] word

Check <word> against Wiktionary using WiktionaryParser

positional arguments:
word the word to look up

options:
-h, --help show this help message and exit
-m, --multiple-languages
if present, look up <word> for several languages; otherwise, just English
```

#### Organization

All of the above -- the modifications and new files -- are in a new `git` branch named
`additional_info`.

#### Minor problem with backwards compatibility

If someone has written some code that checks the result returned by `.fetch()` like this ...

```
result = parser.fetch(word)
if not result: # --or-- if len(result) == 0:
do_something()
```

... this will no longer work. This could be changed to, for example:

```
result = parser.fetch(word)
if not "definitions" in result[0]:
do_something()
```

[added 2023-06-22 08:53] It occurs to me that there might be a way around this problem:
change the call signature to `.fetch()`, by adding the keyword arg `allow_messages=False`.
Calls to `.fetch()` in existing code would, of course, not have this arg, and since the
default would be not to allow the return of `"messages"`, a not-found condition would
return an empty list, as before. However, if the call, in new code, were `.fetch("word",
allow_messages=True)`, then a not-found condition would result in what I was after:
additional info about the not-found result.

Let me know if you want me to implement that.

#### Questions, comments, criticisms

Please feel free to email me: [email protected]. Thanks for reading!
103 changes: 103 additions & 0 deletions tests/test_core_new.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
"""A few quick tests of the modifications made by bjkeefe to core.py.
These tests will NOT all succeed if run against the master branch of WiktionaryParser,
at least as of 2023-06-17.
"""
try:
import pytest
except ModuleNotFoundError:
print("test_core_new.py: these tests require pytest to be importable, so this won't work:")
print(" $ py test_core_new.py")
print()
print("However, pytest usually comes along for the ride when installing Python from")
print("python.org, so this should work:")
print(" $ pytest test_core_new.py")
raise SystemExit()

from wiktionaryparser import WiktionaryParser


def test_core_new_default_language():
parser = WiktionaryParser()

# A word that has several English definitions
result = parser.fetch("receive")
assert type(result) == list
assert len(result) == 1
assert type(result[0]) == dict
assert "etymology" in result[0]
assert "pronunciations" in result[0]
assert "definitions" in result[0]
assert len(result[0]["definitions"]) > 0
assert "additional_info" not in result[0]

# A word that has a Wiktionary entry, because it is a common misspelling
result = parser.fetch("recieve")
assert type(result) == list
assert len(result) == 1
assert type(result[0]) == dict
assert "etymology" in result[0]
assert "pronunciations" in result[0]
assert "definitions" in result[0]
assert len(result[0]["definitions"]) > 0
assert "additional_info" not in result[0]

# Two words that have a Wiktionary entry, but no English definitions
for word in ["abilitanti", "aimai"]:
result = parser.fetch(word)
assert type(result) == list
assert len(result) == 1
assert type(result[0]) == dict
assert "etymology" not in result[0]
assert "pronunciations" not in result[0]
assert "definitions" not in result[0]
assert "additional_info" in result[0]
assert result[0]["additional_info"] == f"no english entry for {word}"

# A "word" that has no Wiktionary entry
result = parser.fetch("aimiable")
assert type(result) == list
assert len(result) == 1
assert type(result[0]) == dict
assert "etymology" not in result[0]
assert "pronunciations" not in result[0]
assert "definitions" not in result[0]
assert "additional_info" in result[0]
assert result[0]["additional_info"] == f"Wiktionary does not yet have an entry for aimiable"


def test_core_new_non_english_languages():
words = ["receive", "recieve", "abilitanti", "aimai", "aimiable"]
languages = ["italian", "french", "japanese"]

parser = WiktionaryParser()
for word in words:
for language in languages:
parser.set_default_language(language)
result = parser.fetch(word)
if language == "italian":
if word == "abilitanti":
assert "definitions" in result[0]
assert "additional_info" not in result[0]
else:
assert "definitions" not in result[0]
assert "additional_info" in result[0]
if word != "aimiable":
assert result[0]["additional_info"] == f"no {language} entry for {word}"
else:
expected = f"Wiktionary does not yet have an entry for {word}"
assert result[0]["additional_info"] == expected

elif language == "french" or language == "japanese":
if word == "aimai":
assert "definitions" in result[0]
assert "additional_info" not in result[0]
else:
assert "definitions" not in result[0]
assert "additional_info" in result[0]
if word != "aimiable":
assert result[0]["additional_info"] == f"no {language} entry for {word}"
else:
expected = f"Wiktionary does not yet have an entry for {word}"
assert result[0]["additional_info"] == expected


8 changes: 6 additions & 2 deletions wiktionaryparser/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,14 +116,14 @@ def get_word_data(self, language):
start_index = content.find_previous().text + '.'
if not start_index:
if contents:
return []
return [{"additional_info": f"no {language} entry for {self.current_word}"}]
language_heading = self.soup.find_all(
"span",
{"class": "mw-headline"},
string=lambda s: s.lower() == language
)
if not language_heading:
return []
return [{"additional_info": f"no {language} entry for {self.current_word}"}]
for content in contents:
index = content.find_previous().text
content_text = self.remove_digits(content.text.lower())
Expand Down Expand Up @@ -282,4 +282,8 @@ def fetch(self, word, language=None, old_id=None):
self.soup = BeautifulSoup(response.text.replace('>\n<', '><'), 'html.parser')
self.current_word = word
self.clean_html()
search_string = "Wiktionary does not yet have an entry for " + word
result = self.soup.find_all(string=re.compile(search_string))
if result:
return [{"additional_info": search_string}]
return self.get_word_data(language.lower())
Loading