0.1.7 (2016-01-30)
Closed issues:
- ImportError: cannot import name 'Image' #183
- Won't let me import #182
- Install on Mac - El Capitan Failed - "Operation not permitted" #181
- Downgrades to old versions of required packages upon installation #174
- Handling 404, 500, and other non-200 http response codes to prevent scraping error pages #142
- Libray downgrading in installation #138
Merged pull requests:
- Don't scrape error pages #190 (yprez)
- Added Hebrew stop words for language support #188 (alon7)
- Fix installation and build #187 (yprez)
- Fix installation docs #184 (yprez)
- Travis CI integration #180 (yprez)
- requirements.txt - Use minimal instead of exact versions #179 (yprez)
- Handle lxml raising ValueError on node.itertext() - Python 3 #178 (yprez)
- Handle lxml raising ValueError on node.itertext() #144 (yprez)
- Parse byline fix #132 (davecrumbacher)
0.1.6 (2016-01-10)
Closed issues:
- Critical leak in newspaper.mthreading.Worker #177
- HTMLParseError #165
- Take local paths to .html files #153
- Wall Street Journal Full Text is not Correctly Scraped #150
- Article HTML Returning Null #131
- No articles #130
- Loading Pages that use heavy javascript #127
- Login handling for premium websites #126
- Installation of nltk is failing #121
Merged pull requests:
- Support urls with dots #176 (alexanderlukanin13)
- upgrade beautifulsoup4 to 4.4.1 for python 3.5 #171 (AlJohri)
- Updated requests version #170 (adrienthiery)
- Turkish Language added #169 (muratcorlu)
- Add macedonian stopwords #166 (dimitrovskif)
- Issue#95 added graceful string concatenation #157 (surajssd)
- fix for "jpeg error with PIL, Can't convert 'NoneType' object to str implicitly" #154 (hnykda)
- bugfix in article.py, is_valid_body #149 (ms8r)
- Fixed typo #139 (Eleonore9)
- Correct link for the Python 3 branch #136 (jtpio)
- Add python3-pip install step for Ubuntu #135 (irnc)
0.1.5 (2015-03-04)
Closed issues:
- is there any kind of documentation on centos 7? #114
- Add extraction publishing date from article. #3
Merged pull requests:
0.1.4 (2015-02-04)
Closed issues:
- Getting rate limiting issue? #116
- newspaper.build( ) error #111
- Allow lists in Parser.clean_article_html() #108
Merged pull requests:
- Fix incorrect log call while generating articles #115 (curita)
- Allow lists in clean_article_html() - fixes #108 #112 (ecesena)
- Fixed nodeToString() to return valid HTML #110 (ecesena)
- Fixed empty return in top_meta_image #109 (ecesena)
0.1.3 (2015-01-15)
Implemented enhancements:
- Fulltext extraction improvement #1 #105
Closed issues:
- Tags h1 in article_html - indented behavior? #107
Merged pull requests:
0.1.2 (2015-01-01)
Closed issues:
- Metatags on Vice.com #103
- Can't extract images from german newspapers #96
- article_html misses many of the images #89
Merged pull requests:
- Integrate UnicodeDammit, deprecate parser_class, deprecate encodeValue, refactor, scaffolding for more unit tests #104 (codelucas)
0.1.1 (2014-12-27)
Closed issues:
- UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc #99
- TypeError: Can't convert 'bytes' object to str implicitly #98
- [Parse lxml ERR] Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. #78
- UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128) #77
- article.text and keywords error #47
Merged pull requests:
- Huge bugfix to aid lxml DOM parsing + remove unhelpful and excess exception messages and added tracebacks to exception logging #102 (codelucas)
- Decode bytestring returned from lxml's
toString
early on before sending it out to outer code #101 (codelucas) - Fixed #78: Remove encoding tag because lxml won't accept it for unicode #97 (mhall1)
0.1.0 (2014-12-17)
0.0.9 (2014-12-17)
Closed issues:
- object has no attribute clean Error when using parse method #90
- Questions #85
- [nltk_data] Error loading brown: <urlopen error [Errno -2] Name or [nltk_data] service not known> #84
- newspaper unable to find embeded youtube video #82
- Bound for memory usage #81
- Hosted demo #80
- Having issues installing due to lxml #79
- Add a BeautifulSoup4 parser. #44
- python 3 support request #36
Merged pull requests:
- update jieba to 0.35 #94 (WingGao)
- Parse was breaking in the method clean_article_html when keep_article_ht... #88 (phoenixwizard)
- split title with _ #87 (deweydu)
- Update to support python3 #86 (log0ymxm)
- Added link to basic demo #83 (iwasrobbed)
- Add splitting of slash-separated titles #75 (igor-shevchenko)
0.0.8 (2014-10-13)
Closed issues:
- Parsing Raw HTML #74
- Can't install newspaper #72
- Refactor codebase so newspaper is actually pythonic #70
- Article.top_node == Article.clean_top_node #65
- article.movies missing 'http:' #64
- KeyError when calling newspaper.languages() #62
- Memoize Articles - Not Printing #61
- Add URL headers while building a "paper" #60
- AttributeError: 'module' object has no attribute 'build' #59
- Typo in newspaper.build argument "memoize_articles" #58
- issue with stopwords-tr.txt #51
- Other language support. #34
- Character encoding detection #2
Merged pull requests:
- Huge refactor: entire codebase in PEP8, imports alphabetized, bugfixes, core changes #71 (codelucas)
- Meta tag extraction fixes #69 (karls)
- Test suite improvements #68 (karls)
- Test suite fixes #67 (karls)
- Revert "Added published date to the extractor+article" #66 (codelucas)
- Added published date to the extractor+article #63 (parhammmm)
0.0.7 (2014-06-17)
Closed issues:
- no document on how to add language #57
- Retain <a> tags in top article node? #56
- DocumentCleaner is missing clean_body_classes #55
- You must download and parse an article before parsing it #52
- Not extracting UL LI text #50
- article does not release_resources() #42
- Doesn't work on http://www.le360.ma/fr #40
- How to assign html content without downloading it? #37
- Python venv only? #32
- .nlp() could not work #27
- Doesn't work with Arabic news sites #23
- SyntaxError: invalid syntax #19
- Retain HTML markup for extracted article #18
- Portuguese is misspelled #14
- Multi-threading article downloads not working #12
- Timegm error? #10
- Problem in Brazilian sites #9
- Brazilian portuguese support #6
Merged pull requests:
- Fix typo in code and documentation #54 (jacquerie)
- removed quotes of 'filename' in utils\__init__.py #53 (jay8688)
- Fixed long-form article issue w/ calculate_best_node #49 (jeffnappi)
- Use first image from article top_node #35 (otemnov)
- Add a section with links to related projects #33 (cantino)
- Original #30 (otemnov)
- Fix reddit top image #29 (otemnov)
- Extract Meta Tags in structured way #28 (voidfiles)
- Replace instances of 'Portugease' with 'Portuguese' #26 (WheresWardy)
- It's The Changelog not The ChangeLog :) #24 (adamstac)
- syntax errors #22 (arjun024)
- Support for more HTML tags in parsers.py #21 (WheresWardy)
- Fixed syntax error #20 (damilare)
- Minor Performance tweaks #17 (techaddict)
- Update README.rst #15 (girasquid)
- Minor Typo candiate_words -> candidate_words #13 (techaddict)
0.0.6 (2014-01-18)
Closed issues:
- Port to Ruby #8
- Huge internationalization / API revamp underway! #7
- Multithread & gevent framework built into newspaper #4
Merged pull requests:
0.0.5 (2014-01-09)
0.0.4 (2013-12-31)
Closed issues:
- Calling nlp() on an article causes 'tokenizers/punkt/english.pickle' Not Found Error #1
Merged pull requests:
- Fix for keyword arg usage in print() on Python 2.7 #5 (michaelhood)
0.0.3 (2013-12-22)
0.0.2 (2013-12-21)
0.0.1 (2013-12-21)
* This Change Log was automatically generated by github_changelog_generator