Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -677,6 +677,15 @@ Readability.prototype = {
}

this._replaceNodeTags(this._getAllNodesWithTag(doc, ["font"]), "SPAN");

// Fix for issue #986: Remove Wikipedia edit section links that cause headings
// to be improperly removed. These spans contain "edit" links and add negative
// weight to otherwise good headings.
this._forEachNode(this._getAllNodesWithTag(doc, ["span"]), function (span) {
if (span.className && span.className.includes("mw-editsection")) {
span.remove();
}
});
Comment on lines +681 to +688
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, at this point I have 2 concerns:

  1. this is going to be really slow and accomplish almost nothing for most documents, as they will have lots of spans that don't fit this criterium anyway. So from a performance PoV it'd be better to filter for header elements first and then check if they include these spans. For that, it seems like it'd be cleaner to just update _cleanHeaders to account for this? That already looks at h1/h2 headers, and is presumably where this is being removed - is only updating that method insufficient?
  2. This seems hyper-focused on Wikipedia (or mediawiki) itself. Is there any way to make this more generic? Perhaps the technique in WIP permalink anchor stripping #982 could be generalized?

},

/**
Expand Down Expand Up @@ -2540,18 +2549,18 @@ Readability.prototype = {
"iframe",
]);

for (var i = 0; i < embeds.length; i++) {
for (var k = 0; k < embeds.length; k++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated, can you revert it?

// If this embed has attribute that matches video regex, don't delete it.
for (var j = 0; j < embeds[i].attributes.length; j++) {
if (this._allowedVideoRegex.test(embeds[i].attributes[j].value)) {
for (var j = 0; j < embeds[k].attributes.length; j++) {
if (this._allowedVideoRegex.test(embeds[k].attributes[j].value)) {
return false;
}
}

// For embed with <object> tag, check inner HTML as well.
if (
embeds[i].tagName === "object" &&
this._allowedVideoRegex.test(embeds[i].innerHTML)
embeds[k].tagName === "object" &&
this._allowedVideoRegex.test(embeds[k].innerHTML)
) {
return false;
}
Expand Down
3 changes: 1 addition & 2 deletions test/test-pages/nytimes-1/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,7 @@
<p data-para-count="208" data-total-count="4431">Obama administration officials said that they had briefed President-elect Donald J. Trump’s transition team, but that they did not know if Mr. Trump would stick with a policy of warmer relations with Sudan.</p>
<p data-para-count="143" data-total-count="4574">They said that Sudan had a long way to go in terms of respecting human rights, but that better relations could help increase American leverage.</p>
<p data-para-count="149" data-total-count="4723" data-node-uid="1">Mr. Reeves said he thought that the American government was being manipulated and that the Obama administration had made a “deal with the devil.”</p>
<p><a href="#whats-next">Continue reading the main story</a>
</p>
<p><a href="#whats-next">Continue reading the main story</a></p>
</div>
</article>
</main>
Expand Down
6 changes: 2 additions & 4 deletions test/test-pages/nytimes-2/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,7 @@
<p data-para-count="177" data-total-count="325">First, let’s say what the Yahoo sale is not. It is not a sale of the publicly traded company. Instead, it is a sale of the Yahoo subsidiary and some related assets to Verizon.</p>
<p data-para-count="529" data-total-count="854">The sale is being done in two steps. The <a href="https://www.sec.gov/Archives/edgar/data/1011006/000119312516656036/d178500dex22.htm">first step</a> will be the transfer of any assets related to Yahoo business to a singular subsidiary. This includes the stock in the business subsidiaries that make up Yahoo that are not already in the single subsidiary, as well as the odd assets like benefit plan rights. This is what is being sold to Verizon. A license of Yahoo’s oldest patents is being held back in the so-called Excalibur portfolio. This will stay with Yahoo, as will Yahoo’s stakes in Alibaba Group and Yahoo Japan.</p>
<p data-para-count="479" data-total-count="1333">It is hard to overestimate how complex an asset sale like this is. Some of the assets are self-contained, but they must be gathered up and transferred. Employees need to be shuffled around and compensation arrangements redone. Many contracts, like the now-infamous one struck with the search engine Mozilla, which <a href="http://www.recode.net/2016/7/7/12116296/marissa-mayer-deal-mozilla-yahoo-payment">may result in a payment of up to a $1 billion</a>, will contain change-of-control provisions that will be set off and have to be addressed. Tax issues always loom large.</p>
<p><a href="#story-continues-1">Continue reading the main story</a>
</p>
<p><a href="#story-continues-1">Continue reading the main story</a></p>
</div>
<div id="story-continues-1">
<p><a href="#story-continues-2">Continue reading the main story</a></p>
Expand All @@ -42,8 +41,7 @@
<p data-para-count="583" data-total-count="5954">Whether this is the most tax-efficient way is unclear to me as a nontax lawyer (email me if you know). Yahoo is likely to have a tax bill on the sale, possibly a substantial one. And I presume there were legal reasons for not using a <a href="http://dealbook.nytimes.com/2014/04/29/alliant-techsystems-break-up-and-the-return-of-the-morris-trust/">Morris Trust structure</a>, in which Yahoo would have been spun off and immediately sold to Verizon so that only Yahoo’s shareholders paid tax on the deal. In truth, the Yahoo assets being sold are only about 10 percent of the value of the company, so the time and logistics for such a sale when Yahoo is a melting ice cube may not have been worth it.</p>
<p data-para-count="450" data-total-count="6404">Finally, if another bidder still wants to acquire Yahoo, it has time. The agreement with Verizon allows Yahoo to terminate the deal and accept a superior offer by paying a $144 million breakup fee to Verizon. And if Yahoo shareholders change their minds and want to stick with Yahoo’s chief executive, <a href="http://topics.nytimes.com/top/reference/timestopics/people/m/marissa_mayer/index.html?inline=nyt-per" title="More articles about Marissa Mayer.">Marissa Mayer</a>, and vote down the deal, there is a so-called naked no-vote termination fee of $15 million payable to Verizon to reimburse expenses.</p>
<p data-para-count="426" data-total-count="6830">All in all, this was as hairy a deal as they come. There was the procedural and logistical complications of selling a company when the chief executive wanted to stay. Then there was the fact that this was an asset sale, including all of the challenges that go with it. Throw in all of the tax issues and the fact that this is a public company, and it is likely that the lawyers involved will have nightmares for years to come.</p>
<p><a href="#whats-next">Continue reading the main story</a>
</p>
<p><a href="#whats-next">Continue reading the main story</a></p>
</div>
</article>
</main>
Expand Down
Loading