Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storyseedling.com no content because adding protection #1560

Open
rizkiv1 opened this issue Oct 28, 2024 · 8 comments
Open

storyseedling.com no content because adding protection #1560

rizkiv1 opened this issue Oct 28, 2024 · 8 comments

Comments

@rizkiv1
Copy link

rizkiv1 commented Oct 28, 2024

Bug:
Storyseedling seems update the site. Chapters are need to load a few secs for english content to appear. And before english words seems they load chinese(?) content.

To Reproduce
Steps to reproduce the behavior:
sample TOC: https://storyseedling.com/series/177491/
sample chapter: https://storyseedling.com/series/177491/v1/1/
When finished, you cant find content of chapter.

Expected behavior
Containing content of chapter.

Desktop:

  • OS: Windows 10
  • Browser Firefox 131.0.3
  • Version 1.0.0.0

Smartphone:

  • Device: Realme C51s
  • OS: Android 13
  • Browser Kiwi Browser 124.0.6327.4
  • Version 1.0.0.0
@rizkiv1
Copy link
Author

rizkiv1 commented Oct 28, 2024

Seems they add encryption using custom font. Idk how they do that, because copying text directly resulted in chinese text and alphanumeric words.

@dnshipit
Copy link

dnshipit commented Nov 1, 2024

They seems to be doing character replacement for the content in the back end and just use a custom font that's generated directly from code to render everything in English.

All the characters of the content is just a character shift from alphanumeric into some random unicode character.

I think there is only a few options left to scrape this site:

  • decrypt their content using character frequency analytic (as long as they only use direct character replacement)
  • render the whole page, take screenshot of the content pixel by pixel and then use some OCR software to translate it back to text

@Elthara
Copy link

Elthara commented Nov 7, 2024

It seems like it's just a character swap, similar to second life translations, just not using English characters.

a:⽜
A:⽂
b:⽝
B:⽃
c:⽞
C:⽄
d:⽟
D:⽅
e:⽠
E:⽆
f:⽡
F:⽇
g:⽢
G:⽈
h:⽣
H:⽉
i:⽤
I:⽊
j:⽥
J:⽋
k:⽦
K:⽌
l:⽧
L:⽍
m:⽨
M:⽎
n:⽩
N:⽏
o:⽪
O:⽐
p:⽫
P:⽑
q:⽬
Q:⽒
r:⽭
R:⽓
s:⽮
S:⽔
t:⽯
T:⽕
u:⽰
U:⽖
v:⽱
V:⽗
w:⽲
W:⽘
x:⽳
X:⽙
y:⽴
Y:⽚
z:⽵
Z:⽛

@dnshipit
Copy link

dnshipit commented Nov 8, 2024

Yeah, It's a very simple character swap scheme. However, they can update the character swap mapping anytime and dynamically for every document too. That's why in the long run it would be safer to do a cryptographic decoding or OCR approach.

One thing that can help with decoding is the HTML meta tag. I notice in all the chapters they have at least a few sentences of the content available in normal English for SEO purpose. Those sentences are then encoded in the full content part. For simple character swap cypher like that, it would eliminate quite a few characters in advance.

@dteviot
Copy link
Owner

dteviot commented Nov 9, 2024

They're not just doing character replacement. The content isn't on the initial page that is downloaded. Instead a second call is made to get the content.

e.g. For first chapter with URL of https://storyseedling.com/series/177491/v1/1/, content is obtained with a POST to https://storyseedling.com/series/177491/v1/1/content

Unfortunately, I'm having some problems reproducing the call.

Time taken: 42 minutes

@yuyu-cloud
Copy link

yuyu-cloud commented Nov 10, 2024

I apologise if I maybe oversimplified this issue, but, given the blank output Webtoepub now generates (due to the new protection scheme), would it be possible to be able to still extract the encrypted source content (that presently has the Chinese characters and other html miscellany in the html data), and then use an in-built script in ePubEditor to decrypt the character swap scheme using the cryptographic decoding method a user above mentioned so any new changes to the site can still possibly be undone (like add an additional button for Story Seedling, the same as Chrysanthemum Garden)? Since the extension wouldn't be attempting to directly bypass these protections and therefore not violate Chrome policy.

Just putting my two cents in hope there's some good workaround to still extract the source content from this site, since it has such a wide selection of novels... 😭🥺😢

@dnshipit
Copy link

e.g. For first chapter with URL of storyseedling.com/series/177491/v1/1, content is obtained with a POST to storyseedling.com/series/177491/v1/1/content

According to my analysis of their website script. They are using cloudflare turnstile as the captcha method for the content call. Instead of doing a fetch directly, this might require a slow full page load and extract the content after captcha have been cleared.

@bonnetchuu
Copy link

Also came for a resolution to this site no longer working (but didn't want to open a duplicate issue)... TT_TT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants