Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: lesson about using a framework #1303

Merged
merged 16 commits into from
Jan 21, 2025
Merged

Conversation

honzajavorek
Copy link
Collaborator

@honzajavorek honzajavorek commented Nov 25, 2024

This PR introduces a new lesson to the Python course for beginners in scraping. The lesson is about working with a framework. Decisions I made:

  • I opted not to use type hints to make the examples less cluttered and to avoid the need to explain type hints to people who didn't ever use them
  • The logging section serves two purposes - first, it adds logging :) and second, it conveniently provides code for the whole program at the end of the lesson
  • I had a hard time to come up with exercises, because most of the simple stuff I came up with was too simple and would result in shorter and simpler code without the framework 😅
    • I decided to have one classic scenario (listing & detail) just to let the student write their first Crawlee program. It's a bit challenging regarding traversal over the HTML to get the data, but it shouldn't be challenging regarding Crawlee.
    • I introduced one scenario where the scraper needs to jump through several pages (even domains) to get the result. Such program would be hard or at least very annoying to write without framework.
  • As always, I focused on having the examples based on real world sites which are somewhat known and popular globally, but also don't feature extensive anti-scraping protections.

Crawlee feedback

Regarding Crawlee, I didn't have much trouble to write this lesson, apart from the part where I wanted to provide hints on how to do this:

requests = []
for ... in context.soup.select(...):
    ...
    requests.append(Request.from_url(imdb_search_url, label="..."))
await context.add_requests(requests)

I couldn't find good example in the docs, and I was afraid that even if I provided pointers to all the individual pieces, the student wouldn't be able to figure it out.

Also, I wanted to link to docs when pointing out the fact that enqueue_links() has a limit argument, but I couldn't find enqueue_links() in the docs. I found this which is weird. It's not clear what object is documented, or what it is, feels like some internals, not as regular docs of a method. I probably know how come it's this way, but I don't think it's useful this way and I decided I don't want to send people from the course to that page.

One more thing: I do think that Crawlee should log some "progress" information about requests made or - especially - items scraped. It's so weird to run the program and then just look at the program as if it hanged, waiting if something happens or not. E.g. Scrapy logs how many items per minute I scraped, which I personally find super useful.

@honzajavorek honzajavorek added the t-academy Issues related to Web Scraping and Apify academies. label Nov 25, 2024
@honzajavorek honzajavorek force-pushed the honzajavorek/py-framework branch from cb0f718 to e18ea31 Compare November 27, 2024 16:19
@honzajavorek honzajavorek marked this pull request as ready for review November 28, 2024 16:45
Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point for avoiding type hints. However, in the case of the handler:

    @crawler.router.default_handler
    async def handle_listing(context):
        ...

It leaves the reader without any possibility for code completions or static analysis when working with the context object.

In my opinion, type hints should be included here. We have been using them across all docs & examples.

Just a suggestion for you to reconsider, not a request.

Other than that, good job 🙂, and the code seems to be working.

@honzajavorek
Copy link
Collaborator Author

Thanks for the review! I see your point and I will indeed reconsider adding the type hint, at least for the context. It would be easier decision if the type name wasn't 28 characters long, but you're right about the benefits for people with editors like VS Code, where we could assume some level of automatic code completions.

honzajavorek added a commit that referenced this pull request Jan 20, 2025
@honzajavorek honzajavorek force-pushed the honzajavorek/py-framework branch from c55010a to 23fdbdb Compare January 20, 2025 13:14
@honzajavorek
Copy link
Collaborator Author

@vdusek I added reconsidering the type hints to #1319, thanks!

@TC-MO I think we could merge this now, but I'd appreciate if you could take a look at what I did, at least in f0c6041. I randomly added words to Vale's spelling dictionary, but even then, I had to turn it off for one code block. According to my testing it seems that Vale isn't able to identify and skip the block as a code block if it has title="newmain.py" (which is a non-standard Markdown extension of the syntax, I suppose). Do we have a solution for that, or do I have to comment it out, like I did? Thanks!

I asked about the dictionary also here https://github.com/apify/apify-docs/pull/1345/files#r1922397733

@TC-MO
Copy link
Contributor

TC-MO commented Jan 20, 2025

not sure about titling the codeblock and vale issues, never encountered it before, let me investigate, but for a quick fix I guess turning vale off & on after codeblock will do. If we find a proper solution it will be an easy change

@honzajavorek
Copy link
Collaborator Author

Thanks! I'm surprised, because I think this isn't the first time it's used in the docs, but it's also possible that Vale works incrementally, so it errors only when I'm adding it, and ignores whatever is already there. Not sure.

@TC-MO
Copy link
Contributor

TC-MO commented Jan 20, 2025

The way Vale is set up is , it doesn't check whole docuemntation each time it runs, it just checks the changed files, so that is why some other files may have had title in codeblock, but since they have not been changed in this PR it isn't caught. ALso this might be the result of Vale Spelling being added.

@honzajavorek
Copy link
Collaborator Author

That's what I think as well.

@honzajavorek honzajavorek merged commit 89a564d into master Jan 21, 2025
7 checks passed
@honzajavorek honzajavorek deleted the honzajavorek/py-framework branch January 21, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-academy Issues related to Web Scraping and Apify academies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants