Skip to content

Latest commit

 

History

History

000-introduction

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Introduction

Welcome to the course! Glad you're here :)

Supporting The Project

  • Star the repo 😎
    • Maybe share it with some people new to web-scraping?
  • Consider sponsoring me on GitHub
  • Send me an email or a LinkedIn message telling me what you enjoy in the course (and maybe what else you want to see in the future)
  • Submit PRs for suggestions/issues :)

Table Of Contents

  1. Welcome!
    1. What I'm Known For
    2. Learning Objectives
    3. How You Will Learn
    4. How To Learn Effectively
    5. Course Topics
  2. Getting Started
    1. Prerequisites
    2. Tools Required

Video For The Lesson

Consider checking out the video for this introduction here, this video just provides the slides with commentary, later lessons are more high quality.

Video Corrections

None so far

Welcome

I'm David Teather and I work as a software engineer and my specialty is data extraction.

If you'd like a more visual experience check out the introduction video on YouTube, or pull up the introduction slides

What I'm Known For

  • My research on YikYak (a social media app) that was featured in Vice and The Verge
  • Creating various data extraction tools
    • My most popular is TikTokApi
      • 600K+ Downloads
      • 2.3K+ Stars

Course Introduction

Learning Objectives

  • Learners will understand the many different ways websites prevent web scraping
  • Learners will be able to reverse engineer a real-world website for data extraction

How You Will Learn

  • Real website examples
    • Although these websites might change over time and the lesson becomes broken
  • Websites I've created for this course
    • Will not change to ensure that these lessons don't break
  • Each lesson will have a hands on activity
    • In addition most modules will have a submission.py file that you can create functions related to the lesson concept and run it against a test suite
    • These will primarily focused on extracting data from the websites created for this course

How To Learn Effectively

  • Everybody learns different so these are guidelines
  • Take notes from the slides presented in the videos
    • These will revolve around general concepts
    • Will be accompanied by programs to write
  • Try the activities before watching the solution in the video
    • Treat the website folder as a black box, like you would a real website, you can figure out everything through the website itself

Course Topics

  • Forging API requests
  • Proxies
  • Captchas
  • Storing data at scale
  • Emulating human behavior
  • And more
    • Feel free to tweet at me or file an issue with the lesson-request label with what you'd like to see

Getting Started

Learn how to get started learning with this course!

Prerequisites

  • A basic understanding of programming
  • Recommended
    • Some python experience
      • We probably won't do much complex python

Tools Required

  • Docker
    • And docker-compose (should be bundled)
  • Python
    • I'll be using 3.10
  • A web browser
    • I'll be using Brave (chromium based)
    • Doesn't really matter which as long as you can view network traffic
  • And the files in this git repo, so be sure to download it! (and maybe give it a star 😉)

Hope you'll enjoy the content in this course! You can either get started with lesson 1, or check out the course catalogue