Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

34 crawls scheduler #66

Closed
wants to merge 66 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
0a004ab
feat(user): add (fake) login()
AndriPL Apr 27, 2020
36e6122
build: add passlib dependency
AndriPL May 4, 2020
8d370f9
feature(auth): add real login
bartsvoboda May 4, 2020
7fa19d7
Merge branch '30-user-authorization' of https://github.com/nokia-wroc…
bartsvoboda May 4, 2020
4fc80ca
feature(auth): add cookies for login route
bartsvoboda May 7, 2020
a77ea6d
Merge branch 'master' into 30-user-authorization
AndriPL May 16, 2020
23306c8
feat: add email notifications
Jun 1, 2020
755cfd5
refactor: delete scrapy project files
Jun 1, 2020
3418609
docker: add auth secrets
AndriPL Jun 2, 2020
554366c
feat(auth): add auth scheme
AndriPL Jun 2, 2020
06cd0c2
feat(auth): add auth
AndriPL Jun 2, 2020
cce57ef
feat: add save_to_txt helper function
AndriPL Jun 2, 2020
3f76da6
feat(auth): add auth to crawling routes
AndriPL Jun 2, 2020
7684b9d
feat(auth): add auth to statics
AndriPL Jun 2, 2020
026597b
feat: add secrets
Jun 2, 2020
4d37c4b
refactor: add email notifications to helpers
Jun 2, 2020
8844757
feat: add HTTPException
Jun 2, 2020
3ede3f0
refactor: main_notifications cleanup
Jun 2, 2020
0776840
feat: add secrets
Jun 2, 2020
9a682c5
fix(auth): throw 401 in bearer_cookie_auth
AndriPL Jun 2, 2020
26dc876
refactor(auth): combine verify_token with oauth2_scheme
AndriPL Jun 2, 2020
4914d59
refactor: remove scrappy junk
AndriPL Jun 2, 2020
4f90277
feat(auth): add logout
AndriPL Jun 3, 2020
4356af1
refactor: remove logging
AndriPL Jun 3, 2020
b2163dd
refactor: add newline in debug.py
AndriPL Jun 3, 2020
ac9afcd
feat: add crawl to database
Jun 4, 2020
5058874
feat: get crawls from database
Jun 4, 2020
1e898f3
Merge branch '30-user-authorization' into 35-adding-crawls-to-DB
Jun 4, 2020
ac974f5
feat: add read_crawls
Jun 4, 2020
d8e4346
feat: add verify_token dependencies
Jun 4, 2020
016e9a9
feat: add update_crawl
Jun 4, 2020
86a8828
refactor: cleanup
Jun 4, 2020
5bc837a
refactor: cleanup
Jun 4, 2020
7761507
Merge branch 'master' into 35-adding-crawls-to-DB
AndriPL Jun 10, 2020
88366cd
fix(backend): remove excessive auth router in main
AndriPL Jun 10, 2020
aca3251
refactor: styling changes in auth_contorller
AndriPL Jun 10, 2020
d705058
fix(crawling): fix adding new crawl
AndriPL Jun 10, 2020
85fd375
fix(db): fix relations names to work with pydantic
AndriPL Jun 10, 2020
37b0ce2
fix(crawls): fix getting crawls list
AndriPL Jun 11, 2020
fd91473
fix(crawlingCRUD): fix updating a crawl
AndriPL Jun 11, 2020
05f91d3
refactor(cawrCRUD): remove logger from add_crawl
AndriPL Jun 11, 2020
28c6aee
refactor(crawling): add api tags to crawling_controller
AndriPL Jun 11, 2020
89b825d
fix(crawl CRUD): fix adding crawls
AndriPL Jun 12, 2020
e148a61
fix(crawl CRUD): add crawl to user, not email from crawl
AndriPL Jun 12, 2020
10094de
feat: change email style
Jun 13, 2020
576dcd7
Merge branch 'master' into 36-email-notifications
AndriPL Jun 14, 2020
8b3dc0e
fix(crawling): fix timeout when adding crawl
AndriPL Jun 12, 2020
fb29b16
feat(crawl CRUD): add deleting crawls
AndriPL Jun 12, 2020
7d403af
feat(scheduler): add scheduler init
AndriPL Jun 14, 2020
a97ed81
feat(scheduler): add creating crawls queue
AndriPL Jun 14, 2020
8e10575
Merge branch 'master' into 34-crawls-scheduler
AndriPL Jun 14, 2020
225a2b7
feat(scheduler): scheduler runs but not in new loop
AndriPL Jun 15, 2020
f5248dd
fix(scheduler): taking screenshots works and crawls_queue is being up…
AndriPL Jun 15, 2020
8157e70
feat(scheduler): update element value in db
AndriPL Jun 15, 2020
e4fdc36
Merge branch '36-email-notifications' into 34-crawls-scheduler
AndriPL Jun 15, 2020
39a27ea
feat(scheduler): send screenshots
AndriPL Jun 15, 2020
90169e7
feat: add new email style
Jun 15, 2020
e102cec
fix(scheduler): fix scheduler sleeping too long and add new email tem…
AndriPL Jun 15, 2020
47a1961
feat(scheduler): remove screenshots after sending email
AndriPL Jun 15, 2020
6be45e4
Merge branch 'master' into 34-crawls-scheduler
AndriPL Jun 16, 2020
5ec53f0
fix(scheduler): fix loosing crawls outside the queue
AndriPL Jun 16, 2020
75b9b52
refactor(scheduler): remove class __Scheduler
AndriPL Jun 16, 2020
23deb97
fix(scheduler): delete and update crawls without realoading scheduler
AndriPL Jun 16, 2020
9b2fa77
fix(scheduler): fix adding crawl
AndriPL Jun 16, 2020
244f46b
Merge branch 'master' into 34-crawls-scheduler
AndriPL Jun 16, 2020
74e5a9b
fix(scheduler): fix lost connection bug
AndriPL Jun 16, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion backend/src/crawling/crawling_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

from src.crawling.communicators import Communicators
from src.database.database_schemas import Links, Notifications, Scripts
from src.crawling.scheduler import scheduler
from src.crawling.scheduler_api import scheduler
from src.user import user_service

from .models.crawl_data_model import CrawlData, CrawlDataCreate
Expand Down
85 changes: 46 additions & 39 deletions backend/src/crawling/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,19 @@


class Scheduler:
__kill_switch = False
__waiting_crawls: asyncio.PriorityQueue
__crawls_not_in_queue_num: int
__running_crawls_futures = []
__crawls_to_update_or_delete = {}

async def __create(self):
async def create(self):
self.__waiting_crawls = asyncio.PriorityQueue(
maxsize=MAX_WAITING_CRAWLS_QUEUE_SIZE
)
self.__crawls_not_in_queue_num = 0
await self.reload_crawls()

async def run_scheduler(self):
await self.__create()
await self.__run()

async def reload_crawls(self):
self.__waiting_crawls = asyncio.PriorityQueue(
Expand All @@ -58,20 +56,23 @@ async def reload_crawls(self):
)
)


async def add_crawl(self, crawl: CrawlData):
await self.__waiting_crawls.put((time.time() + crawl.period, crawl))


def update_crawl(self, crawl: CrawlData):
self.__crawls_to_update_or_delete.update([(crawl.crawl_id, crawl)])


def delete_crawl(self, crawl_id):
self.__crawls_to_update_or_delete.update([(crawl_id, None)])


async def __run(self):
async def run(self):
loop = asyncio.get_event_loop()
asyncio.create_task(self.__remove_done_futures())
while True:
while not self.__kill_switch:
logger.log(level=logging.DEBUG, msg=f"Number of waiting crawls = {self.__waiting_crawls.qsize()}.")

moment_of_exec, crawl = await self.__waiting_crawls.get()
Expand All @@ -87,7 +88,7 @@ async def __run(self):

if time_to_wait > 0:
await self.__waiting_crawls.put((moment_of_exec, crawl))
self.__crawls_not_in_queue_num -=1
self.__crawls_not_in_queue_num -= 1
await asyncio.sleep(1)
else:
logger.log(level=logging.DEBUG, msg=f"Running check for change for crawl {crawl.name}. Number of running crawls = {len(self.__running_crawls_futures)}.")
Expand All @@ -97,40 +98,45 @@ async def __run(self):
async def __run_check_for_change(self, crawl, loop):
while len(self.__running_crawls_futures) >= MAX_RUNNING_CRAWLS:
await asyncio.sleep(1)
self.__running_crawls_futures.append(asyncio.run_coroutine_threadsafe(self.__check_for_change(crawl), loop))
crawl_future = asyncio.run_coroutine_threadsafe(self.__check_for_change(crawl), loop)
self.__running_crawls_futures.append(crawl_future)


async def __check_for_change(self, crawl):
logger.log(level=logging.DEBUG, msg=f"Checking crawl {crawl.name}")
current_value = await data_selector(url=crawl.url, xpath=crawl.xpath)
if current_value != crawl.element_value:
msg = (
" Name: "
+ crawl.name
+ "\n"
+ " Old value: "
+ crawl.element_value
+ "\n"
+ " New value: "
+ str(current_value)
)
logger.log(level=logging.DEBUG, msg="Value changed! \n" + msg)

update_element_value_in_db(crawl.crawl_id, current_value)
crawl.element_value = current_value

screenshot_name = crawl.name.replace(" ", "_")
screenshot_path = await take_screenshot(crawl.url, filename=screenshot_name)
send_email(
crawl.email,
screenshot_path,
f"Your crawl \"{crawl.name}\" has new value!"
)
os.remove(screenshot_path)
try:
logger.log(level=logging.DEBUG, msg=f"Checking crawl {crawl.name}")
current_value = await data_selector(url=crawl.url, xpath=crawl.xpath)

if current_value != crawl.element_value:
msg = (
" Name: "
+ crawl.name
+ "\n"
+ " Old value: "
+ crawl.element_value
+ "\n"
+ " New value: "
+ str(current_value)
)
logger.log(level=logging.DEBUG, msg="Value changed! \n" + msg)

await self.__waiting_crawls.put((time.time() + crawl.period, crawl))
self.__crawls_not_in_queue_num -=1
update_element_value_in_db(crawl.crawl_id, current_value)
crawl.element_value = current_value

screenshot_name = crawl.name.replace(" ", "_")
screenshot_path = await take_screenshot(crawl.url, filename=screenshot_name)
send_email(
crawl.email,
screenshot_path,
f"Your crawl \"{crawl.name}\" has new value!"
)
os.remove(screenshot_path)

finally:
await self.__waiting_crawls.put((time.time() + crawl.period, crawl))
self.__crawls_not_in_queue_num -= 1


async def __remove_done_futures(self):
while True:
for i, future in enumerate(self.__running_crawls_futures):
Expand All @@ -139,6 +145,10 @@ async def __remove_done_futures(self):
await asyncio.sleep(1)


def kill(self):
self.__kill_switch = True


def update_element_value_in_db(crawl_id, element_value):
db = next(get_db())
db.query(Scripts) \
Expand All @@ -153,6 +163,3 @@ def update_element_value_in_db(crawl_id, element_value):
def get_all_links():
db = next(get_db())
return db.query(Links).all()


scheduler = Scheduler()
31 changes: 31 additions & 0 deletions backend/src/crawling/scheduler_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import logging

from .models.crawl_data_model import CrawlData
from .scheduler import Scheduler


logger = logging.getLogger("Noticrawl")

scheduler = Scheduler()


async def run_scheduler():
while True:
try:
await scheduler.create()
await scheduler.run()
except:
logger.log(level=logging.DEBUG, msg="Exception in scheduler!")
scheduler.reload_crawls()


async def add_crawl(crawl: CrawlData):
await scheduler.add_crawl(crawl)


def update_crawl(crawl: CrawlData):
scheduler.update_crawl(crawl)


def delete_crawl(crawl_id):
scheduler.delete_crawl(crawl_id)
63 changes: 34 additions & 29 deletions backend/src/helpers/crawling.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,35 +7,40 @@
logger = logging.getLogger("Helpers")

async def data_selector(url, xpath):
logging.getLogger("websockets").setLevel("WARN")
browser = await pyppeteer.launch(
headless=True, args=["--no-sandbox"], logLevel="WARN"
)
page = await browser.newPage()
await page.goto(url, waitUntil="networkidle2", timeout=600000)
await page.waitForXPath(xpath)
xpath_content = await page.xpath(xpath)
text_content = await page.evaluate(
"(xpath_content) => xpath_content.textContent", xpath_content[0]
)
await page.close()
await browser.close()
return text_content
try:
logging.getLogger("websockets").setLevel("WARN")
browser = await pyppeteer.launch(
headless=True, args=["--no-sandbox"], logLevel="WARN"
)
page = await browser.newPage()
await page.goto(url, waitUntil="networkidle2", timeout=600000)
await page.waitForXPath(xpath)
xpath_content = await page.xpath(xpath)
text_content = await page.evaluate(
"(xpath_content) => xpath_content.textContent", xpath_content[0]
)
return text_content
finally:
await page.close()
await browser.close()


async def take_screenshot(url, filename="screenshot", directory="/app/logs/screenshots"):
filename = filename + datetime.now().strftime("_%d-%m-%Y_%H-%M-%S-%f") + ".png"
path = directory + "/" + filename
os.makedirs(os.path.dirname(path), exist_ok=True)

logging.getLogger("websockets").setLevel("WARN")
browser = await pyppeteer.launch(
headless=True, args=["--no-sandbox"], logLevel="WARN"
)
page = await browser.newPage()
await page.goto(url, waitUntil="networkidle2", timeout=600000)
await page.screenshot(path=path, fullPage=True)
await page.close()
await browser.close()

return path
try:
filename = filename + datetime.now().strftime("_%d-%m-%Y_%H-%M-%S-%f") + ".png"
path = directory + "/" + filename
os.makedirs(os.path.dirname(path), exist_ok=True)

logging.getLogger("websockets").setLevel("WARN")
browser = await pyppeteer.launch(
headless=True, args=["--no-sandbox"], logLevel="WARN"
)
page = await browser.newPage()
await page.goto(url, waitUntil="networkidle2", timeout=600000)
await page.screenshot(path=path, fullPage=True)
return path

finally:
await page.close()
await browser.close()

4 changes: 2 additions & 2 deletions backend/src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from .auth.auth_controller import auth_router
from .auth.auth_service import verify_token
from .crawling.crawling_controller import crawling_router
from .crawling.scheduler import scheduler
from .crawling.scheduler_api import run_scheduler
from .database import database_schemas
from .helpers.status_code_model import StatusCodeBase
from .user.user_controller import user_router
Expand All @@ -22,7 +22,7 @@

database_schemas.create()

asyncio.create_task(scheduler.run_scheduler())
asyncio.create_task(run_scheduler())

app.mount("/static", StaticFiles(directory="../frontend/build/static"), name="static")
templates = Jinja2Templates(directory="../frontend/build")
Expand Down