About

A Webscraper for freelance.de in Python using the Selenium and Sendgrid packages. Intended to be run on a serverless Azure architecture consisting of an Azure Container Instance for Computation / Extraction & Transformation & Notification and an Azure Function in Powershell for Orchestration/ Infrastacture-as-Code. The scraper simply opens freelance.de, logs in, executes a search for the predefined search terms, parses & appends the new results (which can span multiple pages) and lastly sends an email of the new relevant job if appropriate.

Execution 8 times a day from Monday till Friday is less than 5€/month, only possible due to the serverless nature. Of that, 70% are storage costs.

Architecture

A cronjob triggerred Azure Function orchestrates the whole ordeal, acting as Infrastracture-as-Code by starting the Container Instance via Powershell
The container is created with a mounted Fileshare to enable persistence of data
The Container Instance scrapes freelance.de in aforementioned manner, appending the new jobs to results_{Month}.csv and updating the watermark.
If applicable, an email with a set of newly found relevant jobs gets sent to the recipient using sendgrid API.
The Function stops the container instance

How to deploy

Prerequisites

Powershell with the Azure module installed is needed to perform the rbac
An Azure File Share
An Azure Function App with a Powershell Function
A freelance.de Premium account

Function App Deployment

Configure assign-rbac.ps1, profile.ps1 and run.ps1 (meaning enter your personalized data)
Run assign-rbac.ps1
Replace the contents of requirements.psd1, profile.ps1 as well as run.ps1 of your Azure Function with the files from this function folder
Done!

Script & Storage Configuration

Configure config.json (meaning enter your personalized data)
Upload scraper.py and config.json to the File Share
Done!

That's it! The container instance uses a Dockerhub image based on the dockerfile in this directory and it should automatically access scraper.py and config.py from your File Share. The function can be of any kind, not only time triggerred.

Relevancy Evaluation

The parsing basically consists of a relevancy evaluation at one's will so that a subset of highly appropriate new results (job postings) are subsequently sent via email to the recipient. Whether a new job gets send via email depends on the relevancy condition. Right now it is simply a function of search word occurence counts - but that works satisfactory: Word1 and Word2 must occur in the job description; word3 and word4's occurrences must add up to at least 3.

To Do:

Email specific error if Container fails during runtime
1-click Setup (e.g. Terraform)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
function		function
.gitattributes		.gitattributes
Dockerfile		Dockerfile
LICENSE		LICENSE
Readme.md		Readme.md
Scraper.svg		Scraper.svg
config.json		config.json
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Architecture

How to deploy

Prerequisites

Function App Deployment

Script & Storage Configuration

Relevancy Evaluation

To Do:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

NikOcaml/Azure-CI-Functions-Scraper

Folders and files

Latest commit

History

Repository files navigation

About

Architecture

How to deploy

Prerequisites

Function App Deployment

Script & Storage Configuration

Relevancy Evaluation

To Do:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages