Desploy webscraper

1/7/2024

They can get unplugged accidentally, or restart because of an update. I could have run the script on my computer with a cron job on Mac or a scheduled task on Windows.īut desktop computers are unreliable. I wanted to scrape a government website that is regularly updated every night, detect new additions, alert me by email when something is found, and save the results. With this post, I hope to spare you from wanting to smash all computers with a sledgehammer.

I recently spent several frustrating weeks trying to deploy a Selenium web scraper that runs every night on its own and saves the results to a database on Amazon S3. According to this GitHub issue, these versions work well together: What did work was the following:ĮDIT: The versions above are no longer supported. It’s based on this guide, but it didn’t work for me because the versions of Selenium, headless Chrome and chromedriver were incompatible. TL DR: This post details how to get a web scraper running on AWS Lambda using Selenium and a headless Chrome browser, while using Docker to test locally.

This post should be used as a historical reference only. And here’s a list of useful pre-packaged layers. Here’s a post on how to make such a layer. This post is outdated now that AWS Lambda allows users to create and distribute layers with all sorts of plugins and packages, including Selenium and chromedriver.

0 Comments

Desploy webscraper

Leave a Reply.

Author

Archives

Categories