Update How Daily.Dev Built a Low-Budget Serverless Scraping Pipeline for Online Articles

Main Contents:

How Daily.Dev Built a Low-Budget Serverless Scraping Pipeline for Online Articles is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn How Daily.Dev Built a Low-Budget Serverless Scraping Pipeline for Online Articles in today’s post !

Cloud Functions to the Rescue

I’m a big fan of the Google Cloud Platform so I started looking at the available managed solutions GCP offers.

Why managed? Because we are a small team and can’t afford to manage infrastructure, even though it means that we will be vendor locked. Considering all the available products, I find that Cloud Functions (similar to Amazon Web Services‘ Lambda service) is the best solution for our architecture. It can scale down to zero which is important for the cost, and on the other hand, Cloud Functions support a massive scale, way beyond our requirements.

Every step in our pipeline can be deployed as a separate function, which gives us the flexibility to choose the right programming language, and runtime environment. For example, I’m more familiar with Cloudinary’s JavaScript SDK, so it makes sense to use JavaScript for the image processing function. Python is a great choice for NLP, which is also a step in our pipeline.

Another important cost factor is that we can set the hardware requirements per function. It supports HTTP and Pub/Sub triggers. And it has a very generous free trier. But it does come with some comprises that we have to consider, Cloud Functions is a proprietary solution of Google Cloud. The tools for local development are very simple and you have to hack your way around, and as such so is the testing. Compared to Docker-based solutions, Cloud Functions have limited runtime support.

Given the simple nature of our steps and the fact that they are so independent of each other, I think that the pros outweigh the cons so Cloud Functions it is. Specifically for subscribing to the RSS feeds, we use Superfeedr. It is a managed service that triggers a webhook when the feed changes. It is pretty expensive in my opinion, 10 feeds cost $1 per month, and it could definitely be a better product but it does its job.

To get it going fast, it was the right solution because it reduced the amount of development required and the ongoing operations. A few years, later we’re now considering building our own solution for subscribing to RSS feeds but that’s a story for a different time.

Orchestration vs. Choreography

When dealing with distributed workflows like in our use case, there is always the question of service orchestration or service choreography. The first means that there will be a dedicated service for supervising the whole process from A to Z. The supervisor shall call each service in the right order while providing the right arguments. It also has to deal with errors and unexpected events.

Service choreography means that every service invokes the next service in the process either synchronously (HTTP for example) or asynchronously (message queue of sorts). When following the choreography pattern, some of the workflow logic has to be implemented as part of the service and the service should be aware of the next service in-line.

* Credit to StackOverflow for the images

Our workflow is very straightforward with no conditions, and no complex execution graph. Each service enriches the data of its predecessor. I didn’t want to manage a state for the executions of every post and introduce a single point of failure so I decided to follow the choreography pattern.

The services use Google Pub/Sub to asynchronously communicate. I would like to highlight that with the latest release of Google Workflows, a managed supervisor for service orchestration, I might rethink my decision. With Workflows, I can get all the benefits of service orchestration without the need to develop or maintain the orchestrator itself.

Above you can see the existing architecture of the pipeline. Every box is a cloud function except for the API which is our server that subscribes to the event of a post is processed. Upon this event, the API will add a new entity to the database and making it available to all users. Superfeedr triggers the webhook with an HTTP request and all the rest communicate with messages through Google Pub/Sub.

Monitoring and Error Handling

We can’t introduce a new architecture without considering monitoring and error handling. I won’t go into details but just cover the important takeaways.

First, we need to monitor our message queues. My alarm is set to one unpacked message in the queue for five minutes. Usually, the latency of the articles pipeline is very low so a message doesn’t stay for long in the queue, so if it happens we need to know and we need to know fast.

The second aspect is the application errors that could occur during runtime. I use Google Error Reporting which notifies me in real-time of any new unexpected error in the service. Of course, there are many alternatives to Error Reporting, but I find it easier as it integrates perfectly with the rest of the cloud services.

Lastly, we need to think about the retries strategy in case of an error. Luckily, the message queue has some strategies out-of-the-box including a dead letter queue so we can later inspect those messages that the system couldn’t process.

Cost Analysis

The cost analysis for this architecture is very simple. Cloud Functions have 2 million invocations per month for free. Google Pub/Sub free tier limit is 10GB per month. It means that for 50,000 articles per month everything falls under the free tier of GCP which is incredible!

For this implementation and scale, we pay $0 for infrastructure which includes the Cloud Functions, Pub/Sub, monitoring, and error reporting. Superfeedr is the only cost for this architecture. $1 for 10 feeds per month. In total, we pay $50. Cloudinary is excluded because I count it as part of our API architecture. And anyway, the real cost of Cloudinary is the bandwidth, not the storage which is not relevant for this case.

Conclusion

In this post, we introduced a new serverless pipeline for scraping articles. We’ve covered the pros and cons of Cloud Functions and why it’s a cost-effective solution. We then compared the different techniques to orchestrate our workflow, followed by the necessary aspects of monitoring. Lastly, we did a bit of cost analysis to understand how much it costs.

InApps Technology is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.