This guest blog was written by the NewsCatcher team.
Imagine having to aggregate 40,000 news sources and over 1.2 million articles a day... every day! Quite a task.
NewsCatcher manages this task for customers around the world. Working with the Oracle for Startups program, the NewsCatcher team increased their ability to scale to meet ever increasing customer demands, while reducing their costs to capture, extract, clean, and index the data that’s the lifeblood of the customer business.
NewsCatcher is a data-as-a-service company, and the main product is news data. NewsCatcher helps companies gather industry-specific news article datasets for analysis. By providing customers with an aggregated dataset that customers can better understand market trends, identify risks, check, and validate entities for any compliance and money-laundering history.
<
NewsCatcher extracts, cleans, and indexes content and puts it all in one place. Data is aggregated from thousands of news sources. With a monthly subscription, users can search by specific keywords and filter out news sources by any language, country, or news source web domain. This quick demo also shows other configurations available. Customers choose NewsCatcher because it saves them time and the added effort and expense of extracting and aggregating data on their own.
“NewsCatcher has a pipeline process of discovering news URLs, deduplicating them, extracting relevant news data and storing raw data,” says Maksym Sugonyaka, cofounder and CTO of NewsCatcher.
NewsCatcher has built the solution using Elasticsearch as the main database. With Elasticsearch, NewsCatcher can provide a quick response time and a relevant search score. But before data lands on the Elasticsearch cluster, NewsCatcher has a pipeline process of discovering news URLs, deduplicating them, extracting relevant news data and storing raw data. With over 40,000 news sources and more than 1.2 million articles a day to extract in near real time, the NewsCatcher solution has a daunting task.
The NewsCatcher MVP was built using Oracle Cloud Infrastructure (OCI) Functions service or with parallel Docker containers running on virtual servers. A challenge arose when the number of articles to extract grew. NewsCatcher realized that the current model wasn’t scalable as cloud computing costs began to increase exponentially, leading the NewsCatcher CTO to look at the options available.
The solution they found was Kubernetes, and they were happy to learn that cloud solution providers have already made commitments to help build and grow partner solutions with sets of managed services. Using Kubernetes pods to build and manage the NewsCatcher pipeline, the NewsCatcher team can scale up the cluster as the number of articles grow and scale back when needed too. The dynamic capabilities of Kubernetes allowed the NewsCatcher team grow the business while maintaining their costs, providing a win for them and for their customers.
Another advantage of Kubernetes is portability: Migrate between different cloud solution providers is easy. Seemingly counterintuitive, this availability is an advantage for independent software vendors (ISVs) and end customers. You don’t need to change the code. You only need to adjust several lines in the YAML deployment file.
Historically, the NewsCatcher was architected to run on the Google Cloud Platform (GCP). As they built the initial version of NewsCatcher, they realized that, as they grew, their cloud bill was growing faster than they expected.
The realization that cloud computing costs were growing faster than expected and the need for a partner that could help them scale and grow the business led the NewsCatcher team to OCI and the Oracle for Startups program.
NewsCatcher is planning on introduce natural language processing (NLP) analysis to enrich data and add more value to the NewsCatcher pipeline. When the NewsCatcher team heard that the Oracle Startup Program makes this process easy and offers a 70% discount on cloud spend for two years, the decision to join the program and start using OCI was easy.
Because OCI already has support for Kubernetes with the Oracle Container Engine for Kubernetes (OKE), the decision was even easier.
One of the advantages of working with the Oracle team included support from the Oracle Developer Lighthouse Program team. The NewsCatcher team was pleasantly surprised that the commitment to help build their solution on the Oracle Kubernetes Engine was serious and backed by a team that did everything from scoping the solution to migrating from the old architecture to providing guidance along the way.
“The Oracle Developer Lighthouse Program team helped me at every step: From launching my first virtual cloud network to debugging some of my code and suggesting improvements in the architecture,” said Sugonyaka.
The NewsCatcher team had the following discoveries and challenges while they migrated their solution:
Virtual cloud network (VCN): The NewsCatcher team hadn’t done much work with private cloud networks. Oracle for Startups shared some best practices, and now the NewsCatcher solution has been architected, deployed, and secured on a private network using Oracle VCNs at the core of the solution.
Kubernetes Cluster service: The ability to support autoscaling of Kubernetes clusters is critical to the NewsCatcher business because customers demand more resources to deploy in crunch times and deleted when not in use. OKE allows for autoscaling of Kubernetes clusters.
Monitoring and reporting with partners: The NewsCatcher team needs the ability to capture metrics directly from the UI of the Kubernetes service, such as a list of all deployments, pods, and cronjobs, CPU and memory usage per pod or deployment, and logs per pod or deployment. Oracle works with partners to make monitor OKE and Kubernetes deployments easier than ever. NewsCatcher connected all the Kubernetes clusters to Datadog and was immediately able to monitor key metrics by cluster, pod, or deployment and observe the management logs. For more information, see Datadog and OKE and Monitoring Modern Container Infrastructure.
Continuous integration and deployment (CI/CD) to automate the NewsCatcher deployment: Every deployment requires streamlining and automation, including the ability to autoscale with no need for human intervention. The NewsCatcher team took advantage of OCI’s CI/CD capabilities to speed development and deployment to production servers.
“We had a smooth migration to OCI primarily due to the ease of communicating directly with the Oracle for Startups and the Oracle Developer Lighthouse Program teams. They showed how easy it is to get started and to be productive on the OCI platform, which is something that I struggled with on other cloud providers,” said Sugonyaka.
As NewsCatcher continues to grow the business, they’re discovering more OCI services and being introduced to some of the new technologies being added to the OCI platform. For example, NewsCatcher has been participating in the beta of the OpenSearch from OCI, and they plan to use OCI Language to enrich their planned NLP analysis capabilities.
The ability to aggregate thousands of news sources and millions of articles quickly and cost effectively drives NewsCatcher’s business and is what customers expect. As demand continues to grow, NewsCatcher can scale to meet the performance expectations of customers, while maintaining and managing the spend for their cloud computing costs.
NewsCatcher is a member of Oracle for Startups, the launchpad to integrate and scale with Oracle technology, expertise, and global reach. The program offers 70% off cloud, technical assistance, and mentoring to startups and scaleups.
To explore the NewsCatcher and Oracle partnership activities further, see the following links: