By leveraging the scalability and unique database characteristics of Apache Cassandra, MERA has successfully produced a ‘big data’ system: a high capacity aggregator of records from social media & other sources on the internet.
Besides simply storing these messages in a database Mera is also providing enhanced analytics on a message-by-message basis to allow additional features, such as full-text-search within the stored information.
System cost is always a major factor of system design therefore we looked into publically available open source solutions as our first option. After an in-depth search of available systems, the Cassandra database was chosen by MERA as a core component of our solution. While ensuring optimal ‘access speed’ Cassandra was also found to scale during runtime as well as providing ‘replicated storage’.
Scaling in runtime is important for a growing web service as it provides the capability to expand a system as demand increases, thus distributing capital expense purchases out over time.
At the same time, we found ‘full text search’ was not supported in Cassandra which required an additional component to be added into the final solution – the Apache Solr indexing engine. By combining Cassandra and Solr MERA has been able to provide competitive features along with deep capacity.
Today, our system is a prototype solution running in our Mera lab. Using only six servers we can provide 10 Tb of data storage (limited by purely hard drive throughput) at a rate of 40,000 messages/second.
All the saved messages are automatically indexed and instantly available for search. Interestingly, disk usage for replicated data is relatively modest thanks to Cassandra’s built-in data compression. Statistically speaking, the indexed & replicated messaging data is only 6% larger than the original data content.