By Jeff Carpenter

You might have heard of Apache Cassandra, the open-source NoSQL database. And you might know that some big, very successful companies rely on it, including LinkedIn, Netflix, The Home Depot, and Apple.

But did you know that Cassandra is used by a huge range of companies — including small, cloud-native application builders, financial firms, and broadcasters?

Here, I’ll give you an overview of Cassandra, along with a few reasons why this database might just be the right way to persist data at your organization and ensure your data and the apps that your developers build on it are infinitely scalable, secure, and fast.

A (very abridged) look at the database landscape

Many people in technology first became familiar with relational databases like Oracle DB or MySQL. They’re very powerful because they ensure data consistency and availability at the same time, and they’re effective and relatively easy to use — as long as your databases are running on the same machine.

Apache Cassandra 4.1 is generally available! Read more

But if you need to run more transactions or need more space to store your data, you’ll run into upper limits pretty quickly, as relational databases can’t scale efficiently.

The solution? Split the data among multiple machines and create a distributed system. NoSQL (“Not only SQL”) databases were invented to cope with these new requirements of volume (capacity), velocity (throughput), and variety (format) of big data.

It was born out of necessity, as the rise of Big Tech over the past decade has driven the global data sphere to skyrocket 15-fold; relational databases simply can’t cope with the new data volume or new performance requirements. Huge global operations like Google, Facebook, and LinkedIn created NoSQL databases to enable them to scale efficiently, go global, and achieve zero downtime.

Cassandra’s early days

In the mid-2000s, engineers at young, fast-growing Facebook had a problem: how could they store and access the mushrooming data created by Messenger, the platform that enabled users of the social networking site to communicate with one another? Nothing on the market could handle the hundreds of millions of users that were on the platform at peak times, spread across tens of thousands of servers spread across data centers around the world.

So, Facebook’s team built their own database to enable users to search their Messenger inboxes. It replicated data across geographies to keep latencies down, handled billions of writes per day, and could scale as the number of users grew. (You can geek out on the original Facebook Cassandra paper, authored by its creators, here).

As it became clear that this technology was suitable for other purposes, the company gave Cassandra to the Apache Software Foundation (ASF), where it became an open-source project (it was voted into a top-level project in 2010).

Cassandra’s scalability was impressive, but its reliability also sets it apart among databases. Because of its geographic distribution and the fact that data is replicated across multiple datacenters, Cassandra’s uptime and disaster recovery capabilities are unparalleled. This quickly caught the eye of other rising web stars, like Netflix. The company launched its streaming service in 2007 using an Oracle database housed in a single data center. The company’s rapid growth quickly highlighted the danger of managing data at a single point of failure. By 2013, most of Netflix’s data was housed in Cassandra. 

Cassandra has become the de facto standard database for high-growth applications that need reliability, high performance, and scalability: it’s used by approximately 90% of the Fortune 100, and a bunch of relatively recent developments are making it even more accessible to a wider range of organizations.

Why Cassandra?

Let’s quickly recap some of the unique capabilities of Cassandra:

Scalability – There are essentially no limitations on volume and velocity. Because it’s partitioned over a distributed architecture, Cassandra is capable of handling various data types at petabyte scale.Speed – Read-write performance is unmatched, thanks in part to Cassandra’s distributed nature — it can operate across multiple instances called “nodes.” A single node is very performant, but a cluster with multiple nodes and data centers brings throughput to the next level. Decentralization means that every node can deal with any request, read, or write.Availability – Theoretically, organizations can achieve 100% uptime thanks to data replication, decentralization, and a topology-aware placement strategy that replicates to multiple data centers, eliminating the waste associated with the traditional practice of maintaining duplicative infrastructure for disaster recovery.Geographically distributed – Multi-data center deployments provide exceptional disaster tolerance while keeping data close to clients around the globe, reducing latency (learn more about global data distribution here).Platform and vendor agnostic – Cassandra isn’t bound to any platform or service provider, which enables organizations to build hybrid- and multi-cloud solutions. It also doesn’t belong to any commercial vendor; the fact that it’s offered by the open-source, non-profit ASF means it’s openly available and continuously improving.

For more details, see this excellent Cassandra overview provided by the ASF.

Why Cassandra for your organization?

Online banking services, airline booking systems, and popular retail apps. These modern applications and workloads — many of which operate at huge, distributed scale — should never go down. Cassandra’s seamless and consistent ability to scale to hundreds of terabytes, along with its exceptional performance under heavy loads, has made it a key part of the data infrastructures of companies that operate these kinds of applications.

For instance, Best Buy, the world’s biggest multichannel consumer electronics retailer, describes Cassandra as “flawless” in how it handles huge spikes in holiday shopping traffic.

But Cassandra isn’t just for big, established sector leaders like Best Buy or Bloomberg. It’s a powerful data store for developers and architects who build high-growth applications at organizations of all sizes. Consider Praveen Viswanath, a cofounder of Alpha Ori Technologies, which offers an IOT platform for data acquisition from ships and processing and analytics for their operators.

Having experienced the power of the NoSQL database in earlier roles, Viswanath again turned to Cassandra — delivered via DataStax’s Astra DB cloud service — for its distributed reliability and high throughput, as Alpha Ori’s platform required the constant gathering of thousands of data points from the 40 or so major systems aboard the over 260 ships that it served.

Because of his team’s need to focus on development rather than database operation, Viswanath chose the Astra DB managed service, a serverless solution that scales up and down when needed.

A flourishing ecosystem

The availability of Cassandra as a managed service is one way that this powerful database is reaching more organizations. But there’s also an ecosystem of complementary open-source technologies that have sprung up around Cassandra to make it simpler for developers to build apps with it.

Stargate is an open-source data gateway that provides a pluggable API layer that greatly simplifies developer interaction with any Cassandra database. REST, GraphQL, Document, and gRPC APIs make it easy to just start coding with Cassandra without having to learn the complexities of CQL and Cassandra data modeling.

K8ssandra is another open-source project that demonstrates this approachability, making it possible to deploy Cassandra on any Kubernetes engine, from the public cloud providers to VMWare and OpenStack. K8ssandra extends the Kubernetes promise of application portability to the data tier, making it easier to avoid vendor-lock in.

A vibrant future

As a highly active open source project, Cassandra is always being updated and extended by a vibrant community of very smart people at companies like Apple, Netflix, and my employer, DataStax. Indeed, the Apache Software Foundation today announced the general availability of Cassandra 4.1. Through exciting innovations like ACID transaction support (long a holy grail of distributed NoSQL databases) and improved indexing, we are working to make Cassandra more powerful, easy to use, and ready for the future.

Want to learn more about Apache Cassandra? Register now for the Cassandra Summit, which takes place in San Jose, Calif., March 13-14, 2023.

About Jeff Carpenter:

DataStax

Jeff has worked as a software engineer and architect in multiple industries and as a developer advocate helping engineers succeed with Apache Cassandra. He’s involved in multiple open source projects in the Cassandra and Kubernetes ecosystems including Stargate and K8ssandra. Jeff is coauthor of the O’Reilly books Cassandra: The Definitive Guide and Managing Cloud Native Data on Kubernetes.

Data Management, IT Leadership

By Aaron Ploetz, Developer Advocate

There are many statistics that link business success to application speed and responsiveness. Google tells us that a one-second delay in mobile load times can impact mobile conversions by up to 20%. And a 0.1 second improvement in load times improved retail customer engagement by 5.2%, according to a study by Deloitte.

It’s not only the whims and expectations of consumers that drive the need for real-time or near real-time responsiveness. Think of a bank’s requirement to detect and flag suspicious activity in the fleeting moments before real financial damage can happen. Or an e-tailer providing locally relevant product promotions to drive sales in a store. Real-time data is what makes all of this possible.

Let’s face it – latency is a buzz kill. The time that it takes for a database to receive a request, process the transaction, and return a response to an app can be a real detriment to an application’s success. Keeping it at acceptable levels requires an underlying data architecture that can handle the demands of globally deployed real-time applications. The open source NoSQL database Apache Cassandra®  has two defining characteristics that make it perfectly suited to meet these needs: it’s geographically distributed, and it can respond to spikes in traffic without adverse effects to its unmatched throughput and low latency.

Let’s explore what both of these mean to real-time applications and the businesses that build them.

Real-time data around the world

Even as the world has gotten smaller, exactly where your data lives still makes a difference in terms of speed and latency. When users reside in disparate geographies, supporting responsive, fast applications for all of them can be a challenge.

Say your data center is in Ireland, and you have data workloads and end users in India. Your data might pass through several routers to get to the database, and this can introduce significant latency into the time between when an application or user makes a request and the time it takes for the response to be sent back.

To reduce latency and deliver the best user experience, the data need to be as close to the end user as possible. If your users are global, this means replicating data in geographies where they reside.

Cassandra, built by Facebook in 2007, is designed as a distributed system for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically tailored for deployment across multiple data centers. These features are robust and flexible enough that you can configure clusters (collections of Cassandra nodes, which are visualized as a ring) for optimal geographical distribution, for redundancy, for failover and disaster recovery, or even for creating a dedicated analytics center that’s replicated from your main data storage centers.

But even if your data is geographically distributed, you still need a database that’s designed for speed at scale.

The power of a fast, transactional database

NoSQL databases primarily evolved over the last decade as an alternative to single-instance relational database management systems (RDBMS) which had trouble keeping up with the throughput demands and sheer volume of web-scale internet traffic.

They solve scalability problems through a process known as horizontal scaling, where multiple server instances of the database are linked to each other to form a cluster.

Some NoSQL database products were also engineered with data center awareness, meaning the database is configured to logically group together certain instances to optimize the distribution of user data and workloads. Cassandra is both horizontally scalable and data-center aware. 

Cassandra’s seamless and consistent ability to scale to hundreds of terabytes, along with its exceptional performance under heavy loads, has made it a key part of the data infrastructures of companies that operate real-time applications – the kind that are expected to be extremely responsive, regardless of the scale at which they’re operating. Think of the modern applications and workloads that have to be reliable, like online banking services, or those that operate at huge, distributed scale, such as airline booking systems or popular retail apps.

Logate, an enterprise software solution provider, chose Cassandra as the data store for the applications it builds for clients, including user authentication, authorization, and accounting platforms for the telecom industry.

“From a performance point of view, with Cassandra we can now achieve tens of thousands of transactions per second with a geo-redundant set-up, which was just not possible with our previous application technology stack,” said Logate CEO and CTO Predrag Biskupovic.

Or what about Netflix? When it launched its streaming service in 2007, it used an Oracle database in a single data center. As the number of users and devices (and data) grew rapidly, the limitations on scalability and the potential for failures became a serious threat to Netflix’s success. Cassandra, with its distributed architecture, was a natural choice, and by 2013, most of Netflix’s data was housed there. Netflix still uses Cassandra today, but not only for its scalability and rock-solid reliability. Its performance is key to the streaming media company –  Cassandra runs 30 million operations per second on its most active single cluster, and 98% of the company’s streaming data is stored on Cassandra.

Cassandra has been shown to perform exceptionally well under heavy load. It can consistently show very fast throughput for writes per second on a basic commodity workstation. All of Cassandra’s desirable properties are maintained as more servers are added, without sacrificing performance.

Business decisions that need to be made in real time require high-performing data storage, wherever the principal users may be. Cassandra enables enterprises to ingest and act on that data in real time, at scale, around the world. If acting quickly on business data is where an organization needs to be, then Cassandra can help you get there.

Learn more about DataStax here.

About Aaron Ploetz:

DataStax

Aaron has been a professional software developer since 1997 and has several years of experience working on and leading DevOps teams for startups and Fortune 50 enterprises.

IT Leadership, NoSQL Databases