"Cassandra is an excellent fit for time series data, and it’s widely used for storing many types of data that follow the time series pattern: performance metrics, fleet tracking, sensor data, logs, financial data (pricing and ratings histories), user activity, and so on.
A great introduction to this topic is Kelley Reynolds’ Basic Time Series with Cassandra. If you haven’t read that yet, I highly recommend starting with it. This post builds on that material, covering a few more details, corner cases, and advanced techniques.
Indexes vs Materialized Views
When working with time series data, one of two strategies is typically employed: either the column values contain row keys pointing to a separate column family which contains the actual data for events, or the complete set of data for each event is stored in the timeline itself. The latter strategy can be implemented by serializing the entire event into a single column value or by using composite column names of the form <timestamp>:<event_field>.
With the first strategy, which is similar to building an index, you first fetch a set of row keys from a timeline and then multiget the matching data rows from a separate column family. This approach is appealing to many at first because it is more normalized; it allows for easy updates of events, doesn’t require you to repeat the same data in multiple timelines, and lets you easily add built-in secondary indexes to your main data column family. However, the second step of the data fetching process, the multiget, is fairly expensive and slow. It requires querying many nodes where each node will need to perform many disk seeks to fetch the rows if they aren’t well cached. This approach will not scale well with large data sets.
The second strategy, which resembles maintaining a materialized view, provides much more efficient reads. Fetching a time slice of events only requires reading a contiguous portion of a row on one set of replicas. If the same event is tracked in multiple timelines, it’s okay to denormalize and store all of the event data in each of those timelines. One of the main principles that Cassandra was built on is that disk space is very cheap resource; minimizing disk seeks at the cost of higher space consumption is a good tradeoff. Unless the data for each event is very large, I always prefer this strategy over the index strategy..."