Challenge your understanding of essential data modeling best practices in Cassandra, focusing on design strategies, query patterns, denormalization, partitioning, and schema choices to optimize performance and scalability.
Which of the following is a recommended approach when choosing a primary key for a Cassandra table?
Explanation: Choosing fields that ensure uniqueness and match query requirements is essential for effective primary key selection in Cassandra. Simply copying primary key structures from relational databases may overlook Cassandra's different query model. Randomly picking fields may not reflect query access patterns. Auto-incremented IDs can cause hotspots and are not recommended in distributed environments.
Why is denormalization considered a best practice when modeling data in Cassandra?
Explanation: Denormalizing data enables queries to access all necessary information in a single table, avoiding the need for joins. Reducing storage is not the main goal, as denormalization actually increases storage use. Cassandra's consistency model does not focus on strong transactional consistency. Denormalization is not required for every query, but it benefits performance for many common access patterns.
In Cassandra, why should you design tables around your queries rather than the data structure?
Explanation: Designing tables based on expected queries ensures high performance for both reads and writes. Relational normalization is less applicable in Cassandra's architecture. While partition count matters, blindly minimizing partitions can harm performance. Cassandra does not automatically adjust tables for different queries; explicit design is needed.
When choosing a partition key, what is the most important factor to consider for data distribution?
Explanation: Even data distribution across nodes prevents hotspots and ensures scalability in Cassandra. Short strings do not guarantee even distribution and may limit uniqueness. Using a single static value collects all data into one partition, causing performance issues. Keys starting with the same letter can also create uneven distribution if data is not random.
Which scenario exemplifies a write pattern that should be avoided in Cassandra?
Explanation: Writing repeatedly to a single partition can cause hotspots and degrade performance in Cassandra's distributed system. Evenly spread writes across partitions improve scalability. Batching writes for different partitions is efficient because it distributes load. Reasonable volumes of time-series data per partition are generally manageable.
What is a potential risk of designing wide rows (rows with many columns) in Cassandra?
Explanation: Extremely wide rows may accumulate so much data that they become inefficient to process, potentially hurting read and write performance. There is no strict limit of ten columns. Out-of-memory errors are not guaranteed but are possible with extreme row width, not just any wide row. Automatic deletion of older data does not occur unless explicitly configured.
When is it generally safe to use secondary indexes in Cassandra for query support?
Explanation: Secondary indexes perform best with small tables and low-cardinality values, where writes are not frequent. Using them for any non-key column or large scans is inefficient. Denormalizing data helps avoid the need for secondary indexes, rather than the other way around.
What is a best practice regarding batch operations in Cassandra?
Explanation: Batching within a single partition is efficient because it minimizes coordination overhead. Large multi-partition batches can create performance issues and are not recommended. Batching unrelated partitions does not provide the intended benefits and may slow down operations. Batches are supported but should be used judiciously.
What is a common use case for the Time-to-Live (TTL) feature in Cassandra tables?
Explanation: TTL is designed to remove data after a specified time, making it ideal for cache or session expiration. It does not retain data permanently; in fact, it removes old data. TTL does not affect the default recency of query results and does not increase the partition size.
Which of the following is an anti-pattern in Cassandra data modeling?
Explanation: Frequent schema changes on large tables can disrupt system operation, cause downtime, and impact performance, making it an anti-pattern. Modeling tables around queries is encouraged. Clustering columns are appropriate for organizing and sorting data within a partition. Denormalization is often a recommended practice in Cassandra, not an anti-pattern.