Over the past few years large tech companies such as Netflix, Hulu, Uber and Facebook and have brought notoriety to NoSQL or non-relational databases. In this blog post I will give a brief introduction to the types of NoSQL databases and how they compare to traditional SQL databases.
Relational vs Non-relational
The key difference between a NoSQL and SQL is that a SQL database is considered a relational database. A relational database stores data in tables, which are organized into columns. Each column stores one datatype (integer, real number, string, date etc.) and each row represents an instance of the table. Non-relational databases do not store data in tables- instead there are multiple ways to store data in NoSQL databases (Key-value, Document-based, Column-based). I will briefly summarize these 3 below.
Key-value Store (Oracle NoSQL, Redis, Amazon Dynamo)
Key-Value databases are the most simple of all the NoSQL databases. The basic data structure is a dictionary or map. A value can be stored as an integer, string, JSON, or an array- with a key used to reference that value. For example you could have a key as a customer id, which referred to a value containing a string of the customer’s name. Using a JSON structure would add complexity to the value as it could contain a dictionary of information about the the customer with the corresponding key. Benefits include rapid storage of data (due to the high simplicity), has integrated caching feature allowing users to store/retrieve data as quickly as possible and they are highly suited for unrelated data. For example a user’s web app activity will be different and unrelated to another user’s activity. However, key-value databases make it difficult to perform advanced queries aside from basic CRUD (create, read, update, delete) operations. Additionally, as volume of data increases maintaining unique keys becomes more difficult
Document-Based Store (MongoDB and Couchbase)
The next type of NoSQL database is the Document-oriented database. Data is structured in the form of documents and collections. A document can be a PDF, Microsoft word doc, XML or JSON file. As opposed to columns and datatypes a document contains key value pairs. Each document does not have to be in the same structure as other documents. Due to this, to add additional data one can simply add more documents without having to change the structure of the entire database. Documents are grouped into collections, which serve a similar purpose to a relational table. Document databases provide a querying function to search collections of documents with particular attributes. Benefits include flexible data modeling (eliminates the need to force fit relational data models, as it can handle structured, unstructured and semi-structured data) and fast-write performance over strict consistency (great for Agile and quick iteration). Document based databases also allow for separation of collections by entity (orders and customer profiles). However, they are also limited with advanced queries and does not allow for joins.
Column-based Store (Google’s Bigtable, Cassandra, HBase)
Column-Oriented databases store data in grouped columns rather than in rows of data. They use a concept called keyspace, which is similar to the schema in a relational model. This is shown to the left. The keyspace contains multiple column families. Column families are similar to tables in a relational model. However, instead of containing just rows, a column family contains rows of columns. Each row in a column family has a unique key and each column within the row contains a name, value and time stamp (shown below).
The benefits of a column based database include data compression, high performance with aggregate functions (SUM, COUNT, etc) and scalability across a ton of machines. However, writing new data to columnar databases could take more time. You could write new data to a row based database in one operation, but for a Columnar database you would need to write each column one by one. Therefore, Columnar databases are better for processing data with a small number of columns and larger number of rows.
Pros and Cons of NoSQL
- Lack of standardization, Support and Maturity: as the title indicates these non-relational databases do not use the Structured Query Language and does not have it’s own standardized language. This makes it difficult to migrate processes over to NoSQL. To left is a snippet of the difference between SQL and NoSQL CRUD operations. As you can see as the operations get a little more complex the nested JSON format can get confusing. This also goes hand in hand with the lack of maturity of NoSQL, as there are much less NoSQL experts than SQL at the moment. Due to this same lack of maturity there is not a ton of support for NoSQL- most is done on a community support basis.
- Analytics and BI: SQL has a wide array of tools for Business intelligence (some found here). However, there are not many tools for NoSQL analytics at the moment.
- Flexibility: Easier to manage and more adept at dealing with newer data models. Therefore, NoSQL can fit very specific needs of a company. This article explains how Netflix adopted SimpleDB, Cassandra and HBase for specific business needs way back in 2011.
- Highly Scalable at Low Cost: Many NoSQL options are open source making them a more affordable option for smaller organizations. In addition to this the top NoSQL options (MongoDB, Amazon Dynamo) allow for big data processing at a relatively affordable price. SQL relies on proprietary servers and storage systems which can end up being more expensive than the per GB or transaction cost of NoSQL. Additionally, NoSQL databases can scale out as opposed to scaling up.In the past database administrators had relied on scaling up by buying bigger servers to handle more data loads. NoSQL can scale out- meaning the database will be distributed across multiple servers as load increases. Therefore, NoSQL databases are usually designed with low-cost commodity hardware in mind.
- Although there are many benefits to NoSQL databases, SQL databases are still more widely used at this point. The image to the left shows the current database rankings per DB-Engines.
- SQL databases emphasize ACID (Atomicity, Consistency, Isolation, Durability) where NoSQL focuses on BASE (Basically Available, Soft-State, Eventual Consistency). These are both based on Brewer’s CAP thereoem (visualized below). A common definition for CAP is “In the face of network partitions, you can’t always have both perfect consistency and 100% availability. Plan accordingly.” For more information on BASE, ACID and CAP please see the links below.