2020-12-14

Book-Summary - NoSQL Distilled - A Brief Guide to the Emerging World of Polyglot Persistence

Authors: Pramod J. Sadalage, Martin Fowler
GoodReads: NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence

This book gives a birds eye view of the NoSQL. This is a good book for those who want to start using the NoSQL.

For the data present in an application, it can be depicted as Data Model and Storage Model. Data model is how the data is present in your application and how it interacts within your application where as storage model is about storing your data. Usually storage model does not have anything to do with your applications data model. For example, you are using SQL database to store your data. Here data is stored in the form of tables or b-tree, mainly to optimise the way it is stored and retrieved. Ideally storage model exists completely independent of data model, but in reality our data model gets adjusted based on the storage model to gain the performance in our application. For example, our data structures will be similar to storage model like tables.
In NoSQL, there are attributes that define a data as a whole, the grouping of attributes is based on the use-case. For a particular workflow, the necessary information needed is bundled together into a single object/model and this is called Aggregate Data. The same information may be duplicated in different models. There is nothing called foreign key to define the relationship between two tables rather we copy the items and put it in both the tables.
SQL adheres to ACID property -> Atomicity - set of table operations happens as a single operation, Consistency - any operations result is visible as same to every one, Isolation - one transaction currently working will not be affected by another incoming transaction, Durability - No data loss i.e every transaction is committed. NoSQL does not have the concept of transactions, so inherently they will not support ACID. But they do have some degree of ACID. If you are trying to update a single model or aggregate data, NoSQL will do that in atomic operation. But if you are trying to modify multiple models or aggregate data, then NoSQL cannot do it atomically but rather application should take care of it.
With Key-Value NoSQL database - it has only Key-Value and no structure. Now value can be another set of key-values or lists or data or any blob of data. It is like a python dictionary which does not have a fixed datatype for key-value. In case of say Java you would have created dictionary of String -> List<String>, now every entry added to that dictionary has a string key and a list of string as value. In case of NoSQL, we are having a dictionary with fixed Key type but its corresponding Value type is a generic object. So ,if I want to look for some attribute in one of the Value, we can’t create a query and run, as structure of value1 is different than structure of value2. So, attribute may or may not exist in all the Values. The document database on the other hand puts some structure towards the data. All the document should have this structure, and thus allows us to create smart queries to obtain the data. So Key-Value stores are very bad in case of searching. Actually, we can simulate Key-value from using document based store. Application can also enforce some structure to the Value of Key-Value store. But even doing that from the database itself we will not get any performance benefit in searches.
In Column-Family Store, a key is linked to set of columns. These columns are grouped into one family. So when we access the data, we are accessing the whole set of columns belonging to that family. The key can be linked to multiple column families.
If we have lot of relationship between the data, than storing it in SQL might be easy but retrieving it is very expensive due to lot of join operation needed. Graph database will invest much of time during the insertion and helps to answer queries related to relationship faster. So if you have small data but lot of relationships within the data then Graph database is apt for it.
In case of SQL we need to adhere to a schema. All our data model has to be related to that schema. So in future if there is any change to be done like adding new information, deleting an attribute, updating relationship, etc. will require significant effort. With the NoSQL, there is no schema, so it is very easy to update. In NoSQL, schema/model exists in application, and NoSQL is just a data storage. It almost clearly segregates data model and storage model. There is a problem here, if the data store is accessed by multiple applications, change of schema by one will affect the other application. So best practice is to keep the data store encapsulated by a web service and handled directly by a single application.
Sharding is a way of storing different parts of the data in different node/machine. As an application, data is very huge but with sharding we are dividing it into multiple nodes. Thus each machine can serve only few customers whose data is stored in it. The important function here is how we divide the data into different nodes. We can use domain specific knowledge or general method like hashing. With sharding we are not only reducing the number of reads on the node but also the number of writes thus increasing the performance. But if we lose the node, we lose that chunk of data. Though we did not lose whole data but still broken data is useless. So sharding should always be used along with replication. If one node goes, we have other one to recover.
One of the way to have replication is via Primary-Secondary (master - slave) replication. Here Primary will replicate its data to the secondary nodes. Thus we can direct all the reads to secondary. By this we are reducing the load on primary and thus helping to spend more resources on writes & replications. Now if primary dies, secondaries still can process reads. Though write will get affected until a new primary is elected or old primary comes back.
With Primary-Secondary, primary is the bottleneck. If we lose primary, we lose the ability to write. To overcome this, we have peer-peer replication where there is no single primary node. So each node is primary for certain set of data and every node is secondary for certain set of data.
With the replication comes the inconsistency. We got resilience or availability but we lost the consistency.
Different types of conflict -
- write-write conflict: We have data d and users x and y both want to update d at the same time. x tries to update it as m where as y tries to update it as n. This will lead to Update inconsistencies.
- read-write conflict: We have a data x and a related data y. So if you update x, it will have an affect on y and that also will get updated after some time. Here if m updates x, and simultaneously n is reading the values , n reads old value of y and updated value of x. After sometime y gets updated. Here we have the read and write conflict due to delay. This will lead to read inconsistencies.
We can solve the conflicts by using either pessimistic approach or optimistic approach
- Pessimistic Approach: Do not allow the conflict to occur. Use locks - any client who wants to update a data,needs to acquire the lock first. If you get the lock, you can update otherwise no. With this approach we are taking the hit on the performance.
- Optimistic Approach: Allow the conflict to occur but handle it.For example: conditional approach. Any update should verify if the data it wants to change is same as the data it had read earlier. If yes, then only continue wit the update operation, else fallback or fail.
When replication is involved, we will have replication inconsistencies. The aim of the NoSQL is to reduce the inconsistency time window, thus achieving the eventual consistency.
CAP Theorem - Consistency, Availability and partition tolerance. For a given distributed system, we can achieve only two of them. To get consistency or availability, we need partition tolerance.
To avoid write conflicts in case of replicated systems( replicated to N nodes), for any write to be declared successful that write needs to be propagated successfully to greater than half of the replication nodes that is W > N/2. This is known as write quorum. To get a consistent read you might need to write to all the nodes i.e W = N, but this will slow your writes.
Map-Reduce is a type of calculation system. In the Map step, in each node we will collect the needed attributes from the documents in that node. Within that node it might aggregate all the data. In the reduce part, all the data from the nodes are combined together to form a single value.