Some key characteristics of NqSQL databases are :
- They are distributed, can scale horizontally and can handle data volumes of the order of several terrabytes or petabytes, with low latency.
- They have less rigid schemas than a traditional RDBMS.
- They have weaker transactional guarantees.
- As suggested by the name, these databases do not support SQL.
- Many NoSQL databases model data as row with column families, key value pairs or documents
Theoretically, relational databases comply with Codds 12 rules of relational model. More simply, in RDBMS, a table is relation and database has a set of such relations. A table has rows and columns. Each table has contraints and the database enforces the constraints to ensure the integrity of data.Each row in a table is identified by a primary key and tables are related using foreign keys. You eliminate duplicate data during the process of normalization, by moving columns into separate tables but keeping the relation using foreign keys. To get data out of multiple tables requires joining the tables using the foreign keys. This relational model has been useful in modeling most real world problems and is in widespread use for the last 20 years.
In addition, RDBMS vendors have gone to great lengths to ensure that RDBMSs do a great job in maintaining ACID (actomic, consistent, integrity, durable) transactional properties for the data stored. Recovery is supported from unexpected failures. This has lead to relational databases becoming the de facto standard for storing enterprise data.
If RDBMSs are so good, Why does any one need NoSQL databases ?Even the largest enterprises have users only in the order of 1000s and data requirements in the order of few terra bytes. But when your application is on the internet, where you are dealing with millions of users and data in the order of petabytes, things start to slow down with a RDBMS. The basic operations with any database are read and write. Reads can be scaled by replicating data to multiple machines and load balancing read requests. However this does not work for writes because data consistency needs to be maintained. Writes can be scaled only by partitioning the data. But this affects read as distributed joins can be slow and hard to implement. Additionally, to maintain ACID properties, databases need to lock data at the cost of performance.
The Googles, facebooks , Twitters have found that relaxing the constraints of RDBMSs and distributing data gives them better performance for usecases that involve
- Large datasets of the order of petabytes. Typically this needs to stored using multiple machines.
- The application does a lot of writes.
- Reads require low latency.
- Data is semi structured.
- You need to be able to scale without hitting a bottleneck.
- Application knows what it is looking for. Adhoc queries are not required.
There are a few different types.
1. Key Value Stores
They allow clients to read and write values using a key. Amazon's Dynamo is an example of a key value store.
get(key) returns an object or list of objects
put(key,object) store the object as a blob
Dynamo use hashing to partition data across hosts that store the data. To ensure high availability, each write is replicated across several hosts. Hosts are equal and there is no master. The advantage of Dynamo is that the key value model is simple and it is highly available for writes.
2. Document stores
The key value pairs that make up the data are encapsulated as a document. Apache CouchDB is an example of a document store. In CouchDB , documents have fields. Each field has a key and value. A document could be
--------------------------------------------------------
1 "firstname " : " John ",
2 "lastname " : "Doe" ,
3 "street " : "1 main st",
4 "city " : "New york"
----------------------------------------------------------
In CouchDB, distribution and replication is peer to peer. Client interface is RESTful HTTP, that integrated well with existing HTTP loadbalancing solutions.
3. Column based storesRead and write is done using columns rather than rows. The best known examples are Google's BigTable and the likes of HBase and Cassandra that were inspired by BigTable. The BigTable paper says that BigTable is a sparse, distributed, persistent, multidimensional sorted Map. While that sentence seems complicated, reading each word individually gives clarity.
- sparse - some cells can be empty
- distributed - data is partitioned across many hosts
- persistent - stored to disk
- multidimensional - more than 1 dimension
- Map - key and value
- sorted - maps are generally not sorted but this one is
-----------------------------------------------------------------------------------------------------
{
row1:{
user:{
name: john
id : 123
},
post: {
title:This is a post
text : xyxyxyxx
}
}
row2:{
user:{
name: joe
id : 124
},
post: {
title:This is a post
text : xyxyxyxx
}
}
row3:{
user:{
name: jill
id : 125
},
post: {
title:This is a post
text : xyxyxyxx
}
}
}
-----------------------------------------------------------------------------------------------
The outermost keys row1,row2, row3 are analogues to rows. user and post are what are called column families. The column family user has columns name and id. post has columns title and text.
Columnfamily:column is how you refer to a column. For eg user:id or post:text. In Hbase, when you create the table, the column families need to be specified. But columns can be added on the fly. HBase provides high availability and scalability using a master slave architecture.
Do I needs a NoSQL store ?
You do not need a NoSQL store if:
- All your data fits into 1 machine and does not need to be partitioned.
- You are doing OLTP which required the ACID transaction properties and data consistency that RDBMSs are good at.
- You need ad hoc querying using a language like SQL.
- You have complicated relationships between the entities in your applications.
- Decoupling data from application is important to you.
- Your data has grown so large that it can no longer be handled without partitioning.
- Your RDBMS can no longer handle the load.
- You need very high write performance and low latency reads.
- Your data is not very structured.
- You can have no single point of failure.
- You can tolerate some data inconsistency.
No comments:
Post a Comment