Musings on Technology: Challenge the Conventional Wisdom : MongoSF 2011

||| MongoSF , May 25, 2011 |||

* shamelessly borrowed from Kyle's great slides ... I consider this as the best example to drive home the simple yet strong message from MongoSF Seminar !

Professor Calculus trying to convince Captain Haddock that he should embark upon the Shuttle to Planet MONGO !

Capt. H : My data is not super-transactional , but time-critical and move very fast .. So are you saying I don't need Normalization ?
Prof. C : Exactly ! Design your database as per your usecase. In absence of a schema, no need to build/maintain the lifecycle of a model (conceptual [ERD]-> logical [UML] -> physical [DB] -> sql -> xsd ..)
Captain! just spell-out your thoughts as..{vendor_id : "apple", products : ["iphone","ipod", "ipad"] }
Embedded Array is a powerful concept to get a list of associated entities for a specific entity at one go !

With full respect to RDBMS world (which will always be relevant in the Operational Domain like sayManufacturing ...) few imp points to remember :
>> Tradional DB - Schema has no bias for a specific usecase. It encompasses all possible relationships between the entities (stakeholders) of a domain !
>> So its great for ad-hoc queries and data-rigid ness .... but very very slow for data manipulation and for hopping around !! This trade-off can be equated with the famous trade-offs between 'bloated Wsdl and simplistic Rest' or 'Schematic XML and Schema-less Json' !

Capt. H : Traditionally, first I normalize the data for OLTP and then de-normalize the data for BI .. ?
Prof. C : I do understand your pain points ! Welcome to MongoDB ! Just fire few aggregation queries on the same document that stores your business model and you extract all the intelligence at no extra cost !

Capt. H : So far so good ! Do I still need a Cache or ORM layer ?
Prof. C : Well.. Captain .. Caching was a Work-around for slow lookups from over-burdened database !
Now the db itself is extremely light weight engine - fast lock-free lookup, aycnchonous parrallel writes, distributed loads on shards etc. etc.
- ORM is another work-around to abstract rigid SQL and convert pojo from specific framework to db entities ! .. Well .. in most of cases .. no need for any sort of mappng ideally ! MongoDB manadates that Code and Data should reside in seperate layers and each understand one standard i.e. Json!
It flows so smoothly from front-end (Php/pearl/python/gwt/js .. all understand json) to the json-friendly Rest layer and finally lands on the lap of MongoDB as Bson !
Rememeber ... we now have a database which itself follows object structure that your app layer is based on ... so no need of any ORM layer !
Now for the sake of backward compatibility and smooth integration with existing pojo-based apps (which were talking to rdbms), Spring-data provides nice wrappers to NoSqls !

Capt. H :Well I love breaking down a large Entity like Product into multiple components like - { line_item, payment, order, etc.. } .. but I pay the heavy price of slow joins ?
Prof. C : Hmm .. how about creating a Rich Document - Product and then embed sub-documents {line_ites, payment, order..} inside the parent..
Then index on sub documents and perform super-fast asynchrnous atomic updates on a single document

Capt. H : Great ! You know .. I love storing the parent-child relationship .. i.e. the Inheritence or.. a Tree-Hierarchy in a single table .... but as a result in RDBMS .. you know .. I have redundant columns !!
Prof. C : No worries ! in MongoDB, data itself defines its schema ! (metadata free modelling)
So you can specify ...
db.shapes.find()
{_id:"1",type:"circle",area: 3.14, radius: 1 }
{_id:"2",type:"square",area: 4, d: 2 }
{_id:"3",type:"rect",area: 10, length: 5 , width: 2}
-- And wait .. you can apply the beautiful 'Sparse Index' to diverse data-set stored inside same collection !
Did you catch the sweet spot 'radius' - just by its sheer presence as an unique attribute an 'inherent restriction' is applied on the collection.

Capt H: Professor .. You simply Rock ! .. I kinda learnt how to partition my data in rdbms ! But so difficult to maintain when data outgrows the chunks ...
Prof C. .. Sorry to hear that captain ! ... data-partitioning is a no brainer !
db.runCommand( { addshard : "localhost:10000" } )

{ "shardadded" : "shard0000", "ok" : 1 }
db.runCommand( { addshard : "localhost:10001" } )
{ "shardadded" : "shard0001", "ok" : 1 }

All the goodness of sophisticated master-slave nodes allocation, replication, internal node state management, auto-recovery are provided out-of-the box !
ReplicaSet Mode detects all the hosts every 5 sec.
Replica1 : {chunk1(user1 - user100) => shard1, chunk2(user101 - user 200) => shard1}
Replica2 : {chunk1(user1 - user100) => shard1, chunk2(user101 - user 200) => shard1}

MongoS process is the load-balancer. Upon startup it finds out which servers are up and based on the latency picks the shard. It routes the chunks between shards in replica sets.
If new hot chunk is created, MongoS can dynamically allocate the chunk to an available shard - shard2.
Otherwise it will spawn new shard by dividing existing shard at its median !

Capt H: Very Interesting ! So I can choose whether my query will work on primary or secondary.
Prof C: You got it right ! For immediate consistency, query from / write into the primary master. For eventual consistency query the slaves ! Reads may go out-of-date on slaves. Its just a tip of iceberg ! So refer to the document for more details !

Capt H: Great ! Now that my app working fine ... need to do some analysis .. So coming to data-crunching in batch ! I am a startup guy and have no time to setup hadoop-hive-mahout on a seperate cluster , neither have the luxury to build the infrastructure of Amazon Elastic Map-Reduce .... so .. I guess .. MongoDB ..

Prof C:  Oh ! Yah ! Absolutely ! Right here ... Why as a developer in a fast-paced-env you need to worry about the gory details of data crunching !! Just call map-reduce api ! Stay tuned ! Out-of-the box Aggregation API coming soon ! Remember the key point ... you can now enjoy a real-time streaming analytics with negligible latency !

Capt H: I am ecstatic to hear this ! On a different note I love indexed-based searching using lucene-solr .. but just checking in if ..
Prof C:  Sure ! MongoDB hears you ... full-fledged text-searching .. will be released by end of 2011 !

Capt H: I love Scala and Node.js .. a big fan of executing requests parallely ..  but it gets very tricky to solve table-level and row-level locking issues in Database ....
Prof C: Absolutely ! Your writes per document are atomic and reads are parallel in nature (cpu cores scaling out not speeding up ... so mongodb native threads just love asynchronous parallel processing) ! But if writes fail .. there is no auto-rollback machanism .. your app should be responsible for data consistency ! No support for distributed transactions ! So use a language of your choice and no need to shard in the app layer !

Capt H: A compelling reason for adopting AWS was SimpleDB .. I mean instead of moving my DB to Cloud I tried to access a DB on Cloud .. but sacrificed many features of RDBMS .... So .. is it a Big Ask from Mongo ... ?
Prof C:  Not at all ! MongoDB is Cloud-ready by virtue of its horizontal scalability, multi-tenancy , fail-safe nature, statelessness and atomic transactions. Scale linearly, increase capacity with no down-time.

So Captain Welcome Aboard ! Enjoy your journey into Mongo !

With freedom comes responsibility :

1. MongoDB does not encourage storing reverse links in referred documents for implementing many-2-many relationship.
Captain I know you love your language .. you wont mind couple of extra lines of query in app layer :-)
//All categories for a given product
product = db.products.find(_id, prod_id);
db.categories.find({_id: {$in : product.category_ids} })

2. Its not a full-fledged caching solution ... so there are pending issues with data-eviction ! Expired objects reportedly sit back in memory ! MongoDB believes in reusing memory rather than suffering from reclaiming-reallocating overhead every now-n -then ! But the garbage collection algo definitely get better and better with community support !

3. The thrust is on 'eventual consistency' for writes. there is no 'Retries' for writes ! So we need to judiciously handle the write-exceptions in app layer. We need to remember - if we explicitly mark a 'transaction request' as 'done' then only mongoDB driver will clean-up the connection from the thread-local !

4. While designing the data model, we need to understand where to 'embed' and where to 'link.
Definitely 'Linking' has a higher 'cost of relocation' than 'Embedding'. That implies in a Tree we better store the list of descendants directly.

5. The workingSet size should be kept minimal and need to be distributed evenly.
Pre-splitting for initial bulk loading is always advised.

6. Queries on sharded key are fastest.

7. For pagination we should depend upon stateful cursor_index instead of depending on stateless sharded key.

Tips, tricks, references :

Thought-provoking : http://www.slideshare.net/tackers/moving-from-relational-to-document-store

Great summary on app-design principles :
http://speakerdeck.com/u/kbanker/p/the-mongodb-gamut-four-app-designs

50 Tips and Tricks for MongoDB Developers : http://my.safaribooksonline.com/book/databases/9781449306779/firstchapter

Schema Design Basics : http://speakerdeck.com/u/kbanker/p/mongodb-schema-design-mongosf-2011
http://www.slideshare.net/mongodb/mongodb-schema-design-richard-kreuters-mongo-berlin-preso

New Aggregation Features : http://www.slideshare.net/cwestin63/mongodb-aggregation-mongosf-may-2011

Performance Considerations : http://speakerdeck.com/u/kbanker/p/10-key-mongodb-performance-indicators

http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/

Great Teachings from MongDB Mentor : http://www.scribd.com/alvin_richards