RDBMS vs Cassandra data models

The following table lists down the points that differentiate the data model of Cassandra from that of an RDBMS.

RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
It has a fixed schema. Cassandra has a flexible schema.
In RDBMS, a table is an array of arrays. (ROW x COLUMN) In Cassandra, a table is a list of “nested key-value pairs”. (ROW x COLUMN key x COLUMN value)
Database is the outermost container that contains data corresponding to an application. Keyspace is the outermost container that contains data corresponding to an application.
Tables are the entities of a database. Tables or column families are the entity of a keyspace.
Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.
Column represents the attributes of a relation. Column is a unit of storage in Cassandra.
RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.

What are ACID properties of Database

When talking databases that handle mission-critical business transactions and information you are talking ACID features. There are a set of properties that guarantee that database transactions are processed reliably, referred to as ACID (Atomicity, Consistency, Isolation, Durability)

acid_db

Atomicity requires that each transaction be “all or nothing”: if one part of the transaction fails, then the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible (“atomic”), and an aborted transaction does not happen.

Consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors cannot result in the violation of any defined rules.

Isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e., one after the other. Providing isolation is the main goal of concurrency control. Depending on the concurrency control method (i.e., if it uses strict – as opposed to relaxed – serializability), the effects of an incomplete transaction might not even be visible to another transaction.

Durability property ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.

MongoDB : How to Scale MongoDB Datasets

As time goes and data grows it is required to scale any database to maintain the quick response time and performance. Initially we run query explain plan and make changes in the document structure to make them efficient and add/modify required indexes. However even after performing clean up and maintenance tasks, the time comes to scale and ensure availability, durability and fault tolerance.

scaleMongo

First, we scale vertically

MongoDB, like most databases, craves RAM and IO capacity. It sometimes likes CPU. The simplest conceptual way of scaling performance for a MongoDB dataset is to give it more system resources without worrying about spreading the load across servers. Normally, this is painful for cost or operational reasons. Doubling the capacity of a production MongoDB replica set means swapping larger servers in, figuring out what to do with the old ones, and hoping the new ones are just the right size to keep things healthy for a long period of time.

Then, we scale out

At this point, scaling out MongoDB is easy. A well-built, sharded MongoDB dataset is easy to reason about and will scale linearly across other servers. The sharding needs the key to divide the data across several small / commodity servers grouped in a cluster.

shardingExample

Flexibility is a requirement of an evolving data set

MongoDB offers numerous features that make developers lives easier. It also offers features for scale. Using the scaling features at the wrong time means compromising on developer-friendly features (unique constraints, oplog usefulness, capped collections). There is a great deal of pressure on developers to use the MongoDB sharding features even when they’re not necessary, which makes their lives worse in aggregate. The most healthy MongoDB setups started with developers using features that helped them move faster, and evolved as understanding of the problem scope and appropriate scale increased.

For developers that use MongoDB, they should make smart decisions and don’t force themselves down a path before they even have a map. We say inspect and adopt in Agile 😉

 

What is Json schema

Json is very flexible, especially when you expect a json payload you have to ensure that data is passed in correct data type. For instance

Json Payload 1 – amount as number

{
item : “premaseem”,
sold : true
amount : 786
}

Json Payload 2 – id as string

{
item : “premaseem”,
sold: “True”
amount : “786”
}

Both payloads are good however if you have to do some calculation on amount and if it is passed as string, math operation would fail. So you need to publish a schema by which user or consumer can understand what data type is expected to create a valid and useful payload which can be consumed without causing issues.

This website can be used to generate the Json schema

http://jsonschema.net

This website can be used to validate the json object against Json schema

http://json-schema-validator.herokuapp.com

What is JSON Schema ?

JSON Schema specifies a JSON-based format to define the structure of JSON data for validation, documentation, and interaction control. A JSON Schema provides a contract for the JSON data required by a given application, and how that data can be modified.

JSON Schema is based on the concepts from XML Schema (XSD), but is JSON-based. The JSON data schema can be used to validate JSON data. As in XSD, the same serialization/deserialization tools can be used both for the schema and data. The schema is self-describing.

Example JSON Schema (draft 4):

{
  "$schema": "http://json-schema.org/schema#",
  "title": "Product",
  "type": "object",
  "required": ["id", "name", "price"],
  "properties": {
    "id": {
      "type": "number",
      "description": "Product identifier"
    },
    "name": {
      "type": "string",
      "description": "Name of the product"
    },
    "price": {
      "type": "number",
      "minimum": 0
    },
    "tags": {
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "stock": {
      "type": "object",
      "properties": {
        "warehouse": {
          "type": "number"
        },
        "retail": {
          "type": "number"
        }
      }
    }
  }
}

The JSON Schema above can be used to test the validity of the JSON code below:

{
  "id": 1,
  "name": "Foo",
  "price": 123,
  "tags": [
    "Bar",
    "Eek"
  ],
  "stock": {
    "warehouse": 300,
    "retail": 20
  }
}

 

MongoDB : What is Time to Live (TTL) index

TTL or time to live indexes are special single-field indexes that MongoDB can use to automatically remove documents from a collection after a certain amount of time. Data expiration is useful for certain types of information like machine generated event data, logs, and session information that only need to persist in a database for a finite amount of time.

To create a TTL index, use the db.collection.createIndex() method with theexpireAfterSeconds option on a field whose value is either a date or an array that contains date values.

For example, to create a TTL index on the lastModifiedDate field of the eventlog collection, use the following operation in the mongo shell:

db.eventlog.createIndex( { "lastModifiedDate": 1 }, { expireAfterSeconds: 3600 } )


Warning: The TTL index does not guarantee that expired data will be deleted immediately. 
There may be a delay between the time a document expires and the time that MongoDB removes the document from the database.

 

Python : SQLAlchemy ORM Library for python

SQLAlchemy  is a Python Library created by Mike Bayer to provide a high-level, Pythonic (idiomatically Python) interface to relational databases such as Oracle, DB2, MySQL, PostgreSQL, and SQLite. SQLAlchemy attempts to be unobtrusive to your Python code, allowing you to map plain old Python objects (POPOs) to database tables without substantially changing your existing Python code. SQLAlchemy includes a database server-independent SQL expression language and an object-relational mapper (ORM) that lets you use SQL to persist your application objects automatically.

Doc link : http://docs.sqlalchemy.org/en/latest/core/connections.html#basic-usage

NOTE : It is similar to what hibernate framework does in java.

SQLAlchemy  provides the following functionality:

  1. It maps relational databases into objects
  2. It manages an applications database connections
  3. It can create/alter a database layout if it is allowed to

The most powerful feature of SQLAlchemy is the first point: given a table description, it maps the data in tables into classes of objects, where each instance of an object is a row in the table, and can be worked with like a class or structure in code.

Example: Given a table called “Users”, with a FirstName and LastName column. Once the columns are described in the Python code, to add a row to the users table might look like this:

joebruin = User()
joebruin.FirstName = “Joe”
joebruin.LastName = “Bruin”
joebruin.save()

MongoDB : How to make query result look nice in mongo shell

Whenever we do find() query in mongoDB collection the shell is filled up with too much of data which is difficult to understand and ugly to look.

 

> db.devices.find()
{ "_id" : ObjectId("55ff79deb86a4b0eb1110ba4"), "time_allocated" : null,  "is_allocated" : false, "aggr_zone" : "xxx", "dc" : "DFW1", "time_suspended" : null, "device_type" : "server", "core_template_id" : "12345", "device_swapped_to" : null, "is_suspended" : false, "time_created" : "Sun Sep 20 22:30:38 2015", "device_id" : 48080, "is_decommed" : false }
{ "_id" : ObjectId("55ff79deb86a4b0eb1110ba5"), "time_allocated" : null,  "is_allocated" : false, "aggr_zone" : "xxx", "dc" : "ORD1", "time_suspended" : null, "device_type" : "server", "core_template_id" : "54321", "device_swapped_to" : null, "is_suspended" : false, "time_created" : "Sun Sep 20 22:30:38 2015", "device_id" : 45244, "is_decommed" : false }


 

To make the query result data look nicer there are 2 ways.
1. Use Pretty()

> db.devices.find().pretty()
{
	"_id" : ObjectId("55ff79deb86a4b0eb1110ba4"),
	"time_allocated" : null,
	"is_allocated" : false,
	"aggr_zone" : "xxx",
	"dc" : "DFW1",
	"time_suspended" : null,
	"device_type" : "server",
	"core_template_id" : "12345",
	"device_swapped_to" : null,
	"is_suspended" : false,
	"time_created" : "Sun Sep 20 22:30:38 2015",
	"device_id" : 48080,
	"is_decommed" : false
}
{
	"_id" : ObjectId("55ff79deb86a4b0eb1110ba5"),
	"time_allocated" : null,
	"is_allocated" : false,
	"aggr_zone" : "xxx",
	"dc" : "ORD1",
	"time_suspended" : null,
	"device_type" : "server",
	"core_template_id" : "54321",
	"device_swapped_to" : null,
	"is_suspended" : false,
	"time_created" : "Sun Sep 20 22:30:38 2015",
	"device_id" : 45244,
	"is_decommed" : false
}

Note : This would give your iterator, which mean you need to type "it" for next 20 records.
2. Use to Array

> db.devices.find().limit(2).toArray()
[
{
“_id” : ObjectId(“55ff79deb86a4b0eb1110ba4”),
“time_allocated” : null,
“is_allocated” : false,
“aggr_zone” : “xxx”,
“dc” : “DFW1”,
“time_suspended” : null,
“device_type” : “server”,
“core_template_id” : “12345”,
“device_swapped_to” : null,
“is_suspended” : false,
“time_created” : “Sun Sep 20 22:30:38 2015”,
“device_id” : 48080,
“is_decommed” : false
},
{
“_id” : ObjectId(“55ff79deb86a4b0eb1110ba5”),
“time_allocated” : null,
“is_allocated” : false,
“aggr_zone” : “xxx”,
“dc” : “ORD1”,
“time_suspended” : null,
“device_type” : “server”,
“core_template_id” : “54321”,
“device_swapped_to” : null,
“is_suspended” : false,
“time_created” : “Sun Sep 20 22:30:38 2015”,
“device_id” : 45244,
“is_decommed” : false
}
]
>

Note : This will display all records in shell, so use limit. Iterator is not given, so you might get any records mentioned unless you give sorting option.


			

Mac : Stop and Start Cassandra DB on Mac OSX

For Mac use below commands to stop and start Cassandra :

launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.cassandra.plist
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.cassandra.plist

Once cassandra is running you can validate that by opening query prompt using

 

  aseem’s Mac ✗$ cqlsh                                                                

Connected to Test Cluster at 127.0.0.1:9042.

[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]

Use HELP for help.

cqlsh>

What is Memcached ?

As name suggested Memcached is cache which could be though of as big key value pair bucket residing on RAM which can deliver frequently used data instantly by avoiding datasource access.

Memcached is a general-purpose distributed memory caching system. It is often used to speed up dynamic database-driven websites by caching data and objects in RAM to reduce the number of times an external data source (such as a database or API) must be read.

Memcached’s APIs provide a very large hash table distributed across multiple machines. When the table is full, subsequent inserts cause older data to be purged in least recently used (LRU) order.[3][4] Applications using Memcached typically layer requests and additions into RAM before falling back on a slower backing store, such as a database.

Converting database or object creation queries to use Memcached is simple. Typically, when using straight database queries, example code would be as follows:

 function get_foo(int userid) {
    data = db_select("SELECT * FROM users WHERE userid = ?", userid);
    return data;
 }

After conversion to Memcached, the same call might look like the following

 function get_foo(int userid) {
    /* first try the cache */
    data = memcached_fetch("userrow:" + userid);
    if (!data) {
       /* not found : request database */
       data = db_select("SELECT * FROM users WHERE userid = ?", userid);
       /* then store in cache until next get */
       memcached_add("userrow:" + userid, data);
    }
    return data;
 }

The client would first check whether a Memcached value with the unique key “userrow:userid” exists, where userid is some number. If the result does not exist, it would select from the database as usual, and set the unique key using the Memcached API add function call.

However, if only this API call were modified, the server would end up fetching incorrect data following any database update actions: the Memcached “view” of the data would become out of date. Therefore, in addition to creating an “add” call, an update call would also be needed using the Memcached set function.

 function update_foo(int userid, string dbUpdateString) {
   /* first update database */
    result = db_execute(dbUpdateString);
    if (result) {
       /* database update successful : fetch data to be stored in cache */
       data = db_select("SELECT * FROM users WHERE userid = ?", userid);
       /* the previous line could also look like data = createDataFromDBString(dbUpdateString); */
       /* then store in cache until next get */
       memcached_set("userrow:" + userid, data);
    }
 }

This call would update the currently cached data to match the new data in the database, assuming the database query succeeds. An alternative approach would be to invalidate the cache with the Memcached delete function, so that subsequent fetches result in a cache miss. Similar action would need to be taken when database records were deleted, to maintain either a correct or incomplete cache.

What is ISO date format

As world is becoming a big family with globalization, we need the standard to followed to avoid confusion especially related to date and time. If every country or person would write dates in different format, it would be confusing to figure out exact date while doing communication across countries, hence ISO has come up with ISO date format to be followed by every one. 

International Standard ISO 8601 specifies numeric representations of date and time. This standard notation helps to avoid confusion in international communication caused by the many different national notations and increases the portability of computer user interfaces.

YYYY-MM-DDThh:mm:ss.sTZD

where:

     YYYY = four-digit year

     MM   = two-digit month (01=January, etc.)

     DD   = two-digit day of month (01 through 31)

     hh   = two digits of hour (00 through 23) (am/pm NOT allowed)

     mm   = two digits of minute (00 through 59)

     ss   = two digits of second (00 through 59)

     s    = one or more digits representing a decimal fraction of a second

     TZD  = time zone designator (Z or +hh:mm or -hh:mm)

example: 

 1997-07-16T19:20:30.45Z (Z at end indicates time is taken with Zero correction i.e. UTC time)

 1997-07-16T19:20:30.45+01:00 (+1:00 at end indicates time is taken at time zone which is 1:00 ahead of UTC time)

Advantages of the ISO 8601 standard date notation compared to other commonly used variants:

  • easily readable and writeable by software (no ‘JAN’, ‘FEB’, … table necessary)
  • easily comparable and sortable with a trivial string comparison
  • language independent
  • can not be confused with other popular date notations
  • consistency with the common 24h time notation system, where the larger units (hours) are also written in front of the smaller ones (minutes and seconds)
  • strings containing a date followed by a time are also easily comparable and sortable (e.g. write “1995-02-04 22:45:00”)
  • the notation is short and has constant length, which makes both keyboard data entry and table layout easier
  • identical to the Chinese date notation, so the largest cultural group (>25%) on this planet is already familiar with it 🙂
  • date notations with the order “year, month, day” are in addition already widely used e.g. in Japan, Korea, Hungary, Sweden, Finland, Denmark, and a few other countries and people in the U.S. are already used to at least the “month, day” order
  • a 4-digit year representation avoids overflow problems after 2099-12-31