Skip to content

Big Data

Traditional systems don't scale with big data. New solutions are needed like NoSQL (not only SQL).

Relational Database

Pros: Cons:
- Known technology
- Good structure
- Good interface
- Supports concurrent access for multiple users with different permissions
- Multi user synchronization
- Query processing engine
- Scales only to medium size
- Cant handle over scaling
- Problems by using semi-structured data

Definition of Big Data V's

Volume Large amount of data
Velocity High speed access to data
Variety Different data sources and structures
Veracity Data has different qualities
Value Some data is more valuable than other

Volume (Data in Rest)

  • Scale and dimension of data is huge
  • Possibility to edit large data volumes

Velocity (Data in Motion)

  • Speed and dynamic of incoming data is high
  • Datastreaming
  • Handling of: speed, stream, structured records creation
  • Access & delivery of data in real-time

Variety (Data in many Forms)

  • Different Datatypes wihtout a fixed structure (images, videos, db, blogs, etc.)
  • Missing uniformity/monotony
  • Data can be highly semi-/unstructured

Veracity (Data in Doubt)

  • Trustworthiness of data
  • Different qualities of data
  • Data consistency: trustworthiness, accuracy, etc.
  • Results only considered as probable

Value (additional)

  • Only significant if added value can be generated from Big Data
  • New business models, products, services, etc.

Scalability

  • Ability of a system to handle a growing amount of work by adding resources to the system

Vertical Scaling (Up-Scaling)

Note

Increasing performance and size of instance

  • Bigger and better machine
  • Seller Lock-In
  • Not many providers
  • Higher costs
  • Works well until data gets too big

Horizontal Scaling (Out-Scaling)

Note

Multiple instances working together

  • Cheaper
  • Bypass vertical scaling
  • Latency between instances
  • Network can get unseacure
  • Bandwidth is limited
  • Network is not reliable

Cluster

Note

Collection of connected nodes, based on the shared-nothing architecture

  • Network of Nodes
    • Sharing nothing (no shared memory, no shared storage)
    • Communication via messages
  • Distribution different tasks
    • Data
    • Queries
    • Calculations
    • Requests

Sharding

Note

Partitioning of data across multiple machines

  • Vertical Shards
  • Horizontal Shards

Pros:

  • Balanced workloads
  • respects physical locations
  • Uniform data distribution

Cons:

  • Data changes over time
  • Data conflicts (inconsistency)

Strategies

  • Mapping Structures
    • round-robin
    • range
    • hash
  • Composite Partitioning

Replication

Note

Copying data across multiple machines to provide redundancy

Master Slave Architecture

  • One node is primary (master)
    • only master can write
  • All other nodes are secondary (slaves)
    • slaves can only read

Peer to Peer Architecture

  • All nodes are equal
    • No bottlenech/single point of failure for reads/writes
    • Scalable
    • Synchronization is needed
  • All nodes have same data