Big Data and Kinesis

 EMR

  • Removes the challenges in setting up and maintaining Hadoop cluster.
  • Elastic Mapreduce - Managed Hadoop and Spark Service
  • Storage Options
    • HDFS - Default block size of 128MB.
    • EBS - For temporary data storage
    • EMRFS - Read/writes to S3 based on HDFS.
  • Instance Types
    • General Purpose - M4
    • Machine learning - C4
    • Deep learning - P3
    • Large HDFS - D2
    • Large scale interactive analysis - X1
  • Node types
    • Master - Manages the cluster - Runs Yarn to manage resources. Runs Ganglia, Zepplin.
      • Can have 1 or 3 master nodes in EMR cluster
    • Core nodes - 1 to many - runs HDFS - execute tasks from master nodes.
    • Task nodes - does computation and they dont run HDFS - can be 0 to as many as needed. To accelerate data processing more can be added.
  • EMR Cluster types
    • Transient - Terminates automatically after workload completion.
      • Say running 1 hr job 10 times a day
      • Takes 15-30 mins for initialization
    • Long running  - Need to be terminated manually. 
      • Say running 2 hr job 12 times a day
  • Lifecycles of EMR
    • Starting
    • Bootstrapping
    • Running
    • Waiting
    • Shutting_Down
    • Completed
    • Failed
    • Terminated
  • Billing types
    • On-demand - Pay for what you use - Highest cost - Good for unpredictable workloads
    • Reserved  - 1 or 3 year commitment - Discounts
    • Spot - Upto 90% discounts on On-demand.
    • Use on-demand for Master and Core nodes. Use Spot for task nodes.
    • For long running - use reserved instances and Spot for task nodes.
  • Amazon EBS works differently within Amazon EMR than it does with regular Amazon EC2 instances. Amazon EBS volumes attached to EMR clusters are ephemeral: the volumes are deleted upon cluster and instance termination (for example, when shrinking instance groups), so it's important that you not expect data to persist
EMR
    1. Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data using EC2 instances. When using Amazon EMR, you don’t need to worry about installing, upgrading, and maintaining Spark software (or any other tool from the Hadoop framework). You also don’t need to worry about installing and maintaining underlying hardware or operating systems.
    2. The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type, giving each node a role .
      1. Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node, and it’s possible to create a single-node cluster with only the master node.
      2. Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.
      3. Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
      4. Dont have master and core on Spot instances. Use for Task nodes.
  1. Amazon Redshift only supports Single-AZ deployments.
    1. You can run data warehouse clusters in multiple AZ's by loading data into two Amazon Redshift data warehouse clusters in separate AZs from the same set of Amazon S3 input files.
    2. With Redshift Spectrum, you can spin up multiple clusters across AZs and access data in Amazon S3 without having to load it into your cluster.
    3. Cluster data can be backed up to via snapshots to S3. These snapshots can be restored in any AZ in that region or transferred automatically to other regions for disaster recovery.
    4. With the introduction of Elastic Load Balancing (ELB) access logs, administrators have a tremendous amount of data describing all traffic through their ELB. Redshift can be used for analysis via SQL queries.  
    5. Snapshots are point-in-time backups of a cluster. There are two types of snapshots: automated and manual. Amazon Redshift stores these snapshots internally in Amazon S3 by using an encrypted Secure Sockets Layer (SSL) connection. Amazon Redshift automatically takes incremental snapshots that track changes to the cluster since the previous automated snapshot. 
    6. Automated snapshots retain all of the data required to restore a cluster from a snapshot. You can take a manual snapshot any time. 
    7. When automated snapshots are enabled for a cluster, Amazon Redshift periodically takes snapshots of that cluster, usually every eight hours or following every 5 GB per node of data changes, or whichever comes first.
    8.  Automated snapshots are enabled by default when you create a cluster. Even though the automatic snapshots feature is enabled by default, cross-region snapshot copy is not.
    9. These snapshots are deleted at the end of a retention period. The default retention period is one day, but you can modify it.
    10. When you restore from a snapshot, Amazon Redshift creates a new cluster and makes the new cluster available before all of the data is loaded, so you can begin querying the new cluster immediately. The cluster streams data on demand from the snapshot in response to active queries, then loads the remaining data in the background.
    11. Amazon Redshift only has cross-region backup feature (using snapshots), not Cross-Region Replication. it can’t replicate directly to another cluster in another region.
    12. When you launch an Amazon Redshift cluster, you can choose to encrypt it with a master key from the AWS Key Management Service (AWS KMS). AWS KMS keys are specific to a region. If you want to enable cross-region snapshot copy for an AWS KMS-encrypted cluster, you must configure a snapshot copy grant for a master key in the destination region so that Amazon Redshift can perform encryption operations in the destination region.
    13. Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck behind long-running queries. Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues. 
    14. Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Thus, Redshift Spectrum queries use less of your cluster's processing capacity than other queries.
  2. Kinesis Data Firehose is the easiest way to reliably load streaming data into data stores and analytics tools. 
    1. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today. 
    2. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. 
    3. It can also batch, compress, transform, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security.
  3. Kinesis Data Streams
    1. A producer puts data records into Amazon Kinesis data streams. For example, a web server sending log data to a Kinesis data stream is a producer.
    2. A consumer processes the data records from a stream.
    3. To put data into the stream, you must specify the name of the stream, a partition key, and the data blob to be added to the stream. 
    4. The partition key is used to determine which shard in the stream the data record is added to.
    5. Kinesis shard has 1MB/sec limit, so if we connect multiple devices whose output to shard exceeds the limit we will get error.
    6. Consumers use enhanced fan-out by retrieving data with the SubscribeToShard API (API that pushes data from shards to consumers over a persistent connection without a request cycle from the client.). You can have multiple consumers using enhanced fan-out and others not using enhanced fan-out at the same time.
    7. Read - Getrecords API and Write - Put RecordsAPI
    8. Capacity mode = On-demand mode is best suited for workloads with unpredictable and highly variable traffic patterns.  Provisioned mode is best suited for predictable traffic, where capacity requirements are easy to forecast.
  4. Kinesis Video Streams
    1. One video stream per device (producer)
    2. Underlying data stored in S3
    3. Cannot output the stream directly to S3, it has to be handled by a custom solution .
    4. Integrates with Rekognition for facial identification from the videos. 
  5. Rekognition
    1. Facial analysis/search
    2. Text, Face detection, Celebrity recognition, matching against known faces stored in db.

Comments

Popular posts from this blog

Key Concepts

Linear Algebra Concepts

Cryptography