Haddad Said

Cloud Native Big Data Solutions Architect

Aws S3 Storage Gateway

AWS Storage Gateway connects an on-premises software appliance with cloud-based storage to provide seamless integration with data security features between the on-premises IT environment and the AWS cloud storage infrastructure. The service can be used for backup and archiving, disaster recovery, cloud bursting, storage tiering, and migration. On premises applications connect to the service through a gateway appliance using standard storage protocols, such as NFS and iSCSI, and the gateway connects to AWS storage services, such as Amazon S3, Amazon Glacier, and Amazon EBS, providing file-based, volume-based, and tape-based storage solutions.

Protecting AWS S3 Buckets

Managing Access to S3 Buckets By default all Amazon resources (buckets, objects and related subresources) are private, only the resource owner (the AWS account that created it) can access the resource. To grant access to others the resource owner has to write access policy. As explained in the AWS Identity and Access Management article, these policies are either identity (or user) based or resource based. Bucket (resource based) policy and user (identity based) policy are two of the access policy options available to grant permissions to S3 resources, they both use JSON based access policy language.

AWS Cloudfront

Content Delivery Network A Content Delivery Network (CDN) is a geographically ditributed network of servers that work together to provide a highly available and fast delivery of static internet content, based on the geographic locations of the users, the origin of the web content and a content delivery server. This is achieved by directing users’ requests to servers located closer to the user and caching content that is requested. Subsequent requests are served from cache of the local servers hence improving performance.

AWS S3 (Cross Region) Replication

AWS S3 Cross Region Replication is a bucket-level configuration that enables automatic, asynchronous copying of objects across buckets in different AWS Regions, these buckets are referred to as source bucket and destination bucket. When replication is set up by default; Replicas have the same key names and the same metadata—for example, creation time, user-defined metadata, and version ID Amazon S3 stores object replicas using the same storage class as the source object, unless you explicitly specify a different storage class in the replication configuration Assuming that the object replica continues to be owned by the source object owner, when Amazon S3 initially replicates objects, it also replicates the corresponding object access control list (ACL) Use cases Cross region replication is useful for various reasons, including;

AWS S3 Buckets

Buckets To upload data (i.e files like videos, photos, spreadsheets etc) to Amazon S3, a bucket needs to be created first. A bucket is an organizational structure similar to a drive or a mount point on an operating system in the sense that files and dicrectories/folders can be stored in them. It is important to note that the S3 bucket names are globally unique across all regions and all accounts on AWS and the name chosen for the buckets must be DNS compliant.

AWS Simple Storage Service (S3)

AWS Simple Storage Service (S3) is a secure, durable, highly scalabale object storage. It’s a simple storage service that offers software developers a highly-scalable, reliable, and low-latency data storage infrastructure at very low costs. It provides a simple web service interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. Data is stored in Amazon S3 buckets which are the fundamental containers for storage.

AWS Identitiy and Access Management

AWS Identity and Access Management (IAM) is a web service that helps to securely control access to AWS resources. IAM is used to control who is authenticated (signed in) and authorized (has permissions) to use resources. When an AWS account is first created a single sign in identity (called the AWS account root user) that has complete access to all AWS services and resources in the account is also created, this identity is accessed by signing in with the email address and password used to create the account.

Linux Shell on Windows

For many years I have been yearning to have a single operating system that I run for all my work and leisure needs, but I have always had to run Linux for my software development related tasks, and Microsoft Windows for general usage. Linux works well for me when doing programming work due to the powerfull bash shell, the Linux command line utilities (like git, grep, cat, etc) and how well they integrate with the shell and a host of development tools that just works so naturally in Linux, like python and node.

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing. As I wrote on my article on Hadoop, big data sets required a better way of processing because traditional RDBMS simply can’t cope and Hadoop has revolutionized the industry by making it possible to process these data sets using horizontally scalabale clusters of commodity hardware. However Hadoop’s own compute engine, MapReduce, is limited in having a single programming model of using Mappers and Reducers and also being tied to reading and writing data to the filesystem during the processing of the data which slows down the process.

Hadoop

Traditionally data analysis has been done on Relational Database Management Systems (RDBMS) which work on data with a clearly defined structure since they require a schema definition before the data can be loaded. RDBMS also scale better vertically rather than horizontally, meaning scaling is done through using higher capacity machines rather than spreading the load through many machines as replication of RDBMS data across the machines tend to be problematic.