My Journal

Journey of a Software Architect into the world of Data Science

Python Spark Platform on AWS

Following on my post on setting up a platform to get started with data science tools I since have set up a Jupyter based platform for programming Python on Spark. On top of using Python libraries (like pandas, NumPy, Scikit-Learn, etc) that makes data analysis easier, in this platform I can also use Spark to code applications that run on distributed clusters This setup has the following benefits It is web based, I can work on my projects from anywhere as long as I have a web browser with an internet connection It is set up using light weight EC2 instance types (t2.

Python Spark Challenge

I was recently asked to solve a data science related challenge for a job application, the challenge was to simply write a Spark application that determines the top 10 most rated TV series’ genres of TV series with over 10 episodes. The challenge required the solution to be written in Scala using the SBT tool. I later wrote the solution again using Python which I am more comfortable with, here are my notes on the Python solution.

Dockerizing a NodeJS App

In this post I am documenting what steps I made to convert a traditional NodeJS App that is launched from a command line using node app.js Into a fully dockerized container solution. The App uses a MySQL database which it has static configuration for. I am not going into too much details about the App’s code or architecture but it is just worth noting that it has this piece of configuration for connecting to the database;

Learn Faster

Carrying on with Jeff Patton’s “User Story Mapping” for my project, after slicing releases it is time to put a framework in place to learn faster Once you have your product idea and asked yourself who are your customers, how they will use it and why the need it, you need to *validate the problem your product will solve really exists*. Find a handful of people from your target market and try to engage them.

Build Less

Carrying on with Jeff Patton’s “User Story Mapping” for my project, following on from Framing the Big Picture the next step is to plan on building less, because There’s always more to build than you have people, time and money for Story mapping helps big groups build shared understanding, if the product has stories that crosses multiple teams’ domains get all the teams together so that you can map for a product release across all of the teams, this will help visualize the dependencies across the teams.

The Big Picture

Reading on Jeff Patton’s “User Story Mapping” I have been applying the ideas in a small project I am working on - an online grocery shopping service gengeni.com. In this post I am documenting focusing on the big picture. Jeff insists on creating documents which promotes a shared understanding through user stories (rather than the traditional requirements, which are prone to mis interpretations). He insists that we are building software not for the sake of it but to make things better, solve real world problems, therefore we should focus on maximising the outcome (how we make things better) while minimizing the output (software components).

Python Libraries - NumPy

NumPy NumPy is the core library for scientific computing in Python. It It adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. Installation The easiest way to get NumPy installed is to install one of the Python distirbutions, like Anaconda, which include all of the key packages for the scientific computing stack Usage To start using NumPy you need to import it

Data Science Getting Started Platform

Data Science Getting Started Platform To get started quickly with data science, I started looking at python and its powerful set of libraries (like pandas, NumPy, Scikit-Learn, etc) that makes data analysis easier. I wanted to have a platform that is accessible over the internet so I can get to it from any laptop/PC that has internet access. I decided to get a minimal Virtual Private Server (VPS) that supports containers so I can set up a Docker container with all the languages and frameworks/libraries/tools and mount a path on the VPS that contains all the projects I am working on, which will be checked in to git.

Installing Docker on Ubuntu

Installing Docker on Ubuntu This post is essentially my notes on getting started quickily with Docker. I set this up in my lab machines running Ubuntu 16.04.1 LTS, the steps are based on the excellent instructions written on the Docker getting started guide Add the Docker project repository to APT sources sudo apt-get install apt-transport-https ca-certificates sudo apt-key adv \ --keyserver hkp://ha.pool.sks-keyservers.net:80 \ --recv-keys 58118E89F3A912897C070ADBF76221572C52609D echo "deb https://apt.

Decoupling API Versions From Codebase Versions

When developing a package(any piece of reusable code, like a class library to be loaded or a web service that’s accessible through HTTP) that has a published API it is necessary to have a clear separation between the API version and the codebase version of the package. The API is what is exposed from the package for the users to consume, this should be documented clearly and the module should have thorough tests included that tests the entire published API to assert it conforms to the documented.