Playing with Hadoop (Part 1)

The Hadoop File System (HDFS) is a distributed and redundant file system that stores your data in chunks across multiple nodes. This allows for fault tolerant data storage (a node can die without the loss of data) as well as parallel data processing. If you want to store and analyze large amounts of data, Hadoop is a great option.

I recently read a great book called Data Analytics with Hadoop, and this post is based on what I learned there. In this tutorial, I walk you through setting up Hadoop locally for experimentation. I also show you how to create a simple job that processes data in Hadoop.

Create the Virtual Machine

We’re going to starting by creating a virtual machine. Start by downloading and installing VirtualBox. You’ll also want to download the latest LTS Ubuntu Server ISO. Once VirtualBox is installed and your ISO is downloaded, go to VirtualBox and create an new virtual machine with the following parameters:

  • Type: Linux (Ubuntu 64 bit)
  • Memory: I recommend 2 GB
  • Disk: 10GB should be enough, but I’d recommend 50GB
  • Networking: Make sure your networking is set to Bridged if you want to SSH into the machine

When you start up your VM for the first time, VirtualBox will ask you to select installation media to install an OS. Use the Ubuntu server ISO you downloaded and install Ubuntu with all the default settings.

I won’t cover how to do this in detail, but I recommend setting up SSH (sudo apt-get install ssh) so you can remotely log into the virtual machine. This will allow you to work from your computer’s shell, copy-paste from your browser and switch between windows easily. You can add your machine’s public key to an authorized key so that you don’t have to type a password every time you log in.

Disable IPv6

I’m not sure if this is still true, but the book states that Hadoop doesn’t play well with IPv6. To disable it, edit the config by typing (sudo nano /etc/sysctl.conf) and at the end of the file add the commands listed here:

The settings don’t take effect until you reboot (sudo shutdown -r now). If you did this correctly, typing (cat /proc/sys/net/ipv6/conf/all/disable_ipv6) should print out the number 1 on your screen.

Installing Hadoop

Now comes the fun part: Getting Hadoop all set up! Start by logging in with your username, then logging in as root (sudo su) and following the commands and instructions here:

Setting Up Hadoop

For this section, you’re going to want to log in as the hadoop user with (sudo su hadoop) and add the lines listed in this gist to both of these files:

  • /home/hadoop/.bashrc
  • /srv/hadoop/etc/hadoop/

You’ll then want to create a script to start up Hadoop by typing (nano ~/ and adding the content from this gist to it. In the directory /srv/hadoop/etchadoop, create or update the following files with the corresponding contents:

Finally, we setup an authorized key and and format the name node by executing the following code:

Now let’s start up Hadoop! If you type (jps) now it should only list Jps as a running process. To start up the Hadoop process just type (~/ The first time you run this command it’ll ask you if you trust these servers, to which you should answer “yes”. Now if you type (jps) you should see several processes running such as SecondaryNameNodeNodeNode, NodeManager, DataNode, and ResourceManager. From now on, you’ll only need to type (~/ to start up Hadoop on your virtual machine, and you’ll only need to do this if you restart your machine.

Create and Run Map-Reduce Scripts

A Map-Reduce job consists of two stages: mapping and reducing. In the mapping stage, you go over the raw input data and extract the information you’re looking for. The reduce stage is where the results of mapping are brought together for aggregation. It’s important to remember that these processes are distributed. In a real Hadoop cluster, mapping happens on different machines in parallel and you need to keep this in mind when writing your code. For the purpose of this tutorial we can visualize the process as follows:

<a href="https://i1 Click″>map-reduce

Different nodes in the cluster process different chunk of data locally by running it through the mapper. Then, all the outputs from the different mappers are combined and sorted to be processed by the reducer. More complex arrangements exist, with multiple intermediate reducers for example, but that is beyond the scope of this tutorial.

Getting the Scripts

Now that we have Hadoop up and running on our sandbox, let’s analyze some logs! You’ll want to be logged in as the Hadoop user (sudo su hadoop). Go to the home directory (cd ~) and  checkout the sample code by running (git checkout Then change to the directory of this this tutorial by typing (cd hadoop-tutorial/part-1).

In this folder you’ll find sample logs to work with and four pairs of and scripts, which do the following:

  • count_status: Count occurrences of the status field in all the logs
  • status per day: Same as the above, but provides the stats per day
  • logs_1day: Fetches all the logs of a specific day
  • sample: Extract a 1% random sample of the logs

Running the Scripts Locally

The scripts provided can either be run locally or in the Hadoop cluster. To run them locally, execute the following from inside the part-1 folder:

  • cat sample_log.txt | ./ | sort | ./

To run any of the other jobs, just substitute the mapper/reducer scripts as needed.

Uploading the Logs to Hadoop

Before running a job in Hadoop, we’ll need some data to work with. Let’s upload our sample logs with the following commands:

  • hadoop fs -mkdir -p /home/hadoop
  • hadoop fs -copyFromLocal sample_logs.txt /home/hadoop/sample_logs.txt

Running a Job Hadoop

Finally, type the following to run the a job in Hadoop:

  • hadoop jar $HADOOP_JAR_LIB/hadoop-streaming* -mapper /home/hadoop/hadoop-tutorial/part-1/ -reducer /home/hadoop/hadoop-tutorial/part-1/ -input /home/hadoop/sample_logs.txt -output /home/hadoop/job_1_samples

If that runs successfully, you’ll be able to view the job results by typing (hadoop fs -l /home/hadoop/job_1_samples) and (hadoop fs -cat /home/hadoop/job_1_samples/part-00000).

Another interesting thing to look at is the Hadoop dashboard, which can be found at http://[VM’s IP Address]:8088. This will provide you with some information on the jobs that have been running our your virtual cluster.


At this point, you might be thinking: “I just ran a couple of python scripts locally, and then submitted them to Hadoop to get the same answer. What’s the big deal?” I’m glad you asked (and noticed)! It is true that Hadoop gives you nothing interesting when you’re only working on a few megabytes of data. But imagine instead that you had a few terabytes of data instead. At that scale:

  • It would be very hard to store that information on one machine
  • It would take a very long time to run your python script on one giant file
  • You would might run out of memory before it getting through all the data
  • If that one machine crashes, you could lose all or part of your data

That’s where the Hadoop environment is useful. Your terabytes of data are spread across several nodes, and each node works on a chunk of data locally. Then, each node provides its partial data to produce the final result. Moreover, the beauty of using python the way we just did is that you can first test your script on a local small sample to make sure it works. After you debug it and make sure it works as expected, you can then submit the same code to your Hadoop cluster to work on larger volumes of data.

I hope you enjoyed this tutorial. In part 2 I plan to tackle the topic of: “How do I get my data into Hadoop?”. Specifically, we’ll look into setting up Kafka to receive log messages and store them in HDFS.

WordPress Plugins: How to Develop in Git and Publish to Subversion

About a month ago, I got my Sift Science plugin added to the online store. To publish your plugin to the store, you’re required to use the SVN repository that they provide. Once you get that done correctly, users of WordPress can find and install your plugin through the built in store and they will also receive notifications whenever you publish a new version.

In this post, I describe how I manage releases of my Sift Science plugin. The information here should be useful for many different types of plugins, but it specifically addresses two key issues I faced when getting started with this:

  • I develop my plugin primarily in Git
  • My plugin has a React component that must be packaged with the plugin

So let’s get started!

Getting your Plugin Published

You’ll need to request hosting for your plugin. You’ll have to make sure your plugin conforms to the requirements listed there and follow the instructions to provide the required information.

Checkout Git and SVN Trunk Together

I keep my Git repository synced with the trunk directory in SVN. On my computer I have checked out both Git and SVN trunk into the same directory, which helps keep them in sync quite easily. It can be a little tricky to setup initially, but the basic commands to do this are:

git clone [git repo address]
svn checkout [repo]/trunk
svn revert -R .

I do all of my development in Git, with detailed commit messages and so on. Then, when I’m ready to sync SVN with Git, I’ll pull the latest changes from Git (git pull) and check those changes into SVN trunk (svn ci ...).

Git vs SVN Ignored Files

For most plugins, SVN and Git will probably have identical ignores. However, since I have a React app that needs to be web-packed for deployment, my SVN and Git ignores are slight different Buy lasix. In my case, I drop the web-packed React app into the /dist folder of my project. Naturally, since this is “compiled from code”, the folder is ignored in Git. However, I don’t ignore this in SVN, since I want to ship that file to my users.

Update The Plugin’s Documentation

When you have a new version that you want to push to your users, you’ll want to update the documentation as follows:

  • Update “stable tag” entry and the change log in readme.txt
  • Update to the version number in your plugin’s main php file (`Version: x.y.z`)

Be sure to sync these changes into your SVN trunk folder.

Tag the new Version

The final step to getting your update shipped is to create a new entry in the tags folder. You do this by executing svn copy trunk tags/x.y.z and checking that change in. The version number in the tag must match the version number in your main php file for this to work.

The End

And there you have it! A general outline of how to work in Git while shipping to a SVN repository.

Only Use Mocks When Necessary

A few weeks ago I learned an important lesson that I’d like to share: When building unit tests, never use a mock/stub unless it’s absolutely necessary. Let me explain with a little example.

Let’s say you’re working on a large project and you start developing Module A, which uses a MySQL database. To write a unit test for this module, you’re going to need to mock the MySQL queries. I have used sinon for this purpose and it’s worked great by the way. You typically don’t want your unit tests accessing a real-life database server, so using a mock here makes a lot of sense.

Now let’s say you’re adding a new Module B to your project, which uses Module A. As you’re writing the unit tests for this new module, I highly recommend NOT mocking Module A. You should instead continue to only mock the MySQL module as you did when testing Module A.

The reason I recommend this is that every time you mock an interface, you’re usually replacing the real code with a very simple imitation of it. You might have the mocked Module A always return success or failure, depending on what you’re trying to test in Module B lasix australia. However, you’re very likely to miss out on a lot of detail. It’s highly unlikely that the mock will validate input formats and data structures for example.

By over-using mocks, you end up accumulating technical debt in the form of bugs and implementation errors that will only show up in E2E/integration testing. Debugging in unit tests is typically faster and easier than in an integration environment, so we definitely want to catch as many bugs in the unit test as possible. Unit tests will never catch all the bugs, but minimizing the use of mocks will help you catch more bugs at that stage.

Using ReactJS in WordPress Plugins

When I first started developing the Sift Science for WooCommerce plugin last year, I needed interactive controls on the Orders page that displayed fraud scores and allowed the admin to flag fraudulent users. I didn’t want to reload the page every time the user took an action, so the obvious solution was to implement some client-side scripting and background Ajax calls to the server.

The goal was to add a new column to the Orders page that contained small icons that displayed the score and a couple of other icons for the user to click like so:


My initial implementation was all in jQuery, and it was ugly. In PHP, I rendered all possible icons with the CSS tag display: none. I then used jQuery to turn those icons on and off as needed, and talk to the server via an <a href=" buy generic lasix.js”>endpoint in the plugin.

Having UI logic in both PHP and jQuery meant that the two had to stay closely coupled. On the other hand, I supposed I could have just rendered an empty div in PHP and made jQuery add all the HTML, but using jQuery to render all that display was not appealing.

Enter ReactJS

Late last year I joined Automattic and early this year we started working on Connect for WooCommerce. For this project, we are heavily using React (and wp-calypso as a framework). As I got more comfortable with React, I started thinking about the possibility of using it in the Sift Science plugin.

I started out by implementing a new feature in React to see if it was feasible. I implemented some basic admin functionality in the plugin settings page. When that went smoothly, I decided to take the leap and port all the jQuery code to React.

All-in-all, I’m happy with the switch to React. That said, I’m not convinced that this is the best approach for everyone. Here are some pros and cons:


  • Once you get used to the new paradigm, React code is more readable than jQuery
  • All the UI is in React and the PHP endpoint can act as a data-only REST API
  • You can leverage NPM modules to perform complex tasks


  • You have to learn React, which can be intimidating
  • You have to “compile” (webpack) your code before releasing it
  • Initial setup overhead (npm, modules, build tools)

Sift Science for WooCommerce is now in Beta!

I am pleased to announce that the Sift Science plugin for WooCommerce is finally ready for beta testing! Lukas and I started this plugin over a year ago and have been working on it off and on as time permitted. But I’m happy to say that it’s now good enough for Beta testing.

Note: I work for Automattic on WooCommerce. This is a personal side project.

What is Sift Science for WooCommerce?

Sift Science for WooCommerce is a plugin that integrates Sift Science fraud detection into your WooCommerce online store. To use this plugin you need to have a WooCommerce online store and an account with Sift Science.

How does the plugin work?

The plugin sends order information and user behavior details to your Sift Science account which is then used to give orders scores that predict the likelihood of fraud.

What information do you send to Sift Science?

I will eventually put together more formal information about this, but for now you’ll have to look at the code to know what data gets sent. Sorry visit this page.

What does Beta mean?

It means that the software’s basic functionality is working. However, it hasn’t been extensively tested and there are still a few features that aren’t finished.

What does Open Source/GPL2 mean?

It means that you’re free to copy, use and modify this software. Read here to find out more.

Wait a minute. Don’t you work for WooCommerce?

Yes, I work for an amazing company called Automattic and I work on WooCommerce. However, this is a project that began before I joined the company and is not officially linked to WooCommerce. It is a side project.

How can I help?

This is an open source project that I’ve been working on in my spare time. There are two main ways you can contribute:

  • Testers are highly welcome! If you can install this, try it out and provide feedback on bug reports that would be great!
  • If you’re a developer and would like to contribute, feel free to either contact me to collaborate or submit Pull Requests that fix specific bugs.

There will be a credits/thank you section for contributors 🙂