Instrumenting PHP: A Minimalist Approach

I needed a quick way to measure performance and log errors in my Sift Science for WooCommerce plugin. I didn’t want to go back through all my code and embed logging and timing measurement statements, so I considered a more generic and lazy approach.

I decided to create a class that wraps the class I want to measure/monitor. Its constructor takes a class instance, it saves that instance. Then, for every function call to the wrapper class, the function in the underlying class is called and information is logged as needed. Here’s the class from the project:

https://github.com/Fermiac/woocommerce-siftscience/blob/master/includes/class-wc-siftscience-instrumentation.php

How it Works

The most interesting piece of code in this class is here:

public function __call( $name, $args ) {
  $metric = "{$this->prefix}_{$name}";
  $timer = $this->stats->create_timer( $metric );
  $error_timer = $this->stats->create_timer( "error_$metric" );
  try {
    $result = call_user_func_array( array( $this->subject, $name ), $args );
    $this->stats->save_timer( $timer );
    return $result;
  } catch ( Exception $exception ) {
    $this->logger->log_exception( $exception );
    throw $exception;
  }
}

It’s pretty straight forward; I use the PHP magic function __call for all method calls to the class. I initialize timers and then call the function of the class I’m wrapping. If it succeeds or fails I log the information I want and then I pass on the result or exception.

Drawbacks

The main drawback to this approach is that this wrapper class can’t be passed around the same way as the original class. That is, if you’re using PHP type hints, the wrapper class will not implement the same interface as the class it wraps. This also means that IDEs will have trouble auto-completing your code.

To compensate for this, I generally pass the original class to the constructors that need them. I then use the instrumentation class instance to plug into WordPress hooks. You lose a bit of detail in that you only get measurements on the outer surface of your plugin, but overall I’ve found this approach to be satisfactory. You can still calculate the total run time of your plugin and it’s nearly impossible for an exception to be thrown without it getting caught by the wrapper.

Conclusion

I found this approach easy and simple for collecting metrics and useful logs throughout my plugin. I may carve it out and share it as re-usable code one day, but for now the class has dependencies on other classes specific to my project. However, it wouldn’t be hard to copy the class and modify to your specific need.

Host your own podcast with PHPodcast

I’ve been using a little PHP script for the past few months to host my own private podcast. So I decided to clean it up a little and share it on GitHub.

https://github.com/nabsul/phpodcast

A little background: I had some audio files that I wanted to listen through with the ability to increase the speed and pause at any time to continue later. This is everything that most podcast apps do. So I decided to host my own private podcast channel containing the audio files.

PHPodcast is a simple script that creates a podcast RSS feed based on a directory of audio files. You’ll need to modify a config file to your specific setup, but everything you need to edit is clearly marked.

It basically saves you the time of learning how to correctly build and format a podcast RSS feed. Nothing super hard, but doing it from scratch can take a few hours of research, trial and error.

Here’s a live podcast served up using PHPodcast:

https://phpodcast.nabeel.us/

Enjoy!

Analyzing WordPress Hook Usage with Azure Data Lake

WordPress provides a large number of hooks that allow plugins to extend and modify its behavior. A few months ago, I was curious about which of these hooks are popular, and which of them are hardly ever used. I was also looking for an excuse to give Microsoft’s Data Lake Analytics a spin. U-SQL looked especially attractive as it brought back fond memories of petabyte-scale data crunching at Bing.

With that in mind, I set out to build some tools that would calculate the usage of WordPress’s hooks. Breaking that up into smaller steps, I came up with:

  • Crawl all published plugins on WordPress.org
  • Extract which hooks are used by each plugin
  • Extract a list of WordPress hooks
  • For each WordPress hook, calculate its usage

On the technical side, I set the following goals for this project:

  • The code should be developed in C# and U-SQL
  • The project should use .NET Core so that it’s cross-platform (Windows, Linux, Mac)
  • The project should be usable in Visual Studio, VS Code or from the command line

In this article I talk about the approach and algorithms in general. For the nitty-gritty details, you can check out the source code here: https://github.com/nabsul/WordPressPluginAnalytics. See the README.md file for instructions on building and running the code.

Crawling for Plugins

I decided to crawl the WordPress.org plugins directory to extract a list of all the plugins. All of the plugins can also be accessed from a common SVN repository, but with different branches and tag folders, I felt that would be slightly more tedious than crawling the html pages to extract the official link to each zip file. The HtmlAgilityPack library makes parsing HTML and extracting information very easy. I use it to parse each page of plugins for the links to each individual plugin page, and then I parsed each plugin page for the zip file URL.

Once I have the zip file URL, I upload it to Azure Blob Storage. I considered skipping this and working directly with the data from WordPress.org, but I felt this approach allowed me to have a stable snapshot of the original data to experiment on without repeatedly hitting wordpress.org for the same data.

Running the process sequentially takes nearly 5 hours from a Digital Ocean droplet, but about 90% of that time is just waiting on I/O. Therefore, adding some parallelism to this process made a lot sense. This was done very simply by fetching all 12 plugins per page in parallel. This brought the run time down to just over an hour.

Extracting Data

Now that I have my raw data, the next step is to extract useful information out of it. I used System.IO.Compression.ZipArchive to iterate over each PHP file in the zip file. I then considered writing my own code to parse each PHP file, but quickly gave up on the idea when I realized how complicated that would get. So I looked around and found Devsense.Php.Parser. Using this library, I was able to work directly on tokenized data and avoided all the hassle of parsing text myself.

With that library, I extracted each hook usage and creation in the PHP files. I only count instances where the hook name is a constant string, since it would be impossible to predict the hook name for code like add_action( "updated_$myvar", ...).

The final result needed to be in a format that can be easily analyzed with U-SQL and Azure Data Lake Analytics. U-SQL comes with built in TSV extractors, so if you upload your raw data in that format, you don’t need custom C# code to process it. Data Lake Analytics can automatically uncompress gzipped files, which is great since my TSV files compress to about 10% of their uncompressed size.

Extracting the plugins takes less than 1 hour, so I didn’t bother to run parts of that code in parallel.

Running the Analysis

The final step of the process is running a U-SQL script to analyze the data and generate the final report. You can upload the data manually or using the command line tool included in the project. You should have two extraction files: One for the WordPress source code and one for all the plugins. The final step is to run the U-SQL script. Again, you can edit and submit the script manually, or if you followed the naming conventions used in the program you can submit the job using the command line tool.

U-SQL is a SQL-like language. If you’re familiar with SQL, the code in the script should all make sense. The raw data is read from the uploaded files. The WordPress data is filtered by hooks created and the plugins are filtered by hooks used. Hook usage is counted using a GROUP BY statement. The hooks from WordPress and the plugins are then cross-referenced using a JOIN. The graph of the job looks like this:

The Cost of Data Lake Analytics

The job should take a couple of minutes to run and costs around $0.03 (US). However, I learned a few important lessons on the pricing of Data Lake jobs. First, when running on a few GB of data make sure you run with a parallelism of 1. Increasing the parallelism on a small data set is just a waste of money. For example, my 3-cent job cost 12 cents when I ran it with a parallelism of 5. I also suspect that compressing my data files helped reduce the cost of jobs. Compressed data should mean less data travelling over the network, which can often result in significantly faster (and cheaper) jobs.

The second and more important point is about using custom code and libraries in your scripts: It is possible to upload and use custom .NET DLLs in your U-SQL scripts, but I highly recommend avoiding that unless it’s absolutely necessary. I experimented with uploading the individual plugin zip files to Data Lake storage and using a custom extractor library that directly processed the zip files and tokenizes the PHP. The cost of running such a job was around $5. This is way more than the cost of working on TSV files but it does makes sense since doing the Zip extraction and PHP parsing on Microsoft’s Azure infrastructure will consume far more CPU cycles than if you do most of the pre-processing separately.

As you can see, unlike simpler services like storage, the cost of using this type of service can vary widely depending on how you design your data pipelines. It is therefore important to spend some time researching and carefully considering these decisions before settling on an approach.

Viewing the Results

The final result of running the script is a small TSV formatted report with the follow pieces of information:

  • Hook Name: The name of the hook (prefixed with action_ and filter_ to differentiate those two types of hooks)
  • Num Plugins: Number of plugins using the hook
  • Num Usages: Number of times the hook is used

The data can be imported to a spread sheet for further analysis and charting:

https://1drv.ms/x/s!AoNGbuElNYPMjMUVzq5931eX9YzSuA

Conclusions

Overall, I felt like there was definitely a learning curve to Azure Data Lake services, but it wasn’t all too bad. I’m definitely curious how all of this could be done in the Hadoop ecosystem, which I’m much less familiar with. If anyone would like to try replicating these results in Hadoop, I would greatly appreciate a tutorial and/or shared source code.

This code could easily be expanded to perform other types of analysis. For example, it might be interesting to see the usage of various WordPress functions and classes. It also might be interesting to reduce the list of plugins to the most popular ones to get more realistic usage information for the hooks.

Migrating Data Between Azure Subscriptions

It’s been two years since I left Microsoft, and they finally decided to cancel my free employee Azure account. It was fun while it lasted, but now I have to move my data to a regular paid account. I looked through the FAQs and contacted support for the easiest way to do this, and unfortunately there is no officially recommneded solution for moving storage accounts between different subscriptions.

I found some Windows-only tools that may have done the job, but I wanted a solution that would run on any platform. So I decided to write and share some quick NodeJS scripts: https://github redirected here.com/nabsul/migrate-azure-storage

To use:

  • Get the code and install the dependencies
  • Edit the configuration file with the source and destination storage accounts
  • Run “npm run copy" to copy all blob and table data
  • Run “npm run compare" to compare the data between the accounts

Notes:

  • For better performance, run this on a Digital Ocean droplet instead your home machine
  • Preexisting blobs and tables are silently overwritten, so be careful!

Playing with Hadoop (Part 1)

The Hadoop File System (HDFS) is a distributed and redundant file system that stores your data in chunks across multiple nodes. This allows for fault tolerant data storage (a node can die without the loss of data) as well as parallel data processing. If you want to store and analyze large amounts of data, Hadoop is a great option.

I recently read a great book called Data Analytics with Hadoop, and this post is based on what I learned there. In this tutorial, I walk you through setting up Hadoop locally for experimentation. I also show you how to create a simple job that processes data in Hadoop.

Create the Virtual Machine

We’re going to starting by creating a virtual machine. Start by downloading and installing VirtualBox. You’ll also want to download the latest LTS Ubuntu Server ISO. Once VirtualBox is installed and your ISO is downloaded, go to VirtualBox and create an new virtual machine with the following parameters:

  • Type: Linux (Ubuntu 64 bit)
  • Memory: I recommend 2 GB
  • Disk: 10GB should be enough, but I’d recommend 50GB
  • Networking: Make sure your networking is set to Bridged if you want to SSH into the machine

When you start up your VM for the first time, VirtualBox will ask you to select installation media to install an OS. Use the Ubuntu server ISO you downloaded and install Ubuntu with all the default settings.

I won’t cover how to do this in detail, but I recommend setting up SSH (sudo apt-get install ssh) so you can remotely log into the virtual machine. This will allow you to work from your computer’s shell, copy-paste from your browser and switch between windows easily. You can add your machine’s public key to an authorized key so that you don’t have to type a password every time you log in.

Disable IPv6

I’m not sure if this is still true, but the book states that Hadoop doesn’t play well with IPv6. To disable it, edit the config by typing (sudo nano /etc/sysctl.conf) and at the end of the file add the commands listed here:

The settings don’t take effect until you reboot (sudo shutdown -r now). If you did this correctly, typing (cat /proc/sys/net/ipv6/conf/all/disable_ipv6) should print out the number 1 on your screen.

Installing Hadoop

Now comes the fun part: Getting Hadoop all set up! Start by logging in with your username, then logging in as root (sudo su) and following the commands and instructions here:

Setting Up Hadoop

For this section, you’re going to want to log in as the hadoop user with (sudo su hadoop) and add the lines listed in this gist to both of these files:

  • /home/hadoop/.bashrc
  • /srv/hadoop/etc/hadoop/hadoop-env.sh

You’ll then want to create a script to start up Hadoop by typing (nano ~/hadoop_start.sh) and adding the content from this gist to it. In the directory /srv/hadoop/etchadoop, create or update the following files with the corresponding contents:

Finally, we setup an authorized key and and format the name node by executing the following code:

Now let’s start up Hadoop! If you type (jps) now it should only list Jps as a running process. To start up the Hadoop process just type (~/hadoop_start.sh). The first time you run this command it’ll ask you if you trust these servers, to which you should answer “yes”. Now if you type (jps) you should see several processes running such as SecondaryNameNodeNodeNode, NodeManager, DataNode, and ResourceManager. From now on, you’ll only need to type (~/hadoop_start.sh) to start up Hadoop on your virtual machine, and you’ll only need to do this if you restart your machine.

Create and Run Map-Reduce Scripts

A Map-Reduce job consists of two stages: mapping and reducing. In the mapping stage, you go over the raw input data and extract the information you’re looking for. The reduce stage is where the results of mapping are brought together for aggregation. It’s important to remember that these processes are distributed. In a real Hadoop cluster, mapping happens on different machines in parallel and you need to keep this in mind when writing your code. For the purpose of this tutorial we can visualize the process as follows:

<a href="https://i1 Click Here.wp.com/nabeel.us/wp-content/uploads/2016/10/map-reduce.png?ssl=1″>map-reduce

Different nodes in the cluster process different chunk of data locally by running it through the mapper. Then, all the outputs from the different mappers are combined and sorted to be processed by the reducer. More complex arrangements exist, with multiple intermediate reducers for example, but that is beyond the scope of this tutorial.

Getting the Scripts

Now that we have Hadoop up and running on our sandbox, let’s analyze some logs! You’ll want to be logged in as the Hadoop user (sudo su hadoop). Go to the home directory (cd ~) and  checkout the sample code by running (git checkout https://github.com/nabsul/hadoop-tutorial.git). Then change to the directory of this this tutorial by typing (cd hadoop-tutorial/part-1).

In this folder you’ll find sample logs to work with and four pairs of _mapper.py and _reducer.py scripts, which do the following:

  • count_status: Count occurrences of the status field in all the logs
  • status per day: Same as the above, but provides the stats per day
  • logs_1day: Fetches all the logs of a specific day
  • sample: Extract a 1% random sample of the logs

Running the Scripts Locally

The scripts provided can either be run locally or in the Hadoop cluster. To run them locally, execute the following from inside the part-1 folder:

  • cat sample_log.txt | ./count_status_mapper.py | sort | ./count_status_reducer.py

To run any of the other jobs, just substitute the mapper/reducer scripts as needed.

Uploading the Logs to Hadoop

Before running a job in Hadoop, we’ll need some data to work with. Let’s upload our sample logs with the following commands:

  • hadoop fs -mkdir -p /home/hadoop
  • hadoop fs -copyFromLocal sample_logs.txt /home/hadoop/sample_logs.txt

Running a Job Hadoop

Finally, type the following to run the a job in Hadoop:

  • hadoop jar $HADOOP_JAR_LIB/hadoop-streaming* -mapper /home/hadoop/hadoop-tutorial/part-1/sample_mapper.py -reducer /home/hadoop/hadoop-tutorial/part-1/sample_reducer.py -input /home/hadoop/sample_logs.txt -output /home/hadoop/job_1_samples

If that runs successfully, you’ll be able to view the job results by typing (hadoop fs -l /home/hadoop/job_1_samples) and (hadoop fs -cat /home/hadoop/job_1_samples/part-00000).

Another interesting thing to look at is the Hadoop dashboard, which can be found at http://[VM’s IP Address]:8088. This will provide you with some information on the jobs that have been running our your virtual cluster.

Conclusion

At this point, you might be thinking: “I just ran a couple of python scripts locally, and then submitted them to Hadoop to get the same answer. What’s the big deal?” I’m glad you asked (and noticed)! It is true that Hadoop gives you nothing interesting when you’re only working on a few megabytes of data. But imagine instead that you had a few terabytes of data instead. At that scale:

  • It would be very hard to store that information on one machine
  • It would take a very long time to run your python script on one giant file
  • You would might run out of memory before it getting through all the data
  • If that one machine crashes, you could lose all or part of your data

That’s where the Hadoop environment is useful. Your terabytes of data are spread across several nodes, and each node works on a chunk of data locally. Then, each node provides its partial data to produce the final result. Moreover, the beauty of using python the way we just did is that you can first test your script on a local small sample to make sure it works. After you debug it and make sure it works as expected, you can then submit the same code to your Hadoop cluster to work on larger volumes of data.

I hope you enjoyed this tutorial. In part 2 I plan to tackle the topic of: “How do I get my data into Hadoop?”. Specifically, we’ll look into setting up Kafka to receive log messages and store them in HDFS.