Analyzing WordPress Hook Usage with Azure Data Lake

WordPress provides a large number of hooks that allow plugins to extend and modify its behavior. A few months ago, I was curious about which of these hooks are popular, and which of them are hardly ever used. I was also looking for an excuse to give Microsoft’s Data Lake Analytics a spin. U-SQL looked especially attractive as it brought back fond memories of petabyte-scale data crunching at Bing.

With that in mind, I set out to build some tools that would calculate the usage of WordPress’s hooks. Breaking that up into smaller steps, I came up with:

  • Crawl all published plugins on WordPress.org
  • Extract which hooks are used by each plugin
  • Extract a list of WordPress hooks
  • For each WordPress hook, calculate its usage

On the technical side, I set the following goals for this project:

  • The code should be developed in C# and U-SQL
  • The project should use .NET Core so that it’s cross-platform (Windows, Linux, Mac)
  • The project should be usable in Visual Studio, VS Code or from the command line

In this article I talk about the approach and algorithms in general. For the nitty-gritty details, you can check out the source code here: https://github.com/nabsul/WordPressPluginAnalytics. See the README.md file for instructions on building and running the code.

Crawling for Plugins

I decided to crawl the WordPress.org plugins directory to extract a list of all the plugins. All of the plugins can also be accessed from a common SVN repository, but with different branches and tag folders, I felt that would be slightly more tedious than crawling the html pages to extract the official link to each zip file. The HtmlAgilityPack library makes parsing HTML and extracting information very easy. I use it to parse each page of plugins for the links to each individual plugin page, and then I parsed each plugin page for the zip file URL.

Once I have the zip file URL, I upload it to Azure Blob Storage. I considered skipping this and working directly with the data from WordPress.org, but I felt this approach allowed me to have a stable snapshot of the original data to experiment on without repeatedly hitting wordpress.org for the same data.

Running the process sequentially takes nearly 5 hours from a Digital Ocean droplet, but about 90% of that time is just waiting on I/O. Therefore, adding some parallelism to this process made a lot sense. This was done very simply by fetching all 12 plugins per page in parallel. This brought the run time down to just over an hour.

Extracting Data

Now that I have my raw data, the next step is to extract useful information out of it. I used System.IO.Compression.ZipArchive to iterate over each PHP file in the zip file. I then considered writing my own code to parse each PHP file, but quickly gave up on the idea when I realized how complicated that would get. So I looked around and found Devsense.Php.Parser. Using this library, I was able to work directly on tokenized data and avoided all the hassle of parsing text myself.

With that library, I extracted each hook usage and creation in the PHP files. I only count instances where the hook name is a constant string, since it would be impossible to predict the hook name for code like add_action( "updated_$myvar", ...).

The final result needed to be in a format that can be easily analyzed with U-SQL and Azure Data Lake Analytics. U-SQL comes with built in TSV extractors, so if you upload your raw data in that format, you don’t need custom C# code to process it. Data Lake Analytics can automatically uncompress gzipped files, which is great since my TSV files compress to about 10% of their uncompressed size.

Extracting the plugins takes less than 1 hour, so I didn’t bother to run parts of that code in parallel.

Running the Analysis

The final step of the process is running a U-SQL script to analyze the data and generate the final report. You can upload the data manually or using the command line tool included in the project. You should have two extraction files: One for the WordPress source code and one for all the plugins. The final step is to run the U-SQL script. Again, you can edit and submit the script manually, or if you followed the naming conventions used in the program you can submit the job using the command line tool.

U-SQL is a SQL-like language. If you’re familiar with SQL, the code in the script should all make sense. The raw data is read from the uploaded files. The WordPress data is filtered by hooks created and the plugins are filtered by hooks used. Hook usage is counted using a GROUP BY statement. The hooks from WordPress and the plugins are then cross-referenced using a JOIN. The graph of the job looks like this:

The Cost of Data Lake Analytics

The job should take a couple of minutes to run and costs around $0.03 (US). However, I learned a few important lessons on the pricing of Data Lake jobs. First, when running on a few GB of data make sure you run with a parallelism of 1. Increasing the parallelism on a small data set is just a waste of money. For example, my 3-cent job cost 12 cents when I ran it with a parallelism of 5. I also suspect that compressing my data files helped reduce the cost of jobs. Compressed data should mean less data travelling over the network, which can often result in significantly faster (and cheaper) jobs.

The second and more important point is about using custom code and libraries in your scripts: It is possible to upload and use custom .NET DLLs in your U-SQL scripts, but I highly recommend avoiding that unless it’s absolutely necessary. I experimented with uploading the individual plugin zip files to Data Lake storage and using a custom extractor library that directly processed the zip files and tokenizes the PHP. The cost of running such a job was around $5. This is way more than the cost of working on TSV files but it does makes sense since doing the Zip extraction and PHP parsing on Microsoft’s Azure infrastructure will consume far more CPU cycles than if you do most of the pre-processing separately.

As you can see, unlike simpler services like storage, the cost of using this type of service can vary widely depending on how you design your data pipelines. It is therefore important to spend some time researching and carefully considering these decisions before settling on an approach.

Viewing the Results

The final result of running the script is a small TSV formatted report with the follow pieces of information:

  • Hook Name: The name of the hook (prefixed with action_ and filter_ to differentiate those two types of hooks)
  • Num Plugins: Number of plugins using the hook
  • Num Usages: Number of times the hook is used

The data can be imported to a spread sheet for further analysis and charting:

Conclusions

Overall, I felt like there was definitely a learning curve to Azure Data Lake services, but it wasn’t all too bad. I’m definitely curious how all of this could be done in the Hadoop ecosystem, which I’m much less familiar with. If anyone would like to try replicating these results in Hadoop, I would greatly appreciate a tutorial and/or shared source code.

This code could easily be expanded to perform other types of analysis. For example, it might be interesting to see the usage of various WordPress functions and classes. It also might be interesting to reduce the list of plugins to the most popular ones to get more realistic usage information for the hooks.

Migrating Data Between Azure Subscriptions

It’s been two years since I left Microsoft, and they finally decided to cancel my free employee Azure account. It was fun while it lasted, but now I have to move my data to a regular paid account. I looked through the FAQs and contacted support for the easiest way to do this, and unfortunately there is no officially recommneded solution for moving storage accounts between different subscriptions.

I found some Windows-only tools that may have done the job, but I wanted a solution that would run on any platform. So I decided to write and share some quick NodeJS scripts: https://github redirected here.com/nabsul/migrate-azure-storage

To use:

  • Get the code and install the dependencies
  • Edit the configuration file with the source and destination storage accounts
  • Run “npm run copy" to copy all blob and table data
  • Run “npm run compare" to compare the data between the accounts

Notes:

  • For better performance, run this on a Digital Ocean droplet instead your home machine
  • Preexisting blobs and tables are silently overwritten, so be careful!

Playing with Hadoop (Part 1)

The Hadoop File System (HDFS) is a distributed and redundant file system that stores your data in chunks across multiple nodes. This allows for fault tolerant data storage (a node can die without the loss of data) as well as parallel data processing. If you want to store and analyze large amounts of data, Hadoop is a great option.

I recently read a great book called Data Analytics with Hadoop, and this post is based on what I learned there. In this tutorial, I walk you through setting up Hadoop locally for experimentation. I also show you how to create a simple job that processes data in Hadoop.

Create the Virtual Machine

We’re going to starting by creating a virtual machine. Start by downloading and installing VirtualBox. You’ll also want to download the latest LTS Ubuntu Server ISO. Once VirtualBox is installed and your ISO is downloaded, go to VirtualBox and create an new virtual machine with the following parameters:

  • Type: Linux (Ubuntu 64 bit)
  • Memory: I recommend 2 GB
  • Disk: 10GB should be enough, but I’d recommend 50GB
  • Networking: Make sure your networking is set to Bridged if you want to SSH into the machine

When you start up your VM for the first time, VirtualBox will ask you to select installation media to install an OS. Use the Ubuntu server ISO you downloaded and install Ubuntu with all the default settings.

I won’t cover how to do this in detail, but I recommend setting up SSH (sudo apt-get install ssh) so you can remotely log into the virtual machine. This will allow you to work from your computer’s shell, copy-paste from your browser and switch between windows easily. You can add your machine’s public key to an authorized key so that you don’t have to type a password every time you log in.

Disable IPv6

I’m not sure if this is still true, but the book states that Hadoop doesn’t play well with IPv6. To disable it, edit the config by typing (sudo nano /etc/sysctl.conf) and at the end of the file add the commands listed here:

The settings don’t take effect until you reboot (sudo shutdown -r now). If you did this correctly, typing (cat /proc/sys/net/ipv6/conf/all/disable_ipv6) should print out the number 1 on your screen.

Installing Hadoop

Now comes the fun part: Getting Hadoop all set up! Start by logging in with your username, then logging in as root (sudo su) and following the commands and instructions here:

Setting Up Hadoop

For this section, you’re going to want to log in as the hadoop user with (sudo su hadoop) and add the lines listed in this gist to both of these files:

  • /home/hadoop/.bashrc
  • /srv/hadoop/etc/hadoop/hadoop-env.sh

You’ll then want to create a script to start up Hadoop by typing (nano ~/hadoop_start.sh) and adding the content from this gist to it. In the directory /srv/hadoop/etchadoop, create or update the following files with the corresponding contents:

Finally, we setup an authorized key and and format the name node by executing the following code:

Now let’s start up Hadoop! If you type (jps) now it should only list Jps as a running process. To start up the Hadoop process just type (~/hadoop_start.sh). The first time you run this command it’ll ask you if you trust these servers, to which you should answer “yes”. Now if you type (jps) you should see several processes running such as SecondaryNameNodeNodeNode, NodeManager, DataNode, and ResourceManager. From now on, you’ll only need to type (~/hadoop_start.sh) to start up Hadoop on your virtual machine, and you’ll only need to do this if you restart your machine.

Create and Run Map-Reduce Scripts

A Map-Reduce job consists of two stages: mapping and reducing. In the mapping stage, you go over the raw input data and extract the information you’re looking for. The reduce stage is where the results of mapping are brought together for aggregation. It’s important to remember that these processes are distributed. In a real Hadoop cluster, mapping happens on different machines in parallel and you need to keep this in mind when writing your code. For the purpose of this tutorial we can visualize the process as follows:

<a href="https://i1 Click Here.wp.com/nabeel.us/wp-content/uploads/2016/10/map-reduce.png?ssl=1″>map-reduce

Different nodes in the cluster process different chunk of data locally by running it through the mapper. Then, all the outputs from the different mappers are combined and sorted to be processed by the reducer. More complex arrangements exist, with multiple intermediate reducers for example, but that is beyond the scope of this tutorial.

Getting the Scripts

Now that we have Hadoop up and running on our sandbox, let’s analyze some logs! You’ll want to be logged in as the Hadoop user (sudo su hadoop). Go to the home directory (cd ~) and  checkout the sample code by running (git checkout https://github.com/nabsul/hadoop-tutorial.git). Then change to the directory of this this tutorial by typing (cd hadoop-tutorial/part-1).

In this folder you’ll find sample logs to work with and four pairs of _mapper.py and _reducer.py scripts, which do the following:

  • count_status: Count occurrences of the status field in all the logs
  • status per day: Same as the above, but provides the stats per day
  • logs_1day: Fetches all the logs of a specific day
  • sample: Extract a 1% random sample of the logs

Running the Scripts Locally

The scripts provided can either be run locally or in the Hadoop cluster. To run them locally, execute the following from inside the part-1 folder:

  • cat sample_log.txt | ./count_status_mapper.py | sort | ./count_status_reducer.py

To run any of the other jobs, just substitute the mapper/reducer scripts as needed.

Uploading the Logs to Hadoop

Before running a job in Hadoop, we’ll need some data to work with. Let’s upload our sample logs with the following commands:

  • hadoop fs -mkdir -p /home/hadoop
  • hadoop fs -copyFromLocal sample_logs.txt /home/hadoop/sample_logs.txt

Running a Job Hadoop

Finally, type the following to run the a job in Hadoop:

  • hadoop jar $HADOOP_JAR_LIB/hadoop-streaming* -mapper /home/hadoop/hadoop-tutorial/part-1/sample_mapper.py -reducer /home/hadoop/hadoop-tutorial/part-1/sample_reducer.py -input /home/hadoop/sample_logs.txt -output /home/hadoop/job_1_samples

If that runs successfully, you’ll be able to view the job results by typing (hadoop fs -l /home/hadoop/job_1_samples) and (hadoop fs -cat /home/hadoop/job_1_samples/part-00000).

Another interesting thing to look at is the Hadoop dashboard, which can be found at http://[VM’s IP Address]:8088. This will provide you with some information on the jobs that have been running our your virtual cluster.

Conclusion

At this point, you might be thinking: “I just ran a couple of python scripts locally, and then submitted them to Hadoop to get the same answer. What’s the big deal?” I’m glad you asked (and noticed)! It is true that Hadoop gives you nothing interesting when you’re only working on a few megabytes of data. But imagine instead that you had a few terabytes of data instead. At that scale:

  • It would be very hard to store that information on one machine
  • It would take a very long time to run your python script on one giant file
  • You would might run out of memory before it getting through all the data
  • If that one machine crashes, you could lose all or part of your data

That’s where the Hadoop environment is useful. Your terabytes of data are spread across several nodes, and each node works on a chunk of data locally. Then, each node provides its partial data to produce the final result. Moreover, the beauty of using python the way we just did is that you can first test your script on a local small sample to make sure it works. After you debug it and make sure it works as expected, you can then submit the same code to your Hadoop cluster to work on larger volumes of data.

I hope you enjoyed this tutorial. In part 2 I plan to tackle the topic of: “How do I get my data into Hadoop?”. Specifically, we’ll look into setting up Kafka to receive log messages and store them in HDFS.

WordPress Plugins: How to Develop in Git and Publish to Subversion

About a month ago, I got my Sift Science plugin added to the WordPress.org online store. To publish your plugin to the store, you’re required to use the SVN repository that they provide. Once you get that done correctly, users of WordPress can find and install your plugin through the built in store and they will also receive notifications whenever you publish a new version.

In this post, I describe how I manage releases of my Sift Science plugin. The information here should be useful for many different types of plugins, but it specifically addresses two key issues I faced when getting started with this:

  • I develop my plugin primarily in Git
  • My plugin has a React component that must be packaged with the plugin

So let’s get started!

Getting your WordPress.org Plugin Published

You’ll need to request WordPress.org hosting for your plugin. You’ll have to make sure your plugin conforms to the requirements listed there and follow the instructions to provide the required information.

Checkout Git and SVN Trunk Together

I keep my Git repository synced with the trunk directory in SVN. On my computer I have checked out both Git and SVN trunk into the same directory, which helps keep them in sync quite easily. It can be a little tricky to setup initially, but the basic commands to do this are:

git clone [git repo address]
svn checkout [repo]/trunk
svn revert -R .

I do all of my development in Git, with detailed commit messages and so on. Then, when I’m ready to sync SVN with Git, I’ll pull the latest changes from Git (git pull) and check those changes into SVN trunk (svn ci ...).

Git vs SVN Ignored Files

For most plugins, SVN and Git will probably have identical ignores. However, since I have a React app that needs to be web-packed for deployment, my SVN and Git ignores are slight different Buy lasix. In my case, I drop the web-packed React app into the /dist folder of my project. Naturally, since this is “compiled from code”, the folder is ignored in Git. However, I don’t ignore this in SVN, since I want to ship that file to my users.

Update The Plugin’s Documentation

When you have a new version that you want to push to your users, you’ll want to update the documentation as follows:

  • Update “stable tag” entry and the change log in readme.txt
  • Update to the version number in your plugin’s main php file (`Version: x.y.z`)

Be sure to sync these changes into your SVN trunk folder.

Tag the new Version

The final step to getting your update shipped is to create a new entry in the tags folder. You do this by executing svn copy trunk tags/x.y.z and checking that change in. The version number in the tag must match the version number in your main php file for this to work.

The End

And there you have it! A general outline of how to work in Git while shipping to a WordPress.org SVN repository.

Only Use Mocks When Necessary

A few weeks ago I learned an important lesson that I’d like to share: When building unit tests, never use a mock/stub unless it’s absolutely necessary. Let me explain with a little example.

Let’s say you’re working on a large project and you start developing Module A, which uses a MySQL database. To write a unit test for this module, you’re going to need to mock the MySQL queries. I have used sinon for this purpose and it’s worked great by the way. You typically don’t want your unit tests accessing a real-life database server, so using a mock here makes a lot of sense.

Now let’s say you’re adding a new Module B to your project, which uses Module A. As you’re writing the unit tests for this new module, I highly recommend NOT mocking Module A. You should instead continue to only mock the MySQL module as you did when testing Module A.

The reason I recommend this is that every time you mock an interface, you’re usually replacing the real code with a very simple imitation of it. You might have the mocked Module A always return success or failure, depending on what you’re trying to test in Module B lasix australia. However, you’re very likely to miss out on a lot of detail. It’s highly unlikely that the mock will validate input formats and data structures for example.

By over-using mocks, you end up accumulating technical debt in the form of bugs and implementation errors that will only show up in E2E/integration testing. Debugging in unit tests is typically faster and easier than in an integration environment, so we definitely want to catch as many bugs in the unit test as possible. Unit tests will never catch all the bugs, but minimizing the use of mocks will help you catch more bugs at that stage.