Instrumenting PHP: A Minimalist Approach

I needed a quick way to measure performance and log errors in my Sift Science for WooCommerce plugin. I didn’t want to go back through all my code and embed logging and timing measurement statements, so I considered a more generic and lazy approach.

I decided to create a class that wraps the class I want to measure/monitor. Its constructor takes a class instance, it saves that instance. Then, for every function call to the wrapper class, the function in the underlying class is called and information is logged as needed. Here’s the class from the project:

https://github.com/Fermiac/woocommerce-siftscience/blob/master/includes/class-wc-siftscience-instrumentation.php

How it Works

The most interesting piece of code in this class is here:

public function __call( $name, $args ) {
  $metric = "{$this->prefix}_{$name}";
  $timer = $this->stats->create_timer( $metric );
  $error_timer = $this->stats->create_timer( "error_$metric" );
  try {
    $result = call_user_func_array( array( $this->subject, $name ), $args );
    $this->stats->save_timer( $timer );
    return $result;
  } catch ( Exception $exception ) {
    $this->logger->log_exception( $exception );
    throw $exception;
  }
}

It’s pretty straight forward; I use the PHP magic function __call for all method calls to the class. I initialize timers and then call the function of the class I’m wrapping. If it succeeds or fails I log the information I want and then I pass on the result or exception.

Drawbacks

The main drawback to this approach is that this wrapper class can’t be passed around the same way as the original class. That is, if you’re using PHP type hints, the wrapper class will not implement the same interface as the class it wraps. This also means that IDEs will have trouble auto-completing your code.

To compensate for this, I generally pass the original class to the constructors that need them. I then use the instrumentation class instance to plug into WordPress hooks. You lose a bit of detail in that you only get measurements on the outer surface of your plugin, but overall I’ve found this approach to be satisfactory. You can still calculate the total run time of your plugin and it’s nearly impossible for an exception to be thrown without it getting caught by the wrapper.

Conclusion

I found this approach easy and simple for collecting metrics and useful logs throughout my plugin. I may carve it out and share it as re-usable code one day, but for now the class has dependencies on other classes specific to my project. However, it wouldn’t be hard to copy the class and modify to your specific need.

Host your own podcast with PHPodcast

I’ve been using a little PHP script for the past few months to host my own private podcast. So I decided to clean it up a little and share it on GitHub.

https://github.com/nabsul/phpodcast

A little background: I had some audio files that I wanted to listen through with the ability to increase the speed and pause at any time to continue later. This is everything that most podcast apps do. So I decided to host my own private podcast channel containing the audio files.

PHPodcast is a simple script that creates a podcast RSS feed based on a directory of audio files. You’ll need to modify a config file to your specific setup, but everything you need to edit is clearly marked.

It basically saves you the time of learning how to correctly build and format a podcast RSS feed. Nothing super hard, but doing it from scratch can take a few hours of research, trial and error.

Here’s a live podcast served up using PHPodcast:

https://phpodcast.nabeel.us/

Enjoy!

I have a fantastic startup idea!

“I have a fantastic startup idea.”

You’d be surprised how often I hear this. When I do, it’s often followed by “All I need is a developer. Let’s partner up!”.

Now don’t get me wrong, I think entrepreneurship is a great thing, and I often hear ideas that are genuinely interesting. However, a great idea isn’t enough to justify taking on such an endeavor.

Over the years, through trial and many, many errors, I’ve developed a few rules to help filter these pitches. Here are some tips:

Are you the right person for the job?

I’ve had someone pitch me their startup idea and tell me “Most of my friends are married and have kids. You’re still single, so I figure you have plenty of free time and could even quit your job to work on this.” Yeah, thanks for choosing me over all the qualified candidates.

Seriously though, even when it’s not that obvious, you should ask yourself if you’re really the right person for the job. Do you have the experience needed? Some things you can learn on the job, but if you don’t have what it takes to deliver, things will not be fun.

Are they willing to put in the time and effort?

This is my favorite because it weeds out 80% of the pitches. A lot of people think they have great ideas, but are not really willing to put in the time and effort needed to succeed. Before committing to any project, try to determine how serious the potential partner is.

Have they done any serious research about their business idea? Ask them to write up a short summary of their idea or put together an excel sheet with some concrete numbers. Thinking up a storm of ideas and dreaming of wild success is fun. You’ll be surprised how many people give up when even the smallest amount of real work is required of them.

Will they be honest and fair partners?

This one’s important because failing to catch it early can result in heartache after lots of wasted time and effort. If everything else looks good and you’re considering working with this person, you’re eventually going to discuss partnership details. Often this boils down to who owns what percentage of this theoretical non-existent venture.

If they try to low-ball or otherwise offer you much less than you deserve, just walk away and never look back. Here’s why: If they’re already trying to haggle with you over something that doesn’t yet exist, just imagine what will happen if any level of success is achieved.

Seriously, avoid this type of partner at all cost.

Conclusion

I hope these tips were helpful. This wasn’t meant to be a discouraging post. Entrepreneurship (or simply building things outside of your 9-5 job) is really fun and a great experience, even if you make a bunch of mistakes and fail along the way.

Analyzing WordPress Hook Usage with Azure Data Lake

WordPress provides a large number of hooks that allow plugins to extend and modify its behavior. A few months ago, I was curious about which of these hooks are popular, and which of them are hardly ever used. I was also looking for an excuse to give Microsoft’s Data Lake Analytics a spin. U-SQL looked especially attractive as it brought back fond memories of petabyte-scale data crunching at Bing.

With that in mind, I set out to build some tools that would calculate the usage of WordPress’s hooks. Breaking that up into smaller steps, I came up with:

  • Crawl all published plugins on WordPress.org
  • Extract which hooks are used by each plugin
  • Extract a list of WordPress hooks
  • For each WordPress hook, calculate its usage

On the technical side, I set the following goals for this project:

  • The code should be developed in C# and U-SQL
  • The project should use .NET Core so that it’s cross-platform (Windows, Linux, Mac)
  • The project should be usable in Visual Studio, VS Code or from the command line

In this article I talk about the approach and algorithms in general. For the nitty-gritty details, you can check out the source code here: https://github.com/nabsul/WordPressPluginAnalytics. See the README.md file for instructions on building and running the code.

Crawling for Plugins

I decided to crawl the WordPress.org plugins directory to extract a list of all the plugins. All of the plugins can also be accessed from a common SVN repository, but with different branches and tag folders, I felt that would be slightly more tedious than crawling the html pages to extract the official link to each zip file. The HtmlAgilityPack library makes parsing HTML and extracting information very easy. I use it to parse each page of plugins for the links to each individual plugin page, and then I parsed each plugin page for the zip file URL.

Once I have the zip file URL, I upload it to Azure Blob Storage. I considered skipping this and working directly with the data from WordPress.org, but I felt this approach allowed me to have a stable snapshot of the original data to experiment on without repeatedly hitting wordpress.org for the same data.

Running the process sequentially takes nearly 5 hours from a Digital Ocean droplet, but about 90% of that time is just waiting on I/O. Therefore, adding some parallelism to this process made a lot sense. This was done very simply by fetching all 12 plugins per page in parallel. This brought the run time down to just over an hour.

Extracting Data

Now that I have my raw data, the next step is to extract useful information out of it. I used System.IO.Compression.ZipArchive to iterate over each PHP file in the zip file. I then considered writing my own code to parse each PHP file, but quickly gave up on the idea when I realized how complicated that would get. So I looked around and found Devsense.Php.Parser. Using this library, I was able to work directly on tokenized data and avoided all the hassle of parsing text myself.

With that library, I extracted each hook usage and creation in the PHP files. I only count instances where the hook name is a constant string, since it would be impossible to predict the hook name for code like add_action( "updated_$myvar", ...).

The final result needed to be in a format that can be easily analyzed with U-SQL and Azure Data Lake Analytics. U-SQL comes with built in TSV extractors, so if you upload your raw data in that format, you don’t need custom C# code to process it. Data Lake Analytics can automatically uncompress gzipped files, which is great since my TSV files compress to about 10% of their uncompressed size.

Extracting the plugins takes less than 1 hour, so I didn’t bother to run parts of that code in parallel.

Running the Analysis

The final step of the process is running a U-SQL script to analyze the data and generate the final report. You can upload the data manually or using the command line tool included in the project. You should have two extraction files: One for the WordPress source code and one for all the plugins. The final step is to run the U-SQL script. Again, you can edit and submit the script manually, or if you followed the naming conventions used in the program you can submit the job using the command line tool.

U-SQL is a SQL-like language. If you’re familiar with SQL, the code in the script should all make sense. The raw data is read from the uploaded files. The WordPress data is filtered by hooks created and the plugins are filtered by hooks used. Hook usage is counted using a GROUP BY statement. The hooks from WordPress and the plugins are then cross-referenced using a JOIN. The graph of the job looks like this:

The Cost of Data Lake Analytics

The job should take a couple of minutes to run and costs around $0.03 (US). However, I learned a few important lessons on the pricing of Data Lake jobs. First, when running on a few GB of data make sure you run with a parallelism of 1. Increasing the parallelism on a small data set is just a waste of money. For example, my 3-cent job cost 12 cents when I ran it with a parallelism of 5. I also suspect that compressing my data files helped reduce the cost of jobs. Compressed data should mean less data travelling over the network, which can often result in significantly faster (and cheaper) jobs.

The second and more important point is about using custom code and libraries in your scripts: It is possible to upload and use custom .NET DLLs in your U-SQL scripts, but I highly recommend avoiding that unless it’s absolutely necessary. I experimented with uploading the individual plugin zip files to Data Lake storage and using a custom extractor library that directly processed the zip files and tokenizes the PHP. The cost of running such a job was around $5. This is way more than the cost of working on TSV files but it does makes sense since doing the Zip extraction and PHP parsing on Microsoft’s Azure infrastructure will consume far more CPU cycles than if you do most of the pre-processing separately.

As you can see, unlike simpler services like storage, the cost of using this type of service can vary widely depending on how you design your data pipelines. It is therefore important to spend some time researching and carefully considering these decisions before settling on an approach.

Viewing the Results

The final result of running the script is a small TSV formatted report with the follow pieces of information:

  • Hook Name: The name of the hook (prefixed with action_ and filter_ to differentiate those two types of hooks)
  • Num Plugins: Number of plugins using the hook
  • Num Usages: Number of times the hook is used

The data can be imported to a spread sheet for further analysis and charting:

https://1drv.ms/x/s!AoNGbuElNYPMjMUVzq5931eX9YzSuA

Conclusions

Overall, I felt like there was definitely a learning curve to Azure Data Lake services, but it wasn’t all too bad. I’m definitely curious how all of this could be done in the Hadoop ecosystem, which I’m much less familiar with. If anyone would like to try replicating these results in Hadoop, I would greatly appreciate a tutorial and/or shared source code.

This code could easily be expanded to perform other types of analysis. For example, it might be interesting to see the usage of various WordPress functions and classes. It also might be interesting to reduce the list of plugins to the most popular ones to get more realistic usage information for the hooks.

Migrating Data Between Azure Subscriptions

It’s been two years since I left Microsoft, and they finally decided to cancel my free employee Azure account. It was fun while it lasted, but now I have to move my data to a regular paid account. I looked through the FAQs and contacted support for the easiest way to do this, and unfortunately there is no officially recommneded solution for moving storage accounts between different subscriptions.

I found some Windows-only tools that may have done the job, but I wanted a solution that would run on any platform. So I decided to write and share some quick NodeJS scripts: https://github redirected here.com/nabsul/migrate-azure-storage

To use:

  • Get the code and install the dependencies
  • Edit the configuration file with the source and destination storage accounts
  • Run “npm run copy" to copy all blob and table data
  • Run “npm run compare" to compare the data between the accounts

Notes:

  • For better performance, run this on a Digital Ocean droplet instead your home machine
  • Preexisting blobs and tables are silently overwritten, so be careful!