Vivek Narayanan's blog

Quick and Dirty Parallelism in Python With tasks.py

| Comments

Python doesn’t have a great reputation for executing concurrent or parallel code because of the global interpreter lock (GIL). Only one thread of a process can execute python code at a time. But due to some excellent libraries like multiprocessing and eventlet, it is straightforward to get parallelism on all your sequential functions by just adding a couple of lines of code.

tasks.py is a simple and fast task queue for executing multiple tasks in parallel. All you need to do is specify the task as a simple function that takes an argument and you get instant parallelism.

It is ideal for executing multiple network bound tasks in parallel from a single node, like fetching a list of urls, crawling a site or making a lot of third party API calls, without going through the pain of setting up a map reduce cluster.

Installation

Since it uses Redis as a backend, install Redis and start the server. If you are already using redis, you can pass the custom connection object using the tasks.set_redis call.

Install tasks_py

$ sudo pip install tasks_py

Usage

A task is a function that takes a single string argument. Define such a function and register it using tasks.set_func . If the function raises an exception, it is considered to have failed.

Call tasks.main() to get the interactive command line options.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
    import eventlet
    eventlet.monkey_patch()
    import tasks

    from urllib2 import urlopen

    def fetch(url):
      f = open('/tmp/download', 'w')
      body = urlopen(url).read()
      f.write(body)
      f.close()
      
    tasks.set_func(fetch)
    tasks.main()

Note that it is important to import eventlet and call monkey_patch to replace the blocking network calls with asynchronous IO from eventlet.

Now to add jobs, create a file with one argument per line and use this command.

$ python yourfile.py add <list_of_jobs.txt>

To start (or restart) the job processing (do this in a screen session or close the input stream):

$ python yourfile.py run

tasks has resume support, so it will start where you left off the last time.

To view the current status while it is running:

$ python yourfile.py status

Once you are done, you can clear the logs and the completed tasks by calling reset.

$ python yourfile.py reset

How it works

It uses a multiprocessing pool to distribute tasks across multiple processes and hence bypasses the GIL. Even with multiple processes, there is a problem as most of the time will be spent blocked on a network request. Here is where, eventlet comes in to the picture, each process has an IO loop and a number of coroutines or “green threads”. A green thread is similar to a thread but it is lightweight and has very less overhead. A single green thread runs at a time, but whenever a green thread is waiting on I/O or network, the IO loop switches the thread out and a different green thread is executed. It is a non blocking approach that scales very well.

The code is available on Github. Feel free to fork and modify this.

Yosemite Battery Issues.

| Comments

After upgrading OSX to Yosemite, I noticed a sharp rise in battery usage in sleep mode. I usually close the lid, instead of shutting down and used to get a really good battery life on my Air. But after the upgrade, I would lose about 30-40% of the charge overnight.

So when you close the lid and your mac enters sleep mode, it is still running, only the displays have been turned off. After some time, memory is flushed to disk and it actually enters sleep mode. Turns out Apple have increased the delay for this to happen in the default configuration. To view your power management settings, use the command line tool pmset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ pmset -g
Active Profiles:
Battery Power     -1*
AC Power      -1
Currently in use:
 standbydelay         10800
 standby              1
 halfdim              1
 hibernatefile        /var/vm/sleepimage
 darkwakes            0
 disksleep            10
 sleep                1
 autopoweroffdelay    14400
 hibernatemode        3
 autopoweroff         1
 ttyskeepawake        1
 displaysleep         2
 acwake               0
 lidwake              1

So the problem here is that standbydelay is 10800 or 3 hours, so you laptop is essentially wasting power for three hours when it is in sleep mode. If it doesn’t enter standby mode, it can immediately power up by just turning on the display, but it isn’t worth it. So to reduce the standbydelay to 20 minutes use this command.

1
$ sudo pmset -a standbydelay 1200

An Algorithm for Trending Topics

| Comments

In this post I will describe a really simple algorithm for identifying trending items or posts in your application. TF – IDF (term frequency, inverse document frequency) is a technique that was used to rank search results in early search engines.

Assume that you have a large corpus of text documents and want to search a document containing a certain phrase or set of keywords. With TF-IDF, you need to calculate two quantities for each keyword :– 1. term frequency – the number of times the keyword appears in a particular document. 2. inverse document frequency – the inverse of the number of documents containing the keyword.

For each document sum up the TF-IDF values of each of the keywords in the query and then rank them. The reason this works is because the IDF part of it helps filter out common words that are present in tons of documents. So a document containing a rare term is given more weight in the search results. Why term frequency is needed is much more obvious as a document containing more occurrences of a keyword is more likely to be the document you are looking for.

Now, we can extend this algorithm for identifying trending items in a dataset. First, partition the data into two sets, one the target data set containing posts/items in the timeframe/geography for which you want to find the trending items and the rest of the data. Some preprocessing like removing stop words and stripping punctuation would be useful. Now for each of the terms in the target set, find its TF-IDF score and rank the terms. In this case the term frequency will be the number of occurrences in the target set and the IDF will be the number of documents in which the term appears in the other set. Instead of doing this exercise for each and every term, you may also do this on tags/hashtags to save memory usage.

Here is an example script on a dataset containing tweets:

Writing a Web Server From Scratch - 2

| Comments

Over the past few days I spent some time writing a web server and it has been very gratifying and I learnt about quite a few things. So let me start off with the design, the thing about web servers is that its quite simple to implement the core functionality but the bulk of it is the plumbing work around parsing the different rules of the HTTP protocol, which I’ve kept it to the bare minimum as modern browsers have sane defaults.

Design

So the basic outline is like this:

  1. Open a socket, bind it to a port and start accepting connections.
  2. Once you receive a request from a client, pass the socket information to a handler function and continue servicing other requests.
  3. In the handler function, parse the request headers and generate your response headers and body accordingly. For serving static files, simply do a buffered write on the socket after writing the headers.
  4. Close the socket and any other file descriptors, free resources allocated during the request.

So there are a number of ways how you can implement the listening the listen loop with concurrency (Steps 1 and 2) . A rather naive approach would be to fork a new process on every request and let the child handle the request after closing its access to the server socket. This might appear to work but there is a problem, once you hit 700 requests (or whatever your system’s limit for processes per user is) your server will stop working and a lot of other processes that are running in the background will begin to act strangely. The reason is that even if you exit the child process after handling the request, they will exist as zombie processes. Zombies store some state information and take up a very small chunk of memory but cause the kernel to freak out upon reaching the user limit. They are destroyed only when the parent process exits. You can check the limit with the shell command ulimit -a. So the solution to this will be to service the request in the parent process itself and let the child service the next request. This way the parent process will usually be killed before the child and no zombies will be created. Even if they are created, they will be cleared as soon as the parent completes handling the request. But still this approach has performance problems as we will see later.

A second, more common approach is to create a process pool and is used by servers like Apache. After creating the listening socket, you fork the parent process a number of times to create a process pool. Each of the child processes have a separate file descriptor for the listening socket, but the kernel manages things in a way that they all point to the same socket. So depending on which process is active at that point of time, it gets the client request. Other approaches which I have not explored yet, but will do in the coming few days are multithreading and asynchronous I/O.

Parsing requests and generating responses is fairly straightforward, which is why I only implemented a small subset of the HTTP protocol features.

Benchmarks

It’s time for some benchmarks, I’ll be comparing this file server with Apache httpd and Python’s SimpleHTTPServer module which I often use for transferring files across wifi. I’m not hoping to compete with Apache on speed here but this should be defenetely I was thinking that I might have to write another program for stress testing the file servers but fortunately ab or ApacheBench comes to the rescue. You can specify the number of requests, number of concurrent connections etc. For the tests each of these servers will be serving a file (the compiled binary of my server) and both client and server will be on the same node.

This is the command I used for testing.

ab -n 8000 -c 10 -r localhost/fsrv

SimpleHTTPServer failed with 10 concurrent connections, so I reduced it to 4. But it was still failing, so I reduced the number of requests to 300 and this was the result:

Python SimpleHTTPServer

Requests per second:    148.06 [#/sec] (mean)
Time per request:       27.015 [ms] (mean)
Time per request:       6.754 [ms] (mean, across all concurrent requests)

Then I tested Apache with the original command, and Apache turned out to be much more robust with over 4100 requests per second.

Apache httpd

Requests per second:    4111.07 [#/sec] (mean)
Time per request:       2.432 [ms] (mean)
Time per request:       0.243 [ms] (mean, across all concurrent requests)

So time to test my server, first the fork on every request model

fsrv fork every request

Requests per second:    355.25 [#/sec] (mean)
Time per request:       28.149 [ms] (mean)
Time per request:       2.815 [ms] (mean, across all concurrent requests)

Well, its faster than python but doesn’t seem all that encouraging, let me try the process pool model with 8 processes.

fsrv process pool (8 procs)

Requests per second:    759.35 [#/sec] (mean)
Time per request:       13.169 [ms] (mean)
Time per request:       1.317 [ms] (mean, across all concurrent requests)

Better, but pales in comparison to Apache. Time to run a profiler. On a side note Apple’s Instruments is a much better profiler than gprof for profiling C/C++ code.

After running the profiler, I found that the majority of the time was taken in determining the mime type for a file. I had used the unix file command which does some magic behind the scenes to determine a file’s mime type. I had used popen to open a pipe and fed data interactively to it upon each request.

void get_mime_type(const char *filename, char *mime_type) {
    static FILE* pipe = NULL;
    if (pipe == NULL) {
        // Run xargs in interactive mode
        pipe = popen("xargs -n 1 file --mime-type -b", "r+");
    }
    fprintf(pipe, "%s\n", filename);
    int read = fscanf(pipe, "%s", mime_type);
    if (!read)
        strcpy(mime_type, "application/octet-stream"); // Default mime type
}

One way to fix this would be to cache the values for each file or just write a basic pure C mime type generator without any external calls based on a few rules. For the time being I decided to use a simple default header for the mime type. Time for the benchmarks again.

fsrv fork every request

Requests per second:    3119.81 [#/sec] (mean)
Time per request:       3.205 [ms] (mean)
Time per request:       0.321 [ms] (mean, across all concurrent requests)

I was quite surprised with the results as I thought forks would be quite expensive. After a little bit of searching, I came to know that you can fork around 3-4k times a second.

fsrv process pool

Requests per second:    7256.96 [#/sec] (mean)
Time per request:       1.378 [ms] (mean)
Time per request:       0.138 [ms] (mean, across all concurrent requests)

Incredible! This is about 1.76x faster than Apache!. So there you have it you can write a simple file server from scratch using only the Unix APIs, that is quite a lot faster than Apache. Though, I must admit that Apache has tons of features that would slow it down. Feel free to check out the code at [http://github.com/vivekn/fsrv]

Writing a Web Server From Scratch - 1

| Comments

I have been reading more about system calls in the Unix kernel recently and am thinking of writing a small application to apply what have learnt and gain some experience with systems programming. So, I’ll be building a minimalist high performance web server in C over the next few days.

Why reinvent the wheel?, you may ask but this is just to expand my personal knowledge and may not be used by anyone else. So the first step would be get a simple static server running that would be just serving all the files present in a directory provided as an option to it. Architecturally, this wouldn’t be radically different from a web server with dynamic content and this is the part I want to focus on for now. So to handle multiple requests concurrently there a number of possible designs like using a pre-forking model, multithreading or using non blocking I/O operations. I will be exploring each one of these methods. There will also be some necessary plumbing tasks like parsing and creating HTTP headers. Creating custom handlers for routes would be an interesting problem. Perhaps, I could extend the project by making a micro web framework in C.

I will be blogging about my experiences building this in subsequent posts.

Efficient Spelling Correction Using BK Trees

| Comments

I came across an interesting data structure called the BK Tree today. It is useful in coming up with suggestions for spelling correction from a dictionary.

A simple solution for coming up with spelling corrections would be to iterate through all the words in the dictionary and computing the Levenshtein distance with the search term and filtering those words less than some bound. But this would have a complexity of O(n * m2) where n is the number of words and m, the average length. But words are not randomly distributed across the alphabet space, hence a large number of these computations are useless. The idea behind the BK tree is to construct a k-ary tree, such that only nodes that are within a particular edit distance are searched. This exploits the fact that the Levenshtein distance forms a metric space. A really nice explanation of constructing and searching through a BK tree is given here. It is not necessary that the metric used should be Levenshtein distance, you can use any other metric as you see fit.

It has been observed that for edit distance 2, using a BK tree you need to search only 5-10% of the nodes of the tree. I hacked up a BK Tree implementation in Scala, and was able to load the /usr/share/dict/words file which contains about 250k words in 5 seconds on my MacBook Air. A search query with edit distance 2 takes about 0.06 seconds. Still, I feel there is room for improvement on the performance front. Suggestions welcome.

The Trouble With SVMs

| Comments

People tend to use complicated machine learning models like support vector machines, just because it is believed to be the model with the best accuracy. The fact that SVMs are available as black box implementations exacerbate this problem. I would suggest focusing on your data instead of using slow techniques that you only partially understand. I achieved better results with a simple Naive Bayes classifier by choosing the right features. The best part is it only takes as much time as it does to read the document unlike an SVM.

I had a task of classifying text documents or reviews for sentiment polarity for a college project. I had a dataset of over 50,000 movie reviews with 25,000 positive and 25,000 negative reviews. After dividing my data into training and test data, the first thing I came up with was a Naive Bayes classifier based on the word frequencies of each term in the text. I used add one (or Laplacian) smoothing to handle words which were not in the training set. With this, the accuracy was around 69%. That is bad and is only marginally better than random guessing. I added a mechanism to transfer words to the opposite class when they were negated. That is to say if there is a word like “not” followed by some other word, the other word is placed in the opposite class. This along with Multinomial Naive Bayes, i.e counting a word only once in a document, helped increase the accuracy to 82.4%.

Then, against my better judgement I decided to try an SVM without completely understanding how it works. I used the TinySVM library and the training took over 45 minutes (vs 1 minute for NB) and resulted in a measly improvement of 0.3%. Not convinced by the black box approach, I decided to learn more about SVMs. The first place I went to were Andrew Ng’s lectures on Youtube. They helped me gain a general intuition about how an SVM works but the treatment of math was more hand wavy and it didn’t have anything about the actual algorithm to implement it. The papers on SVMs and SMO (Sequential Minimal Optimization) were too dense and the math was way above my level as I knew nothing about convex optimization. I spent part of the next month watching Stephen Boyd’s convex optimization lectures. Now I understood about Lagrange multipliers and KKT conditions but the task of implementing an SVM was too daunting. I realized that it may take a few months to write one from scratch. What a waste of time!

Now, I thought of removing the features that were only contributing noise to the classifiers. Stop word removal came to my mind but there were research papers suggesting that it actually removes information relevant to sentiment. Mutual information can be used quite effectively for feature selection. It is a measure of correlation of a feature with the whole dataset. Since it was based on probabilities, it suited my Naive Bayes model very well. So I reduced the number of features from over 300,000 words to 6,000. I chose this number by plotting a graph of the accuracy v/s the number of features. The accuracy shot up to 85.2% and the training time was still under a minute. 

The fact that I could throw away irrelevant features meant I could initially define a very large number of features and then prune the irrelevant ones. So I added bigrams and trigrams as additional features. The feature space had now increased to 11 million. By selecting the top 32,000 features by mutual information, I was able to get an accuracy of 88.89%. There was a paper by Andrew Maas of Stanford achieving an accuracy of 88.89% using complicated vector space and probability models on the same dataset. Naive Bayes comes within 0.1% of that.

The lesson from this is that you can achieve a lot using a very simple algorithm, if you focus on selecting the right features for your data. No need to resort to complex black box models.

Here is a link to the sentiment analyzer to play around with.

The Travelling Santa Problem

| Comments

I spent the past couple of days on a problem called the “Travelling Santa Problem”, a variant of the NP complete “Travelling Salesman Problem”. There is an ongoing competition on Kaggle, a site which hosts various machine learning and optimization contests. In the original TSP problem you need to find a tour of a graph, visiting every vertex exactly once, minimizing the total cost. Since there can be an arbitrarily large (exponential in number of vertices) number of such paths (Hamiltonian cycles), no polynomial time exact algorithm is known for this problem. Applications of TSP are in the areas of scheduling, logistics and implementation of electronic circuitry. Kaggle’s version of the problem gives you a set of 150000 points on a two dimensional plane, and you need to find two paths, that visit all the points and have no edges that are common to both the paths, ie, disjoint paths. The aim of the contest is to find the minimum distance and your score is the distance on the longer path.

If you visualize the set of points, you get a picture of Santa.

image

They provided a benchmark of random paths to get started with, that is two completely random sequences of points. It had a score of about 1.3 billion, and at that time the leader had a score of 6.5 million. I decided to start with a very simple heuristic, which was to sort the points according to their x coordinate and connect the path from left to right. With this I could find a single path, but what about the second path? The disjoint path condition makes this a little trickier. But given the first path, it should be easy to check if adding a node to the second path will be valid. Since every node would have a maximum of two connections, we just need to store the ids of the previous and next nodes for each node in the first sequence. So using this, I sorted the points according to the y coordinate and started building the second path from top to bottom, if there was an edge that would be repeated from the first path, I would simply select a random vertex and make a connection. So now that I had two disjoint paths, I decided to make a submission, the score turned out to be 400 million. Much better than random, but way behind the leaders and quite inefficient. If you come to think of it, this approach would create a very zig-zag path on random data. The disjointness condition forces you to add random edges to the second path, which take the distance up even further. So time for a different strategy.

So the next heuristic I tried was the nearest neighbour approach. If you connect each node to its nearest neighbour, you should get a shorter path and this is quite intuitive. But the problem here is, it would have a complexity of O(n2) and there are 150,000 points, go figure. A k-d tree can find the nearest neighbour in O(n lg n) time, but its implementation is non trivial and it would be pretty much useless for generating a second disjoint path, we would still have to rely on random links. Time for an approximation of nearest neighbours: its highly likely that the nearest neighbour would lie on one of the points closest by x or y coordinates. This can be found by simply sorting the points and considering the k nearest such points and finding the nearest neighbour (we can also check for the disjointness condition, while doing so). Doing this on the nearest 100 points for each node, got me a solution with a score of 25 million. I decided to increase the number of points, and at around 4000 points, I could get a solution of around 8.5 million and the code runs in a minute. Within 30% of the leaders, not bad, huh?. So now, how can this be improved further.

I opened CLRS to look for ideas, there was an approximation scheme called the 2-approximation scheme which is based on finding vertices on a walk of a minimum spanning tree. CLRS has an elegant proof of how it is less than twice the minimum possible value, and in practice much better than that. But the problem is in creating the minimum spanning tree. Using Prim’s algorithm would have a complexity of O(n2 lg n), which is not feasible. It turns out that an MST for a Euclidean plane can be created in O(n lg n) time using a Voronoi diagram or a Delaunuay triangulation. As I had no background in computational geometry, I decided not to go down that path. Creating an approximate MST taking 40 nearest by x or y coordinates yielded a poor result.

An optimization technique, that was mentioned a lot in the forums was 2-opt. It is a local optimization technique, that involves taking two adjacent edges, and replacing them with a different combination that reduces the distance. If you have a sequence of points ABCD, you replace it with ACBD if it has a lower cost. I tried this technique, by randomly picking consecutive edges and swapping vertices if they had a lower cost. Even after 1 billion iterations and 45 minutes, it only yielded an improvement of 0.1%. It was getting stuck at a local minimum. Then I ran the optimization in a sequential order of all vertices in the path, instead of random choice, it converged in 5-6 iterations and yielded a 2-3% improvement.

There is a randomized algorithm known as simulated annealing based on the metallurgical principle of annealing, a metal is heated to a high temperature and allowed to cool slowly to remove crystal imperfections. Heating allows a different configuration to be reached, and on cooling the imperfection is removed. Here, perturbations of the sequence are created using 2-opt and during the high temperature phase, changes are accepted even if it increases the total cost. In the cooling phase the acceptance rule is tightened until a point is reached where only those moves that reduce the cost are accepted. This helps in finding a global optimum. Unfortunately, this algorithm didn’t improve the solution by much even after running for an hour.

I could get one path at around 7.1 million and the second at 8.25 million under 5 minutes of running time using sequential 2-opt. So the difference in the two paths could be the main culprit. The Lin – Kernighan heuristic is considered to be one of the best heuristics for TSP and it is a generalization of 2-opt. But my motivation to read more about it diminished after seeing some posts in the competition forum, that some of the leaders ran their programs for days, and worse, even weeks to get their results. In my opinion running a program for days to achieve a very marginal improvement is totally not justified. This was really disturbing and I decided to quit the competition.

What I learned from this competition was that NP complete problems could be approximated reasonably within a short amount of time, and the last mile takes a humongous effort, and whether its worth it is questionable. I also got to know about some randomized algorithms and Monte Carlo simulations. You can find my C++ code over here.

Roll Your Own Autocomplete Solution Using Tries

| Comments

You might have come across many websites with autocomplete suggestions, most notably Google.

Google Autocomplete

Adding such an option to your site or application might seem daunting but there is a very simple recursive data structure that solves the problem. There is a ton of literature on the net on how to do this using black box approaches like Lucene, Solr, Sphinx, Redis etc. But all these packages require a lot of configuration and you also lose flexibility. Tries can be implemented in a few lines of code in any language of your choice.

Trie

A trie is basically a tree, with each node representing a letter as illustrated in the figure above. Words are paths along this tree and the root node has no characters associated with it.

  1. The value of each node is the path or the character sequence leading upto it.
  2. The children at each node are ideally represented using a hash table mapping the next character to the child nodes.
  3. Its also useful to set a flag at every node to indicate whether a word/phrase ends there.

Now we can define methods to insert a string and to search for one.

The insertion cost has a linear relationship with the string length. Now lets define the methods for listing all the strings which start with a certain prefix. The idea is to traverse down to the node representing the prefix and from there do a breadth first search of all the nodes that are descendants of the prefix node. We check if the node is at a word boundary from the flag we defined earlier and append the node’s value to the results. A caveat of the recursive approach, is you might run into stack depth problems when the maximum length of a string reaches tens of thousands of characters.

To delete an item,  first traverse down to the leaf of the keyword and then work backwards, checking for common paths.

And there you have it, an efficient way of retrieving strings with a common prefix in less than 40 lines of python code. I also wrote a C++ version using the STL data structures in about 90 lines, the time complexity for this version however is O(n log n) as the STL uses a red-black tree implementation for an associative array. Colin Dean has wrote a Ruby version and Marcus McCurdy, a Java one. You can find them all over here - https://github.com/vivekn/autocomplete. Read more about tries at Wikipedia.

Update: Thanks to bmahler on reddit for pointing out that the wordlist approach was unnecessary and space inefficient. It has been corrected.

trie data structure
trie algorithm
trie autocomplete
autocomplete
autosuggest
autocomplete algorithm
autosuggest algorithm
trie implementation
input autocomplete
autocomplete algorithm