Importing - There's always a silver lining in the cloud

I wanted to try my hand at a MongoDB and AngularJS app. First things first, I wanted to find a dataset that would do MongoDB justice. The best thing that I came up with was Google’s wiki-links data set. Something big enough to not sit in RAM, but not to overwhelming, and something with a little bit of structure to it.

I made a simple parsing and importing script. I setup my MongoDB instances on my Eucalyptus cluster (that has very slow IO), and I’m using a 3 node replication setup. The dataset is split up into 10 files, each about 550M. Inserts where taking about 120s per 1000 records (safe inserts, and write majority). A few demotions, and we selected the master node to be the node on our Euclyptus cluster with the least contention. That got us down to about 95s per 1000 records. At this rate it was taking several days to import just 1 file in the dataset. I looked around and noticed that we weren’t really bottlenecked on CPU or IO, so I figured we could make the inserts in parallel and go a bit faster.

I’ve done the standard forking method in the past, so I wanted to try something new. I figured I would see what I could do with a message queue or something similar. I started with nanomsg, but the Perl bindings weren’t up to date with the latest beta (the beta just came out last month). I fussed with 0MQ for a bit, but when I started losing messages, I decided I could get lost down that rabbit hole pretty quick. Looking at a few benchmarks, I knew I was looking at either RabbitMQ or Qpid. After looking at some of the coding samples, I picked Qpid. It was in the distro repo’s and very easy to setup. It had examples in multiple languages including Perl. It had some easy to use command line tools for setting up queues, and even a script to do a quick performance test. On my workstation it was something like 200,000 messages a second with transient storage and 32,000 messages a second with persistent message storage. Easily enough that the queuing and dequeuing wasn’t going to be my bottleneck.

yum install qpid*
/bin/systemctl start qpidd
qpid-config add queue q0 --max-queue-size=2147483648 --max-queue-count=2147483648
qpid-stat -q q0

I ripped apart my parser and inserter, and merged it with some of the example scripts, and now I have the parser queueing up records to be inserted, and an import for dequeuing and inserting into MongoDB. The parsing and queueing took about 2 minutes, and I have 5 copies of the importer script running each taking about 106s per 1000 records, with a bit more room to push it further.

As a side note, I should mention that I had to reboot my nodes in my Eucalyptus cluster, and I had lost my setup and data because I’m not using a block storage volume to store things in. So that prompted me to do a little bit more Ansible work, and now setup is a breeze. See my Ansible MongoDB playbook here. It needs a few tweaks to be better, but works for me at the moment.

Now time to finish up that AngularJS tutorial while these import.