Testing testing 1-2-3-…

I moved my much neglected blog to WordPress recently.  I’m messing around with themes, layout, etc.

Once that’s done I hope to merge several interests into one interesting blog.

In the meantime, if you see different looks and different content, it’s just me messing around.

I hope you’ll check back later to find more cool stuff….

And blah blah blah…

Hadoop HDFS: Deceived by Xciever

It‘s undocumented. It’s misspelled. And without it your (insert euphemism for “really big” here) Hadoop grid won’t be able to crunch those multi-hundred gigabyte input files. It’s the “xciever” setting for HDFS DataNode daemons, and the absence of it in your configuration is waiting to bite you in the butt.

The software components that make up HDFS include the NameNode, the SecondaryNameNode, and DataNode daemons. Typical installations place NameNode and SecondaryNameNode on a single “master” node. Installations following best practices go the extra mile and place the NameNode and SecondaryNameNode daemons on separate nodes. DataNode daemons are placed on participating slaves nodes. At the highest level the NameNode/SecondaryNameNode daemons manage HDFS metadata (the data used to map file names to server/block lists, etc.), while the DataNode daemons handle the mechanics of read/writing to disk, and serving up blocks of data to processes on the node or to requesting processes on other nodes.

There are many settings that can be use to configure and tune HDFS (as well as the MapReduce engine and other Hadoop components). The HDFS documentation lists 45 of them (along with their default values) last time I checked. These configuration settings are a somewhat disorganized mix of elements used to configure the NameNode and DataNodes.

The DataNode daemons have an upper limit on the number of threads they can support, which is set to the absurdly small (IMHO) value of 256. If you have a large job (or even a moderately-sized job) that has many files open you can exhaust this limit. When you do exceed this limit there are a variety of failure modes that can arise. Some of the exception traces point to this value being exhausted, while others do not. In either case, understanding what has gone wrong is difficult. For example, one failure mode observed
java.io.EOFException, with a traceback flowing down into DFSClient.

The solution to all this is to configure the Xcievers setting to raise the maximum limit on the number of threads the DataNode daemons are willing to manage. For modern Hadoop releases this should be done in conf/hdfs-site.xml. Here is an example:

< property > 
    < name >dfs.datanode.max.xcievers< /name > 
    < value >4096< /value > 
< /property >

[Notez-bien: I apologize for the lame formatting above. I really need to improve my template]

Interestingly, HBase users seem to have more conversations about this setting and related issues than do predominantly MapReduce users.

The central reason for this posting is to point out this setting as it might be helpful to others. We encountered this problem while porting from an older version of Hadoop to 0.20 (via the Cloudera distribution, which I highly recommend). The failure mode was not immediately evident and we wasted a lot of time debugging and chasing assorted theories before we recognized what was happening and changed the configuration.  Ironically the bread crumbs which led us to change this setting came from a JIRA I opened regarding a scalability issue in 2008.  As I recall the setting was not supported by the version of Hadoop we were using at that time.

A secondary reason for this posting is to point out the crazy name associated with this setting. The term “xceiver” (“i” before “e” except after “c” blah blah blah…) is short-hand for transceiver, which means a device which transmits and receives. But does the phrase “transceiver” really describe “maximum threads a DataNode can support”? At the very least the spelling of this setting should be corrected and the underlying code should recognize both spellings. What would be even better would be add a setting called “transceivers” or even more explicitly “concurrent_threads” like this:


Finally, why is the default value for this setting so low?  Why not have it default to a more reasonable value like 2048 or 4096 out of the box?  Memory and CPU is cheap, chasing infrastructure issues is expensive.

So summing it all up, lack of documentation and poor spelling of this setting caused us to lose hours of productivity.  If you’re deploying a big grid, be sure to configure this properly.  If you’re seeing strange failures related to I/O within your Hadoop infrastructure be sure to check for this value, and potentially increase it.

Down Time == Productive Time

Back in my college days – long enough ago that mainframe computing was still the rage – I discovered the pleasure of The Holiday Break.  Classes were over, everybody went home, the university was largely empty, and as a result you could grab all the computing time you wanted, and you could work for hours undisturbed.   The Christmas holiday was always the best. 

Fast forward to present day, and I continue to find the time from just before Christmas until just after the New Year a sort of magic time to get things done.  Many offices are shut down for the holiday, so my unpredictable 45-minutes-or-maybe-3-hours commute becomes predictable.   The shock jocks and on-air “personalities” on drive time FM radio are all on vacation, and so the radio actually plays music.  Things in the office are quiet – folks don’t normally schedule releases, death marches, etc., around the holidays.  The time of year puts people in a mellow mood.  All-in-all, it’s a great time of year to grab some quiet time.

Quiet time becomes productive time in unexpected ways.  Since I’m not in the throes of a major release crunch I actually have time to catch up on some reading.  Today, for example, I read about  20-30 pages from some of the books shown on my current reading list at the top of the blog.  I also read a few great blog posts about a number of things technical.  So great, I’m reading and surfing…but is it productive?

Oh yeah, it sure is.   I concocted a new way to visualize how well or poorly a very complicated n-tier application is performing, in part based on some blog reading.  Very cool stuff, for which I cannot take all the credit, but about which I will blog more in the near future.

I also got a couple hours to write up some much needed documentation, and spent time catching up with colleagues who actually have time to talk about what they are working on and what sorts of challenges they are facing.  This naturally leads to more ideas, and the ball starts rolling.

People often talk about the December holidays as down-time for business, and I’m sure that it is.  But the flip-side of the coin is that it can be a great time to open up the mind, solve some problems, and come up with good stuff to tackle in the upcoming year.   Down time offers the chance to productively daydream without (usually) getting off schedule.  So for engineers, down time can be super productive time.  And maybe that’s why when the weather gets cold, and the Christmas bell-ringers appear on every street corner, my mind turns to thoughts of  software design, architecture, and coding.  

Visible Measures wins 2008 MITX Technology Award!

Visible Measures (my current gig) won an award tonight at the Massachusetts Innovation and Technology Exchange (MITX) 2008 What’s Next Forum and Technology Awards. We were recognized in the Analytics and Business Intelligence category – the same category that Compete (my old company) was entered in last year.

A lot of great companies were finalists in our category, including Progress Software, salary.com, SiteSpect, and Lexalytics. This was tough competition, which made winning this award all the more sweet. A big shout out to Version 2 Communications as well – we were their guests at the awards.

Visible Measures is an awesome company, with an extremely hard-working highly motvated team. I am extremely proud and humbled to be part of this company.

The MITX event was very nice. There were plenty of opportunities to network with a lot of interesting people doing a lot of cool stuff. It was great to listen to Larry Weber (Chairman of the Board for MITX and founder of W2 Group) host the awards and dispense free advice (“…with 37 offices worldwide – that’s too much overhead…”). MITX honored Amar Bose, who gave a very interesting talk. Bose is legenday – at least in the New England high tech community and particularly within MIT, so hearing him speak live is a privilege.

The only downside to the evening was the fire alarm going off mid-way through the ceremony. This lead to a rather awkward pause in the action while the fire department made sure nothing was wrong.

Infrant/NetGear ReadyNAS NV+

Just picked up one of these last week. The plan is to use the box as a shared storage resource to back up family data (pictures, etc.), and to back up other systems, and the grid machines in the rack.

I was originally going to build a box to handle the task, but a friend of mine recommended the ReadyNAS server as a cost effective (and less labor intensive) alternative. This box is basically plug-and-play…the operating system is delivered in firmware, and you configure and operate the box via a web interface and with a program called RAIDar. The box speaks a variety of protocols and can talk to Windows, Linux, Macs, and streaming media players so it should get along well with all the servers, workstations, etc.

I bought a diskless version, and populated it with 2x500G Western Digital drives. Initially nothing worked and for a brief time I thought the server was DOA. After a bunch of trial and error I concluded that one of the WD drives was DOA. I brought the box up on 1 drive, configured things, and it just worked. NewEgg RMA’d the bad drive (and even gave me freebie shipping label to send the bad device back…good stuff).

I’ve got 2 more 500G drives arriving tomorrow – the box is hot-pluggable so in theory installation is simple. It should be interesting to get the box up to 2T with X-RAID and do some performance testing.

Product reviews of the ReadyNAS have been widely varied, but so far all things look positive. I’ll post more about the box once I get my bad drive issues sorted out…

United Economy Section not so good for laptops users…

Thankfully I don’t have to travel too much on business – there are plenty of things to do right here in the office most of the time. However, I do get a chance to escape the office now and again, and the Hadoop Summit in California a couple weeks ago was one such opportunity.

The summit was awesome, you can see some of the notes I took in earlier postings here. I talked to a lot of people, heard a lot of good presentations, and got tons of good information about the Hadoop roadmap, future directions, etc. All very good stuff.

The point of this post, however, is not to heap praises on the Yahoo! folks for such a great meeting, it’s to chat a bit about Economy Class on United Airlines. When I did the on-line check-in thing for the flight from BOS to SFO, United offered me the chance to upgrade to “Economy Plus” for an extra $60 to get more legroom. “No thanks”, I clicked and thought nothing further about it. Then I got on the airplane (wasn’t it George Carlin who said “Let Evil Kneivel get ON the plane, I am getting IN the plane…”, but I digress…). As we reached the altitude “at which it is safe for portable electronic devices to be used”, I reach for my laptop…just as the guy ahead of me reclined for what turned out to be a 6 hour snooze across the country. Hmmm, uhhhhh……not enough room between seats to uhhh open the lid on my Lenovo T60. Ugh. So I broke out the pad of paper and pen to jot down some notes and uhhhhhhh not enough room to even write comfortably.

Fortunately, the guy sitting next to me turned out to be an interesting fellow starting up a hedge fund, and so I spent the rest of the trip happilly chatting away about everything from technology to politics.

On the return flight I’m nobody’s fool. I really really want to get a good 5 hours of work done on the plane (or should that be “in the plane”…) and so I pony up the extra $60 for the Economy Plus seat. It was a night-and-day difference. I flipped open the laptop, stretched out, and used up both batteries writing code.

Now I’m not gonna gripe too much about the Welcome-To-Economy-Sorry-About-The-Laptop seat on the flight out…the ticket was cheap enough, particulary for a non-stop flight ($339, BOS->SFO), but I would like to:

A. Recommend the hell out of Economy Plus if you have to fly United and you want to work
B. Encourage the airlines to think about the impact such narrow seating has on business travelers
C. Remind myself to get to the gym a little more often…maybe if I skinny up enough the Econony section won’t be so bad…

Given the State of the Airline Industry I am pessimistic about anything happening with respect to (B) above, but I sure do feel better getting this rant off my chest….

Compete acquired by Taylor Nelson Sofres

Compete, a company I worked for as Chief Software Architect, was just recently acquired by Taylor Nelson Sofres. This is a remarkable achievement for Compete and a great deal for both Compete and TNS. If you’re interested, you can read the official postings about the acquisition. Congratulations to everyone at Compete and TNS!

Learning about this acquisition made me feel very proud and quite nostalgic. It was back in April 2001 that I got involved with Compete. I had wrapped up a long contract at Oracle, spent a month in India with my wife visiting her extended family, and was back in Nashua, NH contemplating the irrational exuberance of what we would all soon call “dot bomb.” I had been contracting for a while, and while that was good fun I felt it was time to do something a bit more daring/challenging.

One thing led to another and I found myself in Boston, on oh-so-chic Newbury Street of all places, sitting in a literally transparent office on the upper floors of a converted church building, discussing a company called Compete with its CTO (and now my good friend) David Cancel. I wound up joining Compete a few weeks later as its Chief Software Architect, a position that I held for almost 5 years. I think I was the 11th or 12th person Compete hired. Compete rode the waves of dot com and dot bomb, and eventually found its way to become the strong player it is today, thanks to its excellent management team and a lot of hard work by everybody.

Fast forward 7 years and Compete is now a recognized industry force – an established player with a work force of close to 100 extremely talented and dedicated individuals. Expect more great things from this company in the future!