I moved my much neglected blog to WordPress recently. I’m messing around with themes, layout, etc.
Once that’s done I hope to merge several interests into one interesting blog.
In the meantime, if you see different looks and different content, it’s just me messing around.
I hope you’ll check back later to find more cool stuff….
And blah blah blah…
It‘s undocumented. It’s misspelled. And without it your (insert euphemism for “really big” here) Hadoop grid won’t be able to crunch those multi-hundred gigabyte input files. It’s the “xciever” setting for HDFS DataNode daemons, and the absence of it in your configuration is waiting to bite you in the butt.
The software components that make up HDFS include the NameNode, the SecondaryNameNode, and DataNode daemons. Typical installations place NameNode and SecondaryNameNode on a single “master” node. Installations following best practices go the extra mile and place the NameNode and SecondaryNameNode daemons on separate nodes. DataNode daemons are placed on participating slaves nodes. At the highest level the NameNode/SecondaryNameNode daemons manage HDFS metadata (the data used to map file names to server/block lists, etc.), while the DataNode daemons handle the mechanics of read/writing to disk, and serving up blocks of data to processes on the node or to requesting processes on other nodes.
There are many settings that can be use to configure and tune HDFS (as well as the MapReduce engine and other Hadoop components). The HDFS documentation lists 45 of them (along with their default values) last time I checked. These configuration settings are a somewhat disorganized mix of elements used to configure the NameNode and DataNodes.
The DataNode daemons have an upper limit on the number of threads they can support, which is set to the absurdly small (IMHO) value of 256. If you have a large job (or even a moderately-sized job) that has many files open you can exhaust this limit. When you do exceed this limit there are a variety of failure modes that can arise. Some of the exception traces point to this value being exhausted, while others do not. In either case, understanding what has gone wrong is difficult. For example, one failure mode observed
raised java.io.EOFException, with a traceback flowing down into DFSClient.
The solution to all this is to configure the Xcievers setting to raise the maximum limit on the number of threads the DataNode daemons are willing to manage. For modern Hadoop releases this should be done in conf/hdfs-site.xml. Here is an example:
< property > < name >dfs.datanode.max.xcievers< /name > < value >4096< /value > < /property >
[Notez-bien: I apologize for the lame formatting above. I really need to improve my template]
Interestingly, HBase users seem to have more conversations about this setting and related issues than do predominantly MapReduce users.
The central reason for this posting is to point out this setting as it might be helpful to others. We encountered this problem while porting from an older version of Hadoop to 0.20 (via the Cloudera distribution, which I highly recommend). The failure mode was not immediately evident and we wasted a lot of time debugging and chasing assorted theories before we recognized what was happening and changed the configuration. Ironically the bread crumbs which led us to change this setting came from a JIRA I opened regarding a scalability issue in 2008. As I recall the setting was not supported by the version of Hadoop we were using at that time.
A secondary reason for this posting is to point out the crazy name associated with this setting. The term “xceiver” (“i” before “e” except after “c” blah blah blah…) is short-hand for transceiver, which means a device which transmits and receives. But does the phrase “transceiver” really describe “maximum threads a DataNode can support”? At the very least the spelling of this setting should be corrected and the underlying code should recognize both spellings. What would be even better would be add a setting called “transceivers” or even more explicitly “concurrent_threads” like this:
Finally, why is the default value for this setting so low? Why not have it default to a more reasonable value like 2048 or 4096 out of the box? Memory and CPU is cheap, chasing infrastructure issues is expensive.
So summing it all up, lack of documentation and poor spelling of this setting caused us to lose hours of productivity. If you’re deploying a big grid, be sure to configure this properly. If you’re seeing strange failures related to I/O within your Hadoop infrastructure be sure to check for this value, and potentially increase it.
Back in my college days – long enough ago that mainframe computing was still the rage – I discovered the pleasure of The Holiday Break. Classes were over, everybody went home, the university was largely empty, and as a result you could grab all the computing time you wanted, and you could work for hours undisturbed. The Christmas holiday was always the best.
Fast forward to present day, and I continue to find the time from just before Christmas until just after the New Year a sort of magic time to get things done. Many offices are shut down for the holiday, so my unpredictable 45-minutes-or-maybe-3-hours commute becomes predictable. The shock jocks and on-air “personalities” on drive time FM radio are all on vacation, and so the radio actually plays music. Things in the office are quiet – folks don’t normally schedule releases, death marches, etc., around the holidays. The time of year puts people in a mellow mood. All-in-all, it’s a great time of year to grab some quiet time.
Quiet time becomes productive time in unexpected ways. Since I’m not in the throes of a major release crunch I actually have time to catch up on some reading. Today, for example, I read about 20-30 pages from some of the books shown on my current reading list at the top of the blog. I also read a few great blog posts about a number of things technical. So great, I’m reading and surfing…but is it productive?
Oh yeah, it sure is. I concocted a new way to visualize how well or poorly a very complicated n-tier application is performing, in part based on some blog reading. Very cool stuff, for which I cannot take all the credit, but about which I will blog more in the near future.
I also got a couple hours to write up some much needed documentation, and spent time catching up with colleagues who actually have time to talk about what they are working on and what sorts of challenges they are facing. This naturally leads to more ideas, and the ball starts rolling.
People often talk about the December holidays as down-time for business, and I’m sure that it is. But the flip-side of the coin is that it can be a great time to open up the mind, solve some problems, and come up with good stuff to tackle in the upcoming year. Down time offers the chance to productively daydream without (usually) getting off schedule. So for engineers, down time can be super productive time. And maybe that’s why when the weather gets cold, and the Christmas bell-ringers appear on every street corner, my mind turns to thoughts of software design, architecture, and coding.
Visible Measures (my current gig) won an award tonight at the Massachusetts Innovation and Technology Exchange (MITX) 2008 What’s Next Forum and Technology Awards. We were recognized in the Analytics and Business Intelligence category – the same category that Compete (my old company) was entered in last year.
A lot of great companies were finalists in our category, including Progress Software, salary.com, SiteSpect, and Lexalytics. This was tough competition, which made winning this award all the more sweet. A big shout out to Version 2 Communications as well – we were their guests at the awards.
Visible Measures is an awesome company, with an extremely hard-working highly motvated team. I am extremely proud and humbled to be part of this company.
The MITX event was very nice. There were plenty of opportunities to network with a lot of interesting people doing a lot of cool stuff. It was great to listen to Larry Weber (Chairman of the Board for MITX and founder of W2 Group) host the awards and dispense free advice (“…with 37 offices worldwide – that’s too much overhead…”). MITX honored Amar Bose, who gave a very interesting talk. Bose is legenday – at least in the New England high tech community and particularly within MIT, so hearing him speak live is a privilege.
The only downside to the evening was the fire alarm going off mid-way through the ceremony. This lead to a rather awkward pause in the action while the fire department made sure nothing was wrong.
A few weeks ago I went to California for the Hadoop Summit. I posted a bunch of notes in real-time during the summit until the network connection became too flakey to continue.
The Yahoo folks have come to the rescue however. The slides from the presentations, which are tons better than my notes, are freely available on-line here. There are also slides from the Data Intensive Computing Symposium which was held the next day.
I wish I had know about the Data Intensive symposium as it looks really really interesting (not to mention an excuse to stay in Califorinia one more day…).
Just picked up one of these last week. The plan is to use the box as a shared storage resource to back up family data (pictures, etc.), and to back up other systems, and the grid machines in the rack.
I was originally going to build a box to handle the task, but a friend of mine recommended the ReadyNAS server as a cost effective (and less labor intensive) alternative. This box is basically plug-and-play…the operating system is delivered in firmware, and you configure and operate the box via a web interface and with a program called RAIDar. The box speaks a variety of protocols and can talk to Windows, Linux, Macs, and streaming media players so it should get along well with all the servers, workstations, etc.
I bought a diskless version, and populated it with 2x500G Western Digital drives. Initially nothing worked and for a brief time I thought the server was DOA. After a bunch of trial and error I concluded that one of the WD drives was DOA. I brought the box up on 1 drive, configured things, and it just worked. NewEgg RMA’d the bad drive (and even gave me freebie shipping label to send the bad device back…good stuff).
I’ve got 2 more 500G drives arriving tomorrow – the box is hot-pluggable so in theory installation is simple. It should be interesting to get the box up to 2T with X-RAID and do some performance testing.
Product reviews of the ReadyNAS have been widely varied, but so far all things look positive. I’ll post more about the box once I get my bad drive issues sorted out…
Thankfully I don’t have to travel too much on business – there are plenty of things to do right here in the office most of the time. However, I do get a chance to escape the office now and again, and the Hadoop Summit in California a couple weeks ago was one such opportunity.
The summit was awesome, you can see some of the notes I took in earlier postings here. I talked to a lot of people, heard a lot of good presentations, and got tons of good information about the Hadoop roadmap, future directions, etc. All very good stuff.
The point of this post, however, is not to heap praises on the Yahoo! folks for such a great meeting, it’s to chat a bit about Economy Class on United Airlines. When I did the on-line check-in thing for the flight from BOS to SFO, United offered me the chance to upgrade to “Economy Plus” for an extra $60 to get more legroom. “No thanks”, I clicked and thought nothing further about it. Then I got on the airplane (wasn’t it George Carlin who said “Let Evil Kneivel get ON the plane, I am getting IN the plane…”, but I digress…). As we reached the altitude “at which it is safe for portable electronic devices to be used”, I reach for my laptop…just as the guy ahead of me reclined for what turned out to be a 6 hour snooze across the country. Hmmm, uhhhhh……not enough room between seats to uhhh open the lid on my Lenovo T60. Ugh. So I broke out the pad of paper and pen to jot down some notes and uhhhhhhh not enough room to even write comfortably.
Fortunately, the guy sitting next to me turned out to be an interesting fellow starting up a hedge fund, and so I spent the rest of the trip happilly chatting away about everything from technology to politics.
On the return flight I’m nobody’s fool. I really really want to get a good 5 hours of work done on the plane (or should that be “in the plane”…) and so I pony up the extra $60 for the Economy Plus seat. It was a night-and-day difference. I flipped open the laptop, stretched out, and used up both batteries writing code.
Now I’m not gonna gripe too much about the Welcome-To-Economy-Sorry-About-The-Laptop seat on the flight out…the ticket was cheap enough, particulary for a non-stop flight ($339, BOS->SFO), but I would like to:
A. Recommend the hell out of Economy Plus if you have to fly United and you want to work
B. Encourage the airlines to think about the impact such narrow seating has on business travelers
C. Remind myself to get to the gym a little more often…maybe if I skinny up enough the Econony section won’t be so bad…
Given the State of the Airline Industry I am pessimistic about anything happening with respect to (B) above, but I sure do feel better getting this rant off my chest….
The conference is rolling along – a lot of great information and good presentations all around. Unfortunately, there has been some network flakiness particularly during the afternoon…so I’ve stopped trying to blog each talk.
I’ll try to summarize some of the more interesting points later…
Rapleaf is a people search, profile aggregation, and Data API
Application: custom Ruby web crawler – indexes structured data from profiles. Currently, index page once and then gone forever:
Internet -> Web Crawler ->Index Processing -> MySQL
Using HBase to store pages, HBase via REST servlet. Allows: reindexing at a later date, better factored workflow, easier debugging:
Internet -> Web Crawler -> HBase -> Index Processing -> MySQL Database
Data similar to webtable: keyed on person no URL, webtable less structured, rapleaf table not focused on links.
Schema: 2 column families: content (stores search and profile pages), meta (stores info about the retrieval [when, how long, who it’s about]). Keys are tuples of (person ID, site ID)
Cluster specs: HDFS/HBase cluster of 16 machines, 2TB disk, 64 cores, 64G of memory
Load pattern: Approx 3.6T per month (830G compressed), average row size 64K, 14K gzipped
Performance: average write time 31 seconds (yikes! accounts for ruby time, base64, sending to servlet, writing to HBase, etc.)…median write time is 0.201 seconds. Max write time 359 seconds. Reads not used much at the moment. Note some of these perf. issues are due to Ruby on GreenThreads instead of native threads, etc….haven’t profiled the REST servlet either.
General Observations: Record compression hurt performance on testing; compressing in client gives a big boost; possible addition to standard HBase client. HBase write logs stored in HDFS which don’t exist until closed, which means HBase has durability issues. (this will be resolved by Hadoop-1700).
HBase: Distributed database modelled on B igTable (Google)
Runs on top of Hadoop core
Column-store: Wide tables cost only the data sotred, NULLs in rows are “free”, columns compress well
Not a SQL database: no joins, no transactions, no column typing, no sophisticated query engine, no SQL, ODBC, etc.
Use HBase: Scale is large; access can be basic and mostly table scans.
The canonical use case is web-table: table of web crawls keyed by URL, columns of webpage, parse, attributes, etc.
Data Model: Table of rows X columns; timestamp is 3rd dimension
Cel uninterpretted byte arrays
Rows can be any Text value, e.g. a URL: rows are ordered lexicographically, row writes are atomic
Columns grouped into column families: have prefix and a qualifier (attribute: mimetype, attribute: language)
Members of column family group have similar charcter/access
Implementation: Client, a Master, and one or more region servers (analagous to slaves in Hadoop).
Cluster carries 0->N Tables
Master assigns table regions to region servers
region server vcarries 0->N regions: region server keeps write-ahead log of every update; used recovering lost regionservers; edits first go to WAL, then to Region
Each region is made of MemCache and Stores
Row CRUD: client initially goes to Master for row loaction; client caches and goes direct to region server therafter; on fault returns to master to freshen cache
When regions get too big they are split: regionserver manages split, master is informed parent is off-lined, new daughters deployed.
All data persisted to HDFS
Connecting: java is first -class client; Non-Java clients: thrift server hosting hbase client instance (ruby, c++, java). Also REST server hosts hbase client (ruby gem, Active Record via RESt).; SQL-like shell (HQL); TableInput/OutputFormat for map/reduce
History: 2006 Powerset interest in Bigtable; 02/2007 Mike Cafarella provides initial code drop – cleaned up by powerset and added as Hadoop contrib; First usable HBase in Hadoop 0.15.0; 01/2008 HBase subprojects of of Hadoop; Hadoop 0.16.0 incorporates into code base
HBase 0.1.0 release candidate is effectively Hadoop 0.16.0 with logs of fixes…HBase now stands outside of Hadoop contrib.
Focus on developing use/developer base: 3 committers, tech support (other people’s clusters via VPN), 2 User Group meetings – more to follow; working on documentation and ease-of-use.
Performance: Performance good and improving
Known users: powerset and rapleaf, worldlingo, wikia
Near Future: release 0.2.9 to follow release of Hadoop 017.0, theme robustness and scalability – rebalancing of regions of cluster, replace HQL with jirb, jython, or beanshell, repair and debugging tools.
HBase and HDFS: No appends in HDFS: data loss…WAL is useless without it, Hadoop-1700 making progress
HBase usage pattern is not same as Map/Reduce – random reading and keep files open raises “too many open files” issues
HBase or HDFS errors: Hard to figure without debug logging enabled
Visit hbase.org: mailing lists, source, etc.