Hadoop Summit: Ben Reed and Zookeeper

Zookeeper: General, robust coordination service…

Observations: Distributed systems always need some form of coordination. Programmers cannot use locks correctly (note that google uses “chubby” lock service): you can learn locks in 5 minutes, but spend a lifetime learning how to use them properly; Message based coodination can be hard to use in some apps.

Wants: Simple, robust, good performane; tuned for read-dominant workloads; familiar models and interfaces, wait-free: failed client wil not interfere with requests of a fast client; need to be able to wait efficiently.

Design starting point: start with File API and strip out what we don’t need: partial writes/reads, name. Add what is needed: ordered upates and strong persistance guarantees, conditional updates, watched for data changes, ephemeral nodes (client create files, if client goes away the file goes away), generated file names (i.e. mktemp()).

Data Model: Hierarchical namespace, each znode has data and children, data is read and written in its entirety

Zookeeper API: create(), delete), setData(), getData(), exists(), getChildren(), sync()…also some security APIs (“but nobody cares about security”…). getChildren() is like “ls”.

Create flags: epheeral – the znode gets deleted with the session that created it times out; sequence the path name will have a monotonically increasing counter relative to the parent appended.

How Zookeeper works: Service made up of a set of machines. Each server stores a full copy of data in-memory….gives very low latency and high throughput. A leader is elected at startup, the rest of the servers become followers. Followers service clients, all updates go through leader. Update responses are sent when a majority of servers agree.

Example of Zookeeper service: Hadoop on Demand. I got distracted by something during this part of the presentation and didn’t get all the basic pieces here, but this uses Zookeeper to track interaction with Torque, and handles graceful shutdown, etc.

Status: Project started in 2006, prototyped in fall 2006, initial implementation March 2007, code moved to zookeeper.sourceforge.net and Apache License November 2007. Everything is pure Java: quorum version and standalone version. Clients are Java and C clients.




Hadoop Summit: Andy Konwinski & X-trace

Monitoring Hadoop using X-trace: Andy Konwinski is from UC Berkeley RAD Lab

Motivation: Hadoop style processing masks failures and performance problems

Objectives: Help Hadoop developers debug and profile Hadoop; help operators monitor and optimize Map/Reduce jobs

[CCG: This is cool, we REALLY need this stuf NOW]

RAD Lab: Reliable Adaptive Distributed Systems

Approach: Instrument Hadoop using X-Trace framework. Trace Analysis: virtualizatioin via web-based UI; statistical analysis and anomaly detection

So what’s X-Trace: Path-based tracing framework; generate event graph to capture causality of events across network (RPC, HTTP, etc.). Annotate message with trace metadata carried along execution path (instrument protocol APIs and RPC libraries). Within Hadop the RCPC library has been instrumented.

DFS Write Message Flow Example: client -> DataNode1 -> DataNode2 -> DataNode3
Report Graph Node: Report Label, Trace ID#, Report ID3, Hostname, timestamp

builds a graph which can then be walked.

[CCG: again you need to see the picture, but you get the idea right?]

Andy showed some cool graphs representing map/reduce operations in flight…

Properties of Path Tracing: Deterministic causality and concurrence; control over which events get traced; cross-layer; low-overhead; modest modification of the app source code ( <>

Architecture: Xtrace front ends on each Hadoop node, communicates with Xtrace backend via TCP/IP, backend stores data using BDB, trace analysis web UI communicats with Xtrace backend. Also cool fault detection programs can be run – interact with backend via HTTP.

Trace Analysis UI: “we have pretty pictures which can tell you a lot”: perf stats, graphcs of utilization, critical path analysis…

Applications of Trace Analysis: Examined perf of Apache nutch web indexing engine oin a Wikipedia cral. Time to create an inverted link index of a 50G crawl – with default configuration, ran in 2 hours.

Then by using trace analysis, was able to make changes to run same workload in 7 minutes. Used workload analysis to determine that one single reduce task which actually fails several times at the beginning: 3 ten minute timeouts. Bumped max reducers to 2 per node, and dropped execution time to 7 minutes.

Behold the power of pretty pictures!

Off-line machine learning: faulty machine detection, buggy software detection. Current work oin graph processing and analysis.

Future work: Tracing more production map/reduce applications. More advanced trace processing tools, migrating code into Hadoop codebase.

Hadoop Summit: Michael Isard and DryadLINQ

Michael Isard is from Microsoft Research:

“Dryad: Beyond Map-Reduce”

What is Map-Reduce: An implementation, a computational model, a programming model.

Implementation Performance: Literally map, reduce, and that’s it…reducers write to replicated storage. Complex jobs require pipeline multipl stages….no fault tolerance between stages. Output of Reduce: 2 network copies, 3 disks

Computational Model: Join combines inputs of diff types, “split” produced outputs of different types. This can be done with map-reduce, but leads to ugly programs. Hard to avoid performance penalty described above. Some merge-joins are very expensive. Finally, baking in more operators adds complexity.

Dryad Middleware Layer: Address flexibility and performance issues, more generalized than map-reduce, interface is more complex.

Computational Model: Job is a DAG, each node takes any number of inputs and produces any number of outputs (you need to see the picture).

DAG Abstraction Layer: Scheduler handles arbitrary graphs independent of vertext semantics, simple uniform state machine for scheduling and fault-tolerance. Higher levels build plan from application code: Layers isolated, many optimizations are simple graph manipulations, graph can be modified at runtime.

MapReduce Programming Model: opaque pairs flexible. Front-ends like sawzall and pig help, but domain specific simplifications limit some applications.

Moving beyond simple data-mining to machine learning, etc.

LINQ: Extensions to .Net in Visual Studio in 2008…general purpose data-parallel programming constructs. Data elements are arbitrary .NET types, combined in generalized framework

DryadLINQ: Automagically distribute a LINQ program; some Dryad-specific extensions: same source program runs on single-core to multi-core to cluster. Execution model depends on data source.

LINQ designed to be extensible, LINQ+C# provides parsing, thype checking. LINQ builds expressioin tree. Root provider class called on evaluation (has access to entire tree, reflection allows very powerful operations). Can add custom operators.

PLINQ – Running Queries on Multi-Core Processors – parallel implementation of LINQ.

SQL server to LINQ: cluster computers run run SQL Server. Partitioned tables in local SQLServer DBs. DryadLINQ process can use “SQL to LINQ” provider – “best of both worlds”.

Continuing research:L Applicatioinlevel research (what can we do), system level research (how can we improve performance), LINQDB?

Hadoop Summit: Kevin Beyer and JAQL…

Kevin Beyer is from IBM Almaden Research Center

JAQL (pronounced “jackel”) – Query Language for JSON

JSON: JavaScript Object Notation: simple, self-describing and designed for data
Why: want complete entity in one place, support schema that vary or evolve over time; standard: used in web 2.0 applications, bindings available for many languages, etc.; not XML – XML designed for doc markup, not for data (hooray, thanks for saying THAT).

JAQL processes any data that can be interpreted as JSON (JSON text files, binary, CSV, etc.). Internally, JAQL processees binary data-structures.

JAQL similar to Pig Latin, goal is to get JAQL accepted anywhere JSON might be used (document analytics, doc management [couchDB], ETL, Mashups,….

Immediate Goal: Hide grungy details of writin map-reduce jobs for ETL and analysis workloads; compile JAQL queries into map/reduce jobs.

JAQL Goals: designed for JSON data, functional query language (few side effects, not a scripting language – set-oriented, highly transformed) , composable expressions, draws on other languages, operator plan -> query (rewrites are transforms within the language, any plan is representabke in the language itself).

Core operations: iterations, grouping, joining, combining, sorting, projection, constructors for arrays, records, atomic values, “unnesting”, function definition/evaluation.

Some good examples were presented that are too long to type in…wait for the presentations to appear on-line I guess…sorry. Good stuff though, I am liking the language presented more than Pig Latin.

ImplementationL JAQL input/output designed for extensibility…basically reads/writes JSON values. Examples: Hadoop InputFormat and OutputFormat.

Roadmap: Another release is imminent, next release this summer (open the source, indexing support, native callout, server implementation with a REST interface).

Hadoop Summit – Christopher Olsten and PIG

I saw Chris talk about Pig at MIT a few weeks ago…this looks like the same presentation..

Example: Tracking users who visit good pages (pagerank type thing). Typical operations: loading data, canonicalizing, database JOIN-type operation, database GROUP BY operation, leading to a something which computes the pagerank.

note: the drawings within the presentation make the above very clear. Striping visits and pages across multipe servers, highly parallel processing..fairly straightforward approach.

But using just map/reduce: Write the join yourself (ccg – been there done that, thanks Joydeep Sen Sharma for getting me started). Hadoop users tend to share code among themselves for doing JOINs, and how best to do the join operation, etc. In short, things get ugly in a hurry – gluing map/reduce jobs together, etc. You have to do a lot of low-level operations by hand, etc. It’s potentially hard to understand and maintain code.

So: A data flow language could easily synthesize map/reduce sequences

PIG accomplishes this, using a dataflow language called Pig Latin. Essentially a terse language where each step is loading data, doing some sort of filtering/canonical operation, or doing custom work (via map/reduce). Operators includ: FILTER, FOREACH, GENERATE, GROUP, JOIN, COGROUP, UNION. Also support for sorting, splitting, etc. Goal is a very simple language to do powerful things.

Related languages:
SQL – declarative language (i.e. what, not how).
Pig Latin: Sequence of simple steps – close to imperative/procedural programming, semantic order of operations is obvious, incremental construction, debug by viewing intermediate results, etc.

Map/Reduce: welds together primitives (process records – > create groups -> process groups)
Pig Latin: Map/Reduce is basically a special case of Pig, Pig adds built-in primitives for most-used transformations.

So: Is Pig+Hadoop a database system ? Not really…..

Workload: DBMS does tons of stuff, P+H does bulk reads/writes only…just sequential scans
Data Representation: DBMS controls format…must predeclare schema, etc, Pigs eat anything :-)
Programming Style: DMBS – system of constraints, P+H: sequence of steps
Custom Processing: DBMS: functions second class to logic expressions, P+H: Easy to extend

Coming Soon to Pig: Streaming (external executables), static type checking, error handling (partial evaluation), development environment (Eclipse).




Hadoop Summit: Doug Cutting and Eric Baldeschweiler

Hadoop Overview – Doug Cutting and Eric Baldeschwieler

Doug Cutting – pretty much the father of Hadoop gave an overview of Hadoop history. Interesting comment was that Hadoop has achieved web-scale in early 2008…

Eric14 (Eric B…): Grid computing at Yahoo. 500M unique users per month, billions of interesting events per day. ‘Data analysis is the inner loop” at Yahoo.

Y’s vision and focus: On-demand shared access to vast pool of resources, support for massively parallel execution. , Data Intensive Super Computer (DISC), centrally provisioned and managed, service oriented…Y’s focus is not grid computing in in terms of Globus, etc., not focused on external usage ala Amazon EC2/S3. Biggest grid is about 2,000 nodes.

Open Source Stack: Commitment to open source developent, Yahoo is an Apache Platinum Sponsor

Tools used to implement Yahoo’s grid vision: Hadoop, Pig, Zookeeper (high avail directory and config sevices), Simon (cluster and app monitoring).

Simon: Very early days, internal to Yahoo right now…similar to Ganglia “but more configurable”. Highly configurable aggregation system – gathering data from various nodes to produce (customizable?) reports.

HOD – Hadoop On Demand. Current Hadoop scheduler currently FIFO – jobs will run in parallel to the extent that the previous job doesn’t saturate the node. HOG is built on Torque (http://www.clusterresources.com/) to build virtual clusters, separate file systems, etc. Yahoo has taken development about as far as they want…cleaning up code, etc. Future direction for Yahoo is to invest more heavily in the Hadoop scheduler. Does HOG disrupt data locality – yup, it does. Good news: Future Hadoop work will improve rack locality handling significantly.

Hadoop, HOD, Pig all part of Apache today,

Multiple grids inside Yahoo: tens of thousands of nodes, hundreds of thousands of cores, TBs of memory, PBs of disk…ingests TBs of data daily.

M45 Project: Open Academic Cluster in collaboration with CMU: 500 nodes, 3TB RAM, 1.5P disk, high bandwith located conveniently in a semi-truck trailer

Open source project and Apache: Goal is for Hadoop to remain a viable open source project. Yahoo has invested heavily…very excited to see additional contributors and commiters. “Yahoo is very proud of what we’ve done with Hadoop over the past few years.” Interesting metric: Megawatts of Hadoop

Hadoop Interest Growing: 400 people expressed interest in today’s conference, 28 organizations registered their Hadoop usage/cluster, in use in universities on multple continents, Y is now started hiring employees with Hadoop experience.

GET INVOLVED: Fix a bug, submit a test case, write some docs, help out!

Random notes: More than 50% conf attendees running Hadoop, many with grids more than 20 nodes, and several with grids > 100 nodes.

Yahoo just announced collaboration with Computational Research Labs (CRL) in India to “jointly support cloud computing research”…CRL runs EKA – the 4th fastest supercomputer on the planet.


Hadoop Summit – 25-March-2008 8:45AM

Huge crowd, various Yahoo celebrities like JeremyZ and EricB, Doug Cutting floating around…it’s fun to put faces to names with the various mailing list participants. Looks like good network access and there are even plug strips lying around everywhere to plug in laptops. Looking forward to hearing some excellent presentations…

Hadoop Summit 25-March-2008

I’m packing my bags for the 6-1/2 hour flight from right coast to left for the Hadoop Summit next week. This promises to be a great event with a lot of good material on the agenda, plus the opportunity to chat with Hadoop contributors and application developers as well. I just wish it was longer than 1 day…

There are 215 registered attendees – which seems to be a larger number than Yahoo anticipated when the event was initially announced – so there is evidently significant interest in Hadoop despite map/reduce technology being declared “a major step backwards” by Mike Stonebraker and Dave DeWitt. Can’t we all just get along?

I anticipate having network access at the summit, so I’ll try to blog about the presentations and discussions throughout the day.

Compete acquired by Taylor Nelson Sofres

Compete, a company I worked for as Chief Software Architect, was just recently acquired by Taylor Nelson Sofres. This is a remarkable achievement for Compete and a great deal for both Compete and TNS. If you’re interested, you can read the official postings about the acquisition. Congratulations to everyone at Compete and TNS!

Learning about this acquisition made me feel very proud and quite nostalgic. It was back in April 2001 that I got involved with Compete. I had wrapped up a long contract at Oracle, spent a month in India with my wife visiting her extended family, and was back in Nashua, NH contemplating the irrational exuberance of what we would all soon call “dot bomb.” I had been contracting for a while, and while that was good fun I felt it was time to do something a bit more daring/challenging.

One thing led to another and I found myself in Boston, on oh-so-chic Newbury Street of all places, sitting in a literally transparent office on the upper floors of a converted church building, discussing a company called Compete with its CTO (and now my good friend) David Cancel. I wound up joining Compete a few weeks later as its Chief Software Architect, a position that I held for almost 5 years. I think I was the 11th or 12th person Compete hired. Compete rode the waves of dot com and dot bomb, and eventually found its way to become the strong player it is today, thanks to its excellent management team and a lot of hard work by everybody.

Fast forward 7 years and Compete is now a recognized industry force – an established player with a work force of close to 100 extremely talented and dedicated individuals. Expect more great things from this company in the future!

The Grid in My Basement, part 3: That Sinking Feeling

Size matters. At least when you are building rackmount machines. Of course were I not suffering from sleep deprivation when I made my hardware purchasing decisions I would have realized that you can’t put a MASSIVE heat sink into a tiny space, but such is life.

Anyway, the very spiffy Blue Orb II CPU cooler is never ever gonna fit in the 2U case I bought. That was evident by inspection before I even unpacked the coolers. Had I done my homework on the motherboard and case dimensions I would have realized that a package with a combined fan + heatsink height of 90.3mm would never fit. Not only that, but the heatsink has length and width dimensions of 140×140 which means it might not fit the motherboard at all. There’s a huge row of capacitors next to the retention module base, and the DIMM sockets are proximate on the other side. This is all badness from the perspective of installing a massive heatsink.

So with a heavy sigh I file for my first RMA from Newegg and package the Orbs for shipment back. Bummer drag, they looked so cool too. So I start looking for an appropriate K8 heatsink for my new nodes, and the fun really begins.

First, you may be wondering why I didn’t use the cooler that came with the CPU when I bought it. Well, in order to save money I bought the OEM version of everthing I could. That eliminates a lot of unnecssary packaging, instruction manuals, and in some cases features – like the CPU cooler on my AMD X2s. So I need to buy a cooler on the aftermarket.

The assumption that manufacturers make is that you WILL overclock if you are buying an aftermarket cooler. Therefore, the heatsinks reflect this assumption and most are massive. Looking at heatsinks in close up is sorta like looking at big scary machinery. Pipes and tubes run in all directions, massive banks of fins jut out at weird angles and rise up toward to sky towering over the motherboard. None of these devices are particularly well suited for the tight space of a 2U (or heaven help you a 1U) case.

I start shopping around for low-profile CPU coolers for 2U cases and run into several problems. First, there aren’t too many cooler vendors out there that make this stuff. Second, the ones that do aren’t terribly interested in Socket 939 applications. Third, the low-profile stuff tends to be crazy expensive – $95 for a low-profile heat sink and fan? No thanks…

So I pick up a ruler, open up the case, and start measuring. And measuring. After a good deal of plotting, I calculate that my heat sink can be no more than 70 x 70 x 65mm. And then I start shopping. And shopping.

Finally, after literally 2 evenings wasted googling around, I hit on a cooler/heatsink sold by ASUS – the same manufacturer that makes the motherboard I am using. I look at the height dimension and am psyched – the combined total of both devices is only 55mm tall! The bad news is that the heatsink runs 77 x 68 x 40mm – meaning that it’s too big potentially. I look on the ASUS website (sidebar your honor: remind me to rant and rant later about web sites that provide everything BUT the information you need) and find nothing helpful regarding compatibility with their own motherboards.

So I reason as follows: The height dimension will fit just fine; the heatsink will probably fit an ASUS motherboard since ASUS makes both; the absence of a compatibility list means it’s compatible with all their offerings or somebody is just lazy. So I bite the bullet and order up the ASUS Crux K8 MH7S 70mm Hydraulic CPU Cooling Fan with Heatsink and hope for the best.

2 days later I get the parts, and a couple days after that I open up the build-in-progress machine and install the heatsink. Have I mentioned how stressful putting a heatsink in can be? I mean there you are with all this expensive hardware that looks pretty darn fragile, and you are pushing down on it with no small amount of force trying to more-or-less permanently mate the CPU to the heatsink. Every time I do this I expect the motherboard to crack or something equally as awful.

Good news! The new cooler fits perfectly. It clears the lid of the case beautifully, and the dimensions of both heatsink and cooler are within the perimter of the retention module.