Hadoop Summit Slides

A few weeks ago I went to California for the Hadoop Summit. I posted a bunch of notes in real-time during the summit until the network connection became too flakey to continue.

The Yahoo folks have come to the rescue however. The slides from the presentations, which are tons better than my notes, are freely available on-line here. There are also slides from the Data Intensive Computing Symposium which was held the next day.

I wish I had know about the Data Intensive symposium as it looks really really interesting (not to mention an excuse to stay in Califorinia one more day…).

Hadoop Summit: Internet connectivity..

The conference is rolling along – a lot of great information and good presentations all around. Unfortunately, there has been some network flakiness particularly during the afternoon…so I’ve stopped trying to blog each talk.

I’ll try to summarize some of the more interesting points later…

Hadoop Summit: Bryan Duxberry and HBase

Rapleaf is a people search, profile aggregation, and Data API

Application: custom Ruby web crawler – indexes structured data from profiles. Currently, index page once and then gone forever:

Internet -> Web Crawler ->Index Processing -> MySQL

Using HBase to store pages, HBase via REST servlet. Allows: reindexing at a later date, better factored workflow, easier debugging:

Internet -> Web Crawler -> HBase -> Index Processing -> MySQL Database

Data similar to webtable: keyed on person no URL, webtable less structured, rapleaf table not focused on links.

Schema: 2 column families: content (stores search and profile pages), meta (stores info about the retrieval [when, how long, who it’s about]). Keys are tuples of (person ID, site ID)

Cluster specs: HDFS/HBase cluster of 16 machines, 2TB disk, 64 cores, 64G of memory

Load pattern: Approx 3.6T per month (830G compressed), average row size 64K, 14K gzipped
Performance: average write time 31 seconds (yikes! accounts for ruby time, base64, sending to servlet, writing to HBase, etc.)…median write time is 0.201 seconds. Max write time 359 seconds. Reads not used much at the moment. Note some of these perf. issues are due to Ruby on GreenThreads instead of native threads, etc….haven’t profiled the REST servlet either.

General Observations: Record compression hurt performance on testing; compressing in client gives a big boost; possible addition to standard HBase client. HBase write logs stored in HDFS which don’t exist until closed, which means HBase has durability issues. (this will be resolved by Hadoop-1700).

Hadoop Summit: Michael Stack and HBase

HBase: Distributed database modelled on B igTable (Google)

Runs on top of Hadoop core

Column-store: Wide tables cost only the data sotred, NULLs in rows are “free”, columns compress well

Not a SQL database: no joins, no transactions, no column typing, no sophisticated query engine, no SQL, ODBC, etc.

Use HBase: Scale is large; access can be basic and mostly table scans.

The canonical use case is web-table: table of web crawls keyed by URL, columns of webpage, parse, attributes, etc.

Data Model: Table of rows X columns; timestamp is 3rd dimension

Cel uninterpretted byte arrays

Rows can be any Text value, e.g. a URL: rows are ordered lexicographically, row writes are atomic

Columns grouped into column families: have prefix and a qualifier (attribute: mimetype, attribute: language)

Members of column family group have similar charcter/access

Implementation: Client, a Master, and one or more region servers (analagous to slaves in Hadoop).
Cluster carries 0->N Tables
Master assigns table regions to region servers
region server vcarries 0->N regions: region server keeps write-ahead log of every update; used recovering lost regionservers; edits first go to WAL, then to Region
Each region is made of MemCache and Stores
Row CRUD: client initially goes to Master for row loaction; client caches and goes direct to region server therafter; on fault returns to master to freshen cache
When regions get too big they are split: regionserver manages split, master is informed parent is off-lined, new daughters deployed.
All data persisted to HDFS

Connecting: java is first -class client; Non-Java clients: thrift server hosting hbase client instance (ruby, c++, java). Also REST server hosts hbase client (ruby gem, Active Record via RESt).; SQL-like shell (HQL); TableInput/OutputFormat for map/reduce

History: 2006 Powerset interest in Bigtable; 02/2007 Mike Cafarella provides initial code drop – cleaned up by powerset and added as Hadoop contrib; First usable HBase in Hadoop 0.15.0; 01/2008 HBase subprojects of of Hadoop; Hadoop 0.16.0 incorporates into code base

HBase 0.1.0 release candidate is effectively Hadoop 0.16.0 with logs of fixes…HBase now stands outside of Hadoop contrib.

Focus on developing use/developer base: 3 committers, tech support (other people’s clusters via VPN), 2 User Group meetings – more to follow; working on documentation and ease-of-use.

Performance: Performance good and improving

Known users: powerset and rapleaf, worldlingo, wikia

Near Future: release 0.2.9 to follow release of Hadoop 017.0, theme robustness and scalability – rebalancing of regions of cluster, replace HQL with jirb, jython, or beanshell, repair and debugging tools.

HBase and HDFS: No appends in HDFS: data loss…WAL is useless without it, Hadoop-1700 making progress

HBase usage pattern is not same as Map/Reduce – random reading and keep files open raises “too many open files” issues

HBase or HDFS errors: Hard to figure without debug logging enabled

Visit hbase.org: mailing lists, source, etc.

Hadoop Summit: Ben Reed and Zookeeper

Zookeeper: General, robust coordination service…

Observations: Distributed systems always need some form of coordination. Programmers cannot use locks correctly (note that google uses “chubby” lock service): you can learn locks in 5 minutes, but spend a lifetime learning how to use them properly; Message based coodination can be hard to use in some apps.

Wants: Simple, robust, good performane; tuned for read-dominant workloads; familiar models and interfaces, wait-free: failed client wil not interfere with requests of a fast client; need to be able to wait efficiently.

Design starting point: start with File API and strip out what we don’t need: partial writes/reads, name. Add what is needed: ordered upates and strong persistance guarantees, conditional updates, watched for data changes, ephemeral nodes (client create files, if client goes away the file goes away), generated file names (i.e. mktemp()).

Data Model: Hierarchical namespace, each znode has data and children, data is read and written in its entirety

Zookeeper API: create(), delete), setData(), getData(), exists(), getChildren(), sync()…also some security APIs (“but nobody cares about security”…). getChildren() is like “ls”.

Create flags: epheeral – the znode gets deleted with the session that created it times out; sequence the path name will have a monotonically increasing counter relative to the parent appended.

How Zookeeper works: Service made up of a set of machines. Each server stores a full copy of data in-memory….gives very low latency and high throughput. A leader is elected at startup, the rest of the servers become followers. Followers service clients, all updates go through leader. Update responses are sent when a majority of servers agree.

Example of Zookeeper service: Hadoop on Demand. I got distracted by something during this part of the presentation and didn’t get all the basic pieces here, but this uses Zookeeper to track interaction with Torque, and handles graceful shutdown, etc.

Status: Project started in 2006, prototyped in fall 2006, initial implementation March 2007, code moved to zookeeper.sourceforge.net and Apache License November 2007. Everything is pure Java: quorum version and standalone version. Clients are Java and C clients.

Hadoop Summit: Andy Konwinski & X-trace

Monitoring Hadoop using X-trace: Andy Konwinski is from UC Berkeley RAD Lab

Motivation: Hadoop style processing masks failures and performance problems

Objectives: Help Hadoop developers debug and profile Hadoop; help operators monitor and optimize Map/Reduce jobs

[CCG: This is cool, we REALLY need this stuf NOW]

RAD Lab: Reliable Adaptive Distributed Systems

Approach: Instrument Hadoop using X-Trace framework. Trace Analysis: virtualizatioin via web-based UI; statistical analysis and anomaly detection

So what’s X-Trace: Path-based tracing framework; generate event graph to capture causality of events across network (RPC, HTTP, etc.). Annotate message with trace metadata carried along execution path (instrument protocol APIs and RPC libraries). Within Hadop the RCPC library has been instrumented.

DFS Write Message Flow Example: client -> DataNode1 -> DataNode2 -> DataNode3
Report Graph Node: Report Label, Trace ID#, Report ID3, Hostname, timestamp

builds a graph which can then be walked.

[CCG: again you need to see the picture, but you get the idea right?]

Andy showed some cool graphs representing map/reduce operations in flight…

Properties of Path Tracing: Deterministic causality and concurrence; control over which events get traced; cross-layer; low-overhead; modest modification of the app source code ( <>

Architecture: Xtrace front ends on each Hadoop node, communicates with Xtrace backend via TCP/IP, backend stores data using BDB, trace analysis web UI communicats with Xtrace backend. Also cool fault detection programs can be run – interact with backend via HTTP.

Trace Analysis UI: “we have pretty pictures which can tell you a lot”: perf stats, graphcs of utilization, critical path analysis…

Applications of Trace Analysis: Examined perf of Apache nutch web indexing engine oin a Wikipedia cral. Time to create an inverted link index of a 50G crawl – with default configuration, ran in 2 hours.

Then by using trace analysis, was able to make changes to run same workload in 7 minutes. Used workload analysis to determine that one single reduce task which actually fails several times at the beginning: 3 ten minute timeouts. Bumped max reducers to 2 per node, and dropped execution time to 7 minutes.

Behold the power of pretty pictures!

Off-line machine learning: faulty machine detection, buggy software detection. Current work oin graph processing and analysis.

Future work: Tracing more production map/reduce applications. More advanced trace processing tools, migrating code into Hadoop codebase.

Hadoop Summit: Michael Isard and DryadLINQ

Michael Isard is from Microsoft Research:

“Dryad: Beyond Map-Reduce”

What is Map-Reduce: An implementation, a computational model, a programming model.

Implementation Performance: Literally map, reduce, and that’s it…reducers write to replicated storage. Complex jobs require pipeline multipl stages….no fault tolerance between stages. Output of Reduce: 2 network copies, 3 disks

Computational Model: Join combines inputs of diff types, “split” produced outputs of different types. This can be done with map-reduce, but leads to ugly programs. Hard to avoid performance penalty described above. Some merge-joins are very expensive. Finally, baking in more operators adds complexity.

Dryad Middleware Layer: Address flexibility and performance issues, more generalized than map-reduce, interface is more complex.

Computational Model: Job is a DAG, each node takes any number of inputs and produces any number of outputs (you need to see the picture).

DAG Abstraction Layer: Scheduler handles arbitrary graphs independent of vertext semantics, simple uniform state machine for scheduling and fault-tolerance. Higher levels build plan from application code: Layers isolated, many optimizations are simple graph manipulations, graph can be modified at runtime.

MapReduce Programming Model: opaque pairs flexible. Front-ends like sawzall and pig help, but domain specific simplifications limit some applications.

Moving beyond simple data-mining to machine learning, etc.

LINQ: Extensions to .Net in Visual Studio in 2008…general purpose data-parallel programming constructs. Data elements are arbitrary .NET types, combined in generalized framework

DryadLINQ: Automagically distribute a LINQ program; some Dryad-specific extensions: same source program runs on single-core to multi-core to cluster. Execution model depends on data source.

LINQ designed to be extensible, LINQ+C# provides parsing, thype checking. LINQ builds expressioin tree. Root provider class called on evaluation (has access to entire tree, reflection allows very powerful operations). Can add custom operators.

PLINQ – Running Queries on Multi-Core Processors – parallel implementation of LINQ.

SQL server to LINQ: cluster computers run run SQL Server. Partitioned tables in local SQLServer DBs. DryadLINQ process can use “SQL to LINQ” provider – “best of both worlds”.

Continuing research:L Applicatioinlevel research (what can we do), system level research (how can we improve performance), LINQDB?

Hadoop Summit: Kevin Beyer and JAQL…

Kevin Beyer is from IBM Almaden Research Center

JAQL (pronounced “jackel”) – Query Language for JSON

JSON: JavaScript Object Notation: simple, self-describing and designed for data
Why: want complete entity in one place, support schema that vary or evolve over time; standard: used in web 2.0 applications, bindings available for many languages, etc.; not XML – XML designed for doc markup, not for data (hooray, thanks for saying THAT).

JAQL processes any data that can be interpreted as JSON (JSON text files, binary, CSV, etc.). Internally, JAQL processees binary data-structures.

JAQL similar to Pig Latin, goal is to get JAQL accepted anywhere JSON might be used (document analytics, doc management [couchDB], ETL, Mashups,….

Immediate Goal: Hide grungy details of writin map-reduce jobs for ETL and analysis workloads; compile JAQL queries into map/reduce jobs.

JAQL Goals: designed for JSON data, functional query language (few side effects, not a scripting language – set-oriented, highly transformed) , composable expressions, draws on other languages, operator plan -> query (rewrites are transforms within the language, any plan is representabke in the language itself).

Core operations: iterations, grouping, joining, combining, sorting, projection, constructors for arrays, records, atomic values, “unnesting”, function definition/evaluation.

Some good examples were presented that are too long to type in…wait for the presentations to appear on-line I guess…sorry. Good stuff though, I am liking the language presented more than Pig Latin.

ImplementationL JAQL input/output designed for extensibility…basically reads/writes JSON values. Examples: Hadoop InputFormat and OutputFormat.

Roadmap: Another release is imminent, next release this summer (open the source, indexing support, native callout, server implementation with a REST interface).

Hadoop Summit – Christopher Olsten and PIG

I saw Chris talk about Pig at MIT a few weeks ago…this looks like the same presentation..

Example: Tracking users who visit good pages (pagerank type thing). Typical operations: loading data, canonicalizing, database JOIN-type operation, database GROUP BY operation, leading to a something which computes the pagerank.

note: the drawings within the presentation make the above very clear. Striping visits and pages across multipe servers, highly parallel processing..fairly straightforward approach.

But using just map/reduce: Write the join yourself (ccg – been there done that, thanks Joydeep Sen Sharma for getting me started). Hadoop users tend to share code among themselves for doing JOINs, and how best to do the join operation, etc. In short, things get ugly in a hurry – gluing map/reduce jobs together, etc. You have to do a lot of low-level operations by hand, etc. It’s potentially hard to understand and maintain code.

So: A data flow language could easily synthesize map/reduce sequences

PIG accomplishes this, using a dataflow language called Pig Latin. Essentially a terse language where each step is loading data, doing some sort of filtering/canonical operation, or doing custom work (via map/reduce). Operators includ: FILTER, FOREACH, GENERATE, GROUP, JOIN, COGROUP, UNION. Also support for sorting, splitting, etc. Goal is a very simple language to do powerful things.

Related languages:
SQL – declarative language (i.e. what, not how).
Pig Latin: Sequence of simple steps – close to imperative/procedural programming, semantic order of operations is obvious, incremental construction, debug by viewing intermediate results, etc.

Map/Reduce: welds together primitives (process records – > create groups -> process groups)
Pig Latin: Map/Reduce is basically a special case of Pig, Pig adds built-in primitives for most-used transformations.

So: Is Pig+Hadoop a database system ? Not really…..

Workload: DBMS does tons of stuff, P+H does bulk reads/writes only…just sequential scans
Data Representation: DBMS controls format…must predeclare schema, etc, Pigs eat anything :-)
Programming Style: DMBS – system of constraints, P+H: sequence of steps
Custom Processing: DBMS: functions second class to logic expressions, P+H: Easy to extend

Coming Soon to Pig: Streaming (external executables), static type checking, error handling (partial evaluation), development environment (Eclipse).

Hadoop Summit: Doug Cutting and Eric Baldeschweiler

Hadoop Overview – Doug Cutting and Eric Baldeschwieler

Doug Cutting – pretty much the father of Hadoop gave an overview of Hadoop history. Interesting comment was that Hadoop has achieved web-scale in early 2008…

Eric14 (Eric B…): Grid computing at Yahoo. 500M unique users per month, billions of interesting events per day. ‘Data analysis is the inner loop” at Yahoo.

Y’s vision and focus: On-demand shared access to vast pool of resources, support for massively parallel execution. , Data Intensive Super Computer (DISC), centrally provisioned and managed, service oriented…Y’s focus is not grid computing in in terms of Globus, etc., not focused on external usage ala Amazon EC2/S3. Biggest grid is about 2,000 nodes.

Open Source Stack: Commitment to open source developent, Yahoo is an Apache Platinum Sponsor

Tools used to implement Yahoo’s grid vision: Hadoop, Pig, Zookeeper (high avail directory and config sevices), Simon (cluster and app monitoring).

Simon: Very early days, internal to Yahoo right now…similar to Ganglia “but more configurable”. Highly configurable aggregation system – gathering data from various nodes to produce (customizable?) reports.

HOD – Hadoop On Demand. Current Hadoop scheduler currently FIFO – jobs will run in parallel to the extent that the previous job doesn’t saturate the node. HOG is built on Torque (http://www.clusterresources.com/) to build virtual clusters, separate file systems, etc. Yahoo has taken development about as far as they want…cleaning up code, etc. Future direction for Yahoo is to invest more heavily in the Hadoop scheduler. Does HOG disrupt data locality – yup, it does. Good news: Future Hadoop work will improve rack locality handling significantly.

Hadoop, HOD, Pig all part of Apache today,

Multiple grids inside Yahoo: tens of thousands of nodes, hundreds of thousands of cores, TBs of memory, PBs of disk…ingests TBs of data daily.

M45 Project: Open Academic Cluster in collaboration with CMU: 500 nodes, 3TB RAM, 1.5P disk, high bandwith located conveniently in a semi-truck trailer

Open source project and Apache: Goal is for Hadoop to remain a viable open source project. Yahoo has invested heavily…very excited to see additional contributors and commiters. “Yahoo is very proud of what we’ve done with Hadoop over the past few years.” Interesting metric: Megawatts of Hadoop

Hadoop Interest Growing: 400 people expressed interest in today’s conference, 28 organizations registered their Hadoop usage/cluster, in use in universities on multple continents, Y is now started hiring employees with Hadoop experience.

GET INVOLVED: Fix a bug, submit a test case, write some docs, help out!

Random notes: More than 50% conf attendees running Hadoop, many with grids more than 20 nodes, and several with grids > 100 nodes.

Yahoo just announced collaboration with Computational Research Labs (CRL) in India to “jointly support cloud computing research”…CRL runs EKA – the 4th fastest supercomputer on the planet.