Charles Engelke's Blog

February 1, 2011

Strata Data Bootcamp

Filed under: Uncategorized — Charles Engelke @ 4:18 pm
Tags: , ,

It’s day one at O’Reilly’s Strata conference, and it’s off to a bit of a rocky start.  I’m attending the all day Data Bootcamp tutorial, which was supposed to start at 9:00.  So I grabbed a muffin from the hotel, and a cup of tea from the conference, and went to get settled in at 8:30.  I figured I’d catch up on email while waiting.  Nope.  They told the local staff to not open the rooms until 8:45.  But the door doesn’t open at 8:45.  It actually opens a bit after 9:00, at which point there is a huge line and crush, and everybody trying to get a spot near a power strip since this is supposed to be an all-day hands-on tutorial and it turns out most rows of seats aren’t near power.  The session doesn’t start until about 9:15, because the sound equipment (and possibly video, too) aren’t working right.

Eventually we get started.  And the first thing they do is put up a slide telling us to download conference materials from a git repository (via the command git clone  The network is already hosed; the download comes at 22KB/s.  [Later: the download took almost an hour.]

Within the repository, the initial slides are at slides/intro/viz_intro.pdf.  It’s a nice way to lay the foundation and I recommend you take a look.  Drew Conway is speaking, and I liked his comment on the Afghanistan slide: “This is taking a complex thing – which is a war – and representing it as a complex thing.”  Which doesn’t aid understanding.  The philosophy should be:

  1. Make complex ideas simple
  2. Extract small info from big data
  3. Present truth, don’t deceive

We will do hands on work with R and Python.

Okay, we just had the first hands-on tutorial.  It did not go great.  Flashing code for people to run over four or five slides, advancing very quickly, is not the way for anyone to keep up.  I’m a fast typist, but could not keep up.  I eventually got most of it.  Of course, this is somewhere in the downloaded material from the Git repository, which finally finished downloading after nearly an hour.  Now that I have the slides, I could follow the tutorial examples much better next time.  I’ll be reviewing them after the conference, because the material seems very good, just not set up to follow in real-time.

On to the session on Image Data, given by Jake Hoffman.  The slides are at slides/image_data/image_data.pdf and sample code and data at /code/image_data.  The first question to us from the speaker is how many people regularly work with image data?  With text data?  No surprise, the text data users are a much larger group.  But image data isn’t that hard to work with, and is valuable, so we will learn about it.  Text data this afternoon.

It’s impossible to keep up with the code examples.  This tutorial is not structured clearly and simply enough to do so.  So I’m just going to run the sample code from the repository.  This is a failure for immediately learning how to do things, but I think it’s a success is learning what I want to go learn on my own.  There are interesting concepts that I’ll find useful, but I’m going to have to research and learn them on my own, not here today.  It’s not all new to me, though.  A speaker question: “how many here have worked with k-nearest neighbors?”  Very few hands up.  I realize that I should raise mine, because I used it for my CS master’s thesis – more than 25 years ago.  I had forgotten.  I am ancient.

Now for the lunch break.  And as we break, they point us to a download of just the slides and code, but I can’t see it on the screen so I don’t know what it is.

For the afternoon, we start on working with text data. Hilary Mason is speaking, and the slides are at slides/text_data/strata_bootcamp.pptx.

Our first example uses curl from the command line to start getting data from a web server.  I wrote a curl cheat-sheet post a while ago, and really like using it.  If you want to talk via HTTP and explore as you go, curl is the way to go.  The speaker also shows using Beautiful Soup and lynx to grab data.

Now to e-mail.  Exchange servers are really hard to work with, but nobody in the room will admit to using one.  Most people seem to use GMail or GMail for their own domain.  Others use POP and IMAP protocols, which are old, but widely available.  “IMAP sucks the least.”  And GMail supports it, too.  Hilary thanks Google for making GMail accessible with IMAP, an open, though perhaps old-fashioned, protocol.  Example code is in code/text_data/email_analysis, and the programs have a dummy account and password baked in to them.  That account works today, but probably will be disabled after the workshop.  I didn’t want to risk my own account on an open network with it, but looking at the source I see that it is using SSL.

Hilary gives a nice example of Bayes Law.  Take a look at it in the slides.

What about classifying email (or web pages)?  She gives an example of a cuil search for herself that’s a total disaster.  (cuil is long gone; I wrote a post about it and the poor job it did searching for me.)

Clean data > More data > Fancier math

We close this sub-session with running the various sample programs with various test data.  Hilary shows how easy it is to create your own “Priority Inbox” feature if you first star some important messages.  These general techniques work well here.  And a final challenge to us: write a script to figure out who you’re waiting for replies from, and remind them after a certain amount of time.

Back from the afternoon break, a new topic: Big Data by Joseph Adler.  His slides are at slides/big_data/big data.pptx (there’s an embedded space there).

The first point: don’t jump to using big data techniques.  Small data techniques are easier, so use them unless you can’t. And when you can’t, try to do something to let you use small data techniques.  Shrink you data by using fewer variables or few observations.  Get a bigger computer.  If nothing works, then move to big data methods.

There’s a lot of discussion on statistically valid sampling techniques, so you can run your analyses on a very small subset of your total data, yet still get good answers.

Everything discussed in the Big Data session seems useful, but not particularly new or interesting to me.  Solid material, but it didn’t trigger a lot of new connections to my mind.

And now we will close with a mash-up example they put together, plus questions and answers.  Most of the panel is participating.

All in all, a worthwhile survey of the information.  Not really a bootcamp, and not really hands-on, though.


Blog at