Charles Engelke's Blog

February 9, 2011

Last Day at StrataConf

Filed under: Uncategorized — Charles Engelke @ 11:28 am
Tags: ,

It’s been almost a week since StrataConf ended, but I’ve been busy recovering from the travel and catching up.  Before I forget too much about the last day, though, I want to get my notes down here.

The day opened with a bunch of short “keynotes” again, just like Wednesday, and they were of highly variable value (also just like Wednesday).  Ed Boyajian of EnterpriseDB presented nothing but straight marketing material, a commercial that I think influenced no one.  But DJ Patil of LinkedIn gave a very interesting talk focused on hiring extremely talented people and helping them do their best work, and Carol McCall of Tenzing Healthcare gave a not only interesting, but inspiring talk about how to start fixing the mess our country has made of healthcare (video here).

The day was shorter than Wednesday, but still pretty long, ending at about 6:00PM.  I felt the sessions were, overall, weaker this day than on Wednesday, but they closed extremely strong.  The panel on Predicting the Future, chaired by Drew Conway and with short talks from Christopher Ahlberg,  Robert McGrew, and Rion Snow, followed discussion, was fantastic.  The format of short talks to set the stage for the panel worked great.

All in all, StrataConf was eye opening to me.  I had very little background in using data these ways, and now I feel ready to explore much more deeply on my own.  Many of the presentations and some videos are available online, and they’re worth a look.  And if you ever get a chance to attend a talk by Drew Conway, Joseph Turian, or Hilary Mason, I recommend you take it.  They each have a lot of interesting things to say, and they’re very good at saying them.

Advertisement

February 3, 2011

Day One at StrataConf

Filed under: Uncategorized — Charles Engelke @ 3:24 pm
Tags: , ,

The schedule for day one was so packed, and continued until late at night, that I had no time to write anything as it happened.  And today, it’ll be just a quick recap.

The day started with keynotes.  These were the now standard O’Reilly conference “keynotes” consisting of 10-15 minute presentations, some intrinsically interesting, some little more than infomercials the speakers’ companies paid O’Reilly for.  I dislike the format (can you tell?), and would flat out boycott them but there are always a few moments of value in there.  And the first day keynotes were no exception:

  • Hilary Mason (who participated in the prior day’s Data Bootcamp) of bit.ly opened with a breezy, interesting talk.  It’s available online now, too.  Nothing very deep in only ten minutes, of course.
  • James Powell of Thomson Reuters gave a talk that wasn’t very interesting, but it was certainly okay.
  • Mark Madsen‘s talk was fine and light.  I liked Hilary Mason’s a lot better.
  • Werner Vogel of Amazon gave a short informative talk about their web services.  It wasn’t a sales pitch, it was actually interesting for its content.
  • Zane Adams of Microsoft presented the most blatant commercial, including a video from Microsoft’s marketing group that was simply embarrassing.
  • That was followed by a panel discussion on “Delivering Big Data”.  There’s a video available.  I didn’t think much of the session; you can’t have a worthwhile panel in ten minutes.
  • The closing talk wasn’t announced ahead of time.  Anthony Goldbloom of Kaggle talked about the $3 million prize for producing a good model for predicting who will need to go to the hospital in the coming year.  They acted like this was an announcement of the prize, but it was publicized at least a few days before.

Overall, the keynotes simply weren’t worth the time they took.  The sessions later in the day were better.  I’m not going to talk about them all, just a few highlights (to my mind):

That’s a bit more than half the sessions I attended.  The others weren’t bad, but just not as useful or interesting to me as the ones above.  I’ll update this later today or tomorrow with links to the material as I get a chance.

February 1, 2011

Strata Data Bootcamp

Filed under: Uncategorized — Charles Engelke @ 4:18 pm
Tags: , ,

It’s day one at O’Reilly’s Strata conference, and it’s off to a bit of a rocky start.  I’m attending the all day Data Bootcamp tutorial, which was supposed to start at 9:00.  So I grabbed a muffin from the hotel, and a cup of tea from the conference, and went to get settled in at 8:30.  I figured I’d catch up on email while waiting.  Nope.  They told the local staff to not open the rooms until 8:45.  But the door doesn’t open at 8:45.  It actually opens a bit after 9:00, at which point there is a huge line and crush, and everybody trying to get a spot near a power strip since this is supposed to be an all-day hands-on tutorial and it turns out most rows of seats aren’t near power.  The session doesn’t start until about 9:15, because the sound equipment (and possibly video, too) aren’t working right.

Eventually we get started.  And the first thing they do is put up a slide telling us to download conference materials from a git repository (via the command git clone https://github.com/drewconway/strata_bootcamp.git).  The network is already hosed; the download comes at 22KB/s.  [Later: the download took almost an hour.]

Within the repository, the initial slides are at slides/intro/viz_intro.pdf.  It’s a nice way to lay the foundation and I recommend you take a look.  Drew Conway is speaking, and I liked his comment on the Afghanistan slide: “This is taking a complex thing – which is a war – and representing it as a complex thing.”  Which doesn’t aid understanding.  The philosophy should be:

  1. Make complex ideas simple
  2. Extract small info from big data
  3. Present truth, don’t deceive

We will do hands on work with R and Python.

Okay, we just had the first hands-on tutorial.  It did not go great.  Flashing code for people to run over four or five slides, advancing very quickly, is not the way for anyone to keep up.  I’m a fast typist, but could not keep up.  I eventually got most of it.  Of course, this is somewhere in the downloaded material from the Git repository, which finally finished downloading after nearly an hour.  Now that I have the slides, I could follow the tutorial examples much better next time.  I’ll be reviewing them after the conference, because the material seems very good, just not set up to follow in real-time.

On to the session on Image Data, given by Jake Hoffman.  The slides are at slides/image_data/image_data.pdf and sample code and data at /code/image_data.  The first question to us from the speaker is how many people regularly work with image data?  With text data?  No surprise, the text data users are a much larger group.  But image data isn’t that hard to work with, and is valuable, so we will learn about it.  Text data this afternoon.

It’s impossible to keep up with the code examples.  This tutorial is not structured clearly and simply enough to do so.  So I’m just going to run the sample code from the repository.  This is a failure for immediately learning how to do things, but I think it’s a success is learning what I want to go learn on my own.  There are interesting concepts that I’ll find useful, but I’m going to have to research and learn them on my own, not here today.  It’s not all new to me, though.  A speaker question: “how many here have worked with k-nearest neighbors?”  Very few hands up.  I realize that I should raise mine, because I used it for my CS master’s thesis – more than 25 years ago.  I had forgotten.  I am ancient.

Now for the lunch break.  And as we break, they point us to a download of just the slides and code, but I can’t see it on the screen so I don’t know what it is.

For the afternoon, we start on working with text data. Hilary Mason is speaking, and the slides are at slides/text_data/strata_bootcamp.pptx.

Our first example uses curl from the command line to start getting data from a web server.  I wrote a curl cheat-sheet post a while ago, and really like using it.  If you want to talk via HTTP and explore as you go, curl is the way to go.  The speaker also shows using Beautiful Soup and lynx to grab data.

Now to e-mail.  Exchange servers are really hard to work with, but nobody in the room will admit to using one.  Most people seem to use GMail or GMail for their own domain.  Others use POP and IMAP protocols, which are old, but widely available.  “IMAP sucks the least.”  And GMail supports it, too.  Hilary thanks Google for making GMail accessible with IMAP, an open, though perhaps old-fashioned, protocol.  Example code is in code/text_data/email_analysis, and the programs have a dummy account and password baked in to them.  That account works today, but probably will be disabled after the workshop.  I didn’t want to risk my own account on an open network with it, but looking at the source I see that it is using SSL.

Hilary gives a nice example of Bayes Law.  Take a look at it in the slides.

What about classifying email (or web pages)?  She gives an example of a cuil search for herself that’s a total disaster.  (cuil is long gone; I wrote a post about it and the poor job it did searching for me.)

Clean data > More data > Fancier math

We close this sub-session with running the various sample programs with various test data.  Hilary shows how easy it is to create your own “Priority Inbox” feature if you first star some important messages.  These general techniques work well here.  And a final challenge to us: write a script to figure out who you’re waiting for replies from, and remind them after a certain amount of time.

Back from the afternoon break, a new topic: Big Data by Joseph Adler.  His slides are at slides/big_data/big data.pptx (there’s an embedded space there).

The first point: don’t jump to using big data techniques.  Small data techniques are easier, so use them unless you can’t. And when you can’t, try to do something to let you use small data techniques.  Shrink you data by using fewer variables or few observations.  Get a bigger computer.  If nothing works, then move to big data methods.

There’s a lot of discussion on statistically valid sampling techniques, so you can run your analyses on a very small subset of your total data, yet still get good answers.

Everything discussed in the Big Data session seems useful, but not particularly new or interesting to me.  Solid material, but it didn’t trigger a lot of new connections to my mind.

And now we will close with a mash-up example they put together, plus questions and answers.  Most of the panel is participating.

All in all, a worthwhile survey of the information.  Not really a bootcamp, and not really hands-on, though.

Blog at WordPress.com.