Charles Engelke’s Blog

February 19, 2011

Mediterranean Vacation Pics – Italy

Filed under: Uncategorized — Charles Engelke @ 9:01 pm

More catching up on organizing older vacation photos.  Today I started on the pictures from the Mediterranean cruise we took in late 2009.  The cruise went from Rome to Athens, visiting many Greek and Turkish ports, Cyprus, and a full day in Egypt to see the pyramids.  We spent an extra day in Rome before the cruise, and several days in Athens afterwards.

Digital cameras with large memory cards sure have changed vacation snapshots.  Laurie and I apparently took almost 4000 photos over three weeks, so it’s going to take some time to select the reasonably decent ones.  Today I sifted through the shots for Rome, Pompeii, and cruising by Stromboli.

In Rome for a day, we walked through the city by the Trevi Fountain and into the Pantheon, perhaps my favorite building:

Interior of the Pantheon

Interior of the Pantheon

Most of the day we spent walking through the ancient Forum:

Roman Forum

Ancient Roman Forum

The cruise’s first port was Sorrento, and we went on a tour to Pompeii:

Pompeii

Pompeii

We had a day at sea on our way to Greece and Turkey, and cruised by Stromboli, one of the few active volcanoes in Italy near Sicily:

Stromboli volcano

Stromboli volcano

February 16, 2011

IE9 and Web Apps

Filed under: Uncategorized — Charles Engelke @ 12:15 pm
Tags: , , ,

Yesterday, Paul Rouget, a Mozilla tech evangelist, wrote a blog entry stating that IE 9 is not a “modern browser”.  Not long after that, Ed Bott tweeted that the post was “surprisingly shrill”.  Several folks (including me) responded that the post made important points, and Bott asked for specific examples of real web sites that used the HTML5 features that IE9 is missing.  (I’m using “HTML5″ to refer not only to the language itself, but also to the new APIs related to it.)

That’s hard to do, especially in a tweet.  If the most widely used web browser doesn’t support these features, even in its upcoming newest release, how many mainstream sites can use them?  They’ve been added to the HTML5 spec because there are strong use cases for them, and when users have browsers that support them sites can start taking advantage of them.  Of course, there are some sites that use these features, but Bott specifically said he didn’t want to hear about pilots or demos, which excludes a lot of them.

There’s a chicken and egg problem here.  We can’t make heavy use of HTML5 features in web sites unless web browsers support them, and Ed Bott seems to be saying that the upcoming version of IE9 doesn’t need to support them because they aren’t yet widely used.  That kind of problem is part of what stalled HTML and browser advances ten years ago.  The WHAT WG didn’t accept that, and pushed for what became HTML5.  I think that Google was a major help because it had the resources to improve browsers (first with the non-standard Gears plug-in, later with their standards-based Chrome web browser) in order to be able to develop more sophisticated web applications.  Their experimental ChromeOS devices like the CR-48 show that Google is still very interested in the idea that the browser can be an application platform, not just a viewer of web sites.

For me, IE9 is most disappointing because it fails to implement many key HTML5 features that are essential to building good web apps.  (I use “web apps” to mean platform independent applications that live and run inside a modern browser, including many mobile browsers.)  Yes, IE9 makes a lot of advances and I appreciate them all, but some of what it leaves out is essential and does not seem nearly as hard to implement as some of what they included.  Consider some use cases that I actually encounter.

In a traditional web browser no data persists in the browser between separate visits to a web page.  If I want to start working on something in my web browser and then finish it later, the browser has to send it to a server to remember it, and when I revisit the page in the future it has to fetch that information back from the server.  But what if I don’t want to disclose that information to the server yet?  Maybe I’m preparing a tax form, and I don’t want to give a third party a history of all the changes I’m making as I fill it out, I just want to submit the final filled-out form?  In a traditional web browser I can only do that if I perform all the work during a single page visit.

If only the browser could store the data I enter within the browser, so I could come back and work on the form over multiple visits without ever disclosing my work in progress.  Actually, HTML5 (and related technologies) lets you do that.  Web storage (including local storage and session storage), indexed database, and the file system API can each meet that need.  (So can web SQL databases, but that approach will likely not be in any final standard.)  Of these solutions, only web storage is widely available today.  It’s on all major current browsers, including IE8 and IE9.  Good for IE.

Now, suppose I want to work on my tax form and I don’t have an internet connection.  The data I need is in my browser, so shouldn’t I be able to do this?  If my web browser supports application cache, I can.  Every major web browser supports this, and most have for the last several versions of them.  Except for IE.  Not only does IE8 fail to support this, so does IE9.  If I try to work on my tax form in IE9 I’ll just get an error message that the page is unavailable.  Even though all the functionality of my tax form program lives inside the web browser I can’t get to it unless the server is reachable.  That’s a problem for an app.  This is my biggest disappointment with IE9, especially since application cache seems like a pretty easy extension of the caching all web browsers, including IE, already do.

But you might ask, so what?  This is a web app, and it’s not that big a problem if it only works when the server can be reached.  After all, it’s going to have to talk to that server sooner or later in order to submit the tax form.  But let’s switch to a different use case.  Suppose I want to do some photo editing.  The HTML5 canvas API gives me a lot of ways to do that.  I gave some talks last summer on HTML5 techniques and built an application that could resize photos and convert color photos to black and white or sepia toned.  The whole example took less than an hour to do.  This is an application that doesn’t need to ever talk to a server except for the initial installation.  It’s something that I could use on my machine with any modern web browser, so I can write it once and use it everywhere.  There are two big challenges for this application, though: getting photos into the code in my browser, and then getting the edited photos back out.

There’s no way to do that in an old-fashioned web browser.  If I’ve got a binary file on my PC and want to get it to the code in the browser, I have to use a form to upload that file to a server.  My browser code can then fetch it back from the server.  It goes through the browser to get to the server, but is inaccessible to code running inside the browser.  With the HTML5 File API, I no longer have that restriction.  I can select a file with a form and the code in the browser can directly read that file instead of sending it to the server.  That’s how I get a photo into my application.  Every current major browser supports the File API except for IE and Opera.  And Opera might add it in their next version (they haven’t said), but IE9 won’t have it.

Once I’ve edited the photo I need to get it back out.  What I need is either an img element (so the user can right-click and choose to save the image) or a simple link that user can click to download the image.  The problem here is that for either of these methods to work, the photo has to be in a resource with a URL.  How do I get it there?  In an old fashioned web browser, the code in the browser would send it to a server, which would save it and make it accessible at some specific URL.  Once again, my browser ends up having to send something to a server so that the browser code and browser user can share something.  With a Data URL, I can create a resource with a URL inside the browser so that no server is needed.  Data URLs are a lot older than HTML5 and have been supported in all major browsers.  However, until recently IE limited their size so much as to make them not very useful.  IE9 does allow large Data URLs, though.  Again, good for IE9.

So, for these use cases we need four key technologies: persistent storage in the browser, offline access, reading files, and creating resources and URLs for them in the browser.  Every modern web browser supports all of them (assuming the next version of Opera adds the File API).  IE9 supports only half of them, and can’t serve either use case.

That’s one reason we should not consider IE9 to be a “modern browser”.

February 13, 2011

Alaska Cruise Pictures

Filed under: Uncategorized — Charles Engelke @ 7:05 pm
Tags: , ,

Last weekend I did our taxes.  This weekend I organized photos from the Alaska cruise we took in July and August 2009 and posted selected ones on my Picasa web albums page.

View in Skagway

Morning in Skagway

They’re organized by port; we visited Ketchikan, Skagway, Valdez, Seward, Kodiak, Hoonah, and Juneau.  There are also photos from our day in Glacier Bay, and the Princess Cruise’s Chef’s Table dinner during a day at sea.

Glacier Bay from the deck

Overlooking Glacier Bay

One truly bizarre thing about this cruise was that Laurie and I were among the most active folks on the ship.  We went ziplining in Ketchikan and Juneau:

Zip lining

Zip lining near Ketchikan

Rock climbing near Skagway:

Rock climbing

Rock climbing

And hiked on a Glacier near Valdez:

Glacier hike

Glacier hike

There was one other passenger along with us for one ziplining outing, one couple for the other zipline, and that passenger and couple for the rock climbing.  The glacier hike was better attended, though.

February 9, 2011

Source Control Basics, by Example

Filed under: Uncategorized — Charles Engelke @ 3:52 pm
Tags: , ,

Many non-developers understand the value of source code and realize that a source control system such as Subversion is extremely important, but don’t really understand how it should be used.  To a lot of people, it’s just a safe used to lock up this important asset.  But really, it’s a much more valuable tool than just a safe.  I’m going to try to describe how it can be used to aid release management, support, and maintenance of products by example.  These examples use Subversion, but the general principles apply to most source control systems.

Core principles

Subversion doesn’t manage each file, it works on an entire directory tree of a files at a time.  That’s a good match for source code.  If you start with an empty subversion repository, you can check it out to a working copy on your own computer, and then start adding your source files and directories to that working copy.

  • repository: the area on a Subversion server where every version of your source code directory tree is stored.
  • working copy: a local folder on your computer where the version of the source code you are working on is kept.

Whenever you want, you can commit your working copy to the repository.  In effect, Subversion stores a snapshot of your source code forever.  You can get a log showing every version that was ever committed, and you can check out a working copy of any version you want, at any time.

  • commit: make the Subversion server keep a snapshot of the source code that matches your current working copy.
  • check out: create a new working copy from any desired snapshot that Subversion has available.  Usually this is based on the latest snapshot, but doesn’t have to be.

Subversion simply numbers each version, or revision, sequentially, so you’ll see versions 1, 2, 3, and so on.  I recently noticed that one of our six year old projects is up to revision twelve thousand and something.  That means that on average, a new snapshot was saved once each business hour over the life of the project.

Before I move on, there are two more points to mention.  First, you don’t have to check out and commit the whole repository at a time.  You can work with any subdirectory you want.  That’s good for dividing up different kinds of work in a project that have little interaction, and it enables the management techniques I’ll be talking about in a minute.  Second, you can’t really commit “whenever you want”.  You can only commit if nobody else has changed the same files you changed since your last checkout.  Otherwise, you need to do another checkout first, and possibly manually resolve any conflicts between your changes and the other folks’ changes.  That sounds like a potential problem to a lot of people (including me) but in practice it works great.

Handling a Release

When you’re ready for a release, all you need to do is note the version number you’re building and packaging from.  That way, if you need to get that exact code back for support or maintenance, it’s extremely easy.  But it could be even easier.  Since you can work on subdirectories of your repository instead of the entire thing, just structure it a bit differently.  Don’t put your source code at the repository root, but in a subdirectory.  That subdirectory is conventionally called the trunk.  To do this, when you first create the repository immediately create a subdirectory called trunk.  Then instead of ever checking out the whole repository, just check out the trunk subdirectory.

The advantage of this is that you can now create a directory sibling to trunk, which will contain copies of all your releases.  By convention, this directory is called tags.  When you are ready to release your code, you copy the entire trunk directory tree to a new child of the tags directory.  Let’s say this release is going to be for 2.1beta2.  The your repository will look something like:

Repository
   |
   +--trunk
   |    |
   |    +--your latest source tree
   |
   +--tags
        |
        +--2.1beta2
              |
              +--snapshot of trunk contents at time of release

Don’t worry about the storage needed to keep this new copy.  Remember, Subversion already needs to keep track of every version of your source tree, and it’s smart enough to store this new “copy” of a snapshot using almost no actual storage.  But even if it needed to use up enough space for a whole new copy, it would be worth it.  Storage is plentiful, and anything that helps you manage the history of your product’s source is priceless.

  • trunk: the subdirectory of your repository containing the current version of your source code (and every prior version, too).
  • tags: the subdirectory that contains other subdirectories, each of which is a copy of a particular version of the trunk.  Each subdirectory should have a meaningful name, and should never be updated (Subversion allows you to check out and update tags, but you should not do it).

Software Maintenance

Everything up to now is useful, important, well-known and widely followed.  But the next step, using source control for more effective software maintenance, seems to be less used, even among seasoned developers I’ve observed.  That’s a shame, because it’s easy to do and a big win.

Suppose you released your software a few weeks ago, and now a user reports a bug.  How are you going to fix it?

You could use your current working copy of the trunk, find the problem, fix it, and then do a build and package from that working copy.  Wait!  You’re using tags now, so you create a new tag that’s a copy of the trunk, and then build and release from that tag.

What’s wrong with that?  Well, your new release doesn’t contain the fixed version of the old release, it contains a fixed version of your trunk.  And that trunk probably has had all sorts of changes made to it in the weeks following the release that contained the bug.  It probably has some new errors in it.  It may even have partially finished new functions and other changes in it.  Even if you work hard to make every build green (passing all tests), you are risking pushing out new errors as you fix the old one.

What you should do instead is make the fix to the exact code you released (which is available in the tag).  Then you’ll know that the only changes between the prior release and your new corrected release were those needed to repair the reported problems.  New functions, restructured code, and other changes that you need to be making in the trunk, won’t affect the bug fix release.

We want to keep each tag frozen, representing exactly what we released.  Sure, we could update it and remember to go back to the proper version when we need to, but its a lot easier to avoid problems if tags aren’t changed.  So we deal with maintenance using branches.  A branch is pretty much like a tag, except that it is generally a copy of a tag, not the trunk, and it is intended to change.

  • branch: the repository subdirectory that contains other subdirectories, each of which is a copy of a tag.  Each subdirectory will be updated as needed to make fixes in the release represented by the tag.

Specifically, you will create a subdirectory of the repository called branches, then copy the 2.1beta2 tag to a subdirectory of branches.  Say you call it 2.1beta2-maintenance.  Next, you will check out a working copy from that branch and do your programming work on it to fix the bug.  As you work on it you commit your changes, and when everything is ready, copy the latest version of the branch to a new tag, perhaps 2.1beta3 (or even 2.1beta2-patch1).  Build the new release from that tag and send it to your users.  You’ve fixed their bug with the least possible chance of creating new problems that didn’t already exist in their release.

Merging Fixes

There’s just one big problem.  The next time you do a new feature release, from a tag copied from the trunk, your fix won’t be in it.  You did all the work on a branch, instead.

Subversion (and other, similar tools) make it easy to solve this problem, too.  You can get a report showing every single change you made on the branch, and then use that report to make the same changes to the trunk.  In fact, Subversion can even make the same changes for you.  This isn’t just copying the changed files from the branch to the trunk, because each of them may have been changed in other ways while you were working on the branch.  This is just looking at what was changed in the branch (delete these lines, add these others) and making the same changes to the trunk.  With luck, the trunk hasn’t diverged so much that the same changes won’t fix the problem there, too.  But if it has, so what?  You’re a developer, and using your head to figure out how to make the same effective changes without messing other things up is one of the things you’re being paid for.

Some people really worry a lot about the potential for duplication of effort in making a fix on a branch and then having to recreate the same fix on the trunk.  But in reality, this rarely requires any thought at all; the automated tools handle it perfectly.  And when they don’t, it’s still just not very hard to do it in both places.  This approach to branching and merging works much better than making the whole team roll back their work in progress, or freezing their changes, while you make a fix.  And it’s one of the biggest wins in using source control.

Summary

Source control tools like Subversion help you keep on top of exactly what source code went in each and every release.  Used properly, they also give you a way to do maintenance fixes with the least possible risk of new problems or errors creeping in.  They cost little or nothing to buy, and require very little effort to run, support and use them.  There are a lot of other ways they help developers, too (comments on the reason for each revision, seeing what was changed at the same time, and knowing who did what if you have a question).  For a manager who wants to know how the team can deal with fixes for multiple releases in an efficient and safe way, understanding tagging, branching and merging as described here are essential.

Last Day at StrataConf

Filed under: Uncategorized — Charles Engelke @ 11:28 am
Tags: ,

It’s been almost a week since StrataConf ended, but I’ve been busy recovering from the travel and catching up.  Before I forget too much about the last day, though, I want to get my notes down here.

The day opened with a bunch of short “keynotes” again, just like Wednesday, and they were of highly variable value (also just like Wednesday).  Ed Boyajian of EnterpriseDB presented nothing but straight marketing material, a commercial that I think influenced no one.  But DJ Patil of LinkedIn gave a very interesting talk focused on hiring extremely talented people and helping them do their best work, and Carol McCall of Tenzing Healthcare gave a not only interesting, but inspiring talk about how to start fixing the mess our country has made of healthcare (video here).

The day was shorter than Wednesday, but still pretty long, ending at about 6:00PM.  I felt the sessions were, overall, weaker this day than on Wednesday, but they closed extremely strong.  The panel on Predicting the Future, chaired by Drew Conway and with short talks from Christopher Ahlberg,  Robert McGrew, and Rion Snow, followed discussion, was fantastic.  The format of short talks to set the stage for the panel worked great.

All in all, StrataConf was eye opening to me.  I had very little background in using data these ways, and now I feel ready to explore much more deeply on my own.  Many of the presentations and some videos are available online, and they’re worth a look.  And if you ever get a chance to attend a talk by Drew Conway, Joseph Turian, or Hilary Mason, I recommend you take it.  They each have a lot of interesting things to say, and they’re very good at saying them.

February 3, 2011

Day One at StrataConf

Filed under: Uncategorized — Charles Engelke @ 3:24 pm
Tags: , ,

The schedule for day one was so packed, and continued until late at night, that I had no time to write anything as it happened.  And today, it’ll be just a quick recap.

The day started with keynotes.  These were the now standard O’Reilly conference “keynotes” consisting of 10-15 minute presentations, some intrinsically interesting, some little more than infomercials the speakers’ companies paid O’Reilly for.  I dislike the format (can you tell?), and would flat out boycott them but there are always a few moments of value in there.  And the first day keynotes were no exception:

  • Hilary Mason (who participated in the prior day’s Data Bootcamp) of bit.ly opened with a breezy, interesting talk.  It’s available online now, too.  Nothing very deep in only ten minutes, of course.
  • James Powell of Thomson Reuters gave a talk that wasn’t very interesting, but it was certainly okay.
  • Mark Madsen‘s talk was fine and light.  I liked Hilary Mason’s a lot better.
  • Werner Vogel of Amazon gave a short informative talk about their web services.  It wasn’t a sales pitch, it was actually interesting for its content.
  • Zane Adams of Microsoft presented the most blatant commercial, including a video from Microsoft’s marketing group that was simply embarrassing.
  • That was followed by a panel discussion on “Delivering Big Data”.  There’s a video available.  I didn’t think much of the session; you can’t have a worthwhile panel in ten minutes.
  • The closing talk wasn’t announced ahead of time.  Anthony Goldbloom of Kaggle talked about the $3 million prize for producing a good model for predicting who will need to go to the hospital in the coming year.  They acted like this was an announcement of the prize, but it was publicized at least a few days before.

Overall, the keynotes simply weren’t worth the time they took.  The sessions later in the day were better.  I’m not going to talk about them all, just a few highlights (to my mind):

That’s a bit more than half the sessions I attended.  The others weren’t bad, but just not as useful or interesting to me as the ones above.  I’ll update this later today or tomorrow with links to the material as I get a chance.

February 1, 2011

Strata Data Bootcamp

Filed under: Uncategorized — Charles Engelke @ 4:18 pm
Tags: , ,

It’s day one at O’Reilly’s Strata conference, and it’s off to a bit of a rocky start.  I’m attending the all day Data Bootcamp tutorial, which was supposed to start at 9:00.  So I grabbed a muffin from the hotel, and a cup of tea from the conference, and went to get settled in at 8:30.  I figured I’d catch up on email while waiting.  Nope.  They told the local staff to not open the rooms until 8:45.  But the door doesn’t open at 8:45.  It actually opens a bit after 9:00, at which point there is a huge line and crush, and everybody trying to get a spot near a power strip since this is supposed to be an all-day hands-on tutorial and it turns out most rows of seats aren’t near power.  The session doesn’t start until about 9:15, because the sound equipment (and possibly video, too) aren’t working right.

Eventually we get started.  And the first thing they do is put up a slide telling us to download conference materials from a git repository (via the command git clone https://github.com/drewconway/strata_bootcamp.git).  The network is already hosed; the download comes at 22KB/s.  [Later: the download took almost an hour.]

Within the repository, the initial slides are at slides/intro/viz_intro.pdf.  It’s a nice way to lay the foundation and I recommend you take a look.  Drew Conway is speaking, and I liked his comment on the Afghanistan slide: “This is taking a complex thing – which is a war – and representing it as a complex thing.”  Which doesn’t aid understanding.  The philosophy should be:

  1. Make complex ideas simple
  2. Extract small info from big data
  3. Present truth, don’t deceive

We will do hands on work with R and Python.

Okay, we just had the first hands-on tutorial.  It did not go great.  Flashing code for people to run over four or five slides, advancing very quickly, is not the way for anyone to keep up.  I’m a fast typist, but could not keep up.  I eventually got most of it.  Of course, this is somewhere in the downloaded material from the Git repository, which finally finished downloading after nearly an hour.  Now that I have the slides, I could follow the tutorial examples much better next time.  I’ll be reviewing them after the conference, because the material seems very good, just not set up to follow in real-time.

On to the session on Image Data, given by Jake Hoffman.  The slides are at slides/image_data/image_data.pdf and sample code and data at /code/image_data.  The first question to us from the speaker is how many people regularly work with image data?  With text data?  No surprise, the text data users are a much larger group.  But image data isn’t that hard to work with, and is valuable, so we will learn about it.  Text data this afternoon.

It’s impossible to keep up with the code examples.  This tutorial is not structured clearly and simply enough to do so.  So I’m just going to run the sample code from the repository.  This is a failure for immediately learning how to do things, but I think it’s a success is learning what I want to go learn on my own.  There are interesting concepts that I’ll find useful, but I’m going to have to research and learn them on my own, not here today.  It’s not all new to me, though.  A speaker question: “how many here have worked with k-nearest neighbors?”  Very few hands up.  I realize that I should raise mine, because I used it for my CS master’s thesis – more than 25 years ago.  I had forgotten.  I am ancient.

Now for the lunch break.  And as we break, they point us to a download of just the slides and code, but I can’t see it on the screen so I don’t know what it is.

For the afternoon, we start on working with text data. Hilary Mason is speaking, and the slides are at slides/text_data/strata_bootcamp.pptx.

Our first example uses curl from the command line to start getting data from a web server.  I wrote a curl cheat-sheet post a while ago, and really like using it.  If you want to talk via HTTP and explore as you go, curl is the way to go.  The speaker also shows using Beautiful Soup and lynx to grab data.

Now to e-mail.  Exchange servers are really hard to work with, but nobody in the room will admit to using one.  Most people seem to use GMail or GMail for their own domain.  Others use POP and IMAP protocols, which are old, but widely available.  “IMAP sucks the least.”  And GMail supports it, too.  Hilary thanks Google for making GMail accessible with IMAP, an open, though perhaps old-fashioned, protocol.  Example code is in code/text_data/email_analysis, and the programs have a dummy account and password baked in to them.  That account works today, but probably will be disabled after the workshop.  I didn’t want to risk my own account on an open network with it, but looking at the source I see that it is using SSL.

Hilary gives a nice example of Bayes Law.  Take a look at it in the slides.

What about classifying email (or web pages)?  She gives an example of a cuil search for herself that’s a total disaster.  (cuil is long gone; I wrote a post about it and the poor job it did searching for me.)

Clean data > More data > Fancier math

We close this sub-session with running the various sample programs with various test data.  Hilary shows how easy it is to create your own “Priority Inbox” feature if you first star some important messages.  These general techniques work well here.  And a final challenge to us: write a script to figure out who you’re waiting for replies from, and remind them after a certain amount of time.

Back from the afternoon break, a new topic: Big Data by Joseph Adler.  His slides are at slides/big_data/big data.pptx (there’s an embedded space there).

The first point: don’t jump to using big data techniques.  Small data techniques are easier, so use them unless you can’t. And when you can’t, try to do something to let you use small data techniques.  Shrink you data by using fewer variables or few observations.  Get a bigger computer.  If nothing works, then move to big data methods.

There’s a lot of discussion on statistically valid sampling techniques, so you can run your analyses on a very small subset of your total data, yet still get good answers.

Everything discussed in the Big Data session seems useful, but not particularly new or interesting to me.  Solid material, but it didn’t trigger a lot of new connections to my mind.

And now we will close with a mash-up example they put together, plus questions and answers.  Most of the panel is participating.

All in all, a worthwhile survey of the information.  Not really a bootcamp, and not really hands-on, though.

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 33 other followers