Open main menu

CDOT Wiki β

User:Abhishekbh/FSOSS 11

< User:Abhishekbh
Revision as of 15:51, 20 January 2012 by Abhishekbh (talk | contribs) (David Eaves' “Saving Open Source Communities With Data”)

OSCON

Introduction

OSCON or the O'Reilly Open Source Convention is an annual event organized by the well known O'Reilly Media. The company runs a few other conferences as well, but this is the only one that focuses on open source software and open data. It brings together experts, community leaders, and hackers to discuss community issues and new ideas.

I ended up watching six videos from OSCON2011's Youtube playlist, these being:

My overall impression of the conference from these and some press release articles and blog posts were quite positive. Most of the speakers held interesting positions and brought good topics to the table. There seemed to be a fair bit of excitement around the conference itself, given its reputation. Its history shows that it was founded in 1999, as a successor to 'The Perl Conference' from 1997, and hence has a well known stature.

Out of the videos listed above I chose to analyze Steve Yegge, David Eaves, and Patrick Curran's videos. While the others were interesting, they turned out to be less about open source, and more so about specific software or projects (with the exception of Paul Fenwich's talk.) One in particular - Steven Harris' - turned out to be exceptionally uninteresting. Being an Oracle employee, he rightly felt a little out of place at an open source conference, and hence spent his talk essentially explaining Oracle's position on open source, rather than discussing anything new or interesting.

Steve Yegge in a talk entitled “What would you do with your own Google?”

Steve Yegge is a Google engineer (at least at the start of the talk) with a specialty in “static program analysis”. He is well known in the industry as a blogger and commentator having also worked previously at Amazon (the CEO of which he famously criticized). He has said unkind words about Google as well, but his prominence in the industry has kept him a safe spot at that company. Yegge has delivered a few significant talks and blog posts, and is often quoted in the industry.

In this particular talk, he stresses the importance of doing something “important” in your career. He puts this succinctly on a slide with “Always work on stuff you love!”. He laments that far too many talented programmers are busy working on software that solve unimportant problems, such as building platforms to view “cat pictures”. He suggests that important solutions like those sought by The Human Genome Project instead are going unheeded. He sees a folly in those programmers who spend their careers simply doing what they already know, instead of learning and solving new problems. He calls this “mercenary programming”. He says that instead of focusing on Farmville, cat picture platforms and general corporate programming, that engineers should instead look to “social-minded and innovative problem-solving.”

His talk reminded me of a quote by Richard Hamming, the author of The Art of Doing Science and Engineering:

"In science, if you know what you are doing, you should not be doing it. In engineering, if you do not know what you are doing, you should not be doing it. Of course, you seldom, if ever, see either pure state."

Essentially Yegge is asking engineers to be scientists as well. As software finds higher integration in our daily lives, universities and colleges are turning out more engineers than scientists, primarily to serve an industrial demand. This scenario creates the problem that Yegge is describing: that a plenitude of programmers is busy solving trivial problems, and hence the potential of computing is not being met.

He admits that he does not fully practice what he is preaching, however that he is taking concrete steps to mend that fact. In fact, to show he is serious, he actually quits his job on stage, that being the first instance of his boss learning the same fact. He says that he had just recently signed up for a “cat picture project”, by which he likely means Google+, but soon finds himself disillusioned with it enough to want to quit. He then announces that now, once a week, he sits down with his wife to “study” - to learn something new, or to read books he hasn't read before. He urges his audience to do the same, to dedicate at least a few hours every week to reeducating themselves, and taking an interest in important problems. One slide of his talks about “starting a culture change” which refers to self education of “infrastructure and scaling”, and “math, data mining and bioinformatics” to solve these important problems.

He notes that while most organizations today are interested in building easy to sell software which has little depth, there are some (such as O'Reilly Books) which do put a stress on what he calls important problems. He suggests that since the current conference was about open source software (rather than something like iPad conference,) he did believe that the people in attendance have an interest in working on the big problems, but they do need to break out of “code mercenary” lifestyle to embrace this fully.

It is in this final point that Yegge relates his talk to the world of open source. He usually has a hint of the open source philosophy in his writing and has also created and led some well known projects (Rhino on Rails being one.) In the current talk, he infers that the “important problems” usually require large data sets (since they are often Human problems) and hence an open data and open source model. Some examples of these types of problems that he gives are:

  • Gene sequencing
  • Voice Recognition and Natural Language Processing
  • Understanding the spread of virii and diseases

According to him, if we had the source code for the Human genome, then we could easily discern cures for medical conditions like Cancer. So the “open source” he talks about is a metaphorical one, in which information is available freely and without restrictions. He says that if you look at all the open source code in the world as data, then it would come down to perhaps a couple of terabytes, which is apparently smaller than some of the larger data sets at Google. But the number of problems that require open solutions are very many.

Though this talk was fairly high level, it helps me affirm two important points about my own view of the open source world:

  • Most open source projects do provide a developer with an opportunity to work on “important” problems. The larger majority of corporate programming happens to be on software with a short life-span which other than making the rounds for a couple of years won't contribute much to the world.
  • The importance of 'free as in freedom' is a fundamental concept of the Free and Open Source movement. As Yegge mentions, solving hard problems require some cross-training in expertise, and hence software needs to come without a prescribed usage.

David Eaves' “Saving Open Source Communities With Data”

According to his website, David Eaves is a “public policy entrepreneur, open government activist and negotiation expert.” He apparently works with several government agencies in an advisory role for open data and open source. As OSCON, Eaves did an interview with the conference's co-chair Edd Dumbill about using open source community statistics to diagnose and help that community.

To start the interview, Eaves suggests that current trends in open source show that too many developers are dropping out of their projects. In general he notes the following four challenges that open source world is facing:

  • the challenge of keeping up with proprietary software developers
  • the challenge of taking on new projects fast enough
  • bringing in new developers, and making sure they stay
  • making sure there is a low transaction cost in getting them started and contributing

He does admit however that GitHub has been a saviour of sorts for the open source world. He says that 'forking' used to be a bad word before GitHub because forked projects would often lead to split loyalties and a dwindling interest. But with GitHub, since the process of forking is very simple and has a negligible investment, it takes away that risk but instead enables experimentation; a majority of forks can die, but progress will still be made. GitHub also lowers the transaction cost required to get a new developers code into a repository, be it the central, or a forked repository. Further, since the process of forking also brings ownership with it, one does not need permissions anymore before they can begin to experiment and contribute test code.

To the point of using community data as a diagnostic tool, he says that by monitoring individual statistics - such as number of commits, number of commits merged, last commit time, etc - a community can gauge its own health level at any given time. If too many of its developers have not made any commits in a long period of time, then that might be a symptom of an inefficiency that lies in the project standards. Eaves is a Mozilla Developer, and is involved with a project that adds a developer dashboard to Bugzilla, which would carry these statistics.

I tried searching for a live version of this dashboard hosted online, but was only able to find a screen-shot of it, viewable here. Some aspects of it seem similar to the statistics that GitHub offers per repository, though they can be hard to understand for larger projects.

He also suggests that such data would allow a community to better understand itself from a developer's perspective. An example he gives surrounds the unknown period of time a developer has to wait after submitting a patch before it gets reviewed. Since this process is not standardized, it often leads to frustrations which might convert to quits. According to Eaves, average wait times will become self-apparent, and if they begin to slide in individual cases, then moderators could be chastized for their delays. In this way, community data can be used for introspection to find efficiencies and standards.

Finally Eaves gives an example of how open data has helped a client of his in the open government model. The city of Los Angeles recently made restaurant inspection data open, and required it to be posted on every establishment's door (much like Toronto.) According to Eaves, this led to restaurants with poor records receiving fewer customers and those with better records a higher number. In other words, good restaurants were rewarded and poor restaurants punished. He also noted that there has been a decline in the number of patients that visit the emergency room with food poisoning, a fact that he says is likely grounded in this freeing of information.

Other than his role in open government, I particularly found Eaves' analysis on the challenges of open source communities interesting. Lately I have been really buying into the idea than an open source project needs not only the relevant licenses to be efficient, but also a community. For example, Google's Chromium Browser is an open source project released largely under a BSD license. However it is still largely developed behind closed doors, code from which is dumped externally afterwards with an open source license. This is in contrast to a project like Mozilla Firefox, which not only has an open source license, but also a large and thriving community working on it. The former has a very small community of developers relatively speaking (mostly Google employees), and is hence largely unknown to the public as an open source project. Eaves' analysis speaks to the mechanics of that community building and can perhaps be used to see why certain projects out there are more popular than others.

Patrick Curran's “Who Needs Standards?”

Patrick Curran is the chair of JCP (Java Community Process). Hence his job has everything to do with “standards”, and he probably comes face to face with the legal issues behind open source on a daily basis. His talk was very pedantic, in the sense that it absolutely stuck to its title - he essentially spends 15 minutes giving example after example of standards used in Human history, especially in the industrialization process. These examples are very interesting and convincing, to be sure, however he never really touches upon the subject of open source; the case he makes for open standards stands impeccably though.

The first standard that Curran mentions are Human words without which, he points out, communication would be very difficult and terribly inefficient. He then gives the example of early currency, relating to the time when coins were actually worth their own value in metal. He then gives an anecdote of Isaac Newton as the “master” of the Mint in his country in the 17th century. Apparently “clipping” was a common problem then, which involved individuals cutting off tiny bits of metal from the edges of coins and melting them to be used as gold or silver. This would create coins of varying sizes and hence debase the currency. For this crime, Newton and his government levied the punishment of death on the perpetrators. Curran jokes at this points that “some standards are so important, that if you violate them, death is the penalty.”

Though his anecdote was only meant to be humorous, it did speak to the seriousness and importance of Curran's talk. On a personal level, I am often left wondering about lack of standards in our own world of software when I find myself on the worse end of it. For example, I don't often have access to Microsoft Office the native software for the de facto document standard in North America. But most of my school submissions and document communiques have to occur in one of its formats. This often leaves me in an unfavourable position wondering why open document standards such as ODF are not widely used. In some readings, I discovered that Microsoft's Open XML format, which is an open format, has apparently not been made fully accessible on Microsoft Office yet, which carries a slight modification of their own standard.

For reasons like these and those made by Curran, I see the value in having open standards, and realize that they are necessary for the success of open software. For any wide adoption of these to occur, and for them to be able to displace popular proprietary software, open standards should be enforced and supported. The World Wide Web is a great example of how open standards can lead to an environment in which open software can thrive. Due to the standards that were created for it right at its inception, the web has probably become the largest platform for open source software.

Conclusion

As I have been learning more about open source over the past year, I have found myself confused and conflicted by a few discoveries. One of these came a few months ago when I discovered a website selling Open Office for $20 per digital download. The website had a table comparing feature lists between Microsoft Office and Open Office (fairly similar), and them compared their costs, some $270 for the former, and their price of $20 for the latter. My first instinct upon visiting this site was that it was malicious in its intent and thought of reporting it to the Free Software Foundation. After a little more research I learned however that that website was not violating any rules of Open Office's LGPL and Apache licenses - they were distributing the software with the source code and original licenses included. After learning that it was not illegal to sell open source software, I realized that I hardly understood what it meant or stood for.

This was when I began to realize that the 'free' in Free and Open Source Software could mean either 'free as in freedom' or 'free as in gratis'. None of the videos I watched from OSCON 2011 spoke about this subject explicitly, however I did see that difference in the meaning of the word 'free' in context here. It was relieving to know that I have a better understanding of this subject now.

Further, I came across three iterations of the word 'open' in this conference: open source, open data and open standards. The term 'open' is often used in popular Internet culture without being quite pinned to either of those three, and this can make things quite confusing. However, the speakers in these talks helped me reinforce that separation of meaning with their contextual uses of them.