thefoundationhttp://www.thefoundation.de2011-04-14T12:31:25Z(c) 2012 Michael Kurze, Aachen, GermanyClickable stack traces for Node.JS2011-04-14T12:31:25ZMichael Kurzehttp://www.thefoundation.de/about/michaelclickable-stack-traces-nodejs<p>How to add URLs to Node.JS stderr stack traces. Helps jumping to frame sources in Textmate or Macvim (and possibly other editors).</p><p>While they are often incomplete due to evented execution, Node.JS stacktraces are an invaluable tool for debugging your app. However, thanks to the plethora of Node.JS libraries, they often point to files outside of your project, making inspection a bit tedious. Also, would it not be nice to just jump to the correct source location in your editor by clicking on a stack frame?</p> <p>If you use TextMate, you might profit from <a href="https://gist.github.com/919180" title="node-trace-txmt.sh">this one-liner</a>: Simply save the gist as an executable on your path (or add the sed call as an alias) and pipe your Node/Expresso invocations through it like this:</p> <p> <a href="http://www.flickr.com/photos/brighteye/94218497/" title="define: Cat Stack">cat stack</a> through sed: <code> &gt; expresso my/test.js 2&gt;&amp;1 | node-trace-txmt.sh </code> </p> <h3>Almost clickable</h3> <p>On Mac OS X, you can now open a frame source in Textmate by cmd-shift-doubleclick or using the context menu (alternative terminals such as iTerm / iTerm2 supposedly make this even easier).</p> <h3>Mac Vim &amp; al</h3> <p>Obviously similar expressions will do the same thing for your Django/Rails/COBOL app. Different text editors are also easy to support — if they provide an URL handler. Here is the <a href="https://gist.github.com/919186" title="node-trace-macvim.sh">version for Mac Vim</a>. If you do not use an editor that supports a URL scheme, you can still use generic <tt>file://</tt> URLs (<a href="https://gist.github.com/919193" title="">gist</a>), which works on non-mac systems as well.Scalable Text Clustering 2011-03-01T17:25:05ZMichael Kurzehttp://www.thefoundation.de/about/michaelscalable-text-clustering<p>The updated version of the Grouperfish clustering plans, now published on the Mozilla <a href="http://blog.mozilla.com/data/2011/03/08/scalable-text-clustering-for-the-web/" title="Blog of Data - Scalable Text Clustering for the Web">Blog of Data</a>.</p><h3>The Background</h3> <p>During my work as <a href="http://blog.mozilla.com/metrics/">metrics</a> liaison with the <a title="Firefox Input" href="http://input.mozilla.com/">Firefox Input</a> team, an exciting <a title="Bug 629019: Cluster themes and sites as publicly available and output to JSON" href="https://bugzilla.mozilla.org/show_bug.cgi?id=629019">requirement</a> has come up: scalable online clustering of the millions of feedback items that the users of Firefox share with us.</p> <p>When designing a service at the metrics team, besides functional requirements (<em>accept text messages, produce clusters</em>) we consider scalability and durability. In fact, scalability concerns play a major role in wanting to replace the <a title="Dave Dash’s textcluster" href="https://github.com/davedash/textcluster">current solution</a> (which has done a fine job so far) and not picking another powerful <a title="Carrot2 clustering framework" href="http://project.carrot2.org/">existing tool</a>: We expect the influx of messages (already heading towards 2 million) to increase up to 50x once Firefox 4 is released.</p> <h3>On to Architecture</h3> <p>There is a <a title="Grouperfish architecture" href="https://github.com/michaelku/grouperfish/blob/master/doc/medium_sized_picture.pdf?raw=true">slide</a> outlining what the system (called <a title="Grouperfish on github" href="https://github.com/michaelku/grouperfish">Grouperfish</a>) is planned to look like. As this service is to be developed quickly and in iterations, even major parts of the system might be replaced in the future though. <em>This</em> is the rationale for our first version, to be released sometime around the Firefox 4 release:</p> <h4>Concurrency</h4> <p><em>We want to be able to handle tens of thousands of GET’s and thousands of POST’s per second, provided we have enough commodity hardware at our disposal.</em></p> <p>To accept incoming documents and queue them for clustering, <a title="Node.JS Website" href="http://nodejs.org/">Node.JS</a> fits the bill. Its event-based concurrency model dominates thread- and process-based designs in IO-bound tasks such as this. Also, depending on the storage you pick, requests might pause to wait on garbage collection or to rewrite store files. Node can handle a lot of waiting requests because it does not use system level threads (or even processes) for concurrency.</p> <h4>Storage</h4> <p><em>Grouperfish must store millions of documents in hundreds of thousands of collections. The generated clusters may reference thousands of documents each, each ranging from a few bytes to about a megabyte. Also, we want to store processing data for clustering.</em></p> <p>When planning for more data than fits into your collective RAM, you usually have two options (SQL not being one of them since RAM has become pretty big):</p> <p><a title="Wikipedia: Amazon Dynamo" href="http://en.wikipedia.org/wiki/Dynamo_%28storage_system%29">Dynamo</a>-style key/value stores like <a title="Basho Riak" href="http://www.basho.com/riak">Riak</a> and <a title="Apache Cassandra" href="http://cassandra.apache.org/">Cassandra</a> allow to store replicated values with high write rates, and also to quickly retrieve individual items from disk. You do not need to worry about one machine getting too much attention (e.g. when one of your services gets slashdotted), thanks to consistent hashing. Riak even has a notion of <em>buckets, keys</em> and <em>values</em>: We would intuitively use buckets for collections of documents (and of clusters), and values for individual documents (and clusters). No wonder we looked at this more closely. Unfortunately though, Riak’s buckets are more of a namespacing device than anything else. It is expensive to get all elements of a bucket, since they are neither indexed by a common key nor stored together on disk. The Riak design can be a bit <a title="Getting all the keys" href="http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-January/002947.html">misleading</a> in this regard, as buckets are in fact <a title="Layout of Riak buckets" href="http://en.wikipedia.org/wiki/There%27s_a_Hole_in_My_Bucket">spread</a> throughout the key space. To retrieve all keys in a bucket, Riak will check every single key — possibly scanning gigabytes of main memory (for the very recent <a title="Riak search for Riak values" href="http://wiki.basho.com/Riak-Search---Indexing-and-Querying-Riak-KV-Data.html">Riak search</a> to help, you’d need to blow up your values quite a bit). And you still only have the keys. To get possibly millions of associated values, you need to move your little disk heads a lot. This is not always as bad as it sounds because Riak gives you streaming access to the data as it comes in. But in general, the smaller your buckets in relation to the entire key space, the higher the cost of retrieving many of them.</p> <p>The other major contender are column-oriented data stores of the <a title="The Google BigTable paper" href="http://labs.google.com/papers/bigtable.html" target="_blank">BigTable</a> family, the most prominent of which is <a title="Apache HBase site" href="http://hbase.apache.org/">Apache HBase</a> (the aforementioned Cassandra is actually somewhat in-between, having properties from both worlds). The two main differences for users of HBase vs. Dynamo style stores as far as we are concerned: <em>1. Data is stored per column family:</em> to retrieve the vector representations of a million documents, we do not have to scan through a million document texts. <em>2. Records are sorted</em> <em>by key</em>, much like in a traditional database (but optimized for fast inserts, using <a title="Log Structured Merge Trees" href="http://nosqlsummer.org/paper/lsm-tree">LSM trees</a>). This is a blessing and a curse. A blessing, because we can scan over contiguous collections of documents. A curse, because we are vulnerable to <em>hotspotting</em> on popular collections. To counter this, we need to make sure that there are random parts in our row keys, e.g. using UUID’s. Because HBase divides tables into regions as they grow and hands them off to other nodes, this method avoids hotspots. And we do not lose the streaming advantage as long as we use common prefixes per collection.</p> <p>Given our access patterns (insert documents, update clusters, re-process entire collections, fetch lists of clusters), efficient sequential access to selected parts of the data is very important. Sorted, column oriented storage seems to be the way to go. There are other pros and cons (single point of failure, write throughput, hardware requirements), but if we don’t cater to our use case, those won’t ever matter.</p> <h4>Clustering</h4> <p><em> Grouperfish must be able to handle small numbers of large corpora (millions of documents), as well as large numbers of small corpora (millions of collections). The generated clusters may contain thousands of messages each.</em></p> <p>This is practically a no-brainer: Apache Mahout supports in-memory operation (for smaller clusters) as well as distributed clustering (using Apache hadoop, for larger clusters). Mahout can update existing clusters with new documents and generate labels for our clusters. Of course, Mahout is a java-library, so we need to run it within a JVM. To simplify management and introspection, we will run our clustering workers in jetty web containers.</p> <h4>Scheduling</h4> <p><em> We need to be able to add workers to increase clustering frequency. When there are more new messages than can be clustered right away, we want them to be queued. Also, we have Node.JS and we have Java/Mahout. We want our queue to bridge the gap.</em></p> <p>Messaging has become a big topic as systems have become larger and more distributed. We want to use messages to decouple write requests from processing them. There is a very elegant solution to maintain queues, offered by the in-memory data store <a title="Redis" href="http://www.redis.io/">Redis</a>. Redis is somewhat like a developers dream of shared memory. No encoding and decoding of lists, maps and values as they enter and leave the stores — just operate on your data structures within shared memory. Unfortunately, Redis queues are really just a linked list with a blocking POP operation. While that is very nice, we want to track and resubmit failed tasks when a worker node falls victim to <a title="Not exactly rodents" href="http://theoatmeal.com/blog/fail_whale">rampaging rodents</a>.</p> <p>The considerations of choosing <a title="RabbitMQ" href="http://www.rabbitmq.com/">RabbitMQ</a> to realize a task queue are worth an article of their own. Suffice to say, it has Node- and Java-bindings, and it supports message Acknowledgement from workers. We still want to use Redis to keep track of collection size, to cache the actual incoming data (no need to ask hbase if we use it right away), and for locking of colections, so that every collection is only modified by one worker at a time. We also <em>might </em>use it to cache frequently requested clusters.</p> <h3>More Thoughts</h3> <p>Selecting these components, I learned that it is important to choose technologies in an unbiased fashion, and to reconsider decisions when a technology has no answer for your requirement. For example, I originally wanted to use just Riak for storage — I like its simplicity and style, and the bucket metaphor — but the enumeration of large buckets would be too slow for an online system. It might be fine for a batch-only system, or a system that just does not operate on collections of varying size as much.</p> <p>For a Message queue, <a title="zeromq" href="http://www.zeromq.org/">ØMQ</a> sounded awesome, offering low latency and powerful constructs, but I quickly realized that it is not really what I understand a <a title="Wikipedia on Message Queues" href="http://en.wikipedia.org/wiki/Message_queue">message queue</a> to be, but rather a very smart abstraction over traditional sockets. Probably someone will eventually build a distributed task queue on top of it though.</p> Diaspora — Can the Social Graph be Our Web of Trust?2010-08-23T01:30:22ZMichael Kurzehttp://www.thefoundation.de/about/michaeldiaspora-and-the-web-of-trust<p>On Friday we had Max, Ilya and Raphael from <a href="http://www.joindiaspora.com" title="Diaspora Project Site">Diaspora</a> over at Mozilla. They <a href="http://tieguy.org/blog/2010/08/20/notes-on-diaspora-talk/" title="Luis Villa’s Notes on the Diaspora Talk">talked</a> about their effort in creating a distributed social network. Where I think they are on the right track, and where they should think even bigger.</p><h3>Why we need Diaspora</h3> <p> Personally, I see three major challenges that everyone passionate about the <a href="http://www.mozilla.org/about/manifesto.en.html" title="Principles of the Open Web, as outlined by the Mozilla Manifesto">open internet</a> needs to make up their mind about: </p> <ul style="margin-bottom:0.5em; margin-top:0.3em; padding-top: 0;"> <li><em>The <a href="http://googlepublicpolicy.blogspot.com/2010/08/joint-policy-proposal-for-open-internet.html" title="Google Public Policy on the Verizon deal">erosion</a> of <a href="http://dig.csail.mit.edu/2006/06/neutralnet.html" title="Daniel Weitzner: The neutral internet">Net Neutrality</a></em></li> <li><em>Participants <a href="http://futureoftheinternet.org/" title="The Future of the Internet and How to Stop it by Jonathan Zittrain">switching to closed</a> environments of apps and appliances, becoming mere consumers (*)</em> </li> <li><em>People entrusting their personal data and social activity to Facebook, forced to <a href="http://www.geekymomblog.com/2010/05/18/the-facebook-dilemma/" title="Geeky Mom on the Facebook dilemma">choose</a> between control and connectedness</em></li> </ul> <p>In the context of the Diaspora talk, I’ll focus on the third issue.</p> <p>We need Diaspora because people need to be in control over with whom they share personal information. Every time Facebook <a href="http://www.aclunc.org/issues/technology/blog/facebook_places_check_this_out_before_you_check_in.shtml" title="http://arstechnica.com/web/news/2010/08/privacy-groups-facebook-already-facing-off-over-places.ars">sneaks in</a> a new default that breaks privacy, we grudgingly change the settings again — and stay, not wanting to lose our friends. Or we just don’t know about it and leave it as it is. Combined with the social monopoly that Facebook has established, this makes privacy and security optional features, subject to change like any other.</p> <h3>How Diaspora can help already</h3> <p> The main distinguishing factor of Diaspora compared to Facebook et al. is in that it decouples your social graph from the network provider, bringing back real competition to the social space. Like with E-Mail, there can be lots of network providers, loosely connected over push-interfaces. Whenever a pod (the equivalent to an e-mail-provider in Diaspora) should violate your trust, you can just switch to another one, or set up your own pod. </p> <h3>What could be done better</h3> <p> On the downside, this means that you have to trust your pod as well as all your friend’s pods. <em>No big deal?</em> Well, where the same server software is used on a distributed network, it is very prone to exploit of <a href="http://en.wikipedia.org/wiki/Sendmail#History_of_vulnerabilities" title="History of Vulnerabilities in the popular mail server sendmail">vulnerabilities</a> due to patch delay and misconfiguration (correctly setting up <abbr title="Transport Layer Security">TLS</abbr> is still a big challenge, <a href="http://www.theinquirer.net/inquirer/news/1727426/us-government-fails-secure-websites" title="The Inquirer: DHS fails to secure its website">not only</a> for regular people). </p> <p> <a href="http://en.wikipedia.org/wiki/HTTP_Secure" title="Wikipedia on HTTPS">Secure HTTP</a> is great when a large, anonymous group of people needs to trust a central service. It allows us to do online banking and purchases, free from eavesdropping and man-in-the-middle attacks. However, it is not peer-to-peer: When you fetch your mail over a secure IMAP connection, you might be sure that your password is protected, but you do not know who actually sent you that e-mail (think about it: that is the reason why phishing works). When you get it from Google Mail, you might be using TLS, but Google is still able to read your every conversation. </p> <h3>How PGP can solve this</h3> <p> I propose that Diaspora pods should be dumb post boxes that <em>are not able</em> to actually look into status updates, private messages, friend lists and so on. <em>How?</em> The technology for that has been available for quite some time and is called <a href="http://www.pgpi.org/doc/pgpintro/" title="Introduction to PGP">PGP</a>. </p> <p> Basically, PGP allows you to send and receive messages that cannot be decrypted by the servers that route them. So, if you were to encrypt your message inside your browser, you would establish secure end-to-end communication. No need to trust the shady pods that some of your friends decided to use, not knowing any better. <em>But encryption in a web client? That sounds awfully slow!</em> Well, <a href="https://addons.mozilla.org/z/en-US/firefox/addon/10868/privacy/" title="Firefox Sync (aka Weave)">Firefox Sync</a> does it already with your entire browsing history (the pass phrase to your key is never sent to the server), and I would imagine that JavaScript interpreters have become fast enough to emulate the cryptographic capabilities of a PC from 1991. </p> <p>I do have ideas on how to approach search and incremental profile updates with this, and on the new security considerations that apply here (Can you always trust your browser? Could a pod not make you use an insecure web client that transmits your passphrase?). However, that is rather technical, possibly material for a follow up post. </p> <h3>The social network is a key signing party</h3> <p> The problem with PGP has always been that people have been unable to exchange public keys in a manner that is both trustworthy and extensive. Because a <a href="http://en.wikipedia.org/wiki/Web_of_trust" title="Wikipedia on the Web of Trust">web of trust</a> can often not be established, people refrain from using encrypted e-mail. Turns out that social networks come with a mechanism that is just made for this: <em>Friending</em>. In the secure social network, accepting a friend request would be equivalent to exchanging keys. Usually you are referred to friends from people you already know, so there already is a basic level of trust. </p> <p> This means that online social networks can be transformed from a jeopardy to our security to a vehicle of the same. This idea is of course also <a href="http://serendipity.ruwenzori.net/index.php/2009/03/18/pgp-web-of-trust-meets-modern-social-networking" title="PGP web of trust meets modern social networking by Jean-Marc Liotier">not entirely new</a>. What might be new is the idea of building the social web entirely on top of PGP rather than just integrating that as an optional feature. </p> <h3>Any Comments?</h3> <p>I have not gotten around to add Commenting or Pingback to this blog, but I would love to incorporate any (links to) comments in a follow up post, please write to michael at this domain.</p> <h3>Update:</h3> <p> If I understand correctly, the diaspora guys are already planning to use GPG for cryptography <a href="http://www.joindiaspora.com/2010/04/21/a-little-more-about-the-project.html" title="Diaspora Blog: A little more about the project">somewhere</a>. This is a pretty good start. If they really already plan on generating keys for everyone, then they would only need to pull the actual encryption into the web client. </p> <p style="font-size: 85%;"><em>(*) Like any intern at Mozilla I had the opportunity to to talk to John Lilly, and I got the impression that Mozilla takes this development very seriously.</em></p>Sites for Mozilla Input2010-08-10T18:46:36ZMichael Kurzehttp://www.thefoundation.de/about/michaelsites-mozilla-input<p>As a side project during my internship at Mozilla, I <a href="http://aakash.doesthings.com/2010/08/10/firefox-input-1-6-2-is-released-more-malory/">worked with Aakash</a> from Mozilla QA to bring <a href="http://input.mozilla.com/sites" title="Input Dashboard: Sites">a new feature</a> to the Mozilla Input website.</p><p> Oftentimes when users have trouble with a Firefox beta, there is not actually a bug in the beta, but a problem with a specific website (such as broken <a href="http://www.anybrowser.org/campaign/" title="Good old anybrowser website, unfortunately still an issue">user agent detection</a>). Even when a problem is related to Firefox, it can be very helpful for QA to see what sites trouble our users the most, and what issues the users face there. </p> <h3>Enter clustering…</h3> <p> To group sentiment by topic, my fellow metrics intern Andres and I made use of Dave Dash’s <a href="http://github.com/davedash/textcluster" title="Textcluster on github">clustering algorithm</a>, which uses techniques from the search engine world to group related input. That helps to get a quick impression on what’s going on when a site is causing trouble for many users. We also get a lot of positive feedback on sites where the user experience has improved for beta users compared to the release version. </p> <h3>…and Django of course!</h3> <p> It was very cool to do something with Django again. The webdev team is very knowledgeable in this area so I learned a lot working with <a href="http://fredericiana.com/" title="Fred Wenzel’s blog">Fred</a> and <a href="http://davedash.com/" title="Dave Dash’s site">Dave</a>. There are some limitations (you <em>still</em> <a href="http://blog.affien.com/archives/2009/05/30/django-annoyances-no-reverse-select_related/" title="Django annoyances — no reverse select related">cannot prefetch related objects</a> along the inverse edge of a one-to-many relationship, like with any sensible ORM), but other than that Django has become a pretty solid toolkit. Also I finally got started with Git, which is as of now my version control system of choice. </p> <p> Hopefully my main project will allow me time to improve Input and the dashboard in the future, there’s a lot of cool stuff planned with it. </p>40 Days to go2010-08-09T06:56:31ZMichael Kurzehttp://www.thefoundation.de/about/michael40-days-go<p>I can't believe in only 40 days my internship will be over. What I've done in the last month in California besides working for Mozilla.</p><h3>The Mozilla Summit</h3> <p> Hundreds of people from all over the world met at the Mozilla Summit, sharing ideas, talks and — in the evening — drinks. Whistler is a beautiful place and although most of the action took place inside of the Chateau Fremont, I still had the opportunity for one or two walks in the nature, and even some clubbing (and still made to the next talk before nine in the morning). At the last night all of us went up to Whistler Peak where we had a crazy party going on. I did not take any photos, but <a href="http://www.flickr.com/search/?q=mozilla+summit" title="Mozilla Summit Photos on Flickr">others did</a>. The only downside to the whole event that I had to witness the German Football team losing to Spain in the World Cup semifinals. </p> <photo slug="summit-polaroid" size="display">All of us interns at Whistler Peak, taken using Josh’s 70ies Polaroid camera.</photo> <h3>Living the Valley Life</h3> <p> Back from the Summit, I did a lot of American Culture stuff. With Mozilla we watched a Baseball Game in San Francisco (the Giants just barely won over the Florida Marlins, scores being equal at the last inning), we went to the theatres to watch Inception (all in all a pretty good movie). A week ago we took the Caltrain to San Francisco again for a <a href="http://picasaweb.google.com/johnwaynehill/SanFranciscoBarCrawlPolkStreet" title="John Wayne Hill’s Photos of the Bar Crawl">Bar Crawl</a> at Polk Street in Nob Hill. Because the bars close down pretty early in the U.S., the crawl starts at 4pm and everyone is pretty much wasted at 8 already. That was an interesting experience… </p> <photo slug="polk-street-bar-crawl" size="display">Crawling the Polk Street, at the Lush Lounge</photo> <h3>Canoeing in Healdsburg</h3> <p> Yesterday we did a canoeing trip in, and I got so sunburnt. It was a lot of fun though. Most of us brought waterguns, the current was pretty easy (we capsized nonetheless, big fun) and the water had just the right temperature for a swim inbetween. </p> <photo slug="canoeing-trip" size="display">A friendly fellow attached a rope to this tree</photo> <p> A lot of Californians have a summer- or weekend-house along the river, which means that they have a beach pretty much for themselves less than an hour on the 101 from cloudy San Francisco. That is, if you do not count city traffic, where we got stuck for an eternity, with wheelchairs and elderly couples overtaking us on our way to the Freeway. Good opportunity to tune in to some radio. There is exactly one among 150 satellite stations playing alternative and indie music, some cultural shock! <a href="www.ilovemetric.com" title="Metric Website">Metric</a> seems to be pretty cool, and I should also check out Interpol again, as they do remind me of the Editors a lot. </p> <photo slug="interns-golden-gate" size="display">Some of us interns at the Golden Gate Bridge (thanks for the photo, Chris)</photo> <p>The lengthy trip was totally worth it, as I was able to pay my second visit to the Golden Gate bridge since 2007. Now it's time for some Phantom Planet, as we are planning a trip to LA for the next weeks. </p>En Route To Whistler2010-07-06T13:48:05ZMichael Kurzehttp://www.thefoundation.de/about/michaelen-route-whistler<p>The two first weeks as an intern at Mozilla were like a blast. Now we are heading for the worldwide Summit 2010 in Whistler, Mountain View.</p><p><iframe width="400" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://maps.google.com/maps?f=q&amp;source=s_q&amp;hl=en&amp;geocode=&amp;q=Whistler,+British+Columbia,+Canada&amp;sll=37.0625,-95.677068&amp;sspn=44.388698,76.289063&amp;ie=UTF8&amp;hq=&amp;hnear=Whistler,+Squamish-Lillooet+Regional+District,+British+Columbia,+Canada&amp;ll=50.120578,-122.958984&amp;spn=9.033263,19.072266&amp;t=h&amp;z=6&amp;output=embed"></iframe></p> <p> Independence day with fireworks in SF is just over, and we are already heading back to the city, this time for the airport. In the past two weeks there was not much time for blogging: Mozilla interns BBQ, awesome work at the Metrics team, and Germany advancing past the round of eight in the soccer world cup. </p> <p> Before I board the plane, I want to share a picture of the custom keyboard I am using at Mozilla: </p> <photo slug="my-keyboard-mozilla" size="display">Mission accomplished!</photo> <h4>Update:</h4> <p>Thanks to <a href="http://blog.finette.co.uk/" title="Pascal Finette">Pascal</a> at Mozilla who gave me a keyboard with German layout.</p>Stirring Kettle2010-07-06T11:45:59ZMichael Kurzehttp://www.thefoundation.de/about/michaelstirring-kettle<p>To work on my intern project at Mozilla, I have learned my way around with <a href="http://www.pentaho.com/" title="Pentaho Web Site">Pentaho Data Integration</a> (aka Kettle).</p><h3>What is this data integration anyways?</h3> <p> At any organization as large as Mozilla, enormous amounts of data are generated, that are potentially useful when making development decisions: We count downloads, up-to-date checks for the <a href="http://blog.mozilla.com/webdev/tag/blocklist/" title="On the Mozilla Malware Blocklist">malware blocklist</a> and installed addons, and there are webserver logs for the various Mozilla Web sites, the responses for Test Pilot case studies, and so on. </p> <p> Such data pieces are generated in large quantities every day. So before the data can be analyzed by the metrics team, it needs to be grouped by common criteria (e.g. by date, language, region or hour of the day). This way, the data sets are summarized and the amount of becomes digestable for statistical tools such as <a href="http://www.r-project.org/" title="R Statistics Software Project Site">R</a>, Microsoft Excel or OpenOffice Calc. </p> <p> A data collection that is accessible for analysis in this way is often called a <em>data warehouse</em>, as it persists data and makes it accessible in consumable quantities. <em>Data integration</em> is the task of transforming data from various sources into the data warehouse representation. </p> <h3>How does Kettle help here?</h3> <p> Kettle is a powerful node based tool to model and implement data integration tasks. It simple to visually express data transformations without getting lost in a soup of nested function calls and intermingled SQL-statements. Nodes in the graph represent transformation steps, while the edges indicate the flow of records. Each step handles one record at a time, allowing for parallel handling of large data sets. </p> <p> There are lots of predefined steps available to import and export data to and from text/CSV files or database tables, to perform basic calculation, to group and to sort. There are steps for details-lookup, for merging and for splitting of record sets. When manually programmed, such operations require a lot of hand coding and are easy to get wrong. With Kettle, it is just a matter of connecting the right nodes and setting the appropriate configuration. </p> <p> Whenever the predefined steps are insufficient, it is possible to write custom steps in JavaScript &mdash; or Java if performance is a concern. </p> <h3>No downsides?</h3> <p> Of course, there is always room or improvement, with Pentaho Data Integration especially in the area of usability. On Mac OS X, the visual editor does not use the appropriate shortcuts (it requires you to press ctrl-c instead of cmd-c for copy to clipboard). Also, the node editing is a bit fiddly: The editor for the various transformation steps should be a contextual inspector that automatically shows the editing options for the currently selected steps. Instead, I manually have to open it for each step (that means: all the time). </p> <p> In the end, Kettle is very useful nonetheless: It allows for headless execution of the created transformations, so that the data integration process can be automated. And it has very good support for regular expressions, allowing for the parsing of fairly complicated source formats.</p> <p>If you have to deal with the conversion of file and/or database formats on a regular basis, you might want to give Kettle a try. There is a free community edition of PDI available at the Pentaho website. </p> 72 Hours at Mozilla2010-06-24T08:42:50ZMichael Kurzehttp://www.thefoundation.de/about/michael72-hours-mozilla<p>On Monday, my internship at Mozilla started, and I <em>can</em> tell you how great it is. Since everything is open source anyway, I am actually encouraged to blog and talk about my work there.</p><p>Mozilla has more than a hundred people at its headquarters in Mountain View, while even more contributors work from all around the world. There are more than thirty other interns here, involved in various projects from mobile development, to metrics (such as myself), to developer engagement. I am going to write more about the data integration project I am working on, but first I want to give a quick impression of what an internship at Mozilla entails.</p> <p> <iframe width="400" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://maps.google.de/maps?f=q&amp;source=s_q&amp;hl=de&amp;geocode=&amp;q=mountain+view&amp;sll=37.528778,-78.921737&amp;sspn=21.895083,32.387695&amp;ie=UTF8&amp;hq=&amp;hnear=Mountain+View,+Santa+Clara+County,+Kalifornien,+Vereinigte+Staaten&amp;ll=37.386052,-122.083851&amp;spn=1.374946,2.024231&amp;z=9&amp;iwloc=poi0&amp;output=embed"></iframe></p> <p>There are various benefits that help you to relax when you need to, and to focus on your work otherwise. For example free snacks and drinks (even beer) are provided for everyone, there are some big screen TV sets (yes, I watched the German team advance in the World Cup today, wooot), as well as pool and ping pong tables.</p> <p>Also, there are <strong>lots</strong> of small conference rooms, so whenever you need some quiet for work and/or a phone call, you can find a spot, or just hang out at <em>Ten Forward</em> which is like a bar, a living room and a cinema in one. Funny thing: All conference rooms are either inspired by the Star Trek series, or pick up on classic internet memes. Other than Ten Forward, there are for example <em><a href="http://en.wikipedia.org/wiki/Holodeck" title="Wikipedia: Holodeck">Holodeck</a></em> (where new interns are trained), <em><a href="http://en.wikipedia.org/wiki/All_your_base" title="Wikipedia: All your Base">All your Base</a></em> and the <em><a href="http://en.wikipedia.org/wiki/Bikeshed" title="Wikipedia: Parkinson&#x2019;s Law of Triviality">Bikeshed</a></em>.</p> <p>Every Monday after the all-hands meeting there is free food for everyone, and every Wednesday evening there is Intern Movie Night (we watched <a href="http://www.imdb.com/title/tt0390384/" title="IMDB: Primer">Primer</a> tonight: great, but mind-boggling). Yesterday, I was at lunch with the Metrics Team: There is some great <a href="http://www.thecantankerousfish.com/">seafood</a> to be had in mountain view.</p> <p>In July, there will be the Mozilla Summit in Whistler, Canada. I will make sure to write about that, too.</p> First day in California2010-06-21T00:04:19ZMichael Kurzehttp://www.thefoundation.de/about/michaelfirst-day-california<p>After nineteen hours of travel, with stops in London and Los Angeles, yesterday I arrived in Mountain View, CA. It is a beautiful place.</p><p> Mozilla takes great care of the interns, so now I am living in the Oakwood apartments, roughly a mile from the Mozilla offices. There are three other interns living with me at the apartment (one from the bay area, two from other parts of the states), and there are about two hundred interns at the complex, most of them interning at facebook. The weather is warm and sunny at about 70° F (about 22° C). </p> <p> When I was looking for my roommates (they were out when I arrived), I had what I am going to think of as a stereotypical Silicon Valley experience: The first guy I asked invited me to his BBQ. Turns out he works at Google, and all his guests were Stanford graduates. Of course, one of them is just now getting his startup going. Another one of them actually worked on the <a href="http://djangoproject.com">Django project</a> in Lawrence, Kansas together with the folks around <a href="http://holovaty.com/">Adrian Holovaty</a>, <a href="http://jacobian.org/">Jacob Kaplan Moss</a> and <a href="http://simonwillison.net/">Simon Willison</a>. So that was absolutely great. </p> <p> Today we visited the farmers market, where strawberries cost half as much as in Germany and are four times as big. We just went out for some beef and vegetables, which we will put onto the grill later. I can't wait for tomorrow, when I will be meeting the people over at Mozilla... </p>Going to Mozilla2010-06-18T17:07:33ZMichael Kurzehttp://www.thefoundation.de/about/michaelgoing-mozilla<p>Starting on Monday, June 21 I am going to intern at the <a href="http://www.mozilla.com" title="mozilla.com">Mozilla Corporation</a> (MoCo) in Mountain View, California. Yay!</p><p> For quite some time I have been following the <a href="http://mozillazine.org" title="mozillaZine">mozine</a> and later <a href="http://planet.mozilla.org" title="Planet Mozilla">PMO</a>. So I am absolutely thrilled to have this opportunity, and this also means that this blog will get a new <a href="/michael/on/mozilla" title="Articles on Mozilla">topic</a> added. Not only will I get to know many more interns with whom I am going to live in <a href="http://en.wikipedia.org/wiki/Mountain_View,_California" title="Mountain View (Wikipedia)">Mountain View</a>, and not only will I participate in the Mozilla project together with all the great people at the MoCo HQ. But also I will be attending the Mozilla Summit, the biennial meeting of people from all over the world that made great projects such as the Firefox web browser and <a href="http://addons.mozilla.org" title="Mozilla Addons">AMO</a> possible. </p> <p> My internship position will be at the <a href="http://blog.mozilla.com/metrics/" title="Mozilla Blog of Metrics">metrics department</a> led by Ken Kovash and quite probably I will be allowed to go into the details of my project there, either at this blog or at a Mozilla blog. </p> <photo slug="leaving-aachen" size="display">Leaving for CA</photo> <p> If you plan to go abroad to the U.S. for an internship, I suggest you apply for the internship position(s) of your choice at least two months before the actual start of the internship. I was a bit late to the party and that led to a rather tight schedule: As a Germany based student at RWTH Aachen University, I had to invest some time in getting the visa. But fortunately there is a very helpful <a href="http://cicdgo.com/" title="CICD">visa sponsoring partner</a>, so everything went smoothly after all. </p> <p> I do not know about other areas, but as a student in computer science you can expect compensation for an internship in the U.S. which is not necessarily the case in Germany. I applied at two organizations, and in both cases their offers covered living expenses and the flight to California. So I really do recommend that next spring you visit the web site of any company or organization you always wanted to get to know, and apply for an internship there. Make sure that the professional and academic experience on your resume matches the position you apply for, and prepare for two to three phone interviews. </p> OneSocialWeb: more than Jabber for Apps2010-04-30T14:00:21ZMichael Kurzehttp://www.thefoundation.de/about/michaelonesocialweb-more-than-jabber<p>Almost a month ago, the presentation of <a href="http://onesocialweb.org" title="OneSocialWeb">OneSocialWeb</a> at the android developers conference <a href="http://droidcon.be" title="Droidcon 2010 Belgium">droidcon.be</a> was one of the most interesting talks there. Recently the XMPP-centric framework has gone open source.</p><p> Last year a group of fellow students and myself were tasked with creating an android applications to organize meetings spontaneously (think something like doodle, only mobile and more short term). At that time we were thinking about using <a href="http://de.wikipedia.org/wiki/Extensible_Messaging_and_Presence_Protocol" title="Extensible Messaging and Presence Protocol">XMPP</a> for real time communication, but were hesitant because of the time this would cost us to implement. In the end we used a traditional REST-based web service rather than a peer-to-peer system. Luckily there already is an effort underway which is called OneSocialWeb, funded by the telco provider Vodafone. It allows people to work together on Java-objects using XMPP. </p> <p> This means that all the work that has to do with XMPP protocol handling and conflict management will be handled by this abstraction layer, while we developers can focus on delivering useful application. You could use this for simple things like associating chat conversations with arbitrary objects in your application. You could also try to model your entire application domain around this collaboration: In his presentation at droidcon <a href="http://eschnou.com/" title="Blog of Laurent Eschenauer">Laurent Eschenauer</a> demonstrated this using a collaborative shopping list where each participant could check off items, notifying the others immediately. </p> <p> The Android platform with its services-model might really help in getting this concept to work, as XMPP protocol handling could be handled by one central service, dispatching updates to any interested Activities. This could well become the bidirectional, decentralized alternative to Apple&#x2019;s Mobile Push service. </p>And Apos Semicolon: A Cathapostrophe2010-03-25T16:32:32ZMichael Kurzehttp://www.thefoundation.de/about/michaeland-apos-semicolon-cathapostrophe<p>This morning on Facebook syndication, I reviewed the <a href="http://www.thefoundation.de/michael/2010/mar/24/thoughts-on-android-platform/" title="Thoughts on the Android">article on android</a> that I wrote yesterday. And one of the few HTML-incompatible XHTML-properties assaulted my eyes, impersonated by a bunch of entity references.</p><p>Specifically, I had escaped the <em>typewriter apostrophe (&#x0027;)</em> using named entity reference syntax (&amp;apos;). Unfortunately, I had forgotten that &#x2014; while this entity is defined by XHTML 1.0 &#x2014; it is actually illegal in plain ol&#x2019; HTML. This should not have been a problem, as these pages are served using the XHTML 1.0 doctype where &amp;apos; points to the Unicode code point 0x27, so that you can use single quotes to delimit attributes. </p> <p>The Django RSS framework however would put a plain &quot;html&quot; content-type into the Atom-Feed, so the references to the apostrophe remained unresolved when the Feed readers converted my contents for display. Instead, they correctly escaped the ampersand, which led to a lot of ugly entity references on my facebook feed.</p> <p>So for now I am going to reference the apostrophe using the Unicode code point reference &amp;#x2019; (<em>punctuation apostrophe: &#x2019;</em>) which is actually recommended over the ASCII-compatible &amp;#x0027; (<em>typewriter apostrophe: &#x0027;</em>). Strictly speaking, I would not even need to use any entityref here, as 0x2019 is not XML syntax. Next I need to figure out if there is a way to configure the <a href="http://docs.djangoproject.com/en/dev/ref/contrib/syndication/" title="The Django syndication feeds framework">Django feeds framework</a> to use XHTML as a content type for Atom feeds and to check if the results are real-world-compatible.</p> <p>But really, this just shows once more that it is absolutely inhumane to edit XHTML by hand. So I&#x2019;ll be looking for a suitable <a href="http://en.wikipedia.org/wiki/WYSIWYM#In_web_environments" title="What you see is what you mean">WYSIWYM</a> editor to maybe handle this stuff.</p>Thoughts on the Android2010-03-24T13:45:36ZMichael Kurzehttp://www.thefoundation.de/about/michaelthoughts-on-android-platform<p>Nearly finished with an Android app that some fellow students and myself did for the mobile lab at RWTH University (simple foursquare like thing: maps, contacts, web services...). I am going to collect some of things that I liked about the Android platform and stuff that annoyed me when working with it.</p><p> First of all, let&#x2019;s state that it was fairly easy to get our application up and running. There were simply no Android problems that were too obscure or too complicated to deal with. Compared to the intricate hells of other Java environments, such as J2EE (think class loaders, think various closed source implementations...) I&#x2019;d say it was an easy ride for the most part. </p> <h3> The good parts </h3> <p> There was some stuff I really liked, plus everything to do with performance, as they obviously really thought about that, compared to say, SUN when they introduced their first JVM. </p> <ul> <li> <h4> generated resource identifiers </h4> <p> No need to mess around with strings in order to access resources. For each resource (localized string, layout component id, image) you define it in an XML file and an ID (as a public static final int) is generated for you. Yields performance benefits and some compile time safety. </p> </li> <li> <h4> eclipse integration </h4> <p> The android SDK does <em>not</em> depend on using eclipse, which is great if you like to learn what&#x2019;s actually going on and you know your way around with the shell. But today&#x2019;s developers are usually bred with eclipse support built right in (kiddin&#x2019;), so the eclipse plugin really helps as it calls the tools for you (e.g. to regenerate resource identifiers) so you don&#x2019;t forget. </p> </li> <li> <h4> layout tools </h4> <p> compared to the Interface Builder for iPhone, the XML based layout definitions need to allow for more flexible layouts, as android apps are not tailored to a specific screen size or orientation. They do a fairly good job of this, and whenever you are not satisfied you can simply define overriding layouts for specific devices or screen dimensions. The layouts do also have an advantage over CSS based layouts (Palm Pre) because you can easily fill horizontally or vertically or align to the bottom of the screen (that is just annoying with CSS). </p> </li> <li> <h4> The API&#x2019;s </h4> <p> Batteries are included: There is JSON support, various UI components, the mature HTTP libraries from apache commons, and everything from the good old JavaSE that is really useful. the awesome <a href="http://developer.android.com/reference/android/webkit/WebView.html" title="Android WebView documentation">webview</a> lets JavaScript running inside the view interact with your Java code really well! </p> </li> </ul> <h3> The bad parts </h3> <ul> <li> <h4> The API&#x2019;s: JSON </h4> <p> Yes, again. Some of the API&#x2019;s seem to me like they had to freeze them in a hurry. The JSON api is only <em>similar</em> the one you get from <a href="http://www.json.org" title="JSON Website">json.org</a> (an older, incompatible version). We wanted our serialization layer to live in a separate project from the android client, so I actually had to get the JSON sources from the Android source repository to downgrade our non-Android implementation to use the platform library.<br> The platform developers should offer standalone packages of all included 3rd party libraries they include, <em>in the version that got shipped with Android</em>. </p> </li> <li> <h4> The API&#x2019;s: Google Maps </h4>Compared to the awesome JavaScript API that you can use when you embed Google Maps into a webapp (or a Palm Pre app), the Android Maps API is just tiny. We wanted to set draggable markers and to reverse <a href="http://en.wikipedia.org/wiki/Geocoding" title="Wikipedia: Geocoding">geocode</a> the selected location, and we just used an android WebView and embedded an GMaps from a web site of ours. That is a tiny bit slower, but so much more powerful to just <abbr title="get things done">GTD</abbr>. While doing that I noticed again, how quickly you can get stuff going with just JavaScript and browser refresh compared to Java&#x2019;s holy "compile, deploy, navigate to screen" trinity of UI development. </li> <li> <h4> The API&#x2019;s: HTTP </h4>Honestly, I could not believe that there was no simple callback based HTTP abstraction layer over apache commons. Have you guys learned nothing from the success of AJAX? I mean, being with Google, they probably should have, but anyway: It is much too painful to access web services. We had to write our own layer for that (and no, I don&#x2019;t want to go all SOAP or even WSDL, I just want to get some lists or some image from our server...), instantiating thread pools, managing listeners and such. Perhaps I should use a WebView for that too, it can do that, you know... </li> <li> <h4> Internationalization </h4> <p> Perhaps I missed something there, but I18N can get really messy because a missing translation will cause no errors before runtime. This would ideally have a spreadsheet based editor or something like that, so you spot missing translations right away. Also, there are still some API&#x2019;s (<a href="http://developer.android.com/reference/android/app/ProgressDialog.html#setMessage%28java.lang.CharSequence%29" title="ProgressDialog.setMessage">ProgressDialog.setMessage</a>) that take strings where numeric identifiers should be used. Not sure about the reason, but this can lead to untranslated UI quickly (and did in our case...). </p> </li> <li> <h4> The date and time pickers </h4> <p> This is mainly about the usability. Compared to the iPhone, these simply sucks. The rolling barrels that the iPhone uses are probably patented, but it must be possible to make something that works more quickly than this. </p> </li> </ul> <p> All in all it was interesting to work with android. I must look into JSON serialization some more, as the various automatic available Java libraries (gson, json-simple, XStream, Jackson) that I glanced at when the json.org one annoyed me all seemed to have serious disadvantages, such as requiring change your model, to write lots of glue code, or to work badly with object graphs containing cycles. But that is another topic, for another time. </p> Google Earth Plug-in and Shiretoko2009-05-10T15:11:16ZMichael Kurzehttp://www.thefoundation.de/about/michaelgoogle-earth-plugin-firefox-3_5-beta<p>While not officially supported, the Google Earth Browser Plug-in seems to work just fine with nightly builds of Mozilla Firefox 3.5, codename <em>Shiretoko</em>. Here is a simple hint to get it up and running.</p><p> When accessing the <a href="http://code.google.com/apis/earth/">plug-in homepage</a> using Shiretoko, Google tells you that your Browser version is not supported. Of course there is a reason for that. Messing around with untested software, especially with plug-ins, might hurt your user experience or data in ways I cannot predict here. </p> <p> That said, the <a href="http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-1.9.1/">recent nightly builds</a> on the Firefox 3.5 / Mozilla 1.9.1 branch seem to be pretty stable, at least on Mac OS 10.5. To get it running, you have to fire up <em>about:config</em> and change the key <em>general.useragent.extra.firefox</em> to something like <em>Firefox/3.0.10</em>. Now reload the Plug-in Homepage, download and install the Google Browser plug-in. After that, you have to keep the modified user-agent or the plug-in will cease to work. </p> <p> Keep in mind that this might affect which add-on versions will be offered to you at <a href="http://addons.mozilla.com" title="Firefox Add-ons">addons.mozilla.org</a>. Come to speak of it, I recently tried out the <a href="https://addons.mozilla.org/en-US/firefox/addon/5203" title="Minimap Sidebar :: Firefox Addons">Minimap Sidebar</a> that kind of gives your browser a Google Earth in the sidebar (or in a full tab). Really great: It allows you to switch between Google Maps, Google Earth and the awesome <a href="http://www.openstreetmap.org">OpenStreetMap</a> at any time without loosing position or bookmarks. It also integrates with various other location based services such as <a href="http://wikimapia.org/#lat=40.7683217&lon=-73.9513779&z=13&l=5&m=a&v=2" title="WikiMapia: Manhattan">WikiMapia</a> or <a href="http://loc.alize.us/#/geo:50.774112,6.081276,15,k/" title="Aachen in loc.alize.us">loc.alize.us</a>. </p> It appears I was wrong2009-04-08T23:18:11ZMichael Kurzehttp://www.thefoundation.de/about/michaelit-appears-i-was-wrong<p>Yesterday, Google <a href="http://googleappengine.blogspot.com/2009/04/seriously-this-time-new-language-on-app.html" title="Seriously this time, the new language on App Engine: Java™">announced</a> the availability of Java as the new programming language for the App Engine, refuting <a href="http://www.thefoundation.de/michael/2008/sep/21/javascript-next-app-engine-language/" title="Is JavaScript The Next App Engine Language?">my guess</a> from last year that it might be JavaScript &mdash; though of course, not entirely.</p><p> If you take the demand of the users into account, Java is of course the right choice as the next language. It might just alienate lots of Java-Developers if a niche language of the server zoo such as JavaScript was to emerge first. Additionally, Java is one Step short of full (albeit sandboxed) <abbr title="Java Virtual Machine">JVM</abbr> support. </p> <p> To allow for sandboxing, Google wraps some of its own <abbr title="Application Programmer Interface">API</abbr>'s into Java SE or <abbr title="Java Specification Request">JSR</abbr>–standardized services such as <tt>javax.mail</tt> and <tt>java.net.URL</tt>. Also, there is a <a href="http://code.google.com/appengine/docs/java/jrewhitelist.html" title="The JRE Class White List">white-list</a> currently containing 1323 of the 3700+ <a title="Overview (Java Platform SE 6)" href="http://java.sun.com/javase/6/docs/api/">Java SE 6</a> classes. Most of the classes that are not available are from the Swing and AWT suites which a web developer will not need anyway. Instead, Google provides the homebrewn <abbr title="Google Web Toolkit">GWT</abbr>. </p> <p> Thanks to the free and open source <a href="http://www.mozilla.org/rhino/" title="Mozilla Rhino: JavaScript for Java">Rhino</a> JavaScript Interpreter written in Java, server side JavaScript on the App Engine is rather easy to achieve now. I guess I might have to check it out and report back about it later, so I just signed up for the Java technology preview on my App Engine account. </p> <p>There are <a title="Campfire One: App Engine Redux" href="http://www.youtube.com/view_play_list?p=DFDBB63922B90A70">some videos</a> from the Google Campfire event over at Youtube. Most of the time they are rather interesting, plus Kevin Gibbs does a pretty decent imitation of Steve Jobs during the presentation (voluntary or not).</p>