Coder Thoughts on software, technology and programming.

Piotr Mionskowski

  • Guide to javascript generators

    15 June 2014 javascript, nodejs, ES6, and generators

    Yield sign For a while now I've been hearing and reading about a new exciting features of ES6 like arrow functions, array comprehensions, classes, modules, generators and many more. While many of them are easy to understand generators can take a while to grok. So I thought it would be a good idea to write some samples to master this particular feature.

    What are generators

    Generators or to be more precise generator functions are functions which can be exited and then reentered with function execution context (variable binding) preserved - a limited version of coroutines if you will. Traditionally they are used to write iterators more easily but of course they have many more interesting use cases.

    To run following samples your node version has to support harmony generators. You can check that with node --v8-options | grep harmony. The latest unstable version 0.11.12 supports them.

    A simple generator

    Here's a very simple example of generator that yields only two values: when executed it will output:

    {value: 1, done: false}
    {value: 2, done: false}
    {value: undefined, done: true}
    

    A Fibonacci generator

    Of course since we have an easy way to create iterators we have to have a way to iterate over them. One approach is to just use a while loop:

    but honestly this would be a step back. Fortunately es6 comes with a new kind of for-of loop designed specifically to work with iterators:

    for(var current of fibo()){
      console.log(current);
    }
    

    Iterators and iterables

    I've mentioned couple of times a word iterator, but what exactly is that? In ES6 iterator is an object that has next method which when called returns and object having 2 properties:

    • value - which represents value returned by iterator
    • done - a boolean indicating whether iterator has done it's job or not

    As we saw in the above generator examples every generator is an implicit iterator thus we can loop through it using new for-of syntax.

    There is one more object type it is good to know of - iterable. According to ES6 Wiki an object is iterable if has a internal property named iterator which is a function returning iterator. Unfortunatelly the latests specs aren't yet implemented in v8 (or incorporated in node.js) so the following example is only my understanding of the specs:

    var iterable = {};
    iterable[Symbol.iterator] = function(){
        console.log('calling Symbol.iterator')
        return (function *(){
          yield 1;
          yield 2;
        }());
    };
    
    for(var item of iterable){
      console.log(item);
    }
    

    which would print:

    calling Symbol.iterator
    1
    2
    

    Passing values to generators

    So far we've covered the most simplistic usage of generators for those familiar with .NET it reassembles a yield keyword introduced in C# 2.0 in 2005. Since ES6 Generators design was influenced a lot by python generators now wonder we can also pass values to generator functions. You do it by passing value to next function call:

    Running it should output something similar to:

    How old are you? oracle says:  0.760754773626104
    Why? oracle says:  0.36784305213950574
    Thank you!
    

    A careful reader will notice that the first question is silently ignored by our oracle generator. This is logical although might be strange at first. The generator function body isn't executed until we call next for the first time - which is like starting the generator. The call to next will execute function body until it encounters yield - the control flow changes and the function is left to be entered later on - on future next call. Notice that we there is no logical place to store/assign the value passed to first next() call - a value that we pass to first next() call is ignored.

    Thanks to @NOtherDev for pointing that out.

    Generators error handling

    Now that we know how to yield values and receive them inside generators. It's equally important to be able to handle exceptions gracefully. Since generators are executing synchronously semantics of typical try {} catch {} should be preseved: which would output:

    Generator entered, yielding first value
    Generator asked for second value - will go bum
    [Error: You can only call this generator once, sorry!]
    

    It's important to note however that if you try to call next method on a generator that has thrown an exception you'll get another error:

    throwingGenerator.next();
                      ^
    Error: Generator is already running
        at GeneratorFunctionPrototype.next (native)
    

    Now this was a case when a generator has thrown exception at caller. What if we would like to tell generator that an error occured at caller's site. Thanks to generator throw function that is easy enough:

    This as you may expect will produce:

    I will stop when you tell me about error
    Got value from you: I will throw in a moment!
    Got error from you: This is the end. Goodbye!
    

    Practical generator usage

    An obvious use case is of course creating lazy evaluated collections. One can now very easily implement collection operators that are available in LINQ. Consider following examples: which when executed will output (note the order or DEBUG calls):

    DEBUG: concat yielding first generator values
    DEBUG: reverse will yield 6
    DEBUG: filter predicate value true for item 6
    DEBUG: take will yield  6 counter 1
    DEBUG: select will yield Even number 6
    Even number 6
    DEBUG: reverse will yield 5
    DEBUG: filter predicate value false for item 5
    DEBUG: reverse will yield 4
    DEBUG: filter predicate value true for item 4
    DEBUG: take will yield  4 counter 2
    DEBUG: select will yield Even number 4
    Even number 4
    DEBUG: concat yielding second generator values
    At the end
    

    Note the concat function implementation uses yield *otherGenerator() to yield values yielded by other generator which greatly reduces a boilerplate code needed otherwise.

    Simplifying asynchronous operations

    It may as first be a surprise that generators can streamline asynchronous code - how would synchronously executing yield help in managing asynchronous calls? If you think of the way we write asynchronous code in most of mainstream languages it typically boils down to either:

    • passing a callback function(s) that will be called when the operation completes
    • immediately getting back a promise of result from async function
    • using a library similar to Reactive Extensions - that when it comes to asynchronous code somewhat similar in usage to promises but offers very rich API

    Now when you look at async code consuming:

    var power = function(number, exponent){
      var deferred = Q.defer();
      setTimeout(function(){
        deferred.resolve(Math.pow(number, exponent))
      }, 300);
      return deferred.promise;
    };
    
    power(2,3).then(function(result){
      console.log(result);
    });
    

    I immediately translate promise calls in my head to a more natural flow:

    var result = _wait_for_async_ power(2,3);
    console.log(result);
    

    Now since we don't have async feature like the one C# 5 has (or rather we don't yet have it) it's not yet possible to achieve this kind of simplification. Still if you think of how the imaginary _wait_for_async_ keyword would work it seems that it (in case of promise based implememntation):

    • would wait for the promise to complete in a non blocking fashion - possibly leaving function execution context
    • when the promise completes it would return to the exact place from where it was called and return value that it got from promise or throw an exception

    Essentially that's the way yield keyword works - however we still need an umbrella machinery that will take care of calling generator next method so it will return to the original caller as well as it throw method to report exceptions.

    As you may have guessed there are already plenty of implementations of this machinery - the one I like most is co. Take a look at an example from their page and enjoy callback free life!

  • Setting up fluentd elasticsearch and Kibana on azure

    08 June 2014 fluentd, kibana, and elasticsearch

    A decent way too go through log files can come handy in various ways. From debugging bugs, investigating problems on production environment to figuring out where performance bottle necks are. For simple cases when there is only one log file involved you can go very far with using grep/awk/sed magic or any decent editor capable of handling regex in large files.

    It becomes tricky however when your logs are scattered through several files, maybe use different log format or stamp entries with different timezones. Fortunately there are services like Splunk, Loggly and many others that take away great amount of pain involved in aggregating distributed log entries and searching through them.

    I was searching once for a free alternative that I could quickly get up and running in our test environment where we had 10+ machines running 20+ web sites or services that we're all part of a single product. That's when I found out about Kibana, logstash and how they make use of elasticsearch.

    Back then I considered using fluentd instead of logstash however most of the machines we were running windows and that is not where fluentd shines. Nowadays I'm using linux vm mostly that's why I decided to use fluentd.

    Setup a vm for elasticsearch

    We need a machine that will store our aggregated log entries so let's create one using Azure Cross-Platform Command-Line interface. After installing it with npm install azure-cli -g we need to introduce ourself to it. The simplest way to do this is to issue azure account download which will open a browser to download your profile: After that you'll need to import the downloaded file with npm account import /path/to/your/credentials.publishsettings.

    I'm used to ubuntu so let's find an machine image to use:

    azure vm image list | grep Ubuntu-14
    data:    b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04-LTS-amd64-server-20140122.1-alpha2-en-us-30GB                          Public    Linux  
    data:    b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04-LTS-amd64-server-20140226.1-beta1-en-us-30GB                           Public    Linux  
    data:    b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04-LTS-amd64-server-20140326-beta2-en-us-30GB                             Public    Linux  
    data:    b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04-LTS-amd64-server-20140414-en-us-30GB                                   Public    Linux  
    data:    b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04-LTS-amd64-server-20140416.1-en-us-30GB                                 Public    Linux  
    data:    b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04-LTS-amd64-server-20140528-en-us-30GB                                   Public    Linux  
    

    The last one looks like most recent so we'll used that. Next we need to create a virtual machine. We can use either password authentication (which isn't recommended) or a ssh certificate. To use the latter we need to create a certificate key pair first. Openssl command line tool will do it for us:

    openssl req -x509 -nodes -days 730 -newkey rsa:2048 -keyout ~/.ssh/mymachine.key -out ~/.ssh/mymachine.pem
    

    It's now time to issue azure to create a new instance of Ubuntu machine for us:

    azure vm create machine-dns-name b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04-LTS-amd64-server-20140528-en-us-30GB machineuser \
      --vm-size extrasmall \
      --location 'North Europe' \
      --ssh 22 --no-ssh-password --ssh-cert ~/.ssh/mymachine.pem
    
    info:    Executing command vm create
    + Looking up image b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-14_04-LTS-amd64-server-20140528-en-us-30GB
    + Looking up cloud service
    + Creating cloud service
    + Retrieving storage accounts
    + Configuring certificate
    + Creating VM
    info:    vm create command OK
    

    We can find out more details about our newly created machine with: azure vm list --dns-name machine-dns-name --json. Our machine is up and running to so we can log into it using ssh (before you do that be sure to change .key file permission to 0600 so that only you can access it): ssh -i ~/.ssh/mymachine.key [email protected]

    Install elasticsearch

    First let's download elasticsearch and extract it to our newly created vm:

    curl https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.0.zip -o elasticsearch-1.2.0.zip
    unzip elasticsearch-1.2.0.zip
    

    To run elasticsearch we'll need java in version at least 1.7:

    sudo apt-get update
    sudo apt-get install default-jre
    

    Now we can finally run elasticsearch:

    ./elasticsearch-1.2.0/bin/elasticsearch
    

    If we want to run elasticsearch as a daemon we can pass -d to the script. (see below for init.d script)

    We also need to open port for elasticsearch:

    azure vm endpoint create --endpoint-name elasticsearch machine-dns-name 9200 9200
    

    We only wan't our machines to be able to push logs to elasticsearch that's why we need to add ACL rules to the endpoint we created above. Unfortunately I couldn't find a way to do this through CLI interface so we need to resort to web interface.

    Install Kibana

    Kibana is a web application designed to perform elasticsearch searches and display them using neat interface. It communicates with elasticsearch directly that's why it's not really suited to be deployed to publicly accessible place. Fortunately there is kibana-proxy which provides authentication on top of it. To run kibana-proxy we'll need node.js let's install both of them:

    sudo apt-get update
    sudo apt-get install git
    sudo apt-get install nodejs
    sudo apt-get install npm
    
    git clone https://github.com/hmalphettes/kibana-proxy.git
    cd kibana-proxy
    git submodule init
    git submodule update
    npm install
    node app.js &
    

    kibana-proxy is now running but in order to access it from outside we need to open firewall port:

    azure vm endpoint create --endpoint-name kibana-proxy machine-dns-name 80 3003
    

    Now when you point your browser to http://machine-dns-name/ you should see kibana welcome screen.

    Restrict access to kibana-proxy

    You probably don't want anyone to be able to see your logs that's why we need to setup authentication. kibana-proxy implements this through Google OAuth - so let's enable it for our installation. First we'll need APP_ID (Client ID) and APP_SECRET (Client secret) so let's go to google developer console and create a new project. Next we need to enable OAuth for our project in APIs & auth > Credentials:

    Make sure AUTHORIZED REDIRECT URI looks like http://machind-dns-name.cloudapp.net/auth/google/callback. For more secure setup you should be using https but for simplicity we'll skip that.

    Now we need to start kibana-proxy again this time passing it APP_ID, APP_SECRET along with AUTHORIZED_EMAILS through environment variables (kibana-proxy-start.sh):

    export APP_ID=fill_me_appid
    export APP_SECRET=fill_me_app_secret
    export AUTHORIZED_EMAILS=my_email_list
    exec /usr/bin/nodejs app.js
    

    Now when you go to http://machine-dns-name/ you should get redirect to Google authorization page.

    We want to run the kibana-proxy continuously - the simplest way to achieve that is to make use of forever module. However in this guide we'll use upstart scripts described below.

    Fluentd installation

    To push log entries to our elasticsearch instance we need fluentd agent installed on the machine we want to gather logs from. Fluentd agent comes in 2 forms: fluentd gem if you wish to have more control features and updates and td-agent if you care for stability over new features. We'll go with fluentd gem for this guide.

    Since fluentd is implemented in Ruby with most performance critical parts written in C. To install it we need ruby first but obtaining it with rvm is a piece of cake:

    curl -sSL https://get.rvm.io | bash -s stable
    source ~/.rvm/scripts/rvm
    rvm install 2.1.1
    sudo gem install fluentd --no-ri --no-rdoc
    sudo fluentd -s
    

    Now we have fluentd installed with sample configuration available in /etc/fluentd/fluent.conf.

    Since we would like our log entries to be pushed to elasticsearch we need a plugin for that too: bash gem install fluent-plugin-elasticsearch gem install fluent-plugin-tail-multiline

    Let's edit fluent.confg file so that our agents reads sample log files from rails app and forwards them to elasticsearch instance:

    <source>
      type tail_multiline
      format /^(?<level_short>[^,]*), \[(?<time>[^\s]*) .(?<request_id>\d+)\]\s+(?<level>[A-Z]+)[-\s:]*(?<message>.*)/
      path /path/to/log/file/staging.log
      pos_file /path/to/positions/file/fluentd_tail.pos
      tag app_name.environment
    </source>
    
    <match app_name.**>
      type elasticsearch
      host machine-dns-name.cloudapp.net
      logstash_format true
      flush_interval 10s # for testing
    </match>
    

    The format we see above will following line of Rails log file:

    I, [2014-05-22T12:24:44.434963 #11837]  INFO -- : Completed 200 OK in 884ms (Views: 395.7ms | ActiveRecord: 115.0ms)
    

    and break it down to following groups/fields:

    KeyValue
    time 2014/05/22 12:24:44
    level_short I
    request_id 11837
    level INFO
    message Completed 200 OK in 884ms (Views: 395.7ms | ActiveRecord: 115.0ms)

    Note that although builtin in_tail plugin supports multiline log entries I either don't completely understand how to use it or there is no way to have full stack trace logged as a single message. Moreover tail_multiline doesn't seem to support multiple input paths that's why in order to monitor multiple files you would have to duplicate the source section.

    The only thing left is to start fluentd and watch our logs

    rvmsudo fluetnd -d
    

    Upstart for fluentd, elasticsearch and kibana-proxy

    The final thing we need to do is to setup above components to be run as services. Since we're using Ubuntu for our servers we'll use upstart congiurations for that.

    Upstart for fluentd

    Since we have installed fluentd in ubuntu user rvm but we wan't we have to generate a wrapper script for it

    rvm wrapper default fluentd
    

    and then use generated wrapper ~/.rvm/wrappers/default/fluentd in our upstart script. Let's create our upstart script and put it inside /etc/init/fluetnd.conf

    description "Fluentd"
    start on runlevel [2345] # All except below
    stop on runlevel [016] # Halt, Single-User, Reboot
    respawn
    respawn limit 5 60 # Restart if ended abruptly
    exec sudo -u ubuntu /home/ubuntu/.rvm/wrappers/default/fluentd
    

    Upstart for elasticsearch

    Elasticsearch already comes with a script for starting it in properly configured jvm we'll utilize it in our upstart config /etc/init/elasticsearch.conf

    description "Elasticsearch"
    start on runlevel [2345] # All except below
    stop on runlevel [016] # Halt, Single-User, Reboot
    respawn
    respawn limit 5 60 # Restart if ended abruptly
    setuid ubuntu
    exec /path/to/your/elasticsearch/bin/elasticsearch
    

    Upstart for kibana-proy

    With the kibana-proxy-start.sh the upstart configuration for kibana-proxy is dead simple:

    description "kibana-proxy"
    start on runlevel [2345] # All except below
    stop on runlevel [016] # Halt, Single-User, Reboot
    respawn
    respawn limit 5 60 # Restart if ended abruptly
    chdir /where/you/installed/kibana-proxy
    setuid ubuntu
    exec ./kibana-proxy-start.sh
    

    Running services

    Once you have your upstart scripts configuration ready you can now start services with a descriptive syntax:

    sudo service elasticsearch start
    

    I you're having trouble getting upstart to work you can always check its configuration with:

    initctl check-config
    

    To find out why a particular service won't start check out

    dmesg | grep elasticsearch
    

    Finally you can find you log files in /var/log/upstart/servicename.*

    Now if you have setup everything correctly and of course you have some log messages generated by apps when you navigate to http://machine-dns-name.cloudapp.net/index.html#/dashboard/file/logstash.json you should see a nice dashboard filling up with log entries:

  • ng-click and double post

    21 May 2014 angularjs

    Recently I was fixing a bug in an application written in angularjs which turned out to be pretty interesting for me. At first I couldn't reproduce the error and just when I got frustrated it happened exactly as stated in the issue tracker. As it turned out the problem was caused by fast clicking on the same button. Click handler triggered an http POST to server which creates a row in table - no wonder it looked strange when 2 rows where created after someone accidentally clicked twice.

    Some solutions

    I've quickly googled to find out what others are doing in such scenario and found at least couple of ways of handling it - some of them I have considered while others where completely new to me.

    Disable button

    The most simple approach is to use ng-disabled on the same button as ng-click directive. Which would look like this:

    <button ng-click="executePost()" ng-disabled="buttonDisabled">Click me</button>
    
    $scope.executePost = function(){
      $scope.buttonDisabled = true;
      runHttpPost().then(function(){
        //do something useful with a response
      }).finally(function(){
          $scope.buttonDisabled = false;
      })
    };
    

    As you can see the code is pretty straight forward especially when you make use of angularjs promise api provided by $q service.

    Stop relying on the result of asynchronous action

    Another approach is to design user interface in such a way that the we can safely predict what result the action will produce. For instance if we're editing an entity and have performed as much validation as possible on client side it's offen safe to assume that the request will be processed successfully and our changes will persist. If so we don't have to wait for the server to finish processing and can move to a next screen in the application that will display values already available in browser. As a bonus our application will feel much faster and in many cases it makes a significant difference. It's not always easy though and sometimes requires significant changes in the way we design our user interface as well as our server api. Alex MacCaw summarised it nicely.

    Make sure a second request will not make it to the server

    An approach I haven't thought about is to actually prevent second request that looks identical to the first one and is created just after the first one. Michał Ostruszka described an implememtation of this technique. I haven't actually tried it as it seemed way too complex for an issue that I was fixing. The approach can be very tempting as it allows to handle double requests globally - once per application. Although I suspect it may be difficult to use if the act of making request that was blocked is causing side affects to the state of the application.

    bi-click approach

    Most of async operation in modern browser based libraries are expressed using a flavour of promise/deferred api. This is no different in angularjs which provides $q. So if we would like to block the button until the operation is completed it seems like a natural fit to use promises to achieve that. Here is an example of such directive:

    Further usage

    Every web developer out there has probably implemented an ajax throbber of some kind. In Angularjs world most of the examples I see are very similar to our button blocking with isDisabled flag. Usually it will mean something along these lines:

    $scope.executePost = function(){
      $scope.showThrobber = true;
      runHttpPost().finally(function(){
          $scope.showThrobber = false;
      })
    };
    
    <div class="ajax-thorbber" ng-show="showThrobber"></div>
    

    While there is nothing wrong in particular with this example I find handling those flags in a controller rather tedious. As with the bi-click example one can very easily create a directive to handle throbber toggling in one place.

  • Nonintrusive http proxy in nodejs

    23 March 2014 nodejs, http, and proxy

    In my previous posts I wrote about problems that might occur when using http proxy written in nodejs to debug http issues. Today I'm going to describe how to use nodejs builtin parser to overcome these problems.

    Nodejs streams

    Node.js has decent support for handling streams. Especially the pipe function takes a lot of burden away - whereas previously it had to be carefully handled by programmer. As a first step let's use a simple socket server and tunnel http requests through it:

    var net = require('net');
    var stdout = process.stdout;
    var server = net.createServer(function(clientSocket){
      var serverSocket = net.connect(8888,'localhost', function(){
    
          clientSocket.pipe(serverSocket);
          clientSocket.pipe(stdout);
    
          serverSocket.pipe(clientSocket);
          serverSocket.pipe(stdout);
    
      });
    });
    server.listen(9000);
    

    The above code creates a socket server listening on port 9000. When it gets a connection from client it will immediately try to connect to other server listening on port 8888 - this is a port used by Fiddler or Charles. After the connection is established it uses pipe function to pass all data coming in on port 9000 to port 8888 as well as to standard output just so we can see what data is sent from client. We also need to pass data returned from port 8888 back to client that's why there is the second pair of pipe calls.

    With that we can already use curl to see raw http traffic written directly to standard output.

    curl http://wp.pl/ --proxy http://localhost:9000/
    

    Creating http proxy on network level

    The above example albeit simple doesn't really provide any value as we still need a external proxy to properly pass http traffic. In order to make it a real http proxy we need to add parsing logic that will extract information about target server where we should pass incoming data too. While building a very simple http parser isn't difficult why don't we use existing one that is built right into node.js core http module?

    Using node.js http parser isn't exactly documented and I guess it's not part of public API. Nevertheless it is exposed to client code intentionally and a quick look through the core http module source code gives enough examples of how to use it. With existing parser instance the only thing left to do is to extract Host header value, use it to connect to target server. Here is a core part of code required:

    net.createServer(function (socketRequest) {
        var requestParser = HttpParsingStream.createForRequest(socketRequest);
    
        requestParser.on('headers', function (req) {
            //pause the request until we setup necessary plumbing
            req.pause();
            //extract information about target server
            var hostNameHeader = req.headers.host,
                hostAndPort = hostNameHeader.split(':'),
                host = hostAndPort[0],
                port = parseInt(hostAndPort[1]) || 80;
            // now we now where to tunnel client request
            var srvSocket = net.connect(port, host, function () {
                //a response parser as a replacement for stdout
                var responseParser = HttpParsingStream.createForResponse(srvSocket);
                srvSocket.pipe(responseParser);
                //pipe data from server to client
                srvSocket.pipe(socketRequest);
    
                //flush data buffered in parser to target server
                requestParser.writeCurrentChunks(srvSocket);
                //pipe remaining data from client to target server
                socketRequest.pipe(srvSocket);
                //resume processing
                req.resume();
            });
        });
    
        //pipe data from client to request parser
        socketRequest.pipe(requestParser);
    
    }).listen(9000);
    

    The above code makes use of HttpParsingStream which is a hand rolled writable Stream that uses node.js http parser to emit events. As you can see we first pipe client socket to requestParser to get information about target server. As soon as we get headers event the incoming client request is paused, we setup connection to target server, write raw data buffered in requestParser and setup pipes in a similar fashion as in the previous example. The most important property of this http proxy is that it does not change data coming from client and from target server in any way which is invaluable when debugging problems in http implementations.

    HttpParsingStream explained

    The above example relies on 2 instances of HttpParsingStream for request and response respectively:

    var net = require('net'),
        http = require('http'),
        stream = require('stream'),
        Writable = stream.Writable,
        parsers = http.parsers,
        HTTPParser = process.binding('http_parser').HTTPParser,
        util = require('util');
    
    var HttpParsingStream = function (options) {
        Writable.call(this, {});
        var socket = options.socket;
        //get an instance of node parser
        var parser = parsers.alloc();
        //buffer for raw data
        var streamChunks = [];
        var that = this;
        //initialize as request or response parser
        parser.reinitialize(options.parserMode);
        parser.socket = socket;
        socket.parser = parser;
        //called by node http module when headers are parsed
        parser.onIncoming = function (request) {
            that.emit('headers', request);
            request.on('data', function () {
                //this is one of ways to get 'end' event
            });
            request.on('end', function () {
                that.emit('end');
                //free parser
                freeParser(parser, request);
            });
        };
        socket.on('close', function () {
            that.emit('socket.close');
        });
    
        this._write = function (chunk, encoding, callback) {
            streamChunks.push({
                chunk: chunk,
                encoding: encoding
            });
            //pass data to parser
            parser.execute(chunk, 0, chunk.length);
            callback && callback();
        };
    
        //write data currently in buffer to other stream
        this.writeCurrentChunks = function (writableStream) {
            streamChunks.forEach(function (chunkObj) {
                writableStream.write(chunkObj.chunk, chunkObj.encoding);
            });
        };
    };
    util.inherits(HttpParsingStream, Writable);
    

    The HttpParsingStream accepts 2 options:

    • socket for underlying request and response
    • parserMode used to properly initialise node.js http parser

    Here's how we create objects with it:

    
    HttpParsingStream.createForRequest = function (socket) {
        return new HttpParsingStream({
            socket: socket,
            parserMode: HTTPParser.REQUEST,
            //used only for debugging
            name: 'request'
        });
    };
    HttpParsingStream.createForResponse = function (socket) {
        return new HttpParsingStream({
            socket: socket,
            parserMode: HTTPParser.RESPONSE,
            //used only for debugging
            name: 'response'
        });
    };
    

    Because HttpParsingStream is a Writable stream we can use it as:

    socketRequest.pipe(HttpParsingStream.createForRequest(socketRequest));
    
    serverSocket.pipe(HttpParsingStream.createForResponse(serverSocket));
    

    and let node.js code handle buffering, pausing and resuming. There is also one additional function used to clean up and return a parser instance back to pool - it's a copy paste from node.js http module source code:

    function freeParser(parser, req) {
        if (parser) {
            parser._headers = [];
            parser.onIncoming = null;
            if (parser.socket) {
                parser.socket.onend = null;
                parser.socket.ondata = null;
                parser.socket.parser = null;
            }
            parser.socket = null;
            parser.incoming = null;
            parsers.free(parser);
            parser = null;
        }
        if (req) {
            req.parser = null;
        }
    }
    

    Why create yet another http proxy implementation?

    proxy-mirror a simple http debugging tool that I wrote so far relies on an excellent http-proxy module. However because http-proxy was created to be used mostly as a reverse proxy to route http traffic to different http servers it does not provide a way to access raw tcp data. With the above code I now have an easy way to access the data - so I can display it in a byte level manner - a feature of Fiddler that I find handy from time to time.

    Moreover the target server will receive a request from client in an unchanged form. The same applies for response received by client. This is extremely important when resolving issues related to improper implementations of HTTP protocol.

    I haven't yet checked how the above code handles WebSocket connections - I'll explore that in a next post. For the referece you can find full code in this gist.

  • Curl and a missing slash

    24 February 2014 http, node.js, https, and proxy

    While I was playing around with proxy-mirror I noticed an interesting behaviour when testing the proxy with curl. The following command:
    curl http://wp.pl --proxy http://localhost:8888/
    will result in following output when the proxy is Fiddler:
    [Fiddler] Response Header parsing failed.
    Response Data:
    <plaintext>
    43 6F 6E 6E 65 63 74 69 6F 6E 3A 20 63 6C 6F 73 65 0D 0A 0D 0A 3C 48 54  Connection: close....<HT
    4D 4C 3E 3C 48 45 41 44 3E 3C 54 49 54 4C 45 3E 34 30 30 20 42 61 64 20  ML><HEAD><TITLE>400 Bad
    52 65 71 75 65 73 74 3C 2F 54 49 54 4C 45 3E 3C 2F 48 45 41 44 3E 0A 3C  Request</TITLE></HEAD>.<
    42 4F 44 59 3E 3C 48 32 3E 34 30 30 20 42 61 64 20 52 65 71 75 65 73 74  BODY><H2>400 Bad Request
    3C 2F 48 32 3E 0A 59 6F 75 72 20 72 65 71 75 65 73 74 20 68 61 73 20 62  </H2>.Your request has b
    61 64 20 73 79 6E 74 61 78 20 6F 72 20 69 73 20 69 6E 68 65 72 65 6E 74  ad syntax or is inherent
    6C 79 20 69 6D 70 6F 73 73 69 62 6C 65 20 74 6F 20 73 61 74 69 73 66 79  ly impossible to satisfy
    2E 0A 3C 48 52 3E 0A 3C 41 44 44 52 45 53 53 3E 3C 41 20 48 52 45 46 3D  ..<HR>.<ADDRESS><A HREF=
    22 68 74 74 70 3A 2F 2F 77 77 77 2E 77 70 2E 70 6C 2F 22 3E 61 72 69 73  "http://www.wp.pl/">aris
    3C 2F 41 3E 3C 2F 41 44 44 52 45 53 53 3E 0A 3C 2F 42 4F 44 59 3E 3C 2F  </A></ADDRESS>.</BODY></
    48 54 4D 4C 3E 0A                                                        HTML>.
    
    while a simple implementation relaying on node.js core http module and http-proxy module outputs this:
    An error has occurred: {"bytesParsed":0,"code":"HPE_INVALID_CONSTANT"}
    Meanwhile without the proxy parameter the actual response is:
    HTTP/1.1 301 Moved Permanently
    Server: aris
    Location: http://www.wp.pl
    Content-type: text/html
    Content-Length: 0
    Connection: close
    

    Curl forgiving behaviour

    As it turns out the actual outgoing HTTP request is different depending on the presence of –-proxy parameter. Without it the target server receives and responds with:
    GET / HTTP/1.1
    User-Agent: curl/7.26.0
    Host: wp.pl
    Accept: */*
    
    HTTP/1.1 301 Moved Permanently
    Server: aris
    Location: http://www.wp.pl
    Content-type: text/html
    Content-Length: 0
    Connection: close
    
    but when the proxy setting is present:
    GET http://wp.pl HTTP/1.1
    User-Agent: curl/7.26.0
    Host: wp.pl
    Accept: */*
    Proxy-Connection: Keep-Alive
    
    UNKNOWN 400 Bad Request
    Server: aris
    Content-Type: text/html
    Date: Sun, 23 Feb 2014 16:01:36 GMT
    Last-Modified: Sun, 23 Feb 2014 16:01:36 GMT
    Accept-Ranges: bytes
    Connection: close
    
    <HTML><HEAD><TITLE>400 Bad Request</TITLE></HEAD>
    <BODY><H2>400 Bad Request</H2>
    Your request has bad syntax or is inherently impossible to satisfy.
    <HR>
    <ADDRESS><A HREF="http://www.wp.pl/">aris</A></ADDRESS>
    </BODY></HTML>
    
    As you may have noticed the difference in requests (apart from additional header) is in first line – where in the first case curl assumed we want to GET /. The former uses a relative URI while the later absolute. Now if we change the command line just a little bit so that the address looks like http://wp.pl/ the server will receive request with correct absolute URI and will respond with 301.

    UNKNOWN Status Line

    Careful readers may have already noticed the Status Line of response returned by server is malformed. According to the spec it should have following form:
    Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
    
    That's the reason why Fiddler warns users about protocol violation and the reason of error inside node.js proxy.

    A less intrusive proxy

    An example described above leads to other questions about behaviour of http proxy especially if we think about implementing HTTP debugging tool like proxy-mirror. It’s only a guess but I suspect that there are other subtle differences between requests sent by client and those received by target server when the proxy is implemented using http-proxy module or core http module. I am aware of at least one such case where HTTP Headers are lowercased. If I’m right a debugging tool relaying on them would make it really hard to spot certain problems – like protocol violations on both client and server side. In my next blog post I’ll describe how to built even less intrusive proxy without building the http parser by hand.