Feb 27, 2011

GAE, cookies and everything

The reason I'm posting this is because I think the solution we found could be useful for others.. So here's the story.


Some time ago we were trying using Google App Engine to create some king of a web-crawler. And pretty soon we have found that doing HTTP requests to other sites with GAE is not a piece of cake. It's more like a piece of something less tasty...



Well no, I mean GAE has it's URL Fetch service with two Java APIs, one of them is using regular HttpURLConnection, and another is using Google's own specific classes. OK, we didn't want anything special, so we decided that using the first APi is our choice. Well, why not? After all, we did not want anything special, we just expect it to do HTTP requests and get responses. And so it did. Except one tiny problem. It did not work with cookies. Why would we care? Well, if the site you want to work with uses cookies, you do care. So the simple solution for this would be to set our own default cookie handler and we'll be OK! But it's not. The thing is that setting default handler causes security exceptions.. (The corresponding issues are here and here).

Thinking it over we decided to try using the second API, i.e. the one which is Google guys own invention. And very soon we found that it's too basic to be easily used. Luckily we have found a solution on how wrap it with the Apache HTTP Client (explained here and here). And worked! But did it handle cookies? The answer is - almost. So that's where the actual part of my post is starting.


So the problem is that some sites send you more than a single cookie (that's ok), and GAE combines these cookies into a single one for you (which seem to be also ok), but it does it in not so great way.. Specifically it  does not perform any quoting where it's required..
So imagine the server you are working with sends you a response with two cookies:
Set-Cookie: uid=1223456; expires=Tue, 19-Jan-2038 03:14:07 GMT; path=/; domain=.wtf.wtf; httponly
Set-Cookie: pass=ae2f5d13303d93d32e1c167bf884c78e; expires=Tue, 19-Jan-2038 03:14:07 GMT; path=/; domain=.wtf.wtf; httponly
But what you get from GAE is
Set-Cookie: uid=5228735; expires=Tue, 19-Jan-2038 03:14:07 GMT; path=/; domain=.wtf.wtf; httponly, pass=ae2f5d13303d93d32e1c167bf884c78e; expires=Tue, 19-Jan-2038 03:14:07 GMT; path=/; domain=.wtf.wtf; httponly
While the cookies are concatenated with comma, the "expires" field also contains comma, and it is not quoted, so default parsing facility gives you some vermicelli instead of cookie you expect. The corresponding issue was
raised in the appengie tracker in June of 2010, the discussion there contains links to different RFCs, but gives no idea if it will be fixed anytime soon..

So we had to find some solution. As a result we have written implementation of Apache HTTP Client's CookieSpec interface that tries to parse that miserable formated cookie header.. It's usage is completely transparent: instead of using
AbstractHttpClient httpclient = new DefaultHttpClient();
just use
AbstractHttpClient httpclient = GaeHttpClientFactory.createHttpClient();
And that's all. As the issue seem to causing troubles not only for us, we decided to publish the code for everybody's use. So here's an SF project. Feel free to use it and post improvements if you like.


Getting back to the discussion in the issue, it makes me think that the original source of the problem is not in that Google did something wrong or not, but in that there's still no one single standard even for such simple matter as cookies is. The problem of fuzzy Internet standards looks like a complete nonsense to me, but that's another story..


Nevertheless I still think that in this particular case Google code should do the job of helping GAE users to handle cookies, even if they think it's the original server's fault. So I invite everybody to go and vote for the issue.


Update: according to the issue log, this problem has been finally fixed on 2012-08-22, exactly 2 years and 2 monts after it was reported. Better late than never, right?

7 comments:

  1. Mikhail, it's great that you described the problem and published your solution.
    I hope your experience will help somebody in future.

    ReplyDelete
  2. Hi Mikhail,

    I ran into this issue as well, and I am now using your GAE cookie parser in one of my projects.

    Thank you!
    Oleg

    ReplyDelete
  3. Hi Oleg,

    It's nice that you found my code useful. If you have any improvements for the code and want them to be published feel free to contact me..

    Also consider voting for the issue, if you haven't done so yet.

    ReplyDelete
  4. Hmm, I think I found a bug/missing feature.

    Here's an example of a cookie string that is not getting parsed correctly:

    CFID=93652;expires=Sat, 18-May-2041 03:14:42 GMT;path=/, CFTOKEN=54481694;expires=Sat, 18-May-2041 03:14:42 GMT;path=/

    Have you considered using unit tests? I did not see any.

    ReplyDelete
  5. No, I have no unit tests for this project.. Actually I haven't spent too much time on this after we have achieved what we wanted. But now as it's shared to public it would be nice to have it polished.
    Do you have unit tests on your side or any other ideas how to improve the situation?

    ReplyDelete
  6. Why not just post how you parse the cookie of GAE,instead of giving a project?I think that's more helpful,for we can't truly reply on third-party tools.

    ReplyDelete
    Replies
    1. I think a working example worth thousand of words. After all, most of the problem there was not to parse the cookie, but to integrate everything together.
      It should be of not problem to get the code and read the parsing part if you are interested just in it.

      Also, there's no need to "rely" on the project, it's under BSD license, so you may download the code, modify it as you please or do whatever else you find appropriate.

      If you have any specific questions, I'll be glad to help, though I haven't been touching GAE for a while already..

      Delete