Wednesday, October 28, 2009

Who still watches Television?

So I know a couple of people who every night or so, sit down and watch Television. I find the very thought of it mind boggling. Why would anyone torture h[im/er]self so?

Television is all about locking yourself into someone else's timetable.
  • Each show comes on when your network decides it comes on.
  • You can't go back in case you missed something.
  • You can't rewatch a particular enjoyable scene.
  • You can't pause.
  • You can't stop it and continue it later.
  • You can't skip an annoying scene.
I find the idea of such a lack of control on the viewer's part excruciatingly painful.

People miss appointments or go to bed late because they just had to know what happens. Parents have to fight with their kids to get them to do homework, which would make them miss their favorite show.

People don't take proper care of themselves by not going to use the facilities when they need to, or getting a drink, or answering the phone. The list of problems goes on and on.

This problem was always apparent, and many people bought video cassette recorders, or other newer devices to get the same job done. But there were those that didn't, and instead preferred to torture themselves.

Okay, at first the technology may have been annoying (video cassettes), then it was too expensive. It's always another annoying peripheral in your house for the most part just taking up space. But why is this still going on today?

If you read this site often, it's quite likely you own your own computer at home and have a high speed internet connection. There's also a good chance if you bought a computer or new hard drive in the past few years, that you have gigabytes of free space that you have no idea what to do with.

If you own what I described above, why would you want to torture yourself so? Many stations are now putting their shows on their website which you can watch in mid-quality annoying flash. But ignoring that, nowadays, there are tons of people who record the show and then share that recording via Bit Torrent or file sharing websites.

Unlike traditional home recording techniques, quite often you don't even have to program your device when to record for you anymore - when using a computer. Many "Online Television" sites provide RSS feeds that you can subscribe to which would automatically download your favorite shows as each episode comes out. You don't have to screw up setting to record at 3:59, and finding out your time was three minutes behind the network, and you missed the beginning. Or instead of setting the end time to 5:01, you screwed up and selected 4:01, and recorded a grand total of two minutes of your hour long show.

These sites generally have the shows up a couple of minutes after they air in the earliest timezone showing them, and the files are of good quality, yet small enough to be downloaded in just a couple of minutes. If you happen to live in a later timezone, many times you can be finished watching a show before it even airs where you live.

So who exactly is still torturing themselves? And why?

Before you question, what would happen if everyone only started watching "Online Television", who exactly would be doing the recordings? We have to ask ourselves why aren't the companies producing these shows using the new medium available to distribute them? If they had an RSS feed with all the new shows set up to be downloaded via Bit Torrent on them (advertisements included), I'm sure many people would subscribe to them. They could even charge a few bucks a year to gain access to a private torrent, cutting out the middlemen (networks, cable companies) they use.

Sunday, October 25, 2009

Distributed HTTP

A couple years back, a friend of mine got into an area of research which was rather novel and interesting at the time. He created a website he hosted from his own computer, where one can read about the research, and download various data samples he made.

Fairly soon, it was apparent that he couldn't host any large data samples. So he found the best compression software available, and setup a BitTorrent tracker on his computer. When someone downloaded some data samples, they would be sharing the bandwidth load, allowing more interested parties to download at once without crushing my friend's connection. This was back at a time when BitTorrent was unheard of, but those interested in getting the data would make the effort to do so.

As time went on, his site popularity grew. He installed some forum software on his PC, so a community could begin discussing his findings. He also gave other people login credentials to his machine, so they can edit pages, and upload their own data.

The site evolved into something close to a wiki, where each project got its own set of pages describing it, and what was currently known on the topic. Each project got some images to visually provide an idea of what each data set covered before one downloaded the torrent file. Some experiments also started to include videos.

As the hits to his site kept increasing, my friend could no longer host his site on his own machine, and had to move his site to a commercial server which required payment on a monthly or yearly basis. While BitTorrent could cover the large data sets, it in no way provided a solution to hosting the various HTML pages and PNG images.

The site constantly gained popularity, and my friend was forced to keep upgrading to an increasingly more powerful server, where the hosting costs increased just as rapidly. Requests for donations, and ads on the server could only help offset costs to an extent.

I imagine other people and small communities have at times run into similar problems. I think its time for a solution to be proposed.

Every major browser today caches files for each site it visits, so it doesn't have to rerequest the same images, scripts, and styles on each page, and conditionally requests pages, only if they haven't been updated since what was in the cache. I think this already solves one third of the problem.

A new URI scheme can be created, perhaps dhttp:// that would act just like normal HTTP with a couple of exceptions. The browser would have some configurable options for Distributed HTTP, such as up to how many MB per site will it cache, how many MB overall, how many simultaneous uploads will it be willing to provide per site, as well as overall, which port it will run on, and a duration as to how long it will do so. When the browser connects via dhttp://, it'll include some extra headers providing the user's desired settings on this matter. The HTTP server will be modified to keep track of which IP addresses connected to it, and downloaded which files recently.

When a request for a file comes into the DHTTP server, it can respond with a list of perhaps five IP addresses to choose from (if available), chosen based on an algorithm designed to round robin the available browsers connecting to the site, and the preferences chosen therein. The browser can then request via a normal HTTP request the same file from one of those IP addresses it received. The browser would need a miniature HTTP server built in which would understand that requests coming to it that seem to be for a DHTTP server should be replied to from its cache. It would also know not to share files that are in the cache which did not originate from a DHTTP server.

If requests to each of those IP addresses have timed out, or responded with 404, then the browser can rerequest that file from the DHTTP server set with a timeout or unavailable header for each of those IP addresses, in which case the DHTTP server will respond with the requested file directly.

The HTTP server should also know to keep track of when files are updated, so it knows not to refer a visitor to an IP address which contains an old file. This forwarding concept should also be disabled in cases of private data (login information), or dynamic pages. However for static (or dynamic which is only generated periodically) public data, all requests should be satisfiable by this described method.

Thought would have to be put into how to handle "leach" browsers which never share from their cache, or just always request from the DHTTP server with timeout or unavailable headers sent.

I think something implemented along these lines can help those smaller communities that host sites on their own machines, and would like to remain self hosting, or would like to alleviate hosting costs on commercial servers.

Thoughts?

Friday, October 23, 2009

Blogger Spam

If you remember, the other day I had a bit of a meltdown in terms of all the spam I saw piling up over here.

I only have ~30 articles here, yet I had over 300 comments which were spam, and it is quite an annoying task to go delete them one by one. Especially when a week later, I'll have to go delete them one by one yet again.

Instead of just throwing my hands up in the air, I found it was time to get insane - I went to check out Blogger's API. So looking it over, I found it's really easy to log in, and about everything else after that gets annoying.

Blogger provides a way to get a list of articles, create new articles, delete articles, and also managing their comments. But the support is kind of limited if you want to specify what kind of data you want to retrieve.

At first, I thought about analyzing each comment for spam, but I didn't want to run the risk of false positives, and figured my best bet for now is just to identify spammers. I identified 25 different spam accounts.

However, Blogger only offers deleting comments by the comment ID, and then, only one by one. The only way to retrieve the comment ID is to retrieve the comments for a particular article, which includes the comments themselves and a bunch of other data. All this data is in a rather large XML file.

It would be rather easy to delete comments if Blogger provided a function like deleteCommentsOf(userId, blogId), or getCommentIdsOf(userId, blogId), or something similar. But no, one needs 4 steps just to get an XML file which contains the comments IDs along with a lot of other unnecessary data. This has to be repeated for each article.

It seems Blogger's API is really only geared towards providing various types of news feeds of a blog, and minimal remote management to allow others to create an interface for one to interact with blogger on a basic level. Nothing Blogger provides is geared towards en masse management.

Blogger also has the nice undocumented caveat that when retrieving a list of articles for a site, it includes all draft articles not published yet, if the requester is currently logged in.

But no matter, I create APIs wrapped around network requests and parsing data for a living. So using the libraries I created and use at work for this kind of thing, and 200 lines later which includes plenty of comments and whitespace, I got an API which allows me to delete all comments from a particular user from a Blogger site. So I arm an application using my new API with the 25 users I identified, and a few minutes later, presto, they're all gone.

As of the time of this posting, there should be no spam in any of the articles here. I will have to rerun my application periodically, as well as update it with the user IDs of new spam accounts, but it shouldn't be a big deal any more.

Remember the old programming dictum: Annoyance+Laziness = Great Software. It surely beats deleting things by hand every couple of days.

Monday, October 19, 2009

Why online services suck

Does anyone other than me think online services suck?

The thing that annoys me the most is language settings. Online service designers one day had this great idea to check the geographical IP address the user visited their site from, and use it to automatically set the language to the native one for the country they visited from. While this sounds nice in theory, most people only know their mother tongue, and also go on vacation now and then, or visit some other country for business purposes.

So here I am, on business in a foreign country, and I connect my laptop into the Ethernet jack in my hotel room which comes with free Internet access, so I can check my e-mail. What's the first thing I notice? The entire interface is no longer in English. Even worse is that the various menu items and buttons are moved around in this other language.

Even Google, known for being ahead of the curve when it comes to web services can't help but make the same mistakes. I'm sitting here looking at the menu on top of Blogger, wondering which one is login.

For Google this is a worse offense compared to other service providers, as I already was logged into their main site.

Google keeps their cookies set for all eternity (well, until the next time rollover disaster), and they know I always used Google in English. Now it sees me connecting from a different country than usual and thinks I want my language settings switched? Even after I set it to English on their main page, I have to figure out how to set it to English again on Blogger and YouTube?

What's really sad about all this is that every web browser sends each website as part of its request a "user agent", which tells the web server the name of the browser, a version number, operating system details, and language information. My browser is currently sending: "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)". Notice the en-US? That tells the site the browser is in English, for the United States. If I downloaded a different version of Firefox, or installed a language package and switched Firefox to a different language, it would tell the web server that I did so. If one uses Windows in another language, Internet Explorer will also tell the web server the language Windows/Internet Explorer is in.

Why are these service providers ignoring browser information, and instead solely looking at geographical information? People travel all the times these days. Let us also not forget those in restrictive countries who use foreign proxy servers to access the internet.


However, common issues, such as annoying language support is hardly the end of the problems. In terms of online communication, virtually all of them suffer from variations of spam. Again, where is Google here? Every time I go read comments on Blogger, I see nothing but spam posts. Even when I go to cleanup my own site, the spam just fills up again a few days later.

Where's the flag as spam button? Where's the flag this user as solely a spammer button?

Sure Google as a site manager lets me block all comments on my site till I personally review them to see if they're spam, but in today's need for hi-speed communication is that really an option when you may have a hot topic on hand? Why can't readers flag posts on their own?

In terms of management, why doesn't Blogger's site management features include a list where I can check off posts and hit one mass delete, instead of having to click delete and "Yes I'm sure" on each and every spam post? Why can't I delete all posts from user X and ban that user from ever posting on my site again?


Okay, maybe this isn't so much an article why online services suck, but more about language and spam complaints, and mostly at Google for the moment. Jet-lag, and getting your E-mail interface in Gibberish does wonders for a friendly post. I'll try to come up something better for my next article.