(Alright, for HackerNews, here’s the TL;DR; version: Built a Twitter Analytics weekend-project, started charging, money comes steadily without advertising, for months now, celebrities sign up, major investors sign up, etc. / Problem: anyone with more than 500k followers will wait days, some wait months to be processed and allowed to see results. Distributed architecture, S3/Dedicated Servers/MYSQL+SQlite+Twitter API)
FRUJI.com has become a very popular service without any marketing or advertising work on my behalf. It just started as a lazy weekend project, I sold PRO accounts to see if anyone would pay for them and people did, lots of them. It doesn’t get better than this. The code was re-written from scratch 3 times now, I started all over and pushed out new versions of it to keep up with demand, with Twitter API changes, with the ugly UI and other problems. There even is an iPhone version that I haven’t had a chance to update.
Now, with all this, you’d have a perfect startup many try to establish. It happened while I was busy working on things I’d considered to be a real startup, this was just a weekend project. Not anymore. In fact, it’s got a lot of potential and a lot of users requesting more features, smarter features and they are willing to pay for it. A perfect perspective.
Except, Twitter.
Twitter has been ruthless and tough on developers with their API limits. They are trying to allow businesses to built on top of their platform, but are unwilling to charge nor adopt to requests made from the outside. I’ve found ways to work with their API limits (without violating policies or rules like many others do), but I’ve hit a nasty bug (after two others I got them to fix) now that they are currently not willing to fix. Here goes:
Twitter allows me to issue a certain number of API calls to a specific endpoint in a specific time frame. All the limits are on their API pages, but here are some made up example numbers for illustration:
Say, I get to call their API a 100 times per hour (it’s more than that, but as said, example).
Now, with every one of these 100 calls, I can request detail records for a user’s followers. What’s more (and crucial), I can request up to 100 follower detail records per 1 API call.
So, say JOHN has 5000 followers. I would need to issue 50 API calls, requesting 100 follower details with each call. That means I can process 10000 follower records per hour (I can do more, but again, example).
So, this Twitter’s official limit. I’m mostly OKish with that.
Here’s the bug they won’t fix:
Twitter has smart developers, so they implemented a time-out. Say, I request 100 follower details for the user JOHN. The Twitter API goes into the databases, fetches all these records while it has me waiting. This requires a bit of time (up to 5 seconds), sometimes it goes fast, sometimes it takes a bit. I am fine with waiting. But Twitter is not. They time out and drop the connection once 5 seconds or so have passed. This means, the API never got back to me with any result. But, and this is the bug, they charged me 1 API call for it. Well, what’s one call you say? They suggest to issue the same call again, but with less than 100 follower details requested. Ok, so my algorithm issues another API call and requests 50 records. Time out. Hm. I go back and request 10. Time out. Hmmmm, what? I go back and request just the first, just one follower record for the user JOHN. I receive the result (or sometimes, if their record is damaged, I receive a dead record). So I scale back up to 100 follower requests for the next call. It goes through. Next 100, fails. I scale down again … you get the idea.
Problem is this: In order to process, say, an account with 250.000 followers, I need at least one full day.
That’s one day, with the user waiting to be logged in and seeing results after signing up.
Now, I’ve had a couple of celebrities that I can’t name here sign up (ask me via e-mail and I’ll send you the list), all of them having way over 1M followers. Guess how long it takes the tool to work through their accounts? 4-5 days? No. Unfortunately not (and even that amount of time would suck in terms of service experience).
It takes up to 60 days or more.
The larger an account is (especially 1M+), the more damaged records, the more time outs (it mostly resorts to 1 API call = 1 follower detail record, a hundredth of what Twitter tells me I can have).
Most celebrities who signed up for FRUJI, haven’t seen the results page yet, and some have waited for over 2 months now.
This sucks!
So, partial results you say? Well, here’s where the complications come in. Back in the day, I had one large MYSQL database containing all follower records. Once I had JOHN’s 5000 follower details, I put them into a large table. This table grew to well over many gigabytes of data (slowly duplicating Twitter’s database) and constantly crashed. Then with more users signing up and especially spikes caused by blog articles, the service was down for days in row, while I tried to repair the database.
My fix:
So I came up with a smarter solution. Every PRO user has his own SQLite database now. It’s stored safely on S3. Then, every night, a cron triggered PHP file downloads the SQLite from S3 to a performant dedicated server, and checks back with Twitter for each individual follower record, if that person is still following the user. If not, I’ll be able to figure out the Followed You / Unfollowed You tables. Also, it helps me keep track of my Most Popular as well as Most Valuable Followers by adding new ones to this list.
This server is cut off from the web server you see when you open FRUJI. Large amounts of visitors should not impact the crawler service. So I came up with a (I feel it’s pretty smart) different way of separating the user from the data. The user is redirected to a basic HTML page on Amazon S3. So Amazon takes the traffic. Then, for authentication, I use a dedicated server that runs the session details and account authentication through a slim MySQL instance. That server then outputs very light-weight, basic HTML data for the user.
The trick is this: These HTML pages (the results) are empty mostly, but have one JavaScript call to pick up JSONP data from S3 for that particular user. So all results (anything that contains data on FRUJI) is actually pre-rendered and waiting on S3 for every user. So the user’s JavaScript / Browser session requests all heavy data from S3. Again, my server is out of the traffic loop, so perfect scenario.
So this means that every night, the crawlers are re-generating new reports and auto-upload them to S3.
So whenever you see data on FRUJI, it’s manipulated/ordered through JavaScript, but pre-rendered and can not be queried dynamically. If I change the code, it’ll require a night to re-render for everyone.
Basic users can render their reports manually and we don’t keep their SQLite databases (they are being re-generated every time). For PRO users it happens automatically and we keep, refresh and re-upload their SQLite databases to S3, so we can track followers/unfollowed data.
So, how do we solve the problem with large Twitter accounts having to wait up to 60 days or more? If you have an idea, e-mail me at: office@twentypeople.com and I’d be happy to work with you on making this happen.
We can do a 50/50 split on upcoming FRUJI PRO accounts (this is all I can pay right now, since it’s all I get semi-regularly). I have reached my technical limits and would love some serious help.
You’ll get access to all systems and can party like you want on it.