Archive for the ‘twitter’ Category

Getting started with Cassandra

Февраль 24th, 2010

With the motivation from today’s public news on Twitter’s move from MySQL to Cassandra, my own skills desire following in-depth discussions at last November’s Open SQL Camp to consider Cassandra and yesterday’s discussion with a new client on persistent key-value store products, today I download installed and configured for the first time. Not that today’s news was unexpected, if you follow the Twitter Engineering Open Source projects you would have seen Cassandra as well as other products being used or evaluated by Twitter.

So I went from nothing to a working Cassandra node in under 5 minutes. This is what I did.

  1. While I knew this was an Apache project, a Google Search yields for me the 3rd link for the The Apache Cassandra Project at http://incubator.apache.org/cassandra/. Congrats for Cassandra now a top level Apache Project. This url will update soon.
  2. Download Cassandra. Hard to miss with a big green button on home page. Current version is 0.5
  3. I read Getting Started, which is the 3rd top level link on menu after Home and Download. Step 1 is picking a version which I’ve already done, Step 2 is Running a single node.
  4. The Getting Started indicated a problem on Mac OS X for the required minimum Java version. I was installing on Mac OS X 10.5 and CentOS 5.4. I’ve experienced this Java 6 default path issue before. Set my JAVA_HOME and PATH accordingly (after I updated the wiki with correct value)
  5. I extracted the tar file, changed to the directory and took at look at the README.txt file. Yes, I always check this first with any software and relevant because it includes valuable instructions on creating the default data and log directories.
  6. Start with bin/cassandra -f. No problems!
  7. I then followed the instructions from the link in Step 2 with the CassandraCli. This tests and confirms the installation is operational.

Ok, a working environment. I’ve now installed on a second machine and tested however I now need to configure the cluster, and the documentation is not as straightforward. Time to try out Google again.

On a side note, this is one reason why I love Open Source. I followed the instructions online and found a mistake in the Mac OS X path, I simply registered and corrected providing the benefit of my experience for the next reader(s).


PlanetMySQL Voting: Vote UP / Vote DOWN

When it Comes to Tweets, the Key is Location, Location, Location!

Февраль 24th, 2010

When you only have 140 characters to get your message across, you have to depend a lot on context. For Twitter, a big part of that context has become location. Knowing where someone is tweeting from can add a lot of value to the experience, and it's Raffi Krikorian's job to integrate location into Twitter. Raffi will be talking about this and other location-related topics at the upcoming Where 2.0 conference. We began by asking him how Twitter determines location, and whether it will always be an opt-in option.

Raffi Krikorian: I think part of it is based around the philosophy of Twitter itself. We only publish information that you've explicitly given to us on a tweet-by-tweet basis. So for location on your tweets, it's all opt-in. You have to give us that location information, and we'll put it out. There are other things we do behind the scenes, like our local trends information, that doesn't actually tie to an individual person. We might do some IP look-ups. We look at your user location field. But for anything that's tied to an individual, it's all opt-in.

James Turner: 140 characters is a restriction that Twitter's famous for. Location is fairly high bandwidth information. Have you considered carrying location data out of band from the 140 characters?

Raffi Krikorian: We do that right now. Originally, when people used to tweet location, they put a URL in their text field which linked to a map or linked to a service which might show where they are. But ever since we launched our geotagging API in November, we store the latitude and longitude for your tweet out of band. It's completely metadata on top of the tweet. A bunch of clients implement it, such as Tweety and Seesmic Web, they can read that metadata, and will show you either a map or attempt a reverse geocode and give you an actual name.

James Turner: What value do you see location bringing to social networking? Usually, if someone is talking about a location, it's explicit in the message or in the blog, "I'm at so-and-so and the show is really nice tonight." If you imagine that people are pervasively providing their geolocation, how does that aid social networking?

where20-2010-block.jpgRaffi Krikorian: I think that one, it helps people like us at Twitter to be able to give more relevant context information to other people. Especially in our 140 character constrained lifestyle, you can't necessarily put fully structured information of where you might be or what you might be talking about. But since we're now trying to expand the dimensionality of our platform to include place, we can now store that structured data, and, therefore, we can analyze it better. We can deliver it to the right people better, and we can do more interesting high-level analytics. Therefore, we can deliver relevant search or relevant information to people who are wanting it.

I think one of the dreams would be, not necessarily for Twitter but for someone out there, to be able to look at status update streams with geotagging on top of it and try to figure out what are the hot bars out there tonight, or be able to see cross-referencing with my foursquare check-in, for example. I want to be able to ask the service, "What bar should I go to right now that my friends have liked that I think I'll probably like and have no line?" And you'd only do something like that kind of high-level query if you actually have some really good way, either to analyze data or to get structured data out of the system. Analysis of that is going to be hard, especially in a world where you only have 140 characters to express yourself, for providing these metadata or meta ways to included structured information, and it becomes a UI problem to get that information into the system. It should become a lot easier for other people to build applications on top of it. So I think that's where geo-type stuff would go for networking, with better recommendations or better information delivery, better stuff within social networks.

James Turner: It sounds, in some ways, like you could mine the data the same way that, for example, GPS data in people who have phones in cars or cell phone data can be used to infer traffic patterns.

Raffi Krikorian: That's actually an excellent way to think about it. In the same way that you can watch how people are moving on freeways and try to figure out what's going on, that's a very passive interaction because people are looking at something and data's being sent out of band whereas something like Twitter, you're trying to express emotion or you're trying to express sentiment through Twitter. And that sentiment with latitude and longitude attached to it inherently talks about not just a feeling, but a feeling associated with a place. If we could start tying those places not just to latitude and longitude, but to the contextual, then you could start really understanding what's going on in the world and be able to deliver it to a lot of people.

It's the foursquare world, right? It's really important for me to find out that my friends really like that bar down the street. It's also really important to me to know that my friends are down the street right now and if I'm not busy, I can go there. So I think that's the direction we're trying to take things.

James Turner: One of the things we saw, especially in Iran this year, was Twitter as breaking news source simply because it is something that along with taking photos with your cell phone, it's just a very fast way of saying, "This thing is happening right now." When you add geolocation in on top of that, do you see the ability to kind of infer news just based on what people are tweeting and where they're tweeting it?

Raffi Krikorian: I think the question of what defines news is always going to be up for debate, but most certainly. We saw this a bunch of times. Right now, there's a Twitter account that watches a USGS website and tweets with geotagging exactly where an earthquake happens within a few seconds of it hitting the USGS website. We've seen examples of people uploading photos via Flickr of traffic accidents on the web. And Flickr has implemented our geotagging API, so if you upload a photo to them with geotagging, they'll pass it through to Twitter. And then on Twitter's side, we allow you to ask either for tweets within a certain location or connect to our geohost and get a stream of tweets subscribed to location. So I could see all of the events that are occurring that are geotagged within the San Francisco area, for example, in real-time or in the Bay area. And then in the Bay area, I'm going to start to see the breaking news events, like I mentioned before, traffic accidents. I'll see people just talking about random stuff. And I'll see the earthquakes pop up here and there as the USGS stuff comes out.

James Turner: Does that alleviate some of the need for some of the function-based or event-based hashtagging, where if I'm in the San Jose Convention Center, then I'm probably attending the event that's there, so I really don't need to tag it?

Raffi Krikorian: I think yes and no. I think that the hashtag system allows for really good context, so people can at a later date understand what went on at the time without necessarily having to figure out how to cross-reference all of the geo-tweets and then cluster them with other geo-tweets in the area trying to infer high-level stuff. I think tagging, explicit tagging, still provides a nice human-readable way to understand that information. And then the geotagging provides a really good machine-readable way to dissect and also provide that out of band type information. I think it's two different use cases. I ,for one, still apply hashtags whenever I talk about stuff to imply other things or imply with a different type of context than just the location alone might provide.

James Turner: We've seen a couple of interesting uses of social media that are apart from their obvious use. Someone was able to prove they weren't able to commit a crime because they were updating their Facebook page when the crime was committed. Similarly, we just saw a news item on a criminal who was caught because he was updating his Facebook page and the cops were able to figure out where he was. Certainly when you've got something like tweets that are geolocated, you can see that type of thing. Do you think that this is going to become more of a privacy issue? Do you think this kind of openness right now is a fad? Or are we seeing a real paradigm change about what people feel comfortable letting other people know about them?

Raffi Krikorian: Well, I think you have two points there, and I just want to hit the first one first, which is being able to find out where people are; being able to know that someone updated a Facebook item from a certain location. What we're doing at Twitter is not necessarily authenticated location, a lot of it's still implied. There's a lot of trust on either the application that updated Twitter or the person that's updating Twitter. We provide no guarantees that the location that's being reported by us is factual, except by the fact that someone posted that into a system. So just like you could totally lie about what you're doing at Twitter -- like I could tweet right now and say I'm sleeping, but I'm on the phone with you. I could send a tweet right now that says I'm in New York when in reality, I'm in my home right now in Oakland, California. A statement like that, of just revealing where my home is, sort of touches upon your second point. I think privacy in this type of world is a very tricky thing that needs to be maneuvered very carefully. Something like foursquare has privacy down because you need to have a bidirectional relationship with other people in foursquare to get a notification of where I am. I need to request to be friends with you, and you need to approve that request in order for me to get notification of where you are. I think that has privacy, or at least it has a privacy model which implies certain levels of control. For Twitter, since we have our asymmetric following method, I can follow you and you don't necessarily have to approve me. The onus is on us to make sure that the user's privacy is under control. It's definitely something that services like us need to take into account, whether it means that we're fuzzing the data, whether it means we're going to be storing data with a different level of precision then you've giving it to us, are all questions up for debate.

If we don't do that, then location-based services won't take off at all, actually. A lot of people will be really concerned about their privacy, no matter how much of a fad there is, or how much uptake there is in the alpha nerd population. But I think if you can provide good methods of privacy control, that can be explained to everyone, so everyone understands what they're doing in a very user-friendly way, then I think there'll be a huge upsurge, because of the value of the data that can come back to people.

James Turner: We're seeing news that the FBI wants ISPs to retain two years of email and two years of surfing records. I don't know how much you could talk about it because of the wonderful government restrictions on this stuff, but is that the kind of thing you guys worry about at all, that you're going to become a source for that kind of intelligence?

Raffi Krikorian: I'm not sure how much I can talk about it. What I will say is that we default to only displaying whatever data you gave us. So we don't hide any data that you give us, unless you're a protected user. Whatever data you give us, we publish back out again. So I guess the answer is yes and no on that point.

James Turner: What do you see as the technical side of geolocation, in terms of what's going to be the new interesting technologies coming along, and how they're going to be used?

Raffi Krikorian: From Twitter's standpoint, it's how do you accept all of this real-time data, index and analyze it and spread it throughout our system in almost real-time. People have traditionally built a bunch of GIS-like systems on top of PostgreSQL or on top of MySQL, and that's fine, but it doesn't scale after a while. After you throw a couple million or a couple hundred million entries at it, the amount it takes for one of those databases to process that, to insert it, all I have to do is select against it, and you can understand it's untenable for real-time operation. And by real-time, I mean sub-second operation.

So the stuff that we're doing is more geared towards how can you accept tweets that are coming in at what you can imagine to be an incredibly fast rate. Tweets are coming in, figure out their location, attach appropriate metadata data to it. Store it in our database. Span it out to anyone who wants to look at it. Run research and analytics on it and index it in their search index, and do this all within a couple of seconds on the way through the system. I think there's a lot of interesting stuff being done out there on how things are being stored, how things are being indexed. But I think our personal contribution will be how do you do it at that kind of speed?


PlanetMySQL Voting: Vote UP / Vote DOWN

#songsincode on Twitter, SongsInCodeDB

Август 21st, 2009

Looking at twitter #songsincode (just search on #songsincode tag), it appears a large chunk of geeky/nerdy world has come to a halt while spending the day expression song titles in code. So far we’ve seen most programming languages as well as CSS and SQL come by. I think it’s a nice example of how “the collective” can become very creative. My favourite SQL ones so far (by @john_chr): SELECT * FROM walk WHERE gait LIKE '%EGYPTIAN%'

Update: a good friend of mine, Steve Thorne (@Jerub), wanted to set up a site for this, so we hacked one up: SongsInCodeDB.


PlanetMySQL Voting: Vote UP / Vote DOWN

Four short links: 14 August 2009

Август 14th, 2009

  1. Page2Pub -- harvest wiki content and turn it into EPub and PDF. See also Sony dropping its proprietary format and moving to EPub. Open standards rock. (via oreillylabs on Twitter)
  2. SQL Pie Chart -- an ASCII pie chart, drawn by SQL code. Horrifying and yet inspiring. Compare to PostgreSQL code to produce ASCII Mandelbrot set. (via jdub on Twitter and Simon Willison)
  3. How SudokuGrab Works -- the computer vision techniques behind an iPhone app that solves Sudoku puzzles that you take a photo of. Well explained! These CV techniques are an essential part of the sensor web. (via blackbeltjones on Delicious)
  4. Twitter by the Numbers -- massive dump of charts and stats on Twitter. I love that there's a section devoted to social media marketers, the Internet's head lice. (via Kevin Marks on Twitter)


PlanetMySQL Voting: Vote UP / Vote DOWN

Four short links: 7 August 2009

Август 7th, 2009

  1. Defragging the Stimulus -- each [recovery] site has its own silo of data, and no site is complete. What we need is a unified point of access to all sources of information: firsthand reports from Recovery.gov and state portals, commentary from StimulusWatch and MetaCarta, and more. Suggests that Recovery.gov should be the hub for this presently-decentralised pile of recovery data.
  2. Memetracker -- site accompanying the research written up by the New York Times as Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.6 million mainstream media sites and blogs [...] For the most part, the traditional news outlets lead and the blogs follow, typically by 2.5 hours [...] a relative handful of blog sites are the quickest to pick up on things that later gain wide attention on the Web. Confirming that blogs and traditional media have a symbiotic relationship, not a parasitic one. (via Stats article in NY Times)
  3. Feds at DefCon Alarmed After RFIDs Scanned (Wired) -- RFID badges make for convenient security, and for convenient attack. Black hats can read your security cards from 2 or 3 feet away, and few in government are aware of the attack vector. To help prevent surreptitious readers from siphoning RFID data, a company named DIFRWear was doing brisk business at DefCon selling leather Faraday-shielded wallets and passport holders lined with material that prevents readers from sniffing RFID chips in proximity cards.
  4. A Comparison of Open Source Search Engines and Indexing Twitter -- Detailed write-up of the open source search options and how they stack up on a pile of Tweets. While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found: Lucene (Nutch, Solr, Hounder), Sphinx, zettair, Terrier, Galago, Minnion, MG4J, Wumpus, RDBMS (mysql, sqlite), Indri, Xapian, grep … And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance. (via joshua on Delicious)


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL: Powering a New World Religion

Июль 26th, 2009

As I’ve blogged many times, MySQLers frequently share off-work interests and running is one of them. I’ve also blogged about social media, which usually use MySQL under the hood. Now I’ve combined the two (running and social media) with the insight that running is a religion: I’m propagating Runnism, the Religion of Running.

It started as a thought experiment that I’m now pursuing in what in MySQL AB lingo used to be called my “copious free time” (of which there never was much). I’ve started Twitter accounts for Runnism, one for each language in which I tweet in. This is in sharp contrast to my @kajarno twitter account, where I happily mix Swedish, German and English — with the end result that certain of my Twitter followers have asked me whether I’ve already got the Twitter error message “maximum number of non-English tweets exceeded”.

I’ve so far published three Runnism blog entries in English:

In German, I’ve published the equivalent three blog entries

plus a fourth one

which is my thanks note to the German Twitterati. After a filmed eleven minute presentation last Wednesday to 50 German alpha twitterers, I was the happy recipient of plenty of tweets and even a long blog entry explaining “Runnismus, die neue Weltreligion” (”Runnism, the new world religion”).

Interested? Follow me on Twitter as @Runnism in English, @Runnismus in German and/or @Runnismen in Swedish!