Hacker Newsnew | past | comments | ask | show | jobs | submit | znbailey's commentslogin

There was an explanation of this during the TC50 presentation and in the Q&A afterward.

From my simple understanding, this is a management application layered on top of the Mechanical Turk platform which is able to provide more robust/advanced management and analytics in terms of tracking the quality and speed of repeat turkers, as well as being able to integrate with other commoditized/outsourced labor systems. Due to being able to maintain more state concerning tasks and turkers, they are able to provide valuable intelligence about how trustworthy and accurate the results are.

It was very interesting during the Q&A when Bradley Horowitz asked the company rep if he thought that Amazon might simply do this themselves - but the presenter stated firmly that they had a pretty good personal relationship with the people at Amazon and they took pride on the turk service as really being a black box type of platform.

Regardless, the main problem this business faces (which was handily pointed out by the experts) is the issue of gaining traction - how do you make big businesses aware of this as a possible solution to their labor issues? Even at MySpace where they pay 300 people to moderate photos for obscenity they have not realized they could much more effectively and cheaply be leveraging MT for this task.


MySpace can't use MT for the task because every picture sent to MT would be a potential "OMG PORN ON MYSPACE FILM AT 11" story. They have to keep their screeners in-house to deal with potential publicity issues. Trust me, the MySpace people (as would any moderated community) would love to use MT for this stuff; moderation is no one's idea of a core competency.


Content moderation is one of the things that CrowdFlower does really well -- and yes, you can have workers search for adult content (At least on Mechanical Turk. I'm not sure about our other labor pools. In any case, you can definitely use CrowdFlower for filtering out adult content.)

You'll probably want to make sure your job description mentions that, though!

(I'm a front-end developer for CrowdFlower. I don't represent the company's views. Except in regards to bacon. I represent those.)


What is MT's policy on that sort of task - do their TOS allow for submitting pictures for screening (or similar things)?

As a side note, this sounds like a potential hole in the market.


Decatur/Candler Park, GA, US


The author fails to mention another very common approach that does not suffer (as badly) from the nested set approach. This strategy is called "materialized path".

In the materialized path strategy, each node/row has its full path in the tree stored in a column and that path is "materialized" or "derived" from a depth-first traversal to that node. In conjunction with the adjacency list approach this could yield much better results in most situations than a nested set model.

Taking the tree used in the article, the paths would look like (using '/' for path delimiter but you can use whatever you like):

  electronics
  electronics/televisions
  electronics/televisions/tube
  electronics/televisions/lcd
  electronics/televisions/plasma
  electronics/portable electronics
  electronics/portable electronics/mp3 players
  electronics/portable electronics/mp3 players/flash
  electronics/portable electronics/cd players
  electronics/portable electronics/2 way radios
Alternatively it would make more sense to base the materialized path off of an immutable field such as the PKey of the record (in the same order as the paths above, from the article):

  1
  1/2
  1/2/4
  1/2/5
  1/2/6
  1/3
  1/3/7
  1/3/7/10
  1/3/8
  1/3/9
This has the following advantages:

  * Node insertion/deletion overhead is low / does not require updating other nodes unlike nested set model
  * Query for subnodes is easy: "select * from tree where path like 'electronics/televisions'". To get only direct subnodes add a parent_id specifier.
  * Getting the full path for a node is easy: it's built into the node itself
  * Node depth is easy: count the number of path "segments"
Downsides:

  * Moving sub-tree requires updating all sub-node paths
Hope this helps someone!

Cheers, -Zach


Materialized path is a good tool for trees that won't be deeply nested, and for tables that won't grow particularly large. However, because the path is stored in a string, read operations are not particularly efficient. It's a good tool to know, but it doesn't scale as well as Nested Set to very large datasets. That being set, it's much less complicated to implement than Nested Sets and is more efficient than adjacency list for many types of data.


I'm not so sure about inefficiency; in particular, most databases that I've seen can use an index for string-prefix queries, which means its performance ought to remain acceptable even for large datasets. (Assumption: subtree queries are the ones you care about making fast.)

Also: inserts are practically free compared to the linked article, where they require updating the left and right numbering for every following node!

Another nice property: if you make entries in the materialized tree column constant-width (e.g. by zero-padding), an alphabetical sort by that column will give you a depth-first dump of the tree -- the exact order you'd like for, say, a comment thread or a table of contents.

When I've implemented materialized paths in the past, I have run into issues with the maximum allowed length of indexable string types (which limits the tree depth), but this was in the long ago late 90s. :) I think it's a very nice albeit imperfect way of storing certain types of trees, especially ones that are mostly insert+query-only.


How are reads not particularly efficient? If you use rooted paths in your lookup, then even a "LIKE '/a/b/c%'" query for all decedents will effectively utilize the index. I think that this would be good for deeply nested trees also. As Zach implied, the down side of this approach is moving subtrees. Unless you have a very volatile tree structure, this should be perfect for most uses.


Because B-Tree indexes perform orders of magnitude better on smaller lookup values, like say an integer, than they do on large (and even worse variable length) strings. There are a number of factors that contribute to this, but two big ones are the raw computation time it takes to compare two strings is much larger than the comparison of an integer that matches the register size of the machine. Second, the depth of the B-Tree is dependent on the key size used for the lookup.

As I said above, if you are using the materialized path for the type of problem it's best at solving, the speed differences won't matter so much. But that's primarily because the tree's aren't particularly deeply nested and/or that tree table itself isn't overly large. So in essence you are trading computational complexity for ease of use on smaller sets of data. In many cases that's exactly what's needed. On the other hand, if you will be modeling very large trees, or will have a huge number of them, nested sets are more efficient in terms of encoding and storage, as well as lookups and retrievals. The down side is that nested sets are more complex to setup and work with, and make understanding the structure of your trees more difficult.

IMHO, it's important not to fall into the trap that one technology/tool/solution/data structure will solve all of your problems. It's good to know the pro's and con's of different solutions and which problems they are most efficient at solving.


Great point. I meant to mention this in my list of cons but it slipped my mind.


I'm not the only person who uses this! I wish I knew this had a name when I proposed it; everybody I was working with thought it was strange. I think it's a good mix of performance and query-ability: you can do quite a bit here using LIKE matches. Of course, the main assumptions are that the dataset isn't huge, and the hierarchy isn't changing often.


Helpful, yes. Thank you.


Interesting.

For short trees and in-memory management another option would be:

  electronics
  /televisions
  //tube
  //lcd
  //plasma
  /portable electronics
  //mp3 players
  ///flash
  //cd players
  //2 way radios
Where depth levels are represented by slashes only.

Replace with indexes and you have a shorter version.

Let's see what else we can come up with just for fun...


You seem to be assuming a record order; SQL doesn't really have that. (You can hack stuff together but you're still building on an essentially orderless foundation.) That's a fine text document format, though tabs or spaces would be more traditional in that role.


record: n int, val varchar2

Manage and sort in-memory and store already ordered when done (no matter if it is physically unordered).

That's why I said, for small trees and in-memory management.


It looks like they announced they will be making a concession to bring back at least some of the old behavior:

http://blog.twitter.com/2009/05/we-learned-lot.html

The jist: they're bringing back the old behavior only when the message starts with @username (and hasn't been created clicking the reply icon).

It sounds like they made a technical change which trumps the actual message content with a piece of reply-to metadata, which is either explicitly included by the client or parsed by their infrastructure.

Previously I believe it was "dumb" and didn't examine anything, and it was up to the client to filter based on the content of the message when it came to showing half-conversations.


We use it where I work and the designers like it a lot. It works very well to build cross-browser layouts in a very quick and simple manner and has drastically reduced the amount of time it takes them to cut PSDs into raw XHTML/CSS.

One downside you will quickly notice is your markup becomes much less semantic and a lot more cluttered with CSS classes. This is simply by virtue of how all "grid" CSS frameworks like this work - they set up a ton of generic styles that are impossible to be symanticized. With blueprint you start to see a lot of classes that look like "span-XX last" or "push-XX" or "pull-XX".

There are a number of ways to get around this, however. The first is a tool that comes with blueprint that lets you set up a yml configuration which maps semantic ids to these generic CSS classes. For instance you can say "#footer => span-24 last". This will then generate the CSS in a semantic manner so you can continue using id references in your XHTML.

The second way, and this is the approach that I would highly recommend, is to go ahead and use some sort of CSS metalanguage/compilation tool such as SASS (Syntactically Awesome Stylesheets - http://haml.hamptoncatlin.com/docs/rdoc/classes/Sass.html). Chris Eppstein has done a fantastic job with the Compass CSS Framework (http://wiki.github.com/chriseppstein/compass) which builds on top of SASS, which allows you to do really amazing things with CSS.

The compass framework he has built allows you to do things like define CSS constants (colors, font sizes, etc), perform unit arithmetic, as well as define abstract classes or groups of properties called "mixins" which allow you to abstract out a lot of your repetitive CSS properties. It is also a whitespace sensitive DML which makes the highly nested nature of CSS selectors a breeze to work with. It also already comes with modules for blueprint and other grid frameworks!

I would check out his talk about it here if this sounds interesting to you:

http://pivotallabs.com/talks

HTH!


While I like what I see in SASS and Compass. I'm afraid I won't be able to partake in the excitement as I code in PHP for my day job. And hack in Python during my spare time. Is there any equivalent out there?


Although SASS is built with Ruby, there is nothing strictly tied to Ruby or Rails. You can use sass as a stand-alone tool for pre-processing your CSS for any other web platform. Haml, however, is tightly bound to Ruby.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: