Matasano Chargen - Latest Comments in Aguri: Coolest Data Structure You’ve Never Heard Of

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

venkat2009 — Tue, 30 Jun 2009 03:32:38 -0000

.I always use the Internet For Hosting the Website In the server.Sometimes My Connection Gets Slow At that Time I used the site IP-Address For the Speed Checking

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

none — Fri, 02 May 2008 13:33:00 -0000

FYI: aguri in greek means cucumber.. :oS

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek — Wed, 30 Jan 2008 18:30:57 -0000

I could burn 8 more blog posts talking about the different theories behind packet filtering --- pcap takes a pretty idiosyncratic response, which is to frame the problem compiler theoretically (though the performance of pcap is dominated by the IO channel you use to get packets, by lines of code, pcap performance is overwhelmingly addressed by IR optimizers.

I say that because tries are a classical approach to speeding up packet filtering. You can consider routing a special case of filtering, of course, in a single dimension. As you add dimensions, you start cross-producting multiple tries.

And, of course, radix tries (and edge-labelled PATRICIA tries in particular) are just a specialization of DFAs, which brings us back to pcap and optimizers and...

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Erik — Tue, 29 Jan 2008 08:09:19 -0000

Good summary on trees and tries!
I wonder if radix tries could be used also to improve the performance on keyword searches in PCAP files or sniffed traffic (like an Echelon functionality). This type of searches can be performed with NetworkMiner, see:
http://networkminer.wiki.so...

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Ryan Russell — Sat, 26 Jan 2008 23:05:09 -0000

Speaking of regex and DFAs, I saw an interesting presentation from a couple of Google researchers at IT Defense this week. They have found a bunch of implementation problems in various regex engines:
http://www.it-defense.de/it...
(Scroll down to "Regular Exceptions".)

No slides online or anything yet, I don't think. If you're interested, I'll come back when they are.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

sigsegv — Fri, 25 Jan 2008 09:30:20 -0000

/me sighs

What if you have to write the in_array() function, jackass?

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Charles — Thu, 24 Jan 2008 14:35:02 -0000

"I hand you an array of random integers. How can you tell if the number three is in it?"

Maybe you should read the manual.

if (in_array(3)) {
echo "3 is in the array";
} else {
echo "3 is not in the array";
}

Problem solved. I didn't even have to read the rest of the post about aguri. That is a kind of cactus, right?

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Dennis Cox — Wed, 23 Jan 2008 23:12:46 -0000

The FPP, Fast Pattern Processor, portion of the Agere Network Processor is Patrica Tree based. The same practice of using Patricia Trees can apply to generic TCP/IP packet processes and "psuedo" regular expression searching. You can leverage them in really neat ways.

I first used Patricia Trees in creating silicon for routers [NP based Cisco routers] and later on IPS systems. The downfall is the changing of rules. Meaning, reorganizing the tree(s) is an art form [in a sense] and I had to fall onto a genetic algorithm to load the tree based on user settings when it came to 220 bit and beyond Patricia Trees. This seemed to solve the rule switching issue. To show you how bad the rule switching was on a 7200 NP based system (Cisco) it would take up to 120 seconds to reload the BPG table (back when that was hip]. Later in 2001 we all switched to creating devices with dual banked memory for Patricia Tree silicon.

Great topic by the way - would love to see more articles like this.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

RPS — Wed, 23 Jan 2008 19:00:57 -0000

Good article!

Correct me if I'm wrong though (it's been a while since my algorithms study), but in response to the original question:

"I hand you an array of random integers. How can you tell if the number three is in it?"

Yes, linear search is bad, but isn't it the best way to answer the original question? If you're sorting the array, you're introducing additional complexity that still has to touch each element O(n). From what I remember, there can't be any sorting algorithm better than O(n).

If we sort the given array we've already spent at least O(n) but we now have to search the array (providing we didn't search while we were sorting). This assumes we're only looking for one element.

Now, if we can choose to get a sorted array of random integers, that all of course changes.

Anyway, that's all beside the point of Alguri. Never heard of it but I'll definitely check it out!

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Ralf — Wed, 23 Jan 2008 18:30:03 -0000

Hi Thomas,

nice to see you writing again. This data structure you're describing sounds awfully similar to Binary Decision Diagrams. Basically if I did a binary expansion of that 32-bit IPv4 address and then gave each of those bits a variable name, I could do exactly the same thing you've described with a reduced, ordered BDD. The merging operation you have described is the reduction operation that transforms a binary decision tree into a decision diagram.

Cheers,
Ralf

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek — Wed, 23 Jan 2008 13:12:42 -0000

Don't tell him I told you about it; he may regret leaving that out there. =)

I've used code derived from Kneel's PATRICIA tree in production, so I'm pretty comfortable with it.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Todd Hayton — Wed, 23 Jan 2008 13:09:17 -0000

"I first heard the complaint from Danny Dulai when we and Kneel Fachan wrote the Patricia library for libishiboo, which you may still be able to find online."

Ah cool - I found the libishiboo library at his (Dulai's) site. Looks like the code actually implements removal of nodes from the patricia trie too which is neat since I've never been able to find any good explanation of how to go about doing this (sedgewick leaves it as an exercise in his book).

Todd H

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Chris — Wed, 23 Jan 2008 10:22:33 -0000

"none of you better have a BSCS or higher"

Never fear, dude :^)

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek — Wed, 23 Jan 2008 09:52:16 -0000

To be honest, without digging out my copy of Sedgewick, I can't actually back that statement up. In my defense, I first heard the complaint from Danny Dulai when we and Kneel Fachan wrote the Patricia library for libishiboo, which you may still be able to find online.

You can check the errata on Sedgewick's site, but he only has errata for the third edition on; third edition is from, I think, 2000? and my apocryphal claim about his error would be from around 1995. So, long story short, if you crib your trie code from the current Sedgewick book, you might be aces.

The graph algorithms volume to new Sedgewick is also excellent.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Todd Hayton — Wed, 23 Jan 2008 08:51:44 -0000

Hey there, interesting post. I was wondering - I've seen two references now stating that Sedgewick got the implementation for patricia tries wrong (the other reference being http://cr.yp.to/critbit.html) - however I've never seen an explanation on just what exactly was wrong with his implementation.

Todd H

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

desentizised — Wed, 23 Jan 2008 04:48:51 -0000

i have no idea how the 4 in the binary tree can be where it seems to be; maybe this has already been pointed out or sumthin but 4 comes from 6 which comes from 5 so 4 is greater than 5? maybe i just didnt get it but 4 shud be a left-wing descendant of 5 instead of being a left wing descendant of 5's right-wing descendant. at least thats how i wudve done it.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek — Wed, 23 Jan 2008 01:17:17 -0000

For a disgusting example of what I'm talking about that Dug Song just mentioned, consider Judy arrays.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek — Wed, 23 Jan 2008 01:13:33 -0000

I'm not talking about the storage hierarchy. We agree, external search, different problem from in-memory search.

I'm specifically talking about tree structure layouts that minimize cache misses and overhead for links.

You don't even need to store links directly; you can use minimum-bit or succinct encodings. If you're concerned about optimizing a specific memory access pattern --- for instance, increasing the chance that all the traversal fetches are going to be in loaded cache lines --- you can tailor the memory layout to that as well.

It's true that "struct treenode { void *key; void *data; struct treenode *left; struct treenode *right; };" is not a particularly memory-efficient or cache-efficient layout. I'm just strongly objecting to tarring the whole concept of a tree with that naive implementation. Realtime network production code running at n mpps wouldn't use "struct treenode", or a hash table.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Ray Lee — Wed, 23 Jan 2008 00:39:01 -0000

Obviously were both just miscommunicating here, so please bear with me a little longer.

My initial statement was pretty simple. You said "Binary trees are a better default table implementation than hash tables." and I said it depends. Take the below with that context, okay?

I don't understand what you mean by "insistent on defining 'tree structures' as 'structs with two child pointers' and 'memory' as 'what malloc returns'." Nothing I said is dependant upon a specific representation. Quite the contrary, I gave the serialied-to-disk example or a ternary tree to counter that possible issue. And as for memory being what malloc returns, all I can say is "huh?". You seem to think that RAM has different characteristics whether it's heap or stack? I'm horribly, horribly confused by what you are trying to get across here.

So...?

From my point of view a tree is something that requires a set of traversals. I would define a single "traversal step" as a load from memory and a branch decision (based on a comparison). Whether memory comes from heap or stack or is even purely on disk is immaterial to my argument. Whether it's a b-tree, radix tree, binary tree, or some hybrid is also immaterial.

So, I guess I just flat out don't understand your point. I'm saying in certain circumstances, a hash is a far better data stucture than a tree, whatever kind of tree that may be. Are you disagreeing with that? I just can't fathom how. Are you saying that a tree of some sort is *always* better than a hash? If so, why do you think ComSci profs teach it? Merely to confuse students?

I mean, in none of my comments have I been strident about this. I'm just trying to say that in your otherwise nice writeup, you gloss over a point that needs to be addressed more carefully for those who are not as versed in data structures as others. And by reading the comments, there are many who just aren't as well trained as we may hope. Don't mislead them. That's all I'm saying. Okay?

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek — Wed, 23 Jan 2008 00:21:47 -0000

Judy is disgusting.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

dugsong — Tue, 22 Jan 2008 23:05:22 -0000

hey tom. going to explain judy trees next? ;-)

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek — Tue, 22 Jan 2008 19:22:01 -0000

Ray, all you're pointing out is that loading whole 64 bit pointers out of memory and following them to some random other region of memory causes cache misses. You're seem insistent on defining "tree structures" as "structs with two child pointers", and "memory" as "what malloc returns".

You're obviously a smart guy. Why am I having such a hard time getting you past that idea?

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Ray Lee — Tue, 22 Jan 2008 17:28:57 -0000

You’re really comparing external search to in-memory search? Come on, Ray.

Wow. I must be really miscommunicating here if that's all you took away from that. It was an example to motivate your intuition.

But to go with it for a moment, yes, I am, as they have parallels. A cache miss in memory is akin to a head seek on a drive. To see why, just think about how the L2 cache is the same as the track cache. The L1 is equivalent to whether a sector is in the RAM cache or not. But this was merely an example. Discard it if it offends you so deeply.

The point is that cache misses are expensive on modern processors. Back when I started on the 6502 et al (1979), there was no difference. Nowadays, there's a massive difference between code that makes many memory accesses or few, and whether those memory accesses are spread out randomly or in order.

Try it. Make a large 2-d array of integers (say 1024x1024), and try walking the array in column order first versus row order first, and timing it.

Anyway, if you don't want your intuition motivated, well then, I guess I'm just going to have to beg you to measure the speed difference between walking a tree and looking up a key in hash.

And for the third time, there are times when hashes are completely worthless -- such as partial path lookups and needing to traverse neighbors (L/R/Parent) in an ordered collection. But for many things you're generally going to be better off with a hash.

Don't take my word for it. Go measure it.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Thomas Ptacek — Tue, 22 Jan 2008 16:21:12 -0000

You're really comparing external search to in-memory search? Come on, Ray.

Re: Aguri: Coolest Data Structure You’ve Never Heard Of

Ray Lee — Tue, 22 Jan 2008 15:30:20 -0000

(2) See N comments ago where I addressed the received wisdom that trees are necessarily hard on memory and cache performance. No, your trees are.

Sigh. If you're not doing prefix or neighbor lookups, then a tree is never a win in memory due to cache effects. It's especially not a win on a disk drive as you're causing more seeks.

This isn't wisdom whispered in halls, this is just plain measurable on any test.

And yes, I know about optimizing trees to deal with cache effects. I wrote a bone-head boggle solver that creates a ternary tree of all the words, so that I could quickly prune search paths. In part of the optimization of the algorithm, I measured cache misses (to disk, as I had serialized the tree and mapped it directly into memory from disk so that it wouldn't have to be re-run on every invocation). Part of what came up immediately was it was important to create the tree in a nice order so that after the first few comparisons, everything was under a page size. This is your point that you make above.

But, bottom line, if you have to traverse a tree structure to merely see whether or not a key exists (or just acquire the associated leaf data), it will always be slower, and worse on the entire system (due to cache effects) than a hash.

I like tries. I use tries. But they aren't always appropriate and you're doing your readers a disservice if you tell them any differently.