[shadow-dev] Extending Shadow to download objects embedded in HTML

Rob Jansen jansen at cs.umn.edu
Thu May 31 17:26:01 CDT 2012

Hey Arne,

Sorry for the delayed response, and thanks for your patience. I've been out
of the office for much of the last 10 days and am trying to catch up on

On Thu, May 24, 2012 at 9:46 AM, Arne Diekmann
<caffeine at parttimegeeks.net>wrote:

> Hey Rob and the others,
> I want to use shadow/scallion to analyze countermeasures in particular
> against
> the fingerprinting attack and against traffic analysis attacks in general.
> More
> details are in my thread "Using Shadow/Scallion to connect to 'real'
> Servers"
> in the shadow-support list.

The thread starts here:

> Rob convinced me that it is easiest to build upon the filetransfer plugin
> to
> add parsing of HTML and requesting further objects.
> I've now written a program to make such a browser-like request with
> libcurl.
> It can be examined and played with here:
> https://gist.github.com/2781392
> Any comments on the code are greatly appreciated. I know it has the
> following
> shortcomings which I might look at in the future, if classification gets
> too
> hard:
> - Javascript is considered to be paresed immediately when in reality it may
> take considerable time blocking downloads in the meantime
> - Alot of objects are actually  referred to in the CSS (webfonts,
> background-
> images)
> - @import is ignored

After adding -I/usr/lib64/glib-2.0/include/ to the Makefile, it correctly
compiled on my machine. It also might help to add fetcher.d to the clean
target, and give a static name to the executable (my exe was called
'gist2781392-c03a8ce20f7166795553d483601415249d7af92d' after running make).

I tried downloading http://www.google.com, and after a bunch of HTML parser
errors, it found the logo image and downloaded it. It appears to actually
download things.

I think you are right to worry about the JS and CSS stuff later.

> The next step for me is to actually  integrate that code into shadow. I
> have a
> few questions about that. Most of them revolve around the virtual network
> and
> how it works in detail.
> 1. Can I just use libcurl or do I have to use epoll like the filetransfer
> plugin currently does?  I use libcurl in the multi-interface which use
> only a
> single thread.

I think its great that you are using libcurl! I had already been hoping
that we would move in this direction for future download plugins.

The answer depends on what libcurl is doing under the covers. So far we
only have epoll implemented in Shadow, though there are other kernel event
handlers out there. (Libevent -- http://libevent.org/ -- is an app library
that wraps all the kernel handlers.) If libcurl uses epoll or can be
configured to use epoll, then you should be good. Otherwise, we'll either
have to either:

1. implement whatever kernel handler they use in Shadow
2. intercept their event calls and somehow force them through an epoll
3. modify libcurl to MAKE it support epoll.

I don't recall if libcurl uses epoll or not.

Another important thing to keep in mind is that whatever code you use MUST
NOT BLOCK. Because Shadow is a discrete event simulator, your program will
pause infinitely if your code blocks while Shadow is running it. According
to the libcurl website -- http://curl.haxx.se/libcurl/c/libcurl-multi.html --
there are still a couple of places that have blocking code. You have to
either not use these things, or fix them in libcurl so they no longer block.

> 2. If there are any calls from libcurl which need to be done in a different
> manner,  is there any way I can intercept them just like it is done with
> Tor?
> Or is this far more complex then using Epoll?

This is not too complex, as most of the hard work has already been done. In
other words, you can just follow the techniques that we use in Shadow and
Scallion to intercept functions. See:

> 3. Should I create a new plugin or extend shd-service-filegetter.c ?

I suggest you thoroughly examine shd-service-filegetter first, and
determine the path you'd like to take. We separated filegetter into a
'service' so that our main code can run as a stand-alone exe, as its own
Shadow plug-in, and as part of a multiple service plug-in like Scallion
(each Scallion node runs Tor and a filegetter). You might gain some insight
into writing the type of non-blocking, asynchronous code that Shadow
requires by examining the existing plug-in.

I wouldn't mind a fresh plug-in implementation, because the existing
filegetter sometimes causes problems. If you feel comfortable redesigning
the code to work better, go for it. Otherwise, I would probably start with
the existing plug-in, and modify things wherever I thought they could be
improved while incorporating the new code.

> 4. I need to extract certain features from the Tor traffic for
> classification.
> E.g. I need to determine the following feature:
> - Total trace time
> - Total transfered bytes
> - Individual packet sizes
> - The time when each packet was received
> - ...
> The best (and supposedly) easiest way would be If I could get them in a
> libpcap-like format (because I need to conduct experiments without using
> Tor
> and only the fetcher script above). What is the easiest and best way to get
> that data?

I'm going to leave this question to John, as he's been working on code that
dumps pcap-style logs. He will be able to give you more insight here than I.

> I know it's a lot of (possibly very stupid) questions again. Bbut I hope
> you
> can help me knowing that I will be grateful forever :)
> - Arne Diekmann

No problem! I'm glad your excited to use and contribute to Shadow :)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.umn.edu/archives/shadow-dev/attachments/20120531/1cc3c6ed/attachment.html>

More information about the shadow-dev mailing list