CS 294-105: Homework #1 - Due 9PM Wed Sep 3
The first 3 parts of this assignment should be straightforward and not
take much time. Part D has a lot of elements, with some of the later
ones being quite challenging. Give each one some thought, and put in some
decent effort exploring them as far as you can - but don't kill yourself
trying to do all of them if you find you're at a loss or it's taking too
much time. Since this is a new assignment, I don't have a sense of whether
it gets too hard too fast, or only at the end of part D.
Complete the assignment by by 9PM Wed Sep 3 if possible. If you
can't finish it by that time, you should still strive to have it done
prior to our next class meeting at 5PM Thu Sep 4, as you will
get a lot more out of the accompanying lecture if you've done so.
Turn in your answers (only for part D) via email to vern@cs.berkeley.edu
with the term Homework in the Subject.
Update: this is Version 3 of the assignment:
Diffs versus Version 2.
(Diffs of Version 2 versus Version 1.)
- Complete the student survey by 3PM on Friday Aug 29.
- If needed, (re)familiarize yourself with:
- How to construct histograms and cumulative distribution
functions (CDFs).
- The (very) basics of (homogeneous) Poisson processes.
This means the gist of what they capture (counting
the arrivals of a memoryless arrival process) and
their most basic properties (independent arrivals;
exponentially-distributed interarrivals).
- The workings of traceroute, a tool very widely used
to measure the routes that individual messages ("packets")
take through the Internet.
(Update: my best guess now is that we won't get to
talking about traceroute during the next
class meeting, so if you're unfamiliar with it, you
could postpone learning about it.)
For all of these, an understanding at the level of the "easy parts" of the
corresponding Wikipedia page (or Unix man page, for traceroute)
will suffice.
- Download the keystroke timing dataset.
This dataset is derived (with one exception, as discussed below)
from actual measurements. Each row in the dataset represents
activity seen from a given user during a remote login session.
For the most part, such activity represents individual keystrokes
that the user types and which are then sent to a remote server.
(Some of the entries instead represent network protocol messages
used by the login service instead, though these are a minority.)
The first column is a timestamp in seconds, and the second is an
anonymized user ID.
- In your homework writeup, briefly answer as much of the following
as you're able to:
- Summarize the most basic properties of the dataset.
(I'm leaving this vague so you can reflect on just
what constitutes those basic properties.)
- Do you have any comments about the data's quality?
If not, simply state that you don't.
- How would you characterize the per-user duration of activity
and number of keystrokes ?
- There's a large body of theoretical modeling premised on the
notion that the "arrival" of new users (i.e., the times at
when they first show up) looks like a Poisson process.
To the degree that you can, do you believe that this data
supports that premise? Explain how you arrived at your
conclusion. Recall that Poisson processes have
exponentially-distributed interarrivals, i.e.,
if you compute the time difference between consecutive
arrivals, it should be exponentially distributed.
Note that you'll have to do a bit of processing of the
data to extract the interarrivals between new users.
- Extract from the data the interarrivals of each user's
individual keystrokes. Analyze these collectively
(pooling them across all users). For example, if
in total the data looked like:
1.3 user5
1.6 user7
1.9 user5
2.1 user5
2.8 user7
then you would extract the values 0.6 (the time between
user5's 1st and 2nd keystrokes), 0.2
(the time between their 2nd and 3rd keystrokes), and
1.2 (the time between user7's 1st and 2nd
keystrokes). The pool of aggregate data you'd analyze
would then be these three values (in no particular order):
0.2, 0.6, and 1.2.
- The aggregate set of interarrivals that you extracted has
more than one qualitatively different type of behavior.
See to what degree you can identify these, and explain how
you did so.
- Similar to the above regarding the arrival of users, some
theories presume that the arrival of an individual user's
keystrokes likewise looks like a Poisson process. Based
on the aggregate interarrivals of user keystrokes
that you extracted above, explore (to the degree that you
can) whether you believe that this data supports that
premise. Explain how you arrived at your conclusion.
- I've added an artificial artifact to the data that manifests
when analyzing the interarrivals of individual user
keystrokes. (This is based on a true quirk that occurs
in other such keystroke datasets, though didn't happen
to do so for this one.)
See if you can spot it. If so, explain how you found it,
and whether you also find any evidence that it's synthetic
rather than arising from actual user activity.
- Three crucial tips:
- If after a bit of exploration and thinking about
it, you're at a loss for what a given question is
asking, or how to go about it, ask on Piazza.
- When replying to such queries on Piazza, be careful
not to "telegraph" the answer, but instead just
provide strategic guidance.
- This assignment is not graded in any sense other
than I'd like to ensure that you've put effort/thought
into it. If you have done so, and you come up empty
for some / most of it, just say so.
- How much time did you spend on this assignment?
- What additional data or meta-data did you find you wished
this dataset had included (if any)?
- (Optional:) Any comments on the assignment? Was it
illuminating/frustrating/fun/boring? What would
improve it?