More

rw · on Sept 29, 2020

Operational Transformation and Conflict-Free Replicated Datatypes are very different from each other.

As the author explains, OT relies on some ordering of system events, and CRDTs don't. That means CRDTs need to be commutative (and probably associative), and OT doesn't.

So, OT is less scalable but more powerful, and CRDTs are more scalable but less powerful (in theory).

It's sort of like comparing Paxos/Raft to Bittorrent.

(I am not an expert on OT.)

rw · on Sept 6, 2019

Stegasuras is convincing work and the quality looks excellent.

I wrote a steganographic tool in this same spirit back in 2011, called Plainsight.

Back then, we didn't have deep learning, and the "Imagenet moment for NLP" had yet to arrive.

My Python code, with examples, is here: https://github.com/rw/plainsight

Unlike the OP, my Plainsight algorithm is 100% invertible by construction, and accepts binary input. (I verified the inversion process with "roundtrip fuzzing", a technique I still use today.)

Plainsight uses each bit of the input message to generate tokens. Bits are used to decide how to traverse a Huffman-style n-gram tree, weighted by frequency. This tree of n-grams is the model used in both the encoding and decoding steps. The drawbacks to my method are that the output 1) can be verbose and 2) does not convince a human that it's plausible, except for short messages.

Stegasuras has orders-of-magnitude better output, and seems to solve the problems I couldn't solve eight years ago. I would venture that their new result has as much to do with advances in language modeling, as it does with the particulars of their encoding and decoding algorithms.

I'll also note that I'm glad these researchers were able to use grant money to do this work. As a non-academic, I applied for an AI Grant to support me in upgrading Plainsight to use deep learning, but I was turned away at the time.

Finally, one of the ideas I picked up back then is that spam can be used to contain secret messages. Send enough gibberish to enough people, with your intended recipient included, and you'll look like a spammer--not a spy:

   $ wget https://spamassassin.apache.org/publiccorpus/20030228_spam.tar.bz2
   $ tar -jxvf 20030228_spam.tar.bz2
   $ cat spam/0* > spam-corpus.txt

   $ echo "The Magic Words are Squeamish Ossifrage" | plainsight -m encipher -f spam-corpus.txt > spam_ciphertext
   
   $ cat spam_ciphertext
   (8.11.6/8.11.6) 3 (Normal) Internet can send e-mails until to transfer 26 10 [127.0.0.1]
   also include address from the most logical, mail business for your Car have a many our
   portals ESMTP Thu, 29 1.0 this letter on internet, <a style=3D"color: 0px; text/plain;
   cellspacing=3D"0" how quoted-printable about receiving you would like width=3D"15%"
   width=3D"15%" border="0" width="511" Date: Tue, 27 Thu, 19 26 because
   zzzz@localhost.spamassassin.taint.org for
   
   $ cat spam_ciphertext | plainsight -m decipher -f spam-corpus.txt
   Adding models:
   Model: spam-corpus.txt added in 2.57s (context == 2)
   input is "<stdin>", output is "<stdout>"   
   deciphering: 100% | 543.84  B/s | Time: 0:00:00
   
   The Magic Words are Squeamish Ossifrage

rw · on Aug 15, 2018

The TimescaleDB benchmark code is a fork of code I wrote, as an independent consultant, for InfluxData in 2016 and 2017. The purpose of my project was to rigorously compare InfluxDB and InfluxDB Enterprise to Cassandra, Elasticsearch, MongoDB, and OpenTSDB. It's called influxdb-comparisons and is an actively-maintained project on Github at [0]. I am no longer affiliated with InfluxData, and these are my own opinions.

I designed and built the influxdb-comparisons benchmark suite to be easy to understand for customers. From a technical perspective, it is simulation-based, verifiable, fast, fair, and extensible. In particular, I created the "use-case approach" so that, no matter how technical our benchmark reports got, customers could say to themselves: "I understand this!". For example, in the devops use-case, we generate data and queries from a realistic simulation of telemetry collected from a server fleet. Doing it this way creates benchmarking stories that appeal to a wide variety of both technical and nontechnical customers.

This user-first design of a benchmarking suite was a novel innovation, and was a large factor in the success of the project.

Another aspect of the project is that we tried to do right by the competition. That means that we spoke with experts (sometimes, the creators of the databases themselves) on how to best achieve our goals. In particular, I worked hard to make the Cassandra, Elasticsearch MongoDB, and OpenTSDB benchmarks show their respective databases in the best light possible. Concretely, each database was configured in a way that is 1) featureful, like InfluxDB, 2) fast at writes, 3) fast at reads, and 4) efficient with disk space.

As an example of my diligence in implementing this benchmark suite for InfluxData, I included a mechanism by which the benchmark query results can be verified for correctness across competing databases, to within floating point tolerances. This is important because, when building adapters for drastically different databases, it is easy to introduce bugs that could give a false advantage to one side or the other (e.g. by accidentally throwing data away, or by executing queries that don't range over the whole dataset).

I don't see that TimescaleDB is using the verification functionality I created. I encourage TimescaleDB to run query verification, and write up their benchmarking methods in detail, like I did here: [1].

I think it's great that TimescaleDB is taking these ideas and extending them. At InfluxData, we made the code open-source so that others could build and learn from our work. In that tradition, I hope that the ongoing discussion about how to do excellent benchmarking of time-series databases keeps evolving.

[0] https://github.com/influxdata/influxdb-comparisons (Note that others maintain this project now.)

[1] https://rwinslow.com/rwinslow-benchmark-tech-paper-influxdb-...

leehampton · on Aug 15, 2018

Hey rw, one of the core contributors to TSBS here. First of all, thank you for the work you did on influxdb-comparisons, it gave us a lot to work with and helped us understand Timescale’s strengths and weaknesses against other systems early on. We do appreciate the diligence and transparency that went into the project. We outline some of the reasons for our eventual decision to fork the project in our recent release post [1]. Most of the reasons boil down to needing more flexibility in the data models/use cases we benchmark and needing a more maintainable code design since we’re using this widely for a lot of internal testing.

Verification of the correctness of the query results is obviously something we take very seriously, otherwise running these benchmarks would be pretty pointless. We carefully verified the correctness of all of the query benchmarks we published. However, it’s a process we haven’t fully automated yet. From what we can tell, the same is true of influxdb-comparisons — the validation pretty prints full responses but each database has a significantly different format, so one needs to manually parse the results or set up a separate tool to do so. We have our own methods for doing that internally — once we get the process more standardized and automated we will definitely be adding it to TSBS. We encourage anyone with ideas around that (or anything else) to take a look at the open source TSBS code and consider contributing [2].

[1] https://blog.timescale.com/time-series-database-benchmarks-t...

[2] https://github.com/timescale/tsbs

rw · on June 29, 2018

Hi Scott, thank you for writing your blog all these years. Your Busy Beaver essay ignited my passion for computer science, especially in algorithm analysis, logic, undecidability, and probability theory. I used to be someone who only thought in code; thanks to you, I now also think in math.

ScottAaronson · on June 30, 2018

Thanks; that made my day!!!

rw · on April 3, 2018

Why the hard dependency on MySQL?

bausshf · on April 3, 2018

Well, it's not a hard dependency. Diamond works fine without MySQL, it only depends on the mysql-native library, but not on MySql itself. It's fairly light-weight, so it has no big impact on the framework itself.

rw · on Feb 28, 2018

"Your scientists were so preoccupied with whether or not they could, that they didn't stop to think if they should."

- Jeff Goldblum as Dr. Ian Malcolm in Jurassic Park

rw · on Oct 1, 2017

This only uses LDA on your starred repository descriptions, to find topic terms that describe your starred repositories. These topic terms are then used to query the GitHub search API to find matching repositories. The results are then sorted by star count.

That is a clever way to make use of a search API like GitHub's. The principled way to do this, though, is to run LDA over all descriptions on GitHub, then use that similarity index to find similar repositories. You could run LDA over code, too.

I'll note that there is a cold start problem with this implementation: using LDA on such a small set of short documents will often lead to uninformative topics with words that are too-specific. You need a big corpus to capture e.g. synonym relationships.

painted · on Oct 2, 2017

Your point is quite interesting although I'm not sure running LDA on the entire code would be useful. I spent half a year writing my postgraduate thesis on a recommender systems for streaming services based on LDA, in particular we wanted to infer who is watching what and when in a shared account. From all the tests I did with LDA I believe the best thing would be to run it on the README files.

rw · on Oct 2, 2017

Good idea, the READMEs would be best of all.

c5urf3r · on Oct 1, 2017

Thats right, but one additional level of depth in fetching repositories increases the API latency by so much that almost becomes unusable for a web app at least for a hobby web app :P Hence the shortcuts. Open to ideas and suggestions.

rw · on Oct 1, 2017

As I said, your approach is a clever way to use the GitHub API. I think you need to change the title and readme to indicate that this isn't an LDA index of GitHub descriptions. To ML practitioners, that's what you are implying with a title of "Show HN: Using LDA to suggest GitHub repositories based on what you have starred".

nl · on Oct 2, 2017

Lab41 has done work on code recommendations by using a word2vec representation on the code itself.

c5urf3r · on Oct 4, 2017

Links please?

rw · on Sept 28, 2017

No, polynomial time. For reference, see these Wikipedia pages:

https://en.wikipedia.org/wiki/Polynomial-time_reduction

https://en.wikipedia.org/wiki/Karp%27s_21_NP-complete_proble...

rw · on Sept 28, 2017

Why has this article been removed from the top 250 news results? It was #1 for a few minutes, then #5, and now it's gone. We've successfully discussed much more risqué topics here on HN...

Why did the comment by `TAForObvReasons calling out this apparent censorship get deleted?

bschwindHN · on Sept 28, 2017

I would like to know the same, it's pretty ironic for an article titled "Mark Zuckerberg's Trust Problem"

ghostDancer · on Sept 28, 2017

Ok, thanks, i was thinking i was paranoid but it's clear that i'm not the only one who has seen this.

rw · on July 29, 2017

No, it's called insufficient feature engineering. Data leakage is when your test data contaminates your training data.