I found those, I just would have appreciated if the content of the mathematics wasn't sidelined to a separate download as if it's not important.
I felt the explanation on the page was shallow, as if they just want people to accept it's a black box.
All I've learnt from this is that they used an unstated amount of computational resources just to basically brute force what a human already is capable of doing in far less time.
Very few humans go after this type of the training. In my "math talent" school (most of the Serbian/Yugoslavian medal winners came from it), at most a dozen students "trained" for this over 4 high school generations (500 students).
Problems are certainly not trivial, but humans are not really putting all their effort into it either, and the few that do train for it, on average medal 50% of the time and get a silver or better 25% of the time (by design) with much less time available to do the problems.
This is disingenuous. People who train are already self selected people who are talented in math. And in the people who train not everyone gets to this level. Sadly i speak from personal experience.
This school is full of people talented at math — you can't get in if you don't pass a special math exam (looking at the list, out of Serbia's 16 gold medals, I can see 14 went to students of this school, and numerous silver and bronzes too — Serbia participates as an independent country since 2006 with a population of roughly 7M, if you want to compare it with other countries on the IMO medal table). So in general, out of this small pool (10 talented and motivated people out of 4 generations), Serbia could get a gold medal winner on average almost once every year. I am sure there were other equally talented mathematicians among the 490 students that did not train for the competition (and some have achieved more academic success later on).
Most students were simply not interested. And certainly, not everybody is equally talented, but the motivation to achieve competition success is needed too — perhaps you had the latter but not enough of the former. I also believe competitive maths is entirely different from research maths (time pressure, even luck is involved for a good idea to come up quickly, etc). Since you said you were a potential bronze medal winner, it might not even be a talent issue but maybe you just had great competition and someone had the better luck in one or two tests to rank above you (better luck as in the right idea/approach came to them quicker, or the type of problem that appeared on the test suited them more). And if you are from a large country like USA, China or Russia (topping the medal table), it's going to be freaking hard to get into a team since you'll have so many worthy students (and the fact they are not always scoring only golds for their teams out of such large pools tells me that the performance is not deterministic).
As a mathematician, I am sure you'd agree you'd want to run a lot more than a dozen tests to establish statistical significance for any ranking between two people at competitive maths IMO style, esp if they are close in the first few. As an anecdote, many at my school participated in national level maths and informatics competitions (they start at school level, go on to county/city level to nation level) — other than the few "trained" competitors staying at the top, the rest of the group mostly rotated in the other spots below them regardless of the level (school/county/nation). We've actually joked amongst ourselves about who had the better intuition "this time around" for a problem or two, while still beating the rest of the country handily (we've obviously had better base level of education + decently high base talent), but not coming close to "competitors".
I, for instance, never enjoyed working through math problems and math competitions (after winning a couple of early age local ones): I've finished the equivalent of math + CS MSc while skipping classes by only learning theory (reading through axioms, theorems and proofs that seemed non-obvious) and using that to solve problems in exams. I've mostly enjoyed building things with the acquired knowledge (including my own proofs on the spot, but mostly programming), even though I understood that you build up speed with more practice (I was also lazy :)).
So, let's not trivialize solving IMO-style problems, but let's not put them on a pedestal either. Out of a very small pool of people who train for it, many score higher than AI did here, and they don't predict future theoretical math performance either. Competition performance mostly predicts competition performance, but even that with large error bars.
To mathematicians the problems are basically easy (at least after a few weeks of extra training) and after having seen all the other AI advances lately I don't think it's surprising that with huge amounts of computing resources one can 'search' for a solution.
Sorry that's wrong. I have a math phd and i trained for Olympiads in high school. These problems are not easy for me at all. Maybe for top mathematicians who used to compete.
I've done that many times and never used exact equality comparisons.
If you do exact comparisons for any non-trivial cases, you'll find different compilers, optimization settings, runtimes, and processors give different results.
Only if each value is equally likely. If you see $1,000 but figure the envelope-filler is a lot more likely to have been willing to put $1,500 in than $3,000 then you should stick.
You're assuming here that there are discrete stages that do different things. I think a better way to conceptualise these deepnets is that they're doing exactly what you want - each layer is "correcting" the mistakes of the previous layer.
Most "deep" networks are organized into layers and information flows in a particular direction although it doesn't have to be that way. Hinton wasn't saying we shouldn't have layers but that we should train the layers together rather than as black boxes that work in isolation.
Also, when people talk about solving problems they talk about layers, layers play a big role in the conceptual models people have for how they do tasks even if they don't really do them that way.
For instance in that ambiguous sentence somebody might say it hinges on whether or not you think "bite" is a verb or a noun.
(Every concept in linguistics is suspect, if only because linguistics has proven to have little value for developing systems that understand language. For instance I'd say a "word" doesn't exist because there are subword objects that depend like a word "non-" and phrases that behave like a word (e.g. "dog bite" fills the same slot as "bite"))
Another ambiguous example is this notorious picture
which most people experience as "flapping" between two states. Since you only see one at a time there is some kind of inhibition between the two states. Who knows how people really see things, but if I'm going to talk about features I'm going to say that one part is the nose of one of the ladies or the chin of the other lady.
Deep networks as we know it have nothing like that.
Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play.
One intuitive way to see this is that a sequence of improving pure policies A < B < C < etc. will converge to optimal play in a perfect information game like chess, but not necessarily in an imperfect information game like rock/paper/scissors where Rock < Paper < Scissors < Rock, etc
Seems to have missed the existence of jax.jit, which basically constructs an XLA program (call it a graph if you like) from your Python function which can then be optimized.
The authors gives that quote (from the JAX documentation) but does not seem to interiorize it as his conclusion says:
> This is the niche that Theano (or rather, Theano-PyMC/Aesara) fills that other contemporary tensor computation libraries do not: the promise is that if you take the time to specify your computation up front and all at once, Theano can optimize the living daylight out of your computation - whether by graph manipulation, efficient compilation or something else entirely - and that this is something you would only need to do once.
It is exactly what JAX does. There is a computational graph in JAX (its encoded in XLA and specified with their numpy like syntax), it is build once, optimized and then runs on the GPU.
Not even cloese, jax.jit allow you to compute almost anything using lax.for_loops, lax.cond and other lax and jax contsturts pytorch jit does not allow that its just extra optimization for static pytorch functions.
JAX autograd will work on most any jitted fn - the control-flow limitations are no autograd for code with for/while loops since there's a statically unknowable trip count through the loop body. Much looping code can be handled differentiably using a "scan" though.
...which fits into a size of less than 700Mb compressed. Some of the most exciting stories I've read recently for machine learning are cases where learning is re-used between different problems. Strip off a few layers, do minimal re-training and it learns a new problem, quickly. In the next decade, I can easily see some unanticipated techniques blowing the lid off this field.
It indeed strikes me as particularly domain-narrow when I hear neuro or ML scientists claim as self-evident that "humans can learn new stuff with just a few examples!.." when the hardware upon which said learning takes place has been exposed to such 'examples' likely trillions of times over billions of years before — encoded as DNA and whatever else runs the 'make' command on us.
The usual corollary (that ML should "therefore" be able to learn with a few examples) may only apply, as I see it, if we somehow encode previous "learning" about the problem in very the structure (architecture, hardware, design) of the model itself.
It's really intuition based on 'natural' evolution, but I think you don't get to train much "intelligence" in 1 generation of being, however complex your being might be (or else humans would be rising exponentially in intelligence every generation by now, and think of what that means to the symmetrical assumption about silicon-based intelligence).
"The usual corollary (that ML should "therefore" be able to learn with a few examples) may only apply, as I see it, if we somehow encode previous "learning" about the problem in very the structure (architecture, hardware, design) of the model itself."
Yes, and they do. They aren't choosing completely arbitrary algorithms when they attempt to solve a ML problem, they are typically using approaches that have already been proven to work well on related problems, or at least are variants of proven approaches.
The question is, how much information is encoded in those algos (to me, low-order logical truths about a few elementary variables, low degree of freedom for the system overall), compared to how much information is encoded in the "algos of the human brain" (and actually the whole body, if we admit that intelligence has little motivation to emerge if there's no signal to process and no action to ever be taken).
I was merely pointing out this outstanding asymmetry, as I see it, and the unfairness of judging our AI progress (or setting goals for it) relatively to anything even remotely close to evolved species, in terms of end-result behavior, emergent high-level observations.
Think of it this way: a tiny neural net (equivalent to the brain of what, not even an insect?) "generationally evolved" enough by us to be able to recognize cats and license numbers and process human speech and suggest songs and whatnot is really not too shabby. I'd call it monumental successs to be able to focus a NN so well on a vertical skill. But that's also low-order low-freedom, in the grander scheme of things, and "focus" (verticality) is just one aspect of intelligence (e.g. the raging battle is for "context" as we speak, horizontality and sequentiality of knowledge; and you can see how the concept of "awareness", even just mechanical, lies behind that). So, many more steps to go. So vastly much more to encode in our models before they're able to take a lesson in one standing and a few examples.
It really took big-big-big data for evolution to do it, anyway, and we're speeding that up thanks to focus in design, and electronics to hasten information processing, but not fundamentally changing the law of neural evolution, it seems.
If you ask me, the next step is to encode structural information in the neuron itself, as a machine or even network thereof, because that's how biology does it (the "dumb" logic gate transistor model is definitely wrong on all accounts, too simplistic). Seems like the next obvious move, architecturally.