Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Karpathy on VS Code Cursor and Sonnet 3.5 vs. GitHub Copilot (twitter.com/karpathy)
91 points by nabla9 on Aug 24, 2024 | hide | past | favorite | 67 comments


I think for people with fairly high experience, the current AI boom is definitely a win; improved productivity for simple tasks and responsive rubber duck to run your ideas by. However, for new grads, students, and even mid-level devs, whenever I see them with Copilot (and just today, Cursor) out, I cry a bit inside and lower my expectations.

I’m not sure why but in my experience, juniors/students that use AI will generally struggle more to debug problems & give up earlier

Edit: and if I sound arrogant, let it be known that I definitely dont have the experience/intelligence to use AI productively, and thus I’ve avoided it beyond experimenting with capabilities


As someone responsible for the fall out from these tools being used to cut corners, I doubt it. And I'm talking about senior developers here with 10y+ experience.

People don't bother to understand problems now. Only the happy path is considered. It takes real intelligence to understand the side effects even a simple thing can have. What was a whiteboard session and a thinking process is now blindly trusted to some 3rd party agent.

Consider reliably receiving a message from a queue and processing it at least once. So far I have encountered at least 0 times processing on two separate occasions. That is a big problem. Turns out the code was generated and no one even knows or cares enough including the people reviewing it because their magic friend wrote it and their other magic friend reviewed it and the management is happy that they were sold magic because some other magic thing wrote some marketing spiel that absolved them from any decision making responsibility.

In some contexts a mistake like that has an actual tangible capital cost if it's a trade or a transaction in a regulated industry. It's not "just a mistake you can learn from" at that point.

People tend to lazy, to steal a mathematical expression, and these tools bring the worst out in people.


That's partly why I've only been using these tools for inconsequential tasks who's primary time cost is typing. Like throwing together a primitive prototype or mock generator and using my brain to build on it or get the rough structure down. HTML/CSS templating for example is still easy to screw up if you actually care about quality, but if you don't then you may as well find the fastest path there.

I agree that the more convenience people rely on though, they seem less willing to sit with a hard problem and learn it the hard way, which is required in so many areas of programming or w/e, or sit with it and just grind their brain for possible edge cases or robust ways to implement something.


I've found that AI takes much more edge cases into account than my lazy brain does. It has no problem writing tedious code covering things that are very unlikely to happen. When I see the solution it suggests it's often more robust than my initial plan.


> In some contexts a mistake like that has an actual tangible capital cost

Developers need to be just as mindful of the reputational cost.

Being known as someone who doesn't know what they are doing and is harming the team can cost you.


How is it different from common situation today when a piece of code was written a long time ago by someone who no longer works at the company, so nobody understands how it works?


The old code is probably close enough to correct. It has been battle tested for a while if it has not been actively worked on. If it has operated for good while without immediately noticeable issues it is likely okay for current uses.


For me AI boom has not been a win at all.

It is at best on par with Stack Overflow and only works when the LLM has a lot of prior examples it has trained on. Professional coding is not something that can tolerate introducing bugs that take for ever to debug simply because the LLM lacks enough training data.

Also it lacks the concept of libraries having different API versions. And so you will see in the generated code it will happily mix methods from both versions. And because the model takes so long to train you can't use newer libraries.


This has been my experience with the chat models (and yes, I'm talking about SOTA, both Claude and GPT), but I've found Copilot autocomplete to be an improvement that is worth the cost.

That's not a huge endorsement given that it's not especially expensive, but it definitely decreases the burden of creating a large amount of new code quickly, which makes bootstrapping a new personal project much easier than before. The key is to give it discrete chunks to generate—small helpers, single lines, or test cases—and not expect it to come up with a reasonable architecture or to even generate a single class.

But I wholeheartedly agree about the chat models. Every time I've tried them I've wished I hadn't. The experience of interacting with them is similar to interacting with a junior developer, and it's not worth pairing with a junior if it's not an investment in their training.


They will. And then they’ll learn the fundamentals. I thought I knew most things I needed when I was 20 because who needs fundamentals when you can just use frameworks? Ah well…

I don’t think this is any different from “young devs don’t know shit, they only know Unity but have no clue of how things work in reality.” Developers need to start somewhere and will work their way down the stack. As the height of the stack increases, that journey takes longer. But the height increases because every additional layer increases productivity.


I agree with this sentiment. However maybe in few years AI will get better enough that you wont need to debug it?


As long as the task remains non-trivial and there is still something to be done – mostly forming a clear idea of what you want to do and what you need and laying it out, now in english – why would this be any less wrong than any of the other times when the incumbents got worried and uphill-both-ways-y?

When has this purity/concept idea ever stood the test of time? Is it only music when it's on tape? Is hiphop even music? Can't you do art with photoshop?

Things evolve and the things that once seemed important won't be. People will be lazy and clever and adapt.


He meant Cursor editor [1] and not VS Code Corsor. He corrected in the next X post [2]

[1] https://www.cursor.com [2] https://x.com/karpathy/status/1827148812168871986


Cursor editor is a fork of VS Code, which was probably why he got it mixed up


"Sometimes you get a 100-line diff to your code that nails it, which could have taken 10+ minutes before" so it only took him 10 mins to write correct 100 line block without LLM? I guess some people are really 10x.


There is some hilarity in having Karpathy who's most likely had significant contributions to OpenAI's current offerings recommend that people ought to use Sonnet instead.


He worked there for a year. He had minimal impact. Probably not even among the top 50 impactful people in OpenAI history


He was one of the founding members: https://openai.com/index/introducing-openai/


He is co-founder and worked there 2015-2017 came back later for a year.



And just as the previous thread was nuked, so was this one. It's a shame but I guess people don't want to discuss this.


Here's a video from an 8 year old creating a chatbot with Cursor and Claude :

https://x.com/rickyrobinett/status/1825581674870055189

Clearly she's learned a few tricks from her dad, but an impressive result none the less.


I am whiny about being anti-AI, but no doubt they are well-suited for English<->programming translation tasks, which along with a well-tuned bag of tricks is a useful tool.

But two issues the tech community had not spent nearly enough effort discussing:

1) As a big fan of Idris, I am worried that these tools will strongly disincentivize language development: why design an elegant language if an LLM can write the boilerplate faster than you can write a cleaner implementation?

2) I still don't think these tools are even slightly ethical. In 2022 I kicked the tires on ChatGPT-3.5 for F# codegen, and got some truly terrible results. I copy-pasted some lines into GitHub and found the unique repositories which ChatGPT was obviously plagiarizing from, and with 15 seconds of prompt "engineering" I got it to spit out ~200 lines verbatim from my personal F# linear algebra library - the only thing that was changed was stripping out the comments and updating some syntax to F# 4.7. Pure plagiarism. It is especially frustrating that GPT is more likely to plagiarize that library precisely because there aren't very many similar repos on GitHub.

Obviously the plagiarism problem can be fixed. (and it seemingly has been...for F#. Not sure about Idris!) However, it really seems like that sort of RLHF fine-tuning is about covering OpenAI's tracks, not "teaching" the AI how to "generalize." In particular I refuse to use the tool because now instead of reliably getting it to plagiarize from F# developers, I have no clue whatsoever if it's stealing or if it managed to truly autoregress its way into an ethical solution. So instead of rolling the dice on being a graceless scumbag, I'll just take my time writing out my code by hand.

And it was striking that GPT-3.5 had read and memorized more F# than Don Syme has seen in his entire life, yet in response to simple questions it was a mindless plagiarist. It's a stark illustration why the legal argument that ANN learning = human learning is vacuous, and why OpenAI should lose most of the copyright lawsuits it's facing.


What he describes is why I have stayed away from using these tools so far. I don't want to be exposed to something that is just useful enough to feel indispensable, but still comes with so many drawbacks.


Cant wait for Opus 3.5 or GPT5 and of course Copilot

Honestly writing code is so much more fun (especially Go code, if err if err if err..)

I just tried Cursor because of this tweet, and it is really nice, I did pay for it.

I just turn all assistance off when I have to think, because it violently interrupts my thoughts with random suggestions, but turns out most of the time, I just have to spit out code.


Does anyone have a write up of the response format the llm uses in these type of editors? I'm assuming the llm can't generate exact diff format without making errors.


Yeah, diffs are unreliable.

A couple of interesting things to read:

1. This reverse-engineering of VS Code GitHub Copilot - it's a bit old but still very relevant to understanding what's going on under the hood: https://thakkarparth007.github.io/copilot-explorer/posts/cop...

2. This post from Aider (a very sophisticated CLI-based coding assistant tool) talking about how the benchmarked different editing formats: https://aider.chat/2023/07/02/benchmarks.html#edit-formats - and this later post as well, about how diffs without line numbers can work well: https://aider.chat/2023/12/21/unified-diffs.html#unified-dif...

This whole field is still very much under active development, so I'm sure there are lots of other interesting techniques out there.


Thank you, I went looking for how the original copilot worked a few months ago, but didn't find it.


I don't know why it's worthwhile to pay Cursor when CoPilot is free for unlimited use for open source developers.


Sorry to go meta, but this post has been demoted to page 6 since I first read it about an hour ago, despite being 80 points, 3 hours ago. I think it's because people have flagged it. I'm frustrated at how powerful flagging is compared to up-votes for stories. Tiny rant over.



Karpathy spends his days rewriting the same code for educational purposes. I'm not surprised he finds such marcos useful.

As a professional though they're pretty useless. Sad to see Karpathy fueling vaporware


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html


I pity comments like this. Instead of trying to up skill to use the newest programming tools, you are setting yourself for failure.

Sonnet 3.5 digests close to 400k bytes of text and produces coherent code that works on the first try. If someone says its not working and they are a professional programmer, get ready to feel like you are hit by ton of bricks next year. The productivity boost is only going to accelerate and those who can't adopt will be left behind.


a) There is no up-skilling needed to use LLMs. They are very basic to use.

b) Many of us have used them for a while now and can speak from experience that they aren't providing a meaningful productivity boost. Simply because they don't work well enough to provide a positive ROI. And no amount of prompting expertise can change that.

c) For me it is junior developers who love these tools because they think it's a shortcut to becoming experienced. But it's akin to cheating. You're not actually learning why and how things are supposed to work. And that will hurt you in professional environments where you often need to explain why you wrote that code and introduced that bug.


Your (1) is not matching with (2) because there are anecdotes contrary to yours (the tweet in question and my personal one). I have close to 2 decades of experience in a variety of languages and frameworks and never felt this powerful and liberated with any of the previous tools.In the past year I have developed 2 complex products nearing market launch with just me on a part time basis.

My professional colleagues continue to feel the exact same way you feel and despite my best efforts refuse to even bother using them for anything. Using LLMs might appear to be simple and the prompt length might be similar between an experienced user vs naive one but the way intent is conveyed varies with skill level.

My only complaints about LLMs are: 1) Context is still a limiting factor (so only medium sized projects) 2) I have to still copy paste the code (no IDE truly helps here)

What has improved in the past 6 months: Sonnet happened and I no longer have to worry about the code being wrong or that it contains obvious mistakes. In many cases where I thought it got it wrong turned out to be a clever way to minimize the number of changes needed/clever ways to do more with less. We are approaching the point where humans no longer are intelligent enough to appreciate the LLMs.


I look forward to the day that I can be "intelligent enough" to truly appreciate LLMs. Maybe I need to buy a course from someone on X.

And not from months of experience using Claude where it over and over again will give me algorithms that are wrong, assure me every time it is right and do so using versions of libraries that are typically a year or more old.


"There is no up-skilling needed to use LLMs. They are very basic to use."

Hard disagree on that. Using LLMs effectively is deceptively deep. Sure, anyone can throw a prompt at a chatbot - but I've been using them on an almost daily basis for over two years at this point and I still feel like I'm finding out new ways to improve my prompting several times a week.

I talked more about how hard they are to use here: https://simonwillison.net/2024/Jun/27/ai-worlds-fair/#slide....

"Many of us have used them for a while now and can speak from experience that they aren't providing a meaningful productivity boost."

I'm getting a meaningful productivity boost, which gets more meaningful the more time I spend learning how best to apply them.


We are many professionals that share Karpathy's opinion on this and know for a fact that it provides a very meaningful productivity boost. It may not be for everyone but I can absolutely not imagine going back, and can confidently say it's not just junior developers that love these tools.


Why isn't there a single screencast (un-edited, un-cherry-picked) of anyone showing off their 10x productivity boost in a full "typical" coding session?


Someone recently asked this on Twitter; Simon Willison responded with https://simonwillison.net/2024/Jun/21/search-based-rag/ which I have not yet watched but which he claimed was a good example of this genre.


Having rewatched that myself the other day it's not actually as good an example as I thought - I use Claude 3.5 Sonnet a bit in it (which was released the morning we recorded that video) and then get a bit of benefit out of Val Town's integration with Codeium, which is similar to VS Code Copilot - but not as much of the code in it was LLM-generated as I remembered.

A better (written) description of how I use these tools is this one: https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/ - and this whole series of posts: https://simonwillison.net/tags/ai-assisted-programming/


I would point out that the OCR example (and from what I see the series of posts you linked to) aren't "live" coding screen shares and don't convey the nitty gritty of how these things are used and how well they work.


Right, but they’re the best I have - I don’t do much live coding aside from that Val Town one which doesn’t use LLMs very much.


I'd love to see this operationalised as concrete predictions, as one might find on a prediction market! Do you have any specific predictions about programming next year?

I ask (for example) because I suspect shitting out CRUD apps is cheaper via LLM than via human now, and I guess probably most programming work is of that nature, but there are programmers out there whose job is not shitting out CRUD apps, and it's not clear from your statement whether you intend the sentiment to cover those programmers too.


The answer lies in your question. I foresee consolidation in programming languages and frameworks with compact and well known ones edging out esoteric and niche ones. In a couple of years of time, I predict that there will be new languages specifically targeting LLMs that aren't as human readable but extremely compact similar to byte code (compactness is preferred due to context size limitation not fully going away).

So in a nutshell I feel like most things will be LLM generated with human focus mostly around systems boundary stitching with focus on extreme cases like quant and medical domains where human oversight might be needed.


Let's cross that bridge when we come to it, shall we. Meanwhile, you should be glad we are refusing to use it. If it works as well as you claim, this situation is to your advantage.


I've been waiting since 2021 when I saw demos of GitHub copilot


I'd like to see that. Link(s)?

I've been on the sidelines, waiting for the dust to settle. Kind of like waiting a few months before applying the latest major OS updates.


Ah yes, just like stocks can only go up. No one will feel like hit by a ton of bricks.


Curious to know why you think it's vaporware. Are the latest LLMs like 3.5 Sonnet bad at original programming based on your experience? It hasn't been the case for me when using it for real world projects lately.


I wrote a cycle detecting reference counting system and asked Sonnet 3.5 to fix it and it failed.

I hand held it, gave it tests, called out flaws in reasoning.. etc., and it still didn't fix.

The longer the chat the more likely it was Sonnet forgot a clarification I provided.

Overall huge waste of time, lost my train of thought and I was helping an LLM rubber duck and fail repeatedly


Don't judge it for one single experience. It'll take several for you to understand well how impactful these tools are for SWE.


I had an XML file format from one app I needed converted to a json file format for another app.

I threw both schemas at Claude and asked for it to write converter code.

Writing mocks, Claude saves an hour+ when mocking out complex classes.

I've never written graphics code before, I had a png animation film strip, Claude wrote code to load, parse, and animate it.


"vaporware"?

You may not like cursor, but they have a product that I -- as a professional -- use every day.

Vaporware isn't the word you're looking for


This whole "I became 10x better" is vaporware.

It can kinda sorta maybe help sometimes a little bit is not vaporware


I would encourage you to learn how to use these systems rather than discounting their value.


Please record a screencast and educate everyone.


Vaporware means something is a concept that hasn't actually shipped yet. You seem to be using "vaporware" to mean over-hyped.


I agree with other commenter. You don't know how to use them. LLM's are not programmers.


This day it's usually not a good professional at all if he thinks he's writing something novel.


Yeah if you're getting paid $50K outside the bay maybe.

If you want big bucks you are writing original code, no two ways about it


Most of the code my friends and I write, isn't original. And it's not just people who make $50K/year. Obviously LLM-assisted code writing is still in its infancy, but it has made a lot of mundane things a breeze already. It sucks that one has to know its shortcomings to make it actually useful for yourself (e.g. I won't ask it to write a context-aware function right away, but I know it's great at generating stubs). But we'll get there, I think.


"Andrej Karpathy (born 23 October 1986[2]) is a Slovak-Canadian computer scientist who served as the director of artificial intelligence and Autopilot Vision at Tesla. He co-founded and formerly worked at OpenAI"

I think he should get some cred with such track record.


You didn't learn how to use these tools properly. If you did you wouldn't have that opinion. Karpathy doesn't just write code snippets for educational purposes. Most of the code he writes is for real world systems and it's not publicly available.


What do you mean by professional that doesn't include Karpathy?


I'm trying to understand how you can so easily dismiss all the professionals thinking it's useful. A more charitable explanation could be that it's useless for you but not others.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: