Qwen3.5 Fine-Tuning Guide

Qwen3.5 Fine-Tuning Guide

398

by bilsbie

krasikra

Fine-tuned Qwen models run surprisingly well on NVIDIA Jetson hardware. We've deployed several 7B variants for edge AI tasks where latency matters more than raw accuracy – think industrial inspection, retail analytics where you can't rely on cloud connectivity. The key is LoRA fine-tuning keeps the model small enough to fit in unified memory while still hitting production-grade inference speeds. Biggest surprise was power efficiency; a Jetson Orin can run continuous inference at under 15W while a cloud round-trip burns way more energy at scale.

Jowsey

13h

This reply is entirely AI generated. You guys are trying to find reason in a hallucination. It's unfortunately impossible to put into words what the "LLM smell" is at this point, but I trust someone else who spends a lot of time reading LLM output can back me up on this.

I've seen these agent-written fake anecdotes on Twitter, Reddit, and now here, all with the exact same formatting. They pretend to be real people with real anecdotes, but they're all completely made up.

therockhead

The two day old account is an obvious hint but I got to be honest, the content didn't look suspicious on first read. I know you touched on it above, but what do you think triggered your AI generated thought ?

GorbachevyChase

Some people don’t farm social credit. I usually drop my account after it gets too high because the evidence of hipsters approving of my words shames me.

jareklupinski

it's this part:

> latency matters more than raw accuracy – think industrial inspection

it (rightfully) raises red flags in anyone when you hear someone confidently claim raw accuracy is _not_ important in things like _inspection_

ctparker

That is definitely the right part. The dash isn't a symbol on a normal keyboard, and "think blah blah blah" occurs frequently in LLM chat sessions (for me, at least). I suspect those easy-to-spot indicators won't be around forever, which will make AI posts much more difficult to spot. But I think the thinly veiled advertisement that follows in that clause will be the bigger tell in future models. If we feel like we're being marketed at, we can almost guarantee there isn't a human on the other end. This isn't the internet I signed up for.

IanCal

They didn't say it wasn't important they said latency was more important, and they're right for many use cases. Once you can't run at realtime where you're operating, you need to move to batching or offloading the work to a pool of workers and handling more async issues. You can no longer have something that shunts the component off to another track where your camera is, you need to have the camera somewhere else then 40s later pull it out of another location. You need good networking so you can fire off images to get processed elsewhere. That's also a bunch more systems to maintain.

These things aren't impossible of course but it's additional management over "place the device here".

Here's how you know that accuracy isn't the be all and end all of the discussion - we already deploy systems with less than human accuracy to monitor things, and when we use humans we very rarely inspect every single item. So there must be a tradeoff we're happy making in lots of industries.

Even if you're focussed on not missing anything, lower accuracy that comes at the cost of more false positives can be massively useful as you can then do a two step process (even with humans as the second step if you need). The goal of the first step is to ignore the 99% of totally fine items so you spend the costly process on just 1% of the items.

jareklupinski

totally

but i wouldnt stand on a soap box and yell that to the world without all that ^ nuance

but by the time i'm done with all that, i'm only preaching to the choir

WithinReason

they might be referring to using a quantised version which gives them high performance and the accuracy drop is less important

thejazzman

10h

Their account only existing 2d lends you a lot of credibility..

That’s wild. And scary.

embedding-shape

What's scary is that it's still the highest upvoted comment on this submission, although it obviously doesn't make sense.

Hope HN has tooling ready to handle this ongoing onslaught of manipulation...

kraig911

this right here I think we all need to think on what is happening right now. Dead internet theory might be plausible. What goal would an AI writing crap responses on reddit/hacker news/what not have to even need to comment?

embedding-shape

> What goal would an AI writing crap responses on reddit/hacker news/what not have to even need to comment?

Obviously the AI itself doesn't have any goal (that matters anyways), but the humans/organizations that set it up obviously have a lot to gain. Accounts of age/above karma thresholds are treated less suspiciously, so if you build up N accounts that way, eventually when you launch your product, each manufactured comment looks less fake as the accounts are already "established" at that point.

This is nothing new, been going on for decades already. Guess the scope kind of expanded and the required effort went down a lot these last few years though.

pneumic

AI will make humans more AI-like, and milestones will be celebrated when it more perfectly simulates degraded humanity

embedding-shape

> AI will make humans more AI-like

Already so, LLMs are trained on human-written text, and then spit out text they try to make human-like, so now a bunch of stylistic choices some humans made are "tellsigns of a human using LLMs for writing". It's not just bad, it's removing humanity from the humans.

Comment was deleted :(

freetonik

Other comments from that account feel very similar. Eery.

andai

Very interesting. Could you give examples of industrial tasks where lower accuracy is acceptable?

eitally

16h

Industrial inspection is usually a fairly blunt task and I wouldn't be concerned about accuracy. Especially in high volume environments where training data is plentiful. Think about things like chip placement errors, alignment problems, bad solder joints, missing components.

Comment was deleted :(

dehrmann

20h

Naive question, but could neural networks handle these use cases?

thot_experiment

20h

NTA but almost certainly, the advantage is that Qwen3.5 is extremely generic already so adapting it to a specific task is way easier than training a NN from scratch. It's probably akin to how OCR is now just something I use Qwen for even though I have access to dedicated OCR tools, Qwen is good enough and it's already in my vram. Modern VLLMs are pretty great at answering basic questions about an image by default and I'm guessing finetuning takes them from "pretty good" to "good enough to use in production".

simgt

13h

Do you have concrete examples to share of what you do with these models?

w10-1

> NVIDIA Jetson hardware ... 15W

7B on 15W could be any of the Orin (TOPS): Nano (40), NX (100), AGX (275)

Curious if you've experimented with a larger model on the Thor (2070)

Zetaphor

16h

Or smaller on Nano

embedding-shape

> where latency matters more than raw accuracy – think industrial inspection

Huh? Why would industrial inspection, in particular, benefit from lower latency in exchange for accuracy? Sounds a bit backwards, but maybe I'm missing something obvious.

someotherperson

At a very high level, think fruit sorting[0] where the conveyor belt doesn't stop rolling and you need to rapidly respond, and all the way through to monitoring for things like defects in silicon wafers and root causing it. Some of these issues aren't problematic on their own, but you can aggregate data over time to see if a particular machine, material or process within a factory is degrading over time. This might not be throughout the entire factory but isolated to a particular batch of material or a particular subsection within it. This is not a hypothetical example: this is an active use case.

[0] https://www.youtube.com/watch?v=vxff_CnvPek

sorenjan

But that's not something you'd use an LLM for. There have been computer vision systems sorting bad peas for more than a decade[0], of course there are plenty of use cases for very fast inspection systems. But when would you use an LLM for anything like that?

[0] https://www.youtube.com/watch?v=eLDxXPziztw

arcanemachiner

22h

Nobody said you would use an LLM for that. It's an example of a process where "industrial inspection, in particular, [would] benefit from lower latency in exchange for accuracy".

The point of their comment isn't that you would use an LLM to sort fruit. It was just an illustrative example.

sorenjan

22h

The discussion was about fine-tuned Qwen models, not industrial inspection in general. I would also find it interesting to learn about what kind of edge AI industrial inspection task you could do with fine-tuned llms, not some handwavy answer about how sometimes latency is important in real time systems. Of course it is, so generally you don't use models with several billion parameters unless you need to.

arcanemachiner

20h

The thread you're in broke away from the main discussion topic.

Again: Nobody is using LLMs to (for example) sort fruit. But there are some industrial processes that prioritize latency over reliability.

15h

No, we are literally trying to find a use case where using a lower accuracy LLM makes sense for a vision task.

But fine - what are these industrial processes where that prioritize latency over reliability and using a LLM - as mentioned by the OP - makes sense?

IanCal

> No, we are literally trying to find a use case where using a lower accuracy LLM makes sense for a vision task.

They're reconfigurable on the fly with little technical expertise and without training data, that's really useful. Personally in projects for people I've found models have fewer unusual edge cases than traditional models, are less sensitive to minor changes in input and are easier to debug by asking them what they can see.

0xbadcafebee

You would use a VLM (vision language model). The model analyzes the image and outputs text, along with general context, that can drive intelligent decisions. https://tryolabs.com/blog/llms-leveraging-computer-vision

embedding-shape

But why would I want to results to be done faster but less reliable, vs slower and more reliable? Feels like the sort of thing you'd favor accuracy over speed, otherwise you're just degrading the quality control?

CamouflagedKiwi

23h

It's not that you want it to be faster, but you want the latency to be predictable and reliable, which is much more the case for local inference than sending it away over a network (and especially to the current set of frontier model providers who don't exactly have standout reliability numbers).

embedding-shape

> which is much more the case for local inference than sending it away over a network

Of course, but that isn't what unclear here.

What's unclear is why a 7b LLM model would be better for those things than say a 14b model, as the difference will be minuscule, yet parent somehow made the claim they make more sense for verification because somehow latency is more important than accuracy.

bigyabai

The high-nines of fruit organization are usually not worth running a 400 billion parameter model to catch the last 3 fruit.

0cf8612b2e1e

Local, offline system you control is worth a lot. Introducing an external dependency guarantees you will have downtime outside of your control.

embedding-shape

Right, but that doesn't answer why you'd need a fast 7b LLM rather than a slightly less fast 14b LLM.

0cf8612b2e1e

In the hypothetical fruit sorting example, if you have a hard budget of 10 msec to respond and the 7B takes 8 msec and the 14B takes 12msec, there is your imaginary answer. Regular engineering where you have to balance competing constraints instead of running the biggest available.

IanCal

Can you fit the 14B on the device they're using? That feels rather important.

And then it depends on whether there is a useful difference in performance between the two.

0xbadcafebee

....because sometimes people need a faster answer? There's many possible reasons someone might need speed over accuracy. In the food sorting example, if lower accuracy means you waste more peanuts, but the speed means you get rid of more bad peanuts overall, then you get fewer complaints about bad peanuts, with a tiny amount of extra material waste.

jwatte

23h

Hard real time is a thing in some systems. Also, the current approaches might have 85% accuracy -- if the LLM can deliver 90% accuracy while being "less exact" that's still a win!

PunchyHamster

11h

the fact the comment is made up nonsense by LLM. you're missing that

clueless

What are some sample real world cases folks are using to fine tune their own small/medium models?

danielhanchen

Oh I wrote up a post on X on this exact question! https://x.com/danielhanchen/status/1979389893165060345?s=20

1. Cursor used online RL to get +28% approval rate: https://cursor.com/blog/tab-rl

2. Vercel used RFT for their AutoFix model for V0: https://vercel.com/blog/v0-composite-model-family

3. Perplexity's Sonar for Deep Research Reasoning I think was a finetuned model: https://docs.perplexity.ai/docs/getting-started/overview

4. Doordash uses LoRA, QLoRA for a "Generalized Attribute Extraction model" https://careersatdoordash.com/blog/unleashing-the-power-of-l...

5. NASA flood water detection https://earthdata.nasa.gov/news/nasa-ibm- openly-release-geospatial-ai-foundation-model-nasa-earth-observation-data6

6. Online RL for robotics - imagine you teaching a robot in the future via some mini finetuning

7. OpenAI's RFT page has more: https://developers.openai.com/api/docs/guides/rft-use-cases

8. For larger models - https://www.mercor.com/blog/expert-data-drives-model-perform...

azath92

Only to prompt thought on this exact question, im interested in answers:

I just ran a benchmark against haiku of a very simple document classification task that at the moment we farm out to haiku in parallel. very naive same prompt system via same api AWS bedrock, and can see that the a few of the 4b models are pretty good match, and could be easily run locally or just for cheap via a hosted provider. The "how much data and how much improvement" is a question i dont have a good intuition for anymore. I dont even have an order of magnitude guess on those two axis.

Heres raw numbers to spark discussion:

|---------------|----------|-------|----------|-----------|

| llama-70b -----| 83 | 98 | 96 | $0.72 |

| gpt-oss-20b --| 83 | 97 | 92 | $0.07 |

| ministral-14b -| 84 | 100 | 90 | $0.20 |

| gemma-4b ----| 75 | 93 | 91 | $0.04 |

| glm-flash-30b -| 83 | 93 | 90 | $0.07 |

| llama-1b ------| 47 | 90 | 58 | $0.10 |

percents are doc type (categorical), year, and subject name match against haiku. just uses the first 4 pages.

in the old world where these were my own in house models, id be interested in seeing if i could uplift those nubmers with traingin, but i haven't done that with the new LLMs in a while. keen to get even a finger to the air if possible.

Can easily generate tens of thousands of examples.

Might try myself, but always keen for an opinion.

_edit for table formatting_

arkmm

You can fine tune a small LLM with a few thousand examples in just a few hours for a few dollars. It can be a bit tricky to host, but if you share a rough idea of the volume and whether this needs to be real-time or batched, I could list some of the tradeoffs you'd think about.

Source: Consulted for a few companies to help them finetune a bunch of LLMs. Typical categorical / data extraction use cases would have ~10x fewer errors at 100x lower inference cost than using the OpenAI models at the time.

azath92

13h

ok, even that "few thousand examples" heuristic is useful. the usecase would be to run this task over id say somewhere in the order of magnitude of 100k extractions in a run, batched not real time, and we'd be interested in (and already do) reruns regularly with minor tweaks to the extracted blob (1-10 simple fields, nothing complex).

My interest in fine tuning at all is based on an adjacent interest in self hosting small models, although i tested this on aws bedrock for ease of comparison, so my hope is that given we are self hosting, then fine tuning and hosting our tuned model shouldn't be terribly difficult, at least compared to managed finetuning solutions on cloud providers which im generally wary of. Happy for those assumptions to be challenged.

faxmeyourcode

Labeling or categorization tasks like this are the bread and butter of small fine tuned models. Especially if you need outputs in a specific json format or whatever.

I did an experiment where I did very simple SFT on Mistral 7b and it was extremely good at converting receipt images into structured json outputs and I only used 1,000 examples. The difficulty is trying to get a diverse enough set of examples, evaling, etc.

If you have great data with simple input output pairs, you should really give it a shot.

airstrike

if you add 2 spaces at the start of the line, you turn it into a code block

  like this

andai

  | Model | DocType% | Year% | Subject% | In $/MTok |

  |----------------|----|-----|----|-------|

  | llama-70b -----| 83 |  98 | 96 | $0.72 |

  | gpt-oss-20b ---| 83 |  97 | 92 | $0.07 |

  | ministral-14b -| 84 | 100 | 90 | $0.20 |

  | gemma-4b ------| 75 |  93 | 91 | $0.04 |

  | glm-flash-30b -| 83 |  93 | 90 | $0.07 |

  | llama-1b ------| 47 |  90 | 58 | $0.10 |

azath92

13h

thank you so much! i suffered with this, and now i never will again!

Comment was deleted :(

clipclopflop

21h

Hi! I think this is a pretty good example:

https://www.atredis.com/blog/2024/6/3/how-to-train-your-larg...

ygouzerh

10h

I am thinking to fine-tune it to recognize better my handwriting. It already works quite well by default, but my writing is just horrible, so it got trouble sometimes.

antirez

Fine tuning is a story that is nice to tell but that with modern LLMs makes less and less sense. Modern LLMs are so powerful that they are able to few shot learn complicated things, so a strong prompt and augmenting the generation (given the massive context window of Qwen3.5, too) is usually the best option available. There are models for which fine tuning is great, like image models: there with LoRa you can get good results in many ways. And LLMs of the past, too: it made sense for certain use cases. But now, why? LLMs are already released after seeing (after pre-training) massive amount of datasets for SFT and then RL. Removing the censorship is much more efficiently done with other techniques. So I have a strong feeling that fine tuning will be every day less relevant, and already is quite irrelevant. This, again, in the specific case of LLMs. For other foundational models fine tuning still makes sense and is useful (images, text to speech, ...).

prettyblocks

I think the biggest case for fine tuning is probably that you can take small models, fine tune them for applications that require structured output, and then run cheap inference at scale. "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.

faxmeyourcode

Especially for super constrained applications. I don't care if the language model that I use for my extremely specific business domain can solve PhD math or remember the works of Shakespeare. I'd trade all of that for pure task specific accuracy.

arkmm

Can you share more details about your use case? The good applications of fine tuning are usually pretty niche, which tends to make people feel like others might not be interested in hearing the details.

As a result it's really hard to read about real-world use cases online. I think a lot of people would love to hear more details - at least I know I would!

derwiki

Exactly, inference cost is a very good reason to fine tune with something like Qwen

_the_inflator

I agree.

Also for certain use cases there are constraints like embedded hardware systems with no internet access. These LLMs have to be trained to specialize for clearly defined use cases under hardware constraints.

Frontier LLMs also are rarely function in isolation instead are orchestrating a system of special units aka subsystems and agents.

While costs and effort are one thing, being able to downsize these monster LLMs through finetuning itself in the first place is extremly valuable.

Me1000

Wouldn’t it be better to use a grammar in the token sampler? Tuning is fine, but doesn’t guarantee a syntactical correct structured output. But if the sampler is grammar aware it could.

MillionOClock

I think both should be done, they don't really serve the same purpose.

andriy_koval

> "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.

I am not expert in this topic, but I am wondering if large cached context is actually cheap to run and frontier models would be cost efficient too in such setting?

prettyblocks

20h

I'd like to read more about that if anyone has any suggestions.

throwaway6977

I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs. It needs to be as tiny as possible so it doesn't take resources away from the running game.

lelanthran

> I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs.

Unless your game states have combinatoral exlosion, would it not be better to generate all of that pre-build? If templated you can generate a few hundreds of thousands of templates to use for any circumstance, then instantiate and stitch together those templates during the game runtime.

hedgehog

20h

There are a bunch of tutorials on how to use GRPO to fine tune a small Qwen. Depending what you're doing LoRA or even just prefix tuning can give pretty good results with no special hardware.

yw3410

How small a model are we talking? Don't even the smallest models which would work need gigabytes of memory?

lelanthran

> How small a model are we talking? Don't even the smallest models which would work need gigabytes of memory?

I dunno, for game prose I expect that a tiny highly quantized model would be sufficient (generating no more than a paragraph), so 300MB - 500MB maybe? Running on CPU not GPU is feasible too, I think.

butILoveLife

This is literally what I'm waiting for. I want a ~8B model that works well with OpenClaw.

prettyblocks

I don't think you will get that anytime soon because for a model to work well with something like openclaw it needs a massive context window.

butILoveLife

but but but but unified memory! (jk, I don't actually believe in Apple marketing words)

There might be future optimizations. Like, have your small model do COT to find where to look for memory that is relevant.

piyh

Qwen 9B doesn't?

butILoveLife

Nothing is really usable outside Opus.

I've tried too. Wasted a few days trying out even high end paid models.

danielhanchen

These are fair points considering LLMs are getting smarter and better every week - but to be fair the biggest benefits of finetuning / RL are still not yet realized:

1. If we have robots at home, they need some sort of efficient continual learning, which could be on the go finetuning / RL via some small LoRA - this will need to do multimodal finetuning with sparse reward signals - one could also imagine all data is aggregated to one central processing center after anonymization, and training a larger model with more data + RL like that

2. Agreed images, audio, video etc is what still LoRA does well - the guide at https://unsloth.ai/docs/models/qwen3.5/fine-tune is actually a vision + text finetuning guide, so you can finetune the vision layers on your own use case

3. Model routing is going to be more the norm in the future - ie locally smallish models with LoRA for continuous finetuning can be used, but complex tasks can be offloaded to a large LLM in the cloud.

4. I also wrote about more use-cases below on the post - DoorDash, Vercel, Mercor, Stripe, NASA, Perplexity, Cursor and many others all do finetuning - for eg Cursor, Perplexity finetune large OSS LLMs themselves for their specific product lines - so there is definitely value if you have the data for it.

canyon289

I work on Gemma and Gemini models I want to echo Daniel's point here. Small finetuned models have their place even with larger general purpose models.

For example last year with Daniel/Unsloth's help we released a tiny specialized model that can get equivalent to Gemini level purpose specifically for FC. For folks that need efficient limited purpose models small models like this can fit a specific need.

https://blog.google/innovation-and-ai/technology/developers-...

Especially on device. https://developers.googleblog.com/on-device-function-calling...

It's the same with chips, we have general purpose CPUs but we still have specialized silicon for tasks that are smaller, more power efficient, cheaper, and because they're single purpose it simplifies and derisks certain designs.

And I have to add, if you want to learn about finetuning models efficiently the Unsloth guides are at the top of my list. They're practical, have all the technical details, and most importantly Daniel and the others are working around the clock to keep it up to date in what is an incredibly fast moving space of models and hardware. I am continually astounded by their work.

danielhanchen

Function calling and also finetuning with FC is a big use-case across any companies - we constantly see large orgs have internal APIs with some schema, and JSON guided output is good, but finetuning with FC is just much more powerful since the model actually starts to understand how to utilize the tools more effectively!

Nice work with Gemma and Gemini as usual! :) Excited for more cool models this year!

bravura

For me, trying to fine-tune a model to write "best day" prose I would accept over 80% of the time.

You are correct if we are talking about knowledge.

However it is bad at hyper-idiosyncratic, gritty style transfer.

I first noticed the issue when asking claude code to draft email responses. The choice of register was off. ("Register in writing refers to the level of formality and tone chosen to suit a specific audience, purpose, and context.")

I decided to talk all my HN comments and rewrite them in various bad LLM prose, and see if I could use DSPy to optimize a prompt using in-context-learning (ICL, I give it 10 examples of my HN comments) and the results were abysmal. RHLF fine-tuned frontier LLMs have a deep seated aversion to the target stylistic distribution of my comments.

I tried fine-tuning qwen3, llama, and gemma models. Instruct models are already so tuned that they could not be tuned. This is using several hunded comments as gold targets and 5 different LLM degradations per gold as the input.

kristianp

21h

> Instruct models are already so tuned that they could not be tuned

Some models have the base model available, that is before instruction tuning. For example llama 3 comes in "pre-trained and instruction tuned variants" [1]. I'm guessing you already know that though.

[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B

BoredomIsFun

14h

Llama-3-8B is a coprolite at this point.

HanClinto

How well would you say it worked? I do like the idea of taking my historical forum posts and e-mails and whatnot and training an autocomplete LLM that is specifically "my voice".

hrmtst93837

14h

I think fine-tuning still matters for production problems where you need deterministic, auditable behavior or to reliably reduce hallucinations that clever prompting alone cannot eliminate. In my experience the best pragmatic approach is parameter efficient tuning, for example LoRA or QLoRA with bitsandbytes for 4-bit training to keep costs down, paired with a RAG layer over a FAISS vector DB so you do not stuff the model context and blow your token budget. I've found that managing a few tuned adapters and a small ops pipeline is a simpler, cheaper long term tradeoff than endless prompt gymnastics, and it saves you from praying to the prompt gods every time requirements creep.

woctordho

12h

This time even Unsloth could not provide bitsandbytes 4-bit models. bitsandbytes does not support new models with MoE and linear attentions, and it's much less flexible than GGUF. Nowadays I think it's better to train lora over GGUF base model, see the discussion at https://github.com/huggingface/transformers/issues/40070

I'll find some time to do this and I hope someone can do this earlier than me.

abhgh

They are great for specialized use-cases: (a) where the problem is not hard enough (you don't need reasoning), or (b) diverse enough (you don't need a world model), (c) you want cheap inference (and you can make it happen hardware-wise) and (d) you either have enough data or a workflow that accumulates data (with fine tuning with enough data you can sometimes beat a premier model while ensuring low latency - ofc, assuming (a) and (b) apply).

I make it sound like a rare perfect storm needs to exist to justify fine tuning, but these circumstances are not uncommon - to an extent (a), (c) and (d) were already prerequisites for deploying traditional ML systems.

Comment was deleted :(

woctordho

12h

In one word, porn.

Qwen filtered out a lot of porn during data curation, and a finetuned model can perform much better than context engineering. Abliteration can only remove censorship, not add something non-existent in the training data.

This guy did some great work in the age of Qwen 3.0: https://huggingface.co/chenrm/qwen3-235b-a22b-h-corpus-lora

joefourier

Fine-tuning still makes sense for cost/latency-sensitive applications. Massive context windows drastically slow down generation, and modern models' performance and instruction following ability relies heavily on a reasoning step that can consume orders of magnitude more tokens than the actual response (depending on the application), while a fine-tuned model can skip/significantly reduce that step.

Using the large model to generate synthetic data offline with the techniques you mentioned, then fine-tuning the small model on it, is an underrated technique.

mountainriver

21h

If that were true, we would be able to run working agents out of the box on any domain.

We are far from that still, for reliability in most applications you need fine tuning.

For any new modality you need fine tuning

For voice, image and video models you need fine tuning

For continual learning you (often) need fine tuning.

For any domain that is somewhat OOD you need fine tuning.

To fully ground a model you need fine tuning

sweaterkokuro

As strong as current LLMs are they are easily distracted from the task often. At production scale, fine tuning can make a lot more sense given you provide the model a very specific task.

andsoitis

For agentic coding, which do you prefer:

a) qwen3-coder

b) qwen3.5 (general)

iamleppert

The problem with this is context. Whatever examples you provide compete with whatever content you want actually analyzed. If the problem is sufficiently complex, you quickly will run out of context space. You must also describe your response, in what you want. For many applications, it's better to fine-tune.

ranger_danger

where it makes sense IMO is when you need it to know about a large amount of information that's not already in the model, such as a company knowledgebase, code repositories or a trove of specialized legal documents... in that case it's not realistic to try to stuff the context window every time with that information, especially if you're trying to make a responsive chat bot.

antirez

With the current context windows and the ability those models did RL to work as agents, it's much faster and reliable for them to use tools and find the information before replying. Much better, no hallucinations problems (or a lot less), no fine tuning needed when information changes. I believe it is exactly in this case that fine tuning is no longer useful, and even in the past worked at very different degrees of quality.

dotancohen

Wouldn't a RAG make more sense for this use case?

larodi

indeed, and in practical terms, this is more often than never, and particularly with large knowledge bases. also makes super sense for VLMs and ViT models.

KronisLV

> But now, why?

Because these models are good in general but their Latvian output is half-drivel, like the roots of the words are usually the right ones, but not the rest.

That, and EuroLLM is really slow to release new models that would be similarly good off the shelf.

esafak

I would like model adaptation algorithms like Doc-to-LoRA (https://pub.sakana.ai/doc-to-lora/) to go mainstream.

syntaxing

Awesome guide, shame how a couple of the Qwen leads got kicked out and replaced with more “business” minded leadership. Hopefully this doesn’t mean the end of the open source era from Qwen.

danielhanchen

Oh I think on X a few ago: https://x.com/poezhao0605/status/2029151951167078454 - Alibaba's CEO and CTO are having an emergency all hands now! Hope it all goes well!

aliljet

Does fine tuning really improve anything above just pure RAG approaches for usee cases that involve tons of direct document context?

44za12

Specialised models easily beat SOTA, case in point: https://nehmeailabs.com/flashcheck

NitpickLawyer

17h

Remember how the tab-next-action model from Cursor was all the rage ~2 years ago when they launched it? That was a fine-tune of a ~70b model (they kinda alluded to this in a podcast).

octoclaw

[dead]

Comment was deleted :(

bugglebeetle

Unfortunately, this looks to only cover the larger MoE models. I imagine the smaller models are what most people would target. 9B just dropped two days ago, so not surprised it’s not explicitly documented, but does use a hybrid mamba architecture that I expect needs some special consideration.

STARGA

22h

[dead]

Crafted by Rajat

Source Code

hckrnws

Qwen3.5 Fine-Tuning Guide