Training Custom LoRA for Z-Image-Turbo (ZIT) via Local PC

- March 31, 2026

⚠️ Important Disclaimer & Ethical Use

Before proceeding with the tutorial, please acknowledge the following ethical guidelines regarding AI model training:

Non-Real Person: The character trained and demonstrated in this guide (Nano-Banana) is a purely fictional AI-generated character and does not represent any real-life individual.
Responsible Use: I strictly discourage and prohibit the use of this technology to recreate real people without their explicit consent. Please use these tools ethically to respect the privacy and rights of others.

←Prev Post

Image generated using the Z-image Turbo model (Generated LoRA) 🔻

Introduction

In this post, I will provide a comprehensive guide on training a custom LoRA to achieve character consistency using the Z-Image-Turbo (ZIT) architecture.

It is important to note from the outset: LoRA training is a resource-intensive process that demands significant hardware performance. While I will demonstrate how to optimize settings for various environments, this task inherently requires substantial GPU power and VRAM.

If you are working with a lower-spec machine, such as one with 8GB of VRAM, you must be prepared for extended processing times—ranging from several hours to an entire day. However, by fine-tuning specific parameters, it is possible to achieve high-quality results even on consumer-grade hardware. This guide will focus on the technical execution and the necessary trade-offs to successfully generate your own .safetensors model.

Hardware Recommendations

If you want a smooth local experience, I recommend:

Local: Recommended 24GB+ System RAM, 12GB+ VRAM, but 8GB is possible with optimization.
Cloud: RunPod or Vast.ai (RTX 3090/4090) for those who want speed over patience.
Toolkit: We use Ostris (AI-Toolkit) for its superior memory efficiency and SDXL/ZIT support.

The Toolkit: Why Ostris (AI-Toolkit)?

There are several tools like Kohya_ss or Ostris. In this guide, we use Ostris.

Pros: Highly optimized for SDXL/ZIT, intuitive UI, and superior memory management (Layer Offloading).
Cons: Less granular control over niche optimizers compared to Kohya.

Step-by-Step: Local Implementation

👉 Check the [Quick Summary of Settings] below.

Installation: Download the toolkit from the official repository [Link].
Setup: Create a dedicated folder (e.g., on your C:/ostris/ drive), extract the files, and run the setup. It will automatically install Python, Git, and Node.js. If it doesn't auto-launch, run Start-AI-Toolkit.bat.
Dataset Tab: Open the 'Dataset' tab and click New Dataset.
- Tip: Upload 15–20 high-quality images where the character's face and features are clearly visible.
Job Configuration: Now, let’s move on to the most critical part of the setup. Please refer to the screenshot below, where I have highlighted the essential settings in red-lined rectangular boxes. You must configure these three areas precisely to ensure a successful training run:
- Trigger Word : The Trigger Word is a unique identifier that tells the AI exactly when to apply your LoRA's specific features. Without a clear trigger word, the model may struggle to distinguish your custom character from the base model's existing knowledge.
- Model: Select Z-Image-Turbo. It excels at facial recognition during training.
- Optimization (Low VRAM & Layer Offloading): * Recommendation: Turn these ON if you are using low-spec hardware (8GB – 12GB VRAM).
  - Why: These settings shift a portion of the GPU's workload to the system RAM/CPU. This is a critical safety measure that prevents "Out of Memory" crashes on 8GB cards, ensuring a stable training session even on consumer-grade PCs.
Target Settings (Rank & Alpha): The Rank (Network Dim) determines the capacity of the LoRA to store information. Choosing the right rank is a balancing act between detail and model flexibility.
- Low Rank (4 – 16): * Pros: Focuses on the "core essence" of the subject (e.g., general face shape, main colors). It results in a smaller file size and is much more flexible, making it easier to combine with other LoRAs or styles.
  - Best for: Character consistency where you want to maintain the ability to change outfits or backgrounds easily.
- High Rank (32 – 128+): * Pros: Captures intricate details, complex textures, and specific artistic styles with high fidelity.
  - Cons: Significantly increases the risk of Overfitting (where the model mimics the training images too literally, including backgrounds or specific poses) and requires more VRAM during training.
- Standard Practice: For most character LoRAs, a rank of 16 to 32 is considered the industry standard "sweet spot," providing a balance between detail and versatility.
Precision & Saving:
BF16 (Bfloat16): Highly recommended for newer NVIDIA architectures (Ampere and Ada Lovelace / RTX 30 and 40 series). It offers a significantly wider dynamic range, which helps prevent "gradient overflows" and ensures much more stable training for high-resolution models like SDXL or ZIT.
FP16 (Half Precision): The standard choice for older GPU architectures (Pascal, Turing / GTX 10 and RTX 20 series). While it is memory-efficient, it has a narrower numerical range than BF16, which may occasionally lead to training instability in complex models.
Checkpoint Management: Save Every
Managing disk space is a critical part of the workflow, especially during long training sessions.
Save Every (Interval): Setting this to 500 steps is a widely accepted balance. It provides enough "recovery points" to revert if the model begins to overfit, while preventing the storage from being overwhelmed by too many large .safetensors files.
Storage Tip: If you are working with limited disk space, consider increasing this interval to 1000 steps, or manually deleting earlier checkpoints once a stable version is reached.
Training Parameters: These settings dictate how efficiently and accurately the model learns from your dataset. For high-quality character replication on consumer hardware, pay close attention to these four parameters:
- Batch Size (Recommended: 1 or 2): Low-Spec (8GB - 12GB VRAM): Keep this at 1 for maximum stability. High-Spec (24GB+ VRAM): You can increase this to 2 for significantly better training efficiency. While it consumes more VRAM, a higher batch size allows the model to process more data at once, leading to more stable gradients and faster overall training.
- Steps (Recommended: 1,500+):
  - Impact: The total number of iterations the model undergoes.
  - Dataset Size: While 20 high-quality images are sufficient to achieve great results, the more diverse and high-quality images you have, the better. A larger dataset allows the model to understand the character from multiple angles and lighting conditions, leading to much higher fidelity and flexibility.
- Text Encoder Optimizations (Cache Text Embeddings):
  - Recommendation: ON (for Low-VRAM setups).
  - Why: Cache Text Embeddings pre-calculates text data to save processing power. Enabling these significantly reduces the risk of memory crashes.
- Differential Guidance Scale (Recommended: 3.0): This experimental setting amplifies the learning process to hit training targets faster and capture much sharper details in every scenario.
Dataset and Sample: To ensure your training is on the right track without crashing your system, pay attention to these two settings:

Num Repeats (Dataset Repetition): This setting determines how many times the model "looks" at each image during one epoch. Balancing this is crucial to prevent the LoRA from becoming too rigid.
- Small Datasets (10–20 images): Set this to 10–12. Since the data is limited, each image needs more attention. However, do not go too high; excessive repeats can cause Overfitting, making the model "break" or fail to generate new poses.
- Large Datasets (50+ images): Set this to 8 or lower. With a wealth of data, the model can learn enough variety with fewer repeats, maintaining better flexibility for diverse prompts.
Resolution (Dataset/Model Res): Resolution is the biggest factor in VRAM consumption. You should adjust this based on your hardware:

Low-Spec (8GB - 16GB VRAM): Set this to 512 x 512 or 768 x 768. Lower resolution drastically reduces the memory load and speeds up the process.
High-Spec (24GB+ VRAM): You can push this to 1024 x 1024 for maximum detail and higher output quality.

Samples (Sample Every: 500): This generates test images every 500 steps during training. By entering specific Prompts in the sample configuration, you can see exactly how the LoRA is interpreting your character in real-time. If the samples look correct, let it run; if not, you can stop early to save time.

Create Job and Launching: Now that everything is configured, it’s time to start the engine.

Click the [Create Job] button first. Then, navigate to the icon indicated by the Red Arrow in the screenshot below to initiate the training process.
Once the training begins, you can monitor the Progress Bar (%) and the Loss Graph.
- The Loss Graph: This is the most critical indicator. A steady, downward-sloping graph (ideally stabilizing between 0.08 and 0.12) confirms that the model is learning the character’s features correctly without crashing.

Wrapping

Training Results

After a training session of only 800 steps, the results are surprisingly accurate. Despite the short training time, the model captures the core essence of the character with high fidelity.

Sample Input: The photos used for training are images generated by Nano Banana.
Model Download (Step 600): [Lora_zit]
Usage Guide:

Recommended Prompting: Use terms like she, woman to define the subject.
Avoidance Tags (Negative Prompts): To maintain the intended artistic style, avoid using ethnic-specific tags such as asian, korean, or japanese.

Training Environment & Hardware Specs

For those curious about the setup used for this 1.5-hour session, here are the technical details:

🚀 Hardware: NVIDIA GeForce RTX 4090
📸 Dataset Size: 16 High-Quality Images
🧠 Base Model: Z-Image-Turbo
💎 Precision: BF16 (Bfloat16)
⚙️ Training Configuration:
- Steps: 600 (Total Time: ~1.5 hours)
- Batch Size: 1 | Learning Rate: 1.2e-4
- Data Strategy: 12 Repeats | Resolution: 768
- Optimization: Cache Text Embeddings (ON)
💾 Checkpoints: Save & Sample Every 200 steps

Search This Blog

Tech & Misc

Creating AI Videos on an RTX 4060 (LTX-2 vs. Wan2)