【RPG-DiffusionMaster】超高性能画像生成AIでめちゃめちゃ可愛いAI美女を作ってみた

オープンソースAI 個人開発画像生成生成AIずかん

2024-01-27

WEELメディア事業部LLMリサーチャーの中田です。

1月22日、RPGという強力な技術を用いたText-to-Imageモデル「RPG-DiffusionMaster」が公開されました。

このモデルを用いることで、テキストによる指示から高品質な画像を生成できるんです、、、！

GitHubでのスター数は、すでに600を超えており、多くの人に注目されていることがうかがえます。

この記事ではRPG-DiffusionMasterの使い方や、有効性の検証まで行います。本記事を熟読することで、RPG-DiffusionMasterの凄さを実感し、これまでの画像生成AIには戻れなくなるでしょう。

ぜひ、最後までご覧ください。

RPG-DiffusionMasterの概要

RPG-DiffusionMasterは、RPGという技術によって性能がアップしたText-to-Imageモデルです。

具体的には、GPT-4やGemini-ProのようなマルチモーダルLLMや、miniGPT-4のようなオープンソースのローカルLLMを使用して、テキストから画像への変換を行います。

つまり、以下の手順に従って画像が生成されます。

ユーザーがテキストプロンプトを入力
MLLMsによってプロンプトからキーワードを抽出
抽出されたキーワードをもとにプロンプトの要約文を生成
要約文とCoTによって、細かい領域ごとの画像を生成するためのプロンプトを生成
上記のプロンプトによって、各領域の画像を生成

また、画像生成モデルの「ControlNet」と比較しても、RPG戦略を用いた画像生成の方が、「プロンプトの内容を正確に捉えている」ことが分かっています。

なお、省エネなのに高解像度の画像生成AIについて知りたい方はこちらの記事をご覧ください。
→【Stable Diffusion WebUI Forge】省エネ・高速・高解像度の画像生成モデルを使ってみた

RPG-DiffusionMasterの料金体系

RPG-DiffusionMasterはオープンソースであるため、誰でも無料で利用できます。

＼画像生成AIを商用利用する際はライセンスを確認しましょう／

RPG-DiffusionMasterの使い方

ここでは、Google ColabのT4を用いて実行します。

まずは、以下のコードを実行して、ライブラリのインストールを完了させましょう。

!git clone https://github.com/YangLing0818/RPG-DiffusionMaster
%cd RPG-DiffusionMaster
!pip install -r requirements.txt

ここで一旦、ランタイムの再起動。

次に、以下のコードを実行して、torchの再インストール。

# torchの再インストール
%cd RPG-DiffusionMaster
!echo y | pip uninstall torch
!echo y | pip install torch

次に、以下のコードを実行して、リポジトリのダウンロード。

!mkdir repositories
!mkdir -p generated_imgs/demo_imgs
!mkdir models/Stable-diffusion
%cd repositories
!git clone https://github.com/Stability-AI/generative-models
!git clone https://github.com/Stability-AI/stablediffusion
!git clone https://github.com/sczhou/CodeFormer
!git clone https://github.com/crowsonkb/k-diffusion
!git clone https://github.com/salesforce/BLIP
!mv stablediffusion stable-diffusion-stability-ai
%cd ..

次に、以下のコードを実行して、モデルのダウンロードをしましょう。

# モデルのダウンロード
!wget -P ./models/Stable-diffusion/ https://huggingface.co/Linaqruf/animagine-xl/resolve/main/animagine-xl.safetensors

最後に、以下のコードを実行して、画像を生成してみましょう。

!python RPG.py \
    --user_prompt 'From left to right, an acient Chinese city in spring, summer, autumn and winter in four different regions' \
    --model_name 'animagine-xl.safetensors' \
    --version_number 0 \
    --api_key 'Your OpenAI API key>' \
    --use_gpt

生成された画像は、「/RPG-DiffusionMaster/generated_imgs/demo_imgs」の中に保存されます。

RPG-DiffusionMasterを動かすのに必要なPCのスペック

少なくとも10GBのストレージを、確保しておきましょう。また、最低でも10GBのVRAM、16GB以上のRAMを持つGPUも必要です。

■Pythonのバージョン
Python 3.8以上

■必要なパッケージ
torchなど

なお、画像内の奥行きを理解できるAIについて知りたい方はこちらの記事をご覧ください。
→【Depth Anything】画像内の距離感を正確に理解できるAIにトリックアートを読ませてみた

RPG-DiffusionMasterを実際に使ってみた

「From left to right, an acient Chinese city in spring, summer, autumn and winter in four different regions（左から右へ、4 つの異なる地域にある春、夏、秋、冬の古代中国の都市）」というプロンプトで画像を生成しましょう。

生成された画像は、以下の通りです。

かなり精度が高いですね！ちなみに、RGB戦略の過程は、以下のようにコマンドラインに出力されます。

### Original Caption:
"From left to right, an ancient Chinese city in spring, summer, autumn, and winter in four different regions."

### Key phrases identification:
We have one main subject (an ancient Chinese city) represented in four different seasonal atmospheres. As such, we must divide the image into four equal horizontal regions, each depicting a different season.

1. Ancient Chinese city in spring (the vibrancy of blooming flowers and a fresh, lively atmosphere)
2. Ancient Chinese city in summer (the warmth of sunlight, lush greenery, and bustling life)
3. Ancient Chinese city in autumn (the changing leaves, a cooler ambiance, and a sense of harvest)
4. Ancient Chinese city in winter (the tranquility of snow-covered scenes and a serene, peaceful mood)

### Split Ratio Planning:
#### Horizontal Split Ratio: `1;1;1;1`
- This ratio splits the image into four horizontal rows, each dedicated to a different season.

#### Vertical Split Ratio: None
- We don't require vertical divides since each season is horizontally delineated.

#### Detailed Subregion Prompts:
1. **First Row** (`1`):
  - **Region 0:** The ancient city in spring, with cherry blossoms adorning the architecture and soft pink hues radiating a renewal of life.
2. **Second Row** (`1`):
  - **Region 1:** The city in summer, showcasing the vibrant sunlight casting golden glows on the rooftops, and streets teeming with activity beneath verdant trees.
3. **Third Row** (`1`):
  - **Region 2:** Autumn within the city, where fiery reds and oranges of foliage frame the traditional buildings, and a coolness settles in the air.
4. **Fourth Row** (`1`):
  - **Region 3:** Winter's gentle embrace, as the city slows under a blanket of pristine snow, with delicate icicles hanging from eaves and a peaceful silence.

#### Composition Logic:
- Each row captures the essence of a season within the city, using colors and elements associated with the respective times of year to convey the atmosphere. 

#### Aesthetic Considerations:
- Spring is depicted with soft pinks and a sense of awakening.
- Summer is represented with warm, golden tones and a feeling of vibrancy.
- Autumn is defined by rich reds and oranges, bringing a sense of change.
- Winter is portrayed with whites and blues, fostering a tranquil and serene environment.

By following this layout, each region effectively encapsulates the distinctive features and ambiance of an ancient Chinese city throughout the seasons, with a focus on aesthetic appeal and coherence.

Now, let's output the split ratio and regional prompt we get in the planning process.

### Output:
Horizontal split ratio: 1;1;1;1
Vertical split ratio: None
Split ratio: 1;1;1;1
Regional Prompt: The ancient city in spring, with cherry blossoms adorning the architecture and soft pink hues radiating a renewal of life. BREAK
The city in summer, showcasing the vibrant sunlight casting golden glows on the rooftops, and streets teeming with activity beneath verdant trees. BREAK
Autumn within the city, where fiery reds and oranges of foliage frame the traditional buildings, and a coolness settles in the air. BREAK
Winter's gentle embrace, as the city slows under a blanket of pristine snow, with delicate icicles hanging from eaves and a peaceful silence.
Horizontal split ratio: 1;1;1;1
Vertical split ratio: None
Split ratio: 1;1;1;1
Regional Prompt: The ancient city in spring, with cherry blossoms adorning the architecture and soft pink hues radiating a renewal of life. BREAK
The city in summer, showcasing the vibrant sunlight casting golden glows on the rooftops, and streets teeming with activity beneath verdant trees. BREAK
Autumn within the city, where fiery reds and oranges of foliage frame the traditional buildings, and a coolness settles in the air. BREAK
Winter's gentle embrace, as the city slows under a blanket of pristine snow, with delicate icicles hanging from eaves and a peaceful silence.
{'split ratio': '1;1;1;1', 'Regional Prompt': "The ancient city in spring, with cherry blossoms adorning the architecture and soft pink hues radiating a renewal of life. BREAK\nThe city in summer, showcasing the vibrant sunlight casting golden glows on the rooftops, and streets teeming with activity beneath verdant trees. BREAK\nAutumn within the city, where fiery reds and oranges of foliage frame the traditional buildings, and a coolness settles in the air. BREAK\nWinter's gentle embrace, as the city slows under a blanket of pristine snow, with delicate icicles hanging from eaves and a peaceful silence."}
select_checkpoint: animagine-xl.safetensors [6f4f816f9d]
process_script_args (True, False, 'Matrix', 'Columns', 'Mask', 'Prompt', '1;1;1;1', 0.3, False, False, False, 'Attention', [False], 0, 0, 0.4, None, 0, 0, False)

RPG-DiffusionMasterの推しポイントであるRPGの威力は本当なのか？

RPG-DiffusionMasterのRPGが、本当にすごい技術なのかを確かめるために、以下の3つの画像を生成させてみます。

文字入り画像（中華料理の看板）
AI美女
web広告（サッカースクールのWeb広告）

#文字入り画像
Signboard of a Chinese restaurant

#AI美女
Photo of a beautiful woman

#Web広告
Soccer School Web Ad Creative

一般的にこれらの画像をAIで生成するのは、難しいと言われています。この3つの画像を品質良く生成できれば、その実力も認められるでしょう。さて、結果は以下の通りです。

AI美女はある程度クオリティ高いですが、文字入り看板とWeb広告はイマイチですね。

続いて、Stable DiffusionとDALL-E 3に、同じタスクを解いてもらいました。

Stable Diffusionの結果。

DALL-E 3の結果。

結論、DALL-E 3が一番クオリティ高いですね！笑

本ツールのRPG-DiffusionMasterの性能に関しては、正直まだまだ低いといった印象です。

なお、ChatGPTとの音声会話について詳しく知りたい方は、下記の記事を合わせてご確認ください。
→ChatGPTと音声会話するには？スマホやPCブラウザでの設定方法を含め徹底解説

まとめ

RPG-DiffusionMasterは、RPGという技術によって性能がアップしたText-to-Imageモデルです。

性能の検証を行ったところ、RPG-DiffusionMasterの性能に関しては、正直まだまだ低いといった印象です。

また、本ツールのRPGでは、マルチモーダルLLMによるプロンプトの要約や、CoTのようなプロンプトエンジニアリングを組み合わせることで、より高性能な画像生成を実現しています。今後も、マルチモーダルAIやプロンプトエンジニアリング技術が、Text-to-Imageにおいても活用されるでしょう。

もしかしたら、音声や画像によるプロンプト入力も組み合わせることで、ほとんど実写に近い画像を生成できるようになるかもしれませんね。

生成系AIの業務活用なら！

・生成系AIを活用したPoC開発

・生成系AIのコンサルティング

・システム間API連携

無料ダウンロード

最後に

いかがだったでしょうか？

弊社では

・マーケティングやエンジニアリングなどの専門知識を学習させたAI社員の開発
・要件定義・業務フロー作成を80%自動化できる自律型AIエージェントの開発
・生成AIとRPAを組み合わせた業務自動化ツールの開発
・社内人事業務を99%自動化できるAIツールの開発
・ハルシネーション対策AIツールの開発
・自社専用のAIチャットボットの開発

などの開発実績がございます。

まずは、「無料相談」にてご相談を承っておりますので、ご興味がある方はぜひご連絡ください。

➡︎生成AIを使った業務効率化、生成AIツールの開発について相談をしてみる。

生成AIを社内で活用していきたい方へ