【Reader-LM】HTMLをMarkdownに！ChatGPT性能向上の新テクニック

LLM その他オープンソースAI

2024-09-24

現代のウェブコンテンツは、ノイズや余計な情報が多く含まれており、必要な情報を効率的に取り出すのは容易ではありません。

そんな中、2024年9月11日にJina AIが開発した「Reader-LM」は、HTMLから不要な要素を自動で除去し、クリーンなMarkdown形式に変換する革新的なソリューションとして登場しました。

複雑なルールや手動の作業が不要なこのモデルが、どのようにしてWebコンテンツの変換を効率化し、データ分析やコンテンツ管理を劇的に改善するのかをご紹介します。

＼生成AIを活用して業務プロセスを自動化／

【無料】専門家に相談する

Reader-LMの概要

Reader-LMは、Jina AIが開発した小型で効率的な言語モデルで、HTMLからMarkdownへの変換に特化しています。

Announcing reader-lm-0.5b and reader-lm-1.5b, https://t.co/jnxcxPzndy two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to…
— Jina AI (@JinaAI_) September 11, 2024

このモデルは、従来の手法のように正規表現や複雑なルールセットに頼ることなく、ノイズを含むHTMLをクリーンで構造化されたMarkdown形式に自動変換します。

広告やスクリプトなどの不要な要素を削除し、必要な情報のみを保持するため、Webスクレイピングやデータ分析、コンテンツ変換の効率を大幅に向上できます。

Announcing reader-lm-0.5b and reader-lm-1.5b, https://t.co/jnxcxPzndy two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to…
— Jina AI (@JinaAI_) September 11, 2024

以下にReader-LMの主な機能と特徴をまとめます。

HTMLからMarkdownへの自動変換
- Reader-LMは、強力な自然言語処理能力を活かし、構造を保持したままウェブページのノイズ（広告やスクリプト、ナビゲーションバーなど）を自動でクリーンアップし、HTMLをMarkdown形式に変換します。
コンパクトなモデルサイズと長いコンテキストウィンドウ
- Reader-LMには、Reader-LM-0.5BとReader-LM-1.5Bの2つのモデルがあります。
- 0.5Bモデルは軽量なため、リソースの少ない環境でも動作し、Google Colabのような無料のGPU環境でもスムーズに動作します。
- 1.5Bモデルはより大規模なパラメータ数を持ち、特に精度が求められる高度なタスクに対して優れたパフォーマンスを発揮します。
ハイパフォーマンス
- Reader-LMは、Markdown変換においてROUGE-Lスコア0.72という高い精度を達成しています。
- 256Kトークンの長いコンテキストを処理する能力を備えており、大規模なHTMLドキュメントや複雑なWebページの変換にも対応可能です。
多言語対応
- Reader-LMは、さまざまな言語のHTMLコンテンツを処理する能力を持ち、国際的なWebデータの処理にも適しています。

特にReader-LMのパフォーマンスは注目で、テキストの構造的な一致や文脈の連続性の評価を行うROUGE-Lベンチマークでは、GPT-4oやGemini 1.5 Proを大幅に上回るスコアを獲得しています。

Announcing reader-lm-0.5b and reader-lm-1.5b, https://t.co/jnxcxPzndy two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to…
— Jina AI (@JinaAI_) September 11, 2024

Reader-LMのトレーニングは、2段階に分けて行われました。

1段階目では、比較的短くシンプルなHTMLファイルを扱い、合計15億トークンのトレーニングデータを使用して最大32Kトークンのシーケンスを処理できるようにしました。

次に、より複雑で長いHTMLドキュメントに対応するため、128Kトークンのシーケンスに対応するトレーニングを行い、12億トークンを使用してモデルを微調整しました。

これにより、Reader-LMは大規模なHTMLファイルを処理する際の精度を向上させ、生成内容の一貫性を保ちながら、ノイズの除去とMarkdownの構造保持を実現しています。

なお、GPT-4oについて詳しく知りたい方は、こちらの記事をご覧ください。

WEEL

【GPT-4o】料金や性能、使い方を実演付きで検証！スゴさと最新アップデートのポイントも解説 | WEEL WEELメディア事業部AIライターの2scです。みなさん、GPT-4の最新版「GPT-4o」が無料版&有料版ChatGPTに降臨しています！このGPT-4o、なんとこれまでのGPT-4の2倍速で生成…

Reader-LMのライセンス

Reader-LMはCC-BY-NC-4.0ライセンスのもとで提供されています。

そのため、このモデルは非営利での使用に限定されいるので注意が必要です。

利用用途	可否
商用利用	–
改変	⭕️
配布	⭕️※非営利目的に限る
特許使用	–
私的使用	⭕️

Reader-LMの使い方

Reader-LMは、Google Colabノートブックが公開されており、こちらから簡単に試すことができます。

Reader-LM

ここではローカルで実装する方法も解説します。

まず、バージョン4.43.4以上のtransformersをインストールする必要があります。

pip install transformers<=4.43.4

次に以下のコードを実行することで、モデルをロードして使用できます。

# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "jinaai/reader-lm-1.5b"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

# example html content
html_content = "<html><body><h1>Hello, world!</h1></body></html>"

messages = [{"role": "user", "content": html_content}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)

print(input_text)

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

print(tokenizer.decode(outputs[0]))

Reader-LMの実行に必要なスペック

Reader-LMは、軽量なモデルであるため比較的スペック要件は低いですが、高速で安定した推論を行うためには高性能なGPUの使用が推奨されます。

想定される具体的な要件をまとめます。

GPU: 0.5Bのモデルの場合、より少ないVRAMでも動作しますが、1.5Bモデルでは多くのVRAMを必要とするため、24GB以上が推奨されます。具体的にはNVIDIA A100、V100、RTX 3090、RTX 4090などのGPUを使用すると、推論速度の向上が期待できます。
RAM: 最低でも16GBのRAMが必要で、より大きなデータセットやモデルを処理する場合は32GB以上を推奨します。
ストレージ: モデルとデータを保存するため、最低でも数十GBのストレージが必要です。

Colabを利用する場合、無料のT4GPUでも動作するようですが、より安定した動作のためL4やA100といった有料の高性能GPUの利用が推奨されています。

Reader-LMでMarkdown化したサイトをGPT-4oに読み込ませてみた

ここからは、Reader-LMを用いてMarkdownに変換した以下のサイトをGPT-4oに読み込ませ、記事内容に関する質問を行います。

また、元のHTMLをそのまま読み込ませた場合と比較し、精度や応答内容に差が出るかを検証します。

Hackernews

本来であれば、弊社のブログ記事で検証をしたかったのですが、情報量が多すぎるのか、このように同じ内容の出力を繰り返してしまったり、そもそも生成できないという問題が発生してしまいました。

Announcing reader-lm-0.5b and reader-lm-1.5b, https://t.co/jnxcxPzndy two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to…
— Jina AI (@JinaAI_) September 11, 2024

そのため、今回は動作が保証されていそうなデフォルトでColabノートブックに設定されていた上記のサイトを使用します。

質問する内容はこちらです。

最新の人気記事
Hacker Newsで最もコメントが多い記事
特定の掲載記事の要約

まずはReader-LMを用いて実際のMarkdownに変換していきましょう！

使用するモデルはReader-LM-1.5Bで、先ほどのColabノートブックでA100GPUを使用して実行します。

今回はmax_tokensを8192に設定します。

Announcing reader-lm-0.5b and reader-lm-1.5b, https://t.co/jnxcxPzndy two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to…
— Jina AI (@JinaAI_) September 11, 2024

ここまで完了したらあとは順にコードを実行して、変換したいURLを指定するだけです。

ちなみに、変換する際のリソースの使用状況を確認すると、GPU RAMを35.3GBも使用していたので、1.5Bモデルの場合T4 GPUではメモリ不足に陥ってしまう可能性があります。

Announcing reader-lm-0.5b and reader-lm-1.5b, https://t.co/jnxcxPzndy two Small Language Models (SLMs) inspired by Jina Reader, and specifically trained to generate clean markdown directly from noisy raw HTML. Both models are multilingual and support a context length of up to…
— Jina AI (@JinaAI_) September 11, 2024

Markdownに変換した結果はこちらです。

Markdown変換の結果はこちら

Hacker News new | past | comments | ask | show | jobs | submit    login



1.    
    The Department of Everything – Dispatches from the telephone reference desk (hedgehogreview.com)
    34 points by pseudolus 2 hours ago | hide | 13 comments

2.    
    Show HN: Meet.hn – Meet the Hacker News community in your city
    208 points by sirbg 3 hours ago | hide | 100 comments

3.    
    They don't make readers like they used to (antipope.org)
    68 points by andyjohnson0 2 hours ago | hide | 53 comments

4.    
    Lisp implemented in Rust macros (github.com/ryanwelly)
    256 points by quasigloam 18 hours ago | hide | 62 comments

5.    
    Why use metaphors in conflicts? Because understanding is remembering in disguise (2009) (westallen.typepad.com)
    46 points by yamrzou 6 hours ago | hide | 37 comments

6.    
    The data on extreme human ageing is rotten from the inside out (theconversation.com)
    107 points by enopod_ 3 hours ago | hide | 63 comments

7.    
    Porting SBCL to the Nintendo Switch (tymoon.eu)
    364 points by todsacerdoti 1 day ago | hide | 65 comments

8.    
    Techniques I use to create a great user experience for shell scripts (nochlin.com)
    345 points by hundredwatt 19 hours ago | hide | 183 comments

9.    Icon (real creators, AI ads) is hiring founding engineers (NYC) (icon.me)
    3 hours ago | hide

10.    
    My 71 TiB ZFS NAS After 10 Years and Zero Drive Failures (louwrentius.com)
    352 points by louwrentius 16 hours ago | hide | 245 comments

11.    
    The Legend of Holy Sword: An Immersive Experience for Concentration Enhancement (arxiv.org)
    114 points by PaulHoule 16 hours ago | hide | 54 comments

12.    
    Intents: The Endgame for Computing Interfaces? (http200.substack.com)
    18 points by dpim 6 hours ago | hide | 7 comments

13.    
    Billiards is a good game (1975) (uchicago.edu)
    54 points by chucksmash 11 hours ago | hide | 25 comments

14.    
    Void captures over a million Android TV boxes (drweb.com)
    122 points by Katana_zero 13 hours ago | hide | 76 comments

15.    
    Defend against vampires with 10 gbps network encryption (synacktiv.com)
    224 points by alxjsn 1 day ago | hide | 66 comments

16.    
    The Podcast Index (podcastindex.org)
    11 points by logn 5 hours ago | discuss

17.    
    CrowdStrike ex-employees: 'Quality control was not part of our process' (semafor.com)
    476 points by everybodyknows 19 hours ago | hide | 235 comments

18.    
    The Cellular Secret to Resisting the Pressure of the Deep Sea (quantamagazine.org)
    69 points by Brajeshwar 19 hours ago | hide | 11 comments

19.    
    Haskell Postgres Stored Procedures (github.com/ed-o-saurus)
    45 points by runeks 14 hours ago | hide | 10 comments

20.    
    Haskell Postgres Stored Procedures (github.com/ed-o-saurus)
    45 points by runeks 14 hours ago | hide | 10 comments

21.    
    Solving the bowling problem with dynamic programming (simonensemble.github.io)
    17 points by signa11 8 hours ago | hide | discuss

22.    
    Grounding AI in reality with a little help from Data Commons (research.google)
    83 points by throwaway888abc 18 hours ago | hide | 12 comments

23.    
    Do Animals Know That They Will Die? (theatlantic.com)
    37 points by XzetaU8 11 hours ago | hide | 10 comments

24.    
    Zero-Click Calendar invite vulnerability chain in macOS (mikko-kenttala.medium.com)
    422 points by jviide 22 hours ago | hide | 150 comments

25.    
    Coupled Oscillators for Computing (2020) (aip.org)
    35 points by nickpsecurity 15 hours ago | hide | 8 comments

26.    
    Ambient Meaning and the Algorithms of Culture (thelastwave.substack.com)
    22 points by Johanam 13 hours ago | hide | 7 comments

27.    
    Gottfried Mind, the Raphael of Cats (publicdomainreview.org)
    37 points by prismatic 17 hours ago | hide | discuss

28.    
    Martin Luther King Jr. In Berlin (german-way.com)
    85 points by Tomte 1 day ago | hide | 24 comments

29.    
    JG Ballard's Apocalyptic Art (newstatesman.com)
    108 points by thinkingemote 1 day ago | hide | 29 comments

30.    
    FDA Authorizes First Over-the-Counter Hearing Aid Software (fda.gov)
    412 points by mgerdts 1 day ago | hide | 227 comments


    More




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact


Search:

正しくMarkdownに変換できているようですね！

それでは検証していきましょう。

HackerNewsで最もコメントが多い記事

次は、もっと具体的に質問して対象を絞りたいと思います。

以下のプロンプトを入力します。

What is the article on Hacker News that has received the most comments? And how many replies have there been to the comment that has received the most replies?

Hacker Newsで最もコメントが多い記事は何ですか？また、その記事のコメントの中で最も多く返信がついたコメントには何件の返信がついていますか？

結果はこのようになりました。

Markdownでの結果はこちら

The article on Hacker News that has received the most comments is titled **"My 71 TiB ZFS NAS After 10 Years and Zero Drive Failures"** from louwrentius.com. It has received 245 comments.

I will now check how many replies the most replied comment has received.

The dataset you provided contains information about articles, their points, and comment counts, but it does not include the details of individual comments or their replies. Therefore, I can't determine the number of replies to the most replied comment from this dataset.

If you have a more detailed file or another source, I can help analyze that data as well!

Hacker Newsで最もコメントの多かった記事は、louwrentius.comの「My 71 TiB ZFS NAS After 10 Years and Zero Drive Failures(10年間でゼロのドライブ故障率を記録した私の71 TiB ZFS NAS)」というタイトルの記事です。この記事には245件のコメントが寄せられています。

最も多くの返信があったコメントに何件の返信があったか、これから確認してみます。

ご提供いただいたデータセットには、記事、記事の論点、コメント数に関する情報が含まれていますが、個々のコメントや返信の詳細は含まれていません。そのため、このデータセットから最も返信の多かったコメントへの返信数を特定することはできません。

より詳細なファイルや別のソースをお持ちであれば、そのデータも分析のお手伝いをすることができます!

HTMLでの結果はこちら

The article on Hacker News that has received the most comments is:

**[Show HN: Meet.hn – Meet the Hacker News community in your city](https://news.ycombinator.com/item?id=41539125)**  
This article has **126 comments**.

The comment with the most replies is from the same article, and it has **246 replies**【4†source】.

Hacker Newsの記事で最もコメントが寄せられたのは、

Show HN: Meet.hn – Meet the Hacker News community in your city
この記事には126件のコメントが寄せられています。

最も返信の多いコメントも同じ記事のもので、246件の返信が寄せられています【4†source】。

結果は、Markdownを読み込ませた方が質問を正しく理解して、正確で適切な情報を提供しました。

求めていた正しい情報を提供できており、情報が無いものについては適当な情報を提供するようなことはありませんでした。

一方、HTMLを読み込ませた方は、全く正しくない誤った情報を提供しました。

同じモデルで、しかもGPT-4oという超高性能モデルでここまでの差が出るのは驚きですが、それだけReader-LMによるMarkdown化の成果が出ているということでしょう。

特定の掲載記事の要約

最後に特定の記事を指定して、その記事を要約してもらいます。

以下のプロンプトを入力します。

Summarize the article [Void captures over a million Android TV boxes] from Hacker News.

Hacker Newsに掲載されている[Void captures over a million Android TV boxes]の要約してください。

結果はこのようになりました。

Markdownでの結果はこちら

The article titled **"Void captures over a million Android TV boxes"** in the Markdown file is sourced from *drweb.com*, having received **122 points** and **76 comments** on Hacker News, posted by **Katana_zero** about 13 hours ago.

However, there is no detailed description of the article's content in the Markdown file, which primarily includes basic metadata such as title, source, points, and comments. For a complete summary, I would recommend referring to the actual content from external sources like Dr.Web's site, which details how the malware Void infected over 1.3 million Android TV devices.

Markdownファイル内の「Void captures over a million Android TV boxes」というタイトルの記事は、drweb.comをソースとしており、Hacker Newsで122ポイントと76コメントを獲得し、Katana_zeroによって約13時間前に投稿されました。

しかし、Markdownファイルには記事の内容に関する詳細な説明がなく、主にタイトル、ソース、ポイント、コメントなどの基本的なメタデータが含まれています。完全な概要については、Dr.Webのサイトなど外部ソースから実際のコンテンツを参照することをお勧めします。このサイトでは、マルウェアVoidが130万台以上のAndroid TVデバイスに感染した経緯が詳しく説明されています。

HTMLでの結果はこちら

The article **[Void captures over a million Android TV boxes](https://news.drweb.com/show/?i=14900)** discusses a large-scale malware infection targeting Android TV boxes. The malware, known as **Void**, has infected over a million devices. These Android TV boxes are primarily used for illegal streaming, and the malware enables remote control of the devices for various malicious purposes, such as creating botnets for Distributed Denial of Service (DDoS) attacks or mining cryptocurrency.

The article highlights the growing concern around security vulnerabilities in smart devices, particularly those with outdated or insufficient security measures. It also emphasizes the importance of vigilance and updates for IoT devices to prevent similar large-scale attacks.

記事「Void が 100 万台以上の Android TV ボックスを攻撃」(https://news.drweb.com/show/?i=14900) では、Android TV ボックスを標的とした大規模なマルウェア感染について説明しています。このマルウェアは Void と呼ばれ、100 万台以上のデバイスに感染しています。これらの Android TV ボックスは主に違法ストリーミングに使用されており、このマルウェアは、分散型サービス拒否 (DDoS) 攻撃用のボットネットの作成や暗号通貨のマイニングなど、さまざまな悪意のある目的でデバイスをリモート制御することを可能にします。

この記事では、スマートデバイス、特にセキュリティ対策が時代遅れまたは不十分なデバイスにおけるセキュリティの脆弱性に関する懸念が高まっていることを強調しています。また、同様の大規模な攻撃を防ぐために IoT デバイスを警戒し、更新することの重要性も強調しています。

結果はどちらも期待通りの回答を生成してくれました。

Markdownの方は、記事の内容に関する情報は無いので、Markdownからわかる内容を精一杯教えてくれました。

HTMLの方は、「Void captures over a million Android TV boxes」の正しい要約を出力してくれており、先ほどとは打って変わって高い精度の回答です。

今回の検証の結果、Reader-LMでHTMLから変換されたMarkdownはノイズが除去された高品質なものになっており、HTMLを参照した場合と比較してLLMの回答の精度が向上しました。

ただ、大規模なHTMLドキュメントも変換可能と謳っておきながら、そこまで大規模というわけでもないブログ記事のMarkdown化に失敗してしまいました。

比較的情報量の少ない簡素なサイトしか変換できないとなれば、現状あまり使い道は多くないと感じます。

今後のさらなる改善に期待しましょう！

なお、GPT-4o miniについて詳しく知りたい方は、こちらの記事をご覧ください。

WEEL

【GPT-4o mini】OpenAIの軽量モデル！概要から使い方まで徹底解説 | WEEL OpenAIが発表した次世代AIモデル「GPT-4o mini」を徹底解説！コスト効率、性能、活用事例を比較しながら従来モデルとの違いを詳しく紹介します。高速性や料金体系に加え、A…

Reader-LMがもたらす価値

Reader-LMは、HTMLからノイズを自動で除去し、Markdown形式に変換することで、Webコンテンツの効率的な管理と分析を可能にする革新的なツールです。

特に、以下の点でユーザーから注目されています。

自動ノイズ除去とMarkdown変換により、広告やスクリプトなどを排除し、クリーンなコンテンツを抽出。
大規模HTMLドキュメントの対応や多言語処理能力により、国際的で複雑なデータ処理にも適応。
高精度かつ軽量モデルの導入により、リソースが限られた環境でも高い性能を発揮。

これらの特性により、Webスクレイピング、コンテンツ変換、データ分析を効率化し、技術者や研究者、ライターにとって非常に有用なツールとして期待できます。

Reader-LMは、情報処理の新しい可能性を提供し、今後さらに多くの分野で活用が広がっていくでしょう。

最後に

いかがだったでしょうか？

GPT-3.5 Turboの最新アップデートで、より高速かつ低コストでのAI活用が可能になりました。自社での導入・活用を検討する際に、最適なモデル選定や活用方法について、一緒に考えてみませんか？

弊社では

・マーケティングやエンジニアリングなどの専門知識を学習させたAI社員の開発
・要件定義・業務フロー作成を80%自動化できる自律型AIエージェントの開発
・生成AIとRPAを組み合わせた業務自動化ツールの開発
・社内人事業務を99%自動化できるAIツールの開発
・ハルシネーション対策AIツールの開発
・自社専用のAIチャットボットの開発

などの開発実績がございます。

まずは、「無料相談」にてご相談を承っておりますので、ご興味がある方はぜひご連絡ください。

➡︎生成AIを使った業務効率化、生成AIツールの開発について相談をしてみる。

生成AIを社内で活用していきたい方へ