Convert video and audio into documents in various styles in one click

AI-Media2Doc (Author: hanshuaikang): AI Video Graphic Creation Assistant is a web tool, based on AI large models, no login and registration, front-end and back-end local deployment, to experience AI video/audio to style document service at a very low cost.

1. What is the project?

From its README:

“One-click convert audio and video into various styles of documents such as Xiaohongshu/official accounts/knowledge notes/mind maps/video subtitles.”
“AI Video Image Creation Assistant is a web tool based on AI large models that transforms video and audio into various styles of documents with one click”

In other words, its core purpose is to convert multimedia content (audio/video) into structured/text/document output, and can output a variety of styles (suitable for official accounts, Xiaohongshu, notes, mind maps, subtitles, etc.).

It is open source (MIT licensed) and supports on-premises deployment, which means you don’t have to use cloud services or third-party platforms.

It contains front-end + back-end + deployment script (Docker) etc.

2. Main functions/features

This project lists some core features in its README, which I list below and interpret:

Function	Explanation / Details
Completely open source + on-premises	With an MIT license, users can deploy it on their own servers or local machines without relying on external services.
Privacy protection	No login is required, and task records are saved locally and are not uploaded to external servers.
Front-end processing	Use ffmpeg wasm technology for multimedia processing on the front-end without the need for users to install ffmpeg locally.
Multiple document style support	It supports output in the form of Xiaohongshu style, official account style, knowledge notes, mind maps, content summary, etc.
AI conversation / Q&A interaction	After conversion, you can do a second Q&A based on the video content.
Subtitle export	The result can be exported as a subtitle file (e.g. SRT, VTT, etc.) for direct use in the video.
Smart screenshot illustration	Automatically capture “keyframe screenshots” based on subtitle information and insert them into articles to generate graphic effects without relying on large visual models.
Customizable prompts	On the front-end, prompts can be customized to suit different style or formatting needs.
One-click deployment / Docker support	Docker images, docker-compose, etc. are available for quick deployment.
You can set an access password	After deployment, you can set an access password for the front-end to control guest access rights.

In addition, it mentioned in its “future plans” that it wants to support fast-whisper native models for audio recognition, thereby further reducing dependence and cost on external services.

3. Workflow/architecture

There is also an “Process/Architecture” diagram in the README to illustrate the overall process. I’m giving here a possible flow logic (based on common practices + README information for similar projects):

The front end receives video/audio upload/input
Audio / Video Decoding / Preprocessing (e.g. segmentation, format conversion)
- The frontend handles some conversion tasks with ffmpeg wasm
Speech recognition / text to text
- Convert speech to text in audio (possibly Whisper, OpenAI API, etc.)
Text comprehension / summarization / analysis / restructuring
- Use large language models to summarize, structure, segment, and stylize the identified text
- There may also be dialogue Q&A, supplementation, polishing, etc
Illustration / Smart Screenshot
- Capture the video keyframe and insert it into the text according to the subtitle or key sentence
Output in a variety of formats/styles
- Layout and reorganize the content according to the target style (Xiaohongshu format / official account format / notes / mind map / subtitles, etc.).
- Generate downloadable documents, subtitle files, graphic combinations, and more
Front-end display/export/interaction
- Users view results, do Q&A, adjust styles, export files, etc. on the frontend

The backend is responsible for model calling, text processing, storage, permission control, etc.

4. Advantages and limitations/risks

Pros:

One-stop: From audio and video to a variety of documents and subtitles, all in one place.
On-premises + privacy: Particularly attractive for users who don’t want to upload their audio and video content to the cloud.
Diverse Styles: Output document styles that adapt to multiple platforms.
Interactive: AI Second Question Answering feature enhances utility.
Automatic Illustration: Generate graphic effects without additional image models.
Open Source / Customizable: Users can modify the prompt or extend the functionality according to their needs.

Limitations/Risks

Recognition/semantic errors: Speech recognition models or text understanding modules can make errors, especially in loud audio.
Quality is limited by the model: The quality of the output and the degree of stylization depend on the LLM or large model capabilities used.
Resource Consumption / Performance: CPU/GPU may be demanding when deploying on-premises, especially when processing video and model inference.
Screenshot/Keyframe Judgment Error: Automatic screenshots may capture images that are not suitable or semantically appropriate.
Insufficient style adaptation: Different styles, especially platform specifications, can be difficult to fully adapt automatically.
Dependence on external models/interfaces: If speech recognition/text processing relies on cloud APIs, there will be cost and privacy considerations.
Front-end compatibility/browser performance limitations: Using wasm, front-end processing has performance and environmental compatibility challenges.

5. Applicable scenarios & inapplicable scenarios

Applicable scenarios

Content creators/self-media operators, hoping to convert video/live broadcast/audio content into text/graphic content for publication on official accounts, Xiaohongshu and other platforms.
Knowledge Management / Note-Taking Organization: Convert course videos and interview audios into notes, maps, and summaries.
Users who want to process media content in a local, controlled environment and are unwilling to upload it to third-party platforms.
You need to quickly generate subtitles/text/summaries of the video for subsequent editing and processing.

Not applicable / Scenarios where performance may be poor

Poor audio/video quality: loud noise, heavy accents, overlapping speech, and multiple people speaking at the same time can affect recognition and understanding.
Extremely strict requirements for output style/formatting/customization: Automation tools may struggle to meet refined typography/style standards.
This project may have high latency due to extremely high real-time requirements (e.g., real-time live broadcast to text, real-time translation).
Extremely limited environmental resources (low-spec PC, no GPU) may not run efficiently.

Github：https://github.com/hanshuaikang/AI-Media2Doc

Tubing: