Wikimedia made Wikipedia data more AI-friendly
Kyiv • UNN
Wikimedia presented a project in Germany that allows AI systems to work more easily with Wikipedia and Wikidata. Thanks to semantic search, almost 120 million records are now accessible by content, not just by keywords.

A new project has been presented in Germany that will allow artificial intelligence systems to work more easily with Wikipedia and Wikidata. Thanks to semantic search, almost 120 million records can now be found by content, not just by keywords, writes UNN with reference to TechCrunch.
Details
"The system, called the 'Wikidata Embedding Project,' applies vector semantic search — a technique that helps computers understand the meaning and relationships between words — to existing data in Wikipedia and its sister platforms, comprising nearly 120 million records," the publication writes.
Combined with support for the new Model Context Protocol (MCP) — a standard that allows AI systems to work more effectively with data sources — "the project opens up the possibility of performing natural language queries directly to LLMs." The initiative, as stated, was implemented by the German Wikimedia chapter in cooperation with the neural search company Jina.AI and DataStax, which specializes in real-time data processing technologies.
How it worked before
Wikidata has offered machine-readable data from Wikimedia resources for years, but previous tools only allowed keyword search and SPARQL queries — a specialized query language. The new system will work better with augmented search (RAG) systems, which allow AI models to retrieve external information, giving developers the ability to base their models on knowledge verified by Wikipedia editors.
The data is also structured to provide important semantic context. For example, a query to the database for the word "scientist" will yield lists of prominent nuclear scientists, as well as scientists who worked at Bell Labs. There are also translations of the word "scientist" into different languages, images of scientists at work, and individuals associated with the concepts of "researcher" and "scholar."
The essence of the new project
The new project comes as AI developers are struggling to find high-quality data sources that can be used to fine-tune models. The training systems themselves have become more complex — often they are assembled as complex learning environments rather than simple datasets — but they still require carefully curated data to function properly.
For AI systems that require maximum accuracy, the need for verified and reliable data is particularly acute. And while Wikipedia is sometimes underestimated, its information is significantly more fact-oriented than general datasets like Common Crawl — a huge collection of web pages from across the internet, the publication says.
However, finding quality data can come at a high price for AI labs. For example, in August, Anthropic agreed to settle a lawsuit by a group of authors whose works were used as training materials, and to pay $1.5 billion to avoid further claims.
Wikidata AI project leader Philipp Saade emphasized the initiative's independence from large AI labs and technology corporations in a press statement.
The launch of this Embedding Project shows that powerful artificial intelligence does not necessarily have to be controlled by a handful of companies. It can be open, collaborative, and built to serve everyone.