Document clustering | How AI identifies similar files

The World Wide Web is the largest shared information source. Despite the existence of search engines like Google, Yahoo!, and Bing, retrieval of specific pages or documents can be overwhelming. This is how AI makes searching for documents easy.

(Subscribe to our Today’s Cache newsletter for a quick snapshot of top 5 tech stories. Click here to subscribe for free.)

We know that Artificial intelligence (AI) recognise faces and other biometric data to find duplicates. AI also performs tasks based on voice command. But, could this technology help compare two pdf files? Here’s how AI helps do it.

How does it do it?

Researchers use document text clustering to segregate documents based on its content. They analyse the documents based on a cluster of similar words, phrases, and sentences.

This way of grouping and segregating data helps simply the extraction process need to pull relevant information, especially when the user is presented with large amounts of data.

Also read | A machine learning tool that helps firms share confidential data easily

The document clustering technique is commonly used in data analysis and mining, image analysis, data compression, and information retrieval.

Where is it used?

The World Wide Web is the largest shared information source. Despite the existence of search engines like Google, Yahoo!, and Bing, retrieval of specific pages or documents can be overwhelming.

So, algorithms use one or more methods to classify documents on the web based on content. Common clustering applications include Vivismo, KartOO and DuckDuckGo.

Search result clustering involves grouping content based on parameters like hyperlinks, user’s context and web usage. The most common method employed by clustering engines is grouping of short text or snippets that hint at what the actual document contains, researchers said in a study titled ‘Web Search Result Clustering based on Heuristic Search and k-means.’

How is AI used in these applications to cluster?

There are two types of algorithms used to cluster documents – hierarchal clustering and non-hierarchal clustering.

Also read | Analysis of Reddit posts show pandemic’s impact on mental health

Hierarchal clustering algorithm divides and aggregates documents in a predefined, hierarchal manner. Pairs of clusters of data objects in the hierarchy are then linked together. Although this system may be easy to read and understand, it may not be as efficient as non-hierarchal clustering. Clustering may also be difficult in cases where the data has high levels of errors.

Non-hierarchal clustering involves formation of new clusters by merging and splitting the clusters. This is a relatively faster, reliable and stable technique of clustering.

Disclaimer: This post has not been edited by our staff and is published from a syndicated feed. The Original Source of this post can be found at Source link


Please enter your comment!
Please enter your name here


Congress monitoring developments in BJP-JJP ties in Haryana over farmers’ agitation

The Congress is keeping a close watch on the developments in the ruling alliance in Haryana, where the Jannayak...

Review: Disney’s Mulan Is Not Bold — or Smart — Enough

In some ways, Mulan represents a major departure from Disney's approach to live-action remakes of its beloved traditionally-animated movies. Unlike the Emma Watson-led Beauty...

Coronavirus live updates | Infections continue to grow at a constant...

India's first COVID-19 wave refuses to die down. The consistent fall observed in India's daily count of cases and deaths between mid-September and October-end,...

Mank dives into Citizen Kane screenwriter controversy

By: Reuters | Los Angeles | December 4, 2020 8:53:56 am Mank will stream on Netflix. (Photo: Netflix)Mank, which starts streaming on Netflix on Friday,...

PM to chair all-party meeting on Covid-19 today

Prime Minister Narendra Modi is expected to interact with leaders of various political parties from both Houses of Parliament on Friday to discuss the...

More Articles