Skip to main content
designed to enable

Document Management Made Easy: Making Company Documents Searchable with Typesense

Author Profile Image Irem Ebenstein

Recently, a company approached us with hundreds of Word reports, and managing so many files and finding the right information had become too complicated. We were tasked with making these reports searchable to simplify their daily operations. In this blog post, I will show you how we solved this problem and the tools and techniques we used.

Background

The Problem with Word Documents

Many companies write documents, analyses, and reports in individual Word files and save them. While they offer high convenience for users in the short term, over time, they accumulate into a multitude of isolated documents that rest in a cloud environment or a local storage like discarded items.

The typical problems with this setup are:

  1. Isolated Data Islands: The documents are stored in isolation and are not searchable together. Individual reports cannot be easily compared, and users have to click through hundreds of files to find the desired information.
  2. Difficult Versioning: Versioning must be done manually, leading to numerous filenames like “Report_2024_v9.docx”. This complicates oversight, and if a new interim version is forgotten, changes can be hard to revert.
  3. Error-Prone: Manual data entry in Word files leads to errors as there is no automatic validation. Rules for headings, formatting, or even just entering numbers and sources are not always adhered to, even with the highest self-discipline. Automatic data and content validation is mostly not possible.

Alternative: A Search Engine for Company Documents

With a small number of documents, the flexibility of Word outweighs these disadvantages. However, once hundreds of reports are involved, it’s time to find a solution that combines the advantages of Word with those of a database.

A database with a flexible web interface is the best solution. The database allows reports to be searched, filtered, and analyzed. The web interface provides a user-friendly way to browse the reports and create new ones with high levels digital and even AI support.

A web interface can automatically check user inputs during entry, highlight errors, and perform calculations safely. This allows your employees to focus on the content.

The Project

One of the tasks that always excites me is text analysis. Whenever text data becomes an insurmountable problem for a company, it’s exciting to find a suitable solution.

In this case, hundreds of company reports awaited us. All these reports, created over the years with countless hours of work, could no longer be viewed and analyzed together. We decided to create an application for this company and prepare an environment where they could easily access the information, and we began our research.

In such projects, the first step is to plan what we want to see at the end and then look for the most suitable applications and put them into practice.

Step 1: Project Requirements

The company wanted to retrieve the desired information from all reports based on specific keywords. It should allow them to search hundreds of reports for specific words or phrases and easily compare them - “what was written about section X in all reports of the past three years?“.

This led to an important decision in the implementation: Each chapter had to be uploaded as a separate document into the search engine. This way, we could directly display each chapter as a query result, along with the associated chapter information.

Step 2: Data Extraction from Word Documents

Direct Extraction from Word

For extraction, we first used python-docx. This Python package allows easy extraction of data from Word documents. Here is a simple example code on how to extract content from Word documents:

from docx import Document

# Function usage
doc_path = "path_to_report.docx"
doc = Document(doc_path)

# All text sections including headings are available under "paragraphs"
for paragraph in doc.paragraphs:
    print(paragraph.text)

# Tables are available separately under "tables"
for table in doc.tables:
    print(table)

# Images are available separately under "inline_shapes"
for image in doc.inline_shapes:
    print(image)

This package is powerful but not ideal for handling particularly large reports with many headings and tables and images scattered throughout. This is because, as seen in the code example above, texts, tables, and images are extracted separately, and determining the positions of elements in the document is not always easy.

Document Conversion to HTML

Since we wanted to present the chapter information in our application in HTML format, we had to convert each document to HTML first. This allowed us to retain the formatting and structures (tables, lists, images) contained in the Word document and offer the user the best of both worlds: full-text search in an appealing HTML presentation.

We used the Python package Mammoth to convert the document to HTML. Here is a simple example code on how to perform the conversion with Mammoth:

from docx import Document
import mammoth

def convert_doc_to_html(doc_path):
    with open(doc_path, "rb") as doc_file:
        result = mammoth.convert_to_html(doc_file)
        html_content = result.value  # Das generierte HTML
        return html_content

# Verwendung der Funktion
doc_path = "path_to_report.docx"
html_content = convert_doc_to_html(doc_path)
print(html_content)

Through this conversion, we could extract chapters as individual HTML elements and align them with the desired part in the search system.

Step 3: Developing a Demo Application

Our task was to develop a demo to quickly show the company how the solution would look and how it could be used. Flask is an excellent framework for such cases. It is not only lightweight and quick to set up but also offers full flexibility for developing API-based applications.

After creating a simple, user-friendly design, we could focus on the APIs we needed in the background. Flask is ideal for quickly creating an API for data access – especially for a prototype or demo.

Step 4: Choosing the Right Search Solution – OpenSearch or Typesense?

The company wanted to search the reports quickly. It should provide fast and precise full-text search, which led us to the decision: OpenSearch or Typesense? Both offer powerful search functions, but depending on the project requirements, they have their respective strengths and weaknesses. Here are my insights and reasons for choosing Typesense.

OpenSearch – Stability and Flexibility for Large Data Volumes

OpenSearch is a fork of the well-known Elasticsearch. It is based on Elasticsearch and offers similar functions but has a freer license. It is a proven tool known for its stability and flexibility. It is ideal for processing and analyzing large data volumes. For companies with existing Elasticsearch setups or whose data is highly structured and regularly updated, OpenSearch is an excellent choice. The support for complex queries and aggregations is another advantage.

However, OpenSearch has some challenges:

  1. Setup and Configuration: OpenSearch is powerful but also complex. Getting started requires more effort as the configuration is versatile, which is often overkill for simple search projects.
  2. Performance: OpenSearch is resource-intensive and requires powerful infrastructure. This can be limiting, especially for smaller projects or limited resources.
  3. Docker Setup and API Integration: The setup with Docker is well-documented but involves many configuration details that hinder a quick start. The APIs are flexible but often too extensive for small projects.

Typesense – Speed and User-Friendliness for Specific Search Queries

Typesense is focused on speed and user-friendliness and is specifically optimized for full-text search and fuzzy search. The application area is smaller but ideal for specific search queries.

Since the copmany wanted to search reports without many additional analyses or aggregations, Typesense was the right choice.

The main advantages of Typesense for our project:

  1. Easy Setup: Typesense was quickly set up via Docker and ready to use in minutes. The simple setup saves valuable time for smaller projects.
  2. API and Speed: Typesense offers an API specifically optimized for full-text search. The speed is impressive and ideal when fast and precise answers are required for the end user.
  3. Focus on Relevance and Error Tolerance: Typesense supports fuzzy search and recognizes typos, providing good results even for inaccurate search queries. This is particularly advantageous when users use different terms or synonyms.

The user-friendliness and efficiency of Typesense allowed us to quickly create a demo for the copmany. With just a few lines of code, we could build a powerful search function. The error tolerance and direct integration with our Flask backend were significant plus points.

Here is an example of how we set up a simple search with Typesense:

import typesense

client = typesense.Client({
    'nodes': [{
        'host': 'localhost',
        'port': '8108',
        'protocol': 'http'
    }],
    'api_key': 'xyz',
    'connection_timeout_seconds': 2
})

# Schema for Typesense, i.e., the format in which the data is stored in Typesense
# For our project, it was important to store chapters as separate documents
schema = {
    "name": "reports",
    "fields": [
        {"name": "document_name", "type": "string"},
        {"name": "content", "type": "string"},
        {"name": "section_name", "type": "string"},
        {"name": "created", "type": "string"},
    ],
    "default_sorting_field": "created"
}

client.collections.create(schema)

# Indexing documents
def index_document(section_name, document_name, content, created):
    document = {
        "section_name": section_name,
        "document_name": document_name,
        "content": content,
        "created": created
    }
    client.collections['reports'].documents.create(document)
Advanced Search Options and Autocomplete

After indexing the documents with this function, the search functionality can now be implemented. Typesense offers advanced search options and autocomplete functionalities with just a few lines of code, which are crucial for precise and error-tolerant search. Here is an example of implementing search and autocomplete functionality in Flask:

@app.route('/search', methods=['GET'])
def search():
    query = request.args.get('q', '*')
    section_name = request.args.get('section_name', '')
    
    # Filters by the selected chapter if specified as a filter
    filter_by = f'section_name:={section_name}' if section_name else ''
    
    search_parameters = {
        'q': query,
        'query_by': 'document_name, content, section_name',
        'filter_by': filter_by,
        'per_page': 10,
        
        # Specifies how long the text snippet of the result should be
        'highlight_affix_num_tokens': 20,
    }
    
    results = client.collections['reports'].documents.search(search_parameters)
    return jsonify(results)

@app.route('/autocomplete', methods=['GET'])
def autocomplete():
    query = request.args.get('q')
    search_parameters = {
        'q': query,
        'query_by': 'document_name, content, section_name',
        'num_typos': 1,
        'per_page': 5
    }
    results = client.collections['reports'].documents.search(search_parameters)
    return jsonify(results)

The autocomplete feature significantly improves user-friendliness as it provides relevant suggestions while the user types in the search term. It also ignores typos and speeds up search results for the end user.

Summary: OpenSearch vs. Typesense

The following table provides an overview of when OpenSearch and when Typesense is more suitable:

CriteriaOpenSearchTypesense
SetupComplex with detailed configuration optionsSimple and quick via Docker
PerformanceIdeal for large data volumes and aggregationsOptimized for small and medium search queries
Error ToleranceBasic fuzzy searchExcellent fuzzy search functionality
APIFlexible but extensiveSimple, focused on full-text search

Conclusion

Choosing Typesense was the right decision for this project as the copmany wanted fast and precise search results without dealing with complex aggregations. OpenSearch remains a strong alternative when complex data analysis and scalability are the focus.

What initially seemed like an insurmountable task turned into an opportunity to create an intelligent solution for the company that positively impacts not only the present but also the future of data usage.

If you also have too many reports and documents in Word and Excel, let’s talk about it. We offer consultation and support for migrating to a scalable database solution that reduces errors, enables new analyses, and saves you time and costs!


Struggling with an Ocean of Word and Excel files? We offer consultation and support for migrating to a scalable database solution! Contact us now