Indexing PDF with Solr

2023-03-20 19:42 问答作者：

Can anyone point me to a tutorial.

My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs.

I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler

But it makes very little sense to me. Do I need to install Tika?

开发者_JS百科

Im lost - please help

With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple. The sample code examples provided in the downloaded archive from here contains a basic solr template project to get you started quickly.

The necessary configuration changes are as follows:

Change the solrConfig.xml to include following lines :

<lib dir="<path_to_extraction_libs>" regex=".*\.jar" /> <lib dir="<path_to_solr_cell_jar>" regex="solr-cell-\d.*\.jar" />

create a request handler as follows:

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults" /> </requestHandler>

2.Add the necessary jars from the solrExample to your project.

3.Define the schema as per your needs and fire a query like :

curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt&literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "myfile=@testDocToExtractFrom.txt"

go to the GUI portal and query to see the indexed contents.

Let me know if you face any problems.

You could use the dataImportHandler. The DataImortHandle will be defined at the solrconfig.xml, the configuration of the DataImportHandler should be realized in an different XML config file (data-config.xml)

For indexing pdf's you could

1.) crawl the directory to find all the pdf's using the FileListEntityProcessor

2.) reading the pdf's from an "content/index"-XML File, using the XPathEntityProcessor

If you have the list of related pdf's, use the TikaEntityProcessor look at this http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/ (example with ppt) and this Solr : data import handler and solr cell

The hardest part of this is getting the metadata from the PDFs, using a tool like Aperture simplifies this. There must be tonnes of these tools

Aperture is a Java framework for extracting and querying full-text content and metadata from PDF files

Apeture grabbed the metadata from the PDFs and stored it in xml files.

I parsed the xml files using lxml and posted them to solr

Use the Solr, ExtractingRequestHandler. This uses Apache-Tika to parse the pdf file. I believe that it can pull out the metadata etc. You can also pass through your own metadata. Extracting Request Handler

public class SolrCellRequestDemo {
public static void main (String[] args) throws IOException, SolrServerException {
SolrClient client = new
HttpSolrClient.Builder("http://localhost:8983/solr/my_collection").build();
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
NamedList<Object> result = client.request(req);
System.out.println("Result: " +enter code here result);
}

This may help.

Apache Solr can now index all sort of binary files like PDF, Words, etc ... check out this doc:
https://lucene.apache.org/solr/guide/8_5/uploading-data-with-solr-cell-using-apache-tika.html

继续阅读：apache-tika full-text-search solr solr-cell solrj

Indexing PDF with Solr

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？