Store Receipts to structured information

Created: 2015-02-08 17:04:27      Last updated: 2015-02-08 17:26:25

As the title suggests, this process is a tool to transform receipts into a table sheet. The process is made for receipts that are already scanned and processed with an OCR Tool.

Input: txt. Files

Output: table with the following columns: Date, price, category, receipt index, buyer, product description  

Roughly speaking this process is divided in the following steps:

1.    .txt2exset: In this sub process a receipt.txt file is segmented. Every line represents one example. So, the example set has as many   examples as there are lines in the original file. For every example (i.e. line) 2 regex extract the price: “(-?([0-9]+([\,\.][0-9]{2})))” and   “(((-?([0-9]|[0-9]{2}|[0-9]{3})([\,\.\'\;\s][0-9]{2}))(\s[ab]|[ab]|[12]|\s[12])))”

2.    Categorization: Each product should be categorized. To keep it simple I choose to build a dictionary of rules that stem products to a   category. Type: categoryX:.*product.keywordX.*  . To do this, the process has to transform the values from the data set to documents, loop these   documents and put it back into a data set.

3.    Manual corrections: Of course this procedure does not create a usable table. Manual corrections are necessary. That’s why the process   stores the example set several times in a sheet for manual correction. This means that the process has three break points. First (“Art und   Preis”) to correct the extracted prices and categories if necessary; second (“datum und geschäft”) to correct the extracted date and the   extracted store. This was a bit tricky because you have to pay attention to the continuity of the data set during the manual editing.

4.    Output: Because of OCR errors there are still many wrong characters in the extracted prices. You can correct these errors an format it in   the table sheet with a formula like this: “=GLÄTTEN(WECHSELN(WECHSELN(WECHSELN(WECHSELN(WECHSELN(W2;"b";"");"a";"");";";",");".";",");"'";","))”   and “=WERT(SÄUBERN(X2))” . At the end you may cross check the result with a pivot table.

Information Preview

Information Run

Not available


Information Workflow Components

Unavailable

Information Workflow Type

RapidMiner

Information Uploader

Avatar Jml

Information License

All versions of this Workflow are licensed under:

Information Version 1 (of 1)

Information Credits (0)

(People/Groups)

None

Information Attributions (0)

(Workflows/Files)

None

Information Tags (0)

None

Log in to add Tags

Information Shared with Groups (0)

None

Information Featured In Packs (0)

None

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (0)

No one

Information Statistics

 

Citations (0)

None


Version History

In chronological order:



Reviews Reviews (0)

No reviews yet

Be the first to review!



Comments Comments (0)

No comments yet

Log in to make a comment




Workflow Other workflows that use similar services (0)

There are no workflows in myExperiment that use similar services to this Workflow.