Deep learning data set—a large collection of text, numbers, and character recognition

Recently, we have collected a large number of data sets related to text and number recognition, including number recognition and language and character recognition. Without further ado, I will share them with you now! !

1. 500 handwritten pinyin data set

500 handwritten pinyin data set, including corresponding txt format annotations and pictures, and an lmdb data format conversion script is provided.

Data acquisition address: https://www.dilitanxianjia.com/2540/

2. Large-scale Chinese semantic analysis data set in the financial field

The data set uses tables in the financial field as the data source, covering fund products and attributes. Players need to build a model to convert the user's natural language questions into structured query statements (Structured Query Language, SQL). The AntSQL data set is provided by Ant Fortune and hosted by the Alibaba Cloud Tianchi platform. It aims to promote the healthy development of Chinese NLP technology and communities in the financial field, promote interdisciplinary research in the field of digital finance, and serve the national strategic needs for the healthy development of the digital economy.

Data acquisition address: https://www.dilitanxianjia.com/2492/

3. Chinese couplet data set

Chinese couplet data set, this couplet data set contains more than 700,000 couplet data, divided by words, and divided into training data set, test data set and a vocabulary. Among them, the training data set and the test data set are divided into upper and lower parts respectively.

Data acquisition address: https://www.dilitanxianjia.com/2462/

4. Symbol image data set

Symbol image data set, this data set contains a total of 1363 image files, including 1361 JPEG files and 2 PNG files. The images represent the 29 letters of the English and Scandinavian alphabet, including the letters A-Z and the letters æ, ø and å. This dataset can be used for various machine learning tasks such as image classification and character recognition.

Data acquisition address: https://www.dilitanxianjia.com/2435/5

5. Dataset of 120,000 Russian jokes

A dataset of 120,000 Russian jokes

Data acquisition address: https://www.dilitanxianjia.com/2085/

6. Geometric shape classification data set

Geometric shape classification data set, the data set consists of 3 data classes, each class represents a geometric shape (triangle, square and circle). Each class consists of 10,000 generated images.

Data acquisition address: https://www.dilitanxianjia.com/2066/

7. Page image data set with numbers

Page picture data set with numbers, a total of 10 pictures of handwritten Arabic numerals

Data acquisition address: https://www.dilitanxianjia.com/1992/

8. 10,000 character document recognition data set

10,000 character file recognition data set, these images also contain letters (A-Z), numbers (0-9) and special characters (such as #)

Data acquisition address: https://www.dilitanxianjia.com/1989/

9. Digital data sets of various fonts

A dataset of digits in various fonts, identifying digits without considering font rules.

Data acquisition address: https://www.dilitanxianjia.com/1716/

10. Handwritten numbers and English characters, data set

Handwritten numbers and English characters, the dataset contains 5 CSV files datasetphanum, datasetchars, datasetmnist and datasetmnist, including alphanumeric, alphabetic, emnist handwritten letters and numbers respectively. datasetfinal is a merged file containing all the above datasets. The image has a grayscale of (28,28) and is stored in the 784 columns of the dataset. The last column contains labels.

Data acquisition address: https://www.dilitanxianjia.com/1713/

11. 20 Chinese news data sets of different categories

Fudan University news classification data set, 20 different categories of Chinese news data sets, the files under the train folder are training files (9804 paragraphs in total). The files in the answer folder are for testing (9833 paragraphs in total). There are 20 different categories.

Data acquisition address: https://www.dilitanxianjia.com/1710/

12. Oracle image data set

Oracle Image Dataset

Data acquisition address: https://www.dilitanxianjia.com/1199/

13. Ancient Persian cuneiform font data set

For the ancient Persian cuneiform font data set, the open source Tesseract engine was selected for character segmentation, learning and classification. Due to the presence of noise (stone cracks) in the inscriptions, this article uses some image processing techniques to eliminate the noise. The final output of the system includes extraction of cuneiform script, transcription of sentences in Persian and English, pronunciation and translation of sentences. A large number of extracted Persian and English words give us a better understanding of how they spoke during that era. The results obtained through validation and result slicing show that the system can handle the recognition of cuneiform scripts well, classifying all characters of the test data well with an accuracy of about 92%.

Data acquisition address: https://www.dilitanxianjia.com/1196/

14. Handwritten digits from 0 to 9 image data set

Handwritten digits from 0 to 9 image data set, this data set contains 200 images of handwritten digits. All numbers were handwritten by the author on white paper and then photographed with a smartphone camera. After the photo is taken, the extra white area is cropped.

Data acquisition address: https://www.dilitanxianjia.com/1192/

15. Russian handwritten letters data set

Russian handwritten letter data set. This data set includes a folder with a total of 14,190 Russian handwritten letter images in PNG format, which facilitates the use of CNN to classify handwritten letters.

Data acquisition address: https://www.dilitanxianjia.com/1188/

16. Invoice information identification data set

Invoice information recognition data set, the data set consists of XML files and images. The XML file contains data extracted from the invoice image, for clarity the names of the text and XML files remain the same. Users of the dataset should extract entities such as invoice numbers, invoice data, company names (invoices from company 1 to company 2), company phone numbers, addresses, etc.

Data acquisition address: https://www.dilitanxianjia.com/1182/

17. Sanskrit character data set

Sanskrit character data set, CSV file size is 92000 1025. There are 1024 input features with pixel values ​​in grayscale (0 to 255). The "Character" column represents the Sanskrit character name corresponding to each image.

Data acquisition address: https://www.dilitanxianjia.com/1179/

Guess you like

Origin blog.csdn.net/weixin_44906759/article/details/134475396