JDownloader Community - Appwork GmbH
 

Reply
 
Thread Tools Display Modes
  #61  
Old 04.05.2019, 08:13
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 16,200
Default

sure you just wan to ignore http or https component of the protocol?
you would be better off with code like jiaz indicated, i would personally recommend bash like script you can then run multiple regular expressions one after another (unlike most text/word processors).

for example cat \file\text | grep patternexpression1 | grep patternexpression2
this allows you to process the text, to pre filter, and then additional patterns to find what you want. you can even port the findings to files and parse them multiple times if you require different outcomes.

raztoki
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #62  
Old 04.05.2019, 13:05
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

The pattern is invalid, because in this case it ignores only the protocol, it must be changed to exclude the entire address.
Reply With Quote
  #63  
Old 04.05.2019, 15:11
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 16,200
Default

yes i gathered, hence my question. since you know the answer then you should be able to fix the expression.
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #64  
Old 04.05.2019, 17:10
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

Quote:
Originally Posted by raztoki View Post
yes i gathered, hence my question. since you know the answer then you should be able to fix the expression.
I have changed the regex, it ignores the links, but the error in the selection of sentences and the error in some lines does not mark the whole text.

(?!(http|https?://|**External links are only visible to Support Staff**ftp://|www\.|[^\s:=]+@www\.).*?[a-z_\/0-9\-\#=&])(?=(\.|,|;|\?|\!)?("|'|«|»|\[|\s|\r|\n|$))[^.!?0-9]+[.!?]
Reply With Quote
  #65  
Old 04.05.2019, 17:21
raztoki's Avatar
raztoki raztoki is offline
English Supporter
 
Join Date: Apr 2010
Location: Australia
Posts: 16,200
Default

you are on the right track with encasing, but you have now introduced more issues. anyway I'm not providing you with any assistance with regular expressions. I'm glad you're learning though!
__________________
raztoki @ jDownloader reporter/developer
http://svn.jdownloader.org/users/170

Don't fight the system, use it to your advantage. :]
Reply With Quote
  #66  
Old 05.05.2019, 17:55
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

(?<!")😊(?!")|(?<!"(?=😊"))😊|😊(?!"(?<="😊"))|(?<!")😊|😊(?!")|😊(?!"(?:(?:[^"]*"){2})*[^"]*)|(?:"😊".*?)*\k😊|(?:(?>{[^}]*?})[^{}]*?)*\k😊
Reply With Quote
  #67  
Old 05.05.2019, 22:24
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

From what I know many times, someone asked on the programming forum stackoverflow.com about almost the same.
But there is no good solution. "Nothing you do will be perfect." To reduce the error rate as much as possible. Run the program on a large set of texts and add exceptions until you reach an acceptable level of error. However, if you need more than dozens of rules, you'll probably just want to rethink the problem.

Step1:

Search sentences that allowed at end .!?
Example sentence:
!
Code:
Gdy patrzÍ na ∂wiat, to jest tak piÍkne i straszne w tym samym czasie!
-or-
.
Code:
Gdy patrzÍ na ∂wiat, to jest tak piÍkne i straszne w tym samym czasie.
-or-
?
Code:
Gdy patrzÍ na ∂wiat, to jest tak piÍkne i straszne w tym samym czasie?

Step2:
Search sentences NOT allowed at end .
The beginning of the Line:
0. (ANY NUMBER + DOT)
5. (ANY NUMBER + DOT)
156. (ANY NUMBER + DOT)
Only at the beginning of the line, everywhere else is acceptable.

Step3:
All languages of the world are allowed, except for Russian.

Step4:
Add a search exception for any links (URLs). Completely ignore.

Step5:

Allow sentence detection when another sentence ends with "three dots", "three exclamation marks", "three question marks" and the next begins with a capital letter:
Example:
Code:
Jestem w innym ∂wiecie... W ∂wiecie o innej kulturze, jÍzyku, tradycjach, architekturze, przyrodzie, kuchni, pogodzie.
Code:
Jestem w innym ∂wiecie!!! W ∂wiecie o innej kulturze, jÍzyku, tradycjach, architekturze, przyrodzie, kuchni, pogodzie.
Code:
Jestem w innym ∂wiecie??? W ∂wiecie o innej kulturze, jÍzyku, tradycjach, architekturze, przyrodzie, kuchni, pogodzie.
Reply With Quote
  #68  
Old 06.05.2019, 01:47
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

Quote:
Step2:
Regex ignores URLs and finds the sentences.
Code:
^(?!\d+\.).*[.!?]$
Only still remains to solve the issues of numbering some sentences (only the beginning of the line) Screenshot text: **External links are only visible to Support Staff****External links are only visible to Support Staff**

----------------------------------
At the moment, only such a workaround, but that and this expression works separately.
... need to look for a solution so that the numbering at the beginning of the line of the sentence is treated as a whole sentence.
With numbering and without numbering (in both cases)

the word "KONIEC" means the completion of the text, and then the separator. "="

Code:
^(?!\d+\.)|(?!KONIEC).*[.!?]$
Reply With Quote
  #69  
Old 07.05.2019, 11:48
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

Update:
New pattern to ignore spaces before the selection.

Basically, a sentence ends with a ".?!" OR a sentence begins a line with a number + "." and ends with ".?!"

Screenshot:
**External links are only visible to Support Staff****External links are only visible to Support Staff**

Regular expression - only do not ignore characters in links. And here is an error not solved.
Reply With Quote
  #70  
Old 22.05.2019, 14:58
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

@Jiaz - You can correct the pattern so that it does not detect char:
. ! ? in links?

Spoiler:
(\S+\.(com|net|org|edu|gov|ru|pl)(\/\S+)?)|((^\d+\..*?|[^\s].*?)(\.\.\.|[\.?!]))
Reply With Quote
  #71  
Old 23.05.2019, 11:43
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,456
Default

^\d+\..*? -> .*? -> you allow everything
^\d+\[^\\.!\?]*?

[^\s].*? -> .*? -> you allow everything
[^\s\.!\?]*?
__________________
JD-Dev & Server-Admin
Reply With Quote
  #72  
Old 23.05.2019, 14:42
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

Quote:
Originally Posted by Jiaz View Post
^\d+\..*? -> .*? -> you allow everything
^\d+\[^\\.!\?]*?

[^\s].*? -> .*? -> you allow everything
[^\s\.!\?]*?

(^\d+\..*?|.*?)(\.\.\.|[\.?!]). Basically, a sentence ends with a ".?!" OR a sentence begins a line with a number + "." and ends with ".?!"

-or-
This works better: (^\d+\..*?|[^\s].*?)(\.\.\.|[\.?!]) to ignore spaces before the selection.

I've tested. Unfortunately, this pattern is incorrect because it matches the links as sentences.

See screenshot:
**External links are only visible to Support Staff****External links are only visible to Support Staff**
Reply With Quote
  #73  
Old 23.05.2019, 15:45
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,456
Default

Please understand that I simply don't have the time to help you with your pattern. If you want to do this all within a single pattern, then you have to learn it and learn to write more complex patterns, like
emailregex.com
__________________
JD-Dev & Server-Admin
Reply With Quote
  #74  
Old 23.05.2019, 16:11
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

Anyway, thanks for partial help.
This question was surprisingly difficult to find an answer for. The regexes I found were too complicated to understand, and anything more that a regex is overkill and too difficult to implement.
Reply With Quote
  #75  
Old 23.05.2019, 16:48
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

Quote:
Originally Posted by Jiaz View Post
like
emailregex.com
A Filter, e.g. a name or package - what is the engine using? Because none of the ready-made pattern "E-MAIL" is incorrect :D
Reply With Quote
  #76  
Old 24.05.2019, 17:09
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,456
Default

Quote:
Originally Posted by djmakinera View Post
A Filter, e.g. a name or package - what is the engine using? Because none of the ready-made pattern "E-MAIL" is incorrect :D
normal pattern/regex. no *engine*.
__________________
JD-Dev & Server-Admin
Reply With Quote
  #77  
Old 17.06.2019, 10:26
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

How do reverse the order of numbers separated by a comma?

Example: 01,03 -> 03,01
Reply With Quote
  #78  
Old 17.06.2019, 11:03
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,456
Default

There comes a time you should learn some sort of coding language and not try to achieve everything with regex.
__________________
JD-Dev & Server-Admin
Reply With Quote
  #79  
Old 17.06.2019, 16:05
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,294
Default

For Unix it is full on the net, but I do not see anything for Windows, but these scripts and commands also take time - writing patterns, the same effect as I would have to change manually. I do not see anything in it that could facilitate the exchange, but to convert 20 numbers, I would have to spend a few or more minutes each time, and manually a lot faster.
Reply With Quote
  #80  
Old 17.06.2019, 16:31
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,456
Default

Ever thought about regex NOT being the answer for all of your *stuff*?
__________________
JD-Dev & Server-Admin
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 17:01.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.